Directing ML toward natural hazard mitigation through collaboration

Directing ML toward natural hazard mitigation through collaboration

Floods are the most common type of natural disaster, affecting more than 250 million people globally each year. As part of Google’s Crisis Response and our efforts to address the climate crisis, we are using machine learning (ML) models for Flood Forecasting to alert people in areas that are impacted before disaster strikes.

Collaboration between researchers in the industry and academia is essential for accelerating progress towards mutual goals in ML-related research. Indeed, Google’s current ML-based flood forecasting approach was developed in collaboration with researchers (1, 2) at the Johannes Kepler University in Vienna, Austria, the University of Alabama, and the Hebrew University of Jerusalem, among others.

Today we discuss our recent Machine Learning Meets Flood Forecasting Workshop, which highlights efforts to bring together researchers from Google and other universities and organizations to advance our understanding of flood behavior and prediction, and build more robust solutions for early detection and warning. We also discuss the Caravan project, which is helping to create an open-source repository for global streamflow data, and is itself an example of a collaboration that developed from the previous Flood Forecasting Meets Machine Learning Workshop.

2023 Machine Learning Meets Flood Forecasting Workshop

The fourth annual Google Machine Learning Meets Flood Forecasting Workshop was held in January. This 2-day virtual workshop hosted over 100 participants from 32 universities, 20 governmental and non-governmental agencies, and 11 private companies. This forum provided an opportunity for hydrologists, computer scientists, and aid workers to discuss challenges and efforts toward improving global flood forecasts, to keep up with state-of-the-art technology advances, and to integrate domain knowledge into ML-based forecasting approaches.

The event included talks from six invited speakers, a series of small-group discussion sessions focused on hydrological modeling, inundation mapping, and hazard alerting–related topics, as well as a presentation by Google on the FloodHub, which provides free, public access to Google’s flood forecasts, up to 7 days in advance.

Invited speakers at the workshop included:

The presentations can be viewed on YouTube:

2023 Flood Forecasting Meets Machine Learning Talks Day 1

2023 Flood Forecasting Meets Machine Learning Talks Day 2

Some of the top challenges highlighted during the workshop were related to the integration of physical and hydrological science with ML to help build trust and reliability; filling gaps in observations of inundated areas with models and satellite data; measuring the skill and reliability of flood warning systems; and improving the communication of flood warnings to diverse, global populations. In addition, participants stressed that addressing these and other challenges will require collaboration between a number of different organizations and scientific disciplines.

The Caravan project

One of the main challenges in conducting successful ML research and creating advanced tools for flood forecasting is the need for large amounts of data for computationally expensive training and evaluation. Today, many countries and organizations collect streamflow data (typically either water levels or flow rates), but it is not standardized or held in a central repository, which makes it difficult for researchers to access.

During the 2019 Machine Learning Meets Flood Forecasting Workshop, a group of researchers identified the need for an open source, global streamflow data repository, and developed ideas around leveraging free computational resources from Google Earth Engine to address the flood forecasting community’s challenge of data collection and accessibility. Following two years of collaborative work between researchers from Google, the school of Geography at the University of Exeter, the Institute for Machine Learning at Johannes Kepler University, and the Institute for Atmospheric and Climate Science at ETH Zurich, the Caravan project was created.

In “Caravan – A global community dataset for large-sample hydrology”, published in Nature Scientific Data, we describe the project in more detail. Based on a global dataset for the development and training of hydrological models (see figure below), Caravan provides open-source Python scripts that leverage essential weather and geographical data that was previously made public on Google Earth Engine to match streamflow data that users upload to the repository. This repository originally contained data from more than 13,000 watersheds in Central Europe, Brazil, Chile, Australia, the United States, Canada, and Mexico. It has further benefited from community contributions from the Geological Survey of Denmark and Greenland that includes streamflow data from most of the watersheds in Denmark. The goal is to continue to develop and grow this repository to enable researchers to access most of the world’s streamflow data. For more information regarding contributing to the Caravan dataset, reach out to caravan@google.com.

Locations of the 13,000 streamflow gauges in the Caravan dataset and the distribution of those gauges in GEnS global climate zones.

The path forward

Google plans to continue to host these workshops to help broaden and deepen collaboration between industry and academia in the development of environmental AI models. We are looking forward to seeing what advances might come out of the most recent workshop. Hydrologists and researchers interested in participating in future workshops are encouraged to contact flood-forecasting-meets-ml@google.com.

Read More

Celebrate PyTorch* 2.0 with New Performance Features for AI Developers

Congratulations to the PyTorch Foundation for its release of PyTorch* 2.0! In this blog, I discuss the four features for which Intel made significant contributions to PyTorch 2.0:

  1. TorchInductor
  2. GNN
  3. INT8 Inference Optimization
  4. oneDNN Graph API

We at Intel are delighted to be part of the PyTorch community and appreciate the collaboration with and feedback from our colleagues at Meta as we co-developed these features.

Let’s get started.

1. TorchInductor CPU FP32 Inference Optimized

As part of the PyTorch 2.0 compilation stack, TorchInductor CPU backend optimization brings notable performance improvements via graph compilation over the PyTorch eager mode.

The TorchInductor CPU backend is sped up by leveraging the technologies from the Intel® Extension for PyTorch for Conv/GEMM ops with post-op fusion and weight prepacking, and PyTorch ATen CPU kernels for memory-bound ops with explicit vectorization on top of OpenMP*-based thread parallelization.

With these optimizations on top of the powerful loop fusions in TorchInductor codegen, we achieved up to a 1.7x FP32 inference performance boost over three representative deep learning benchmarks: TorchBench, HuggingFace, and timm1. Training and low-precision support are under development.

See the Improvements

The performance improvements on various backends are tracked on this TouchInductor CPU Performance Dashboard.

Improve Graph Neural Network (GNN) in PyG for Inference and Training Performance on CPU

GNN is a powerful tool to analyze graph structure data. This feature is designed to improve GNN inference and training performance on Intel® CPUs, including the new 4th Gen Intel® Xeon® Scalable processors.

PyTorch Geometric (PyG) is a very popular library built upon PyTorch to perform GNN workflows. Currently on CPU, GNN models of PyG run slowly due to the lack of GNN-related sparse matrix multiplication operations (i.e., SpMM_reduce) and the lack of several critical kernel-level optimizations (scatter/gather, etc.) tuned for GNN compute.

To address this, optimizations are provided for message passing between adjacent neural network nodes:

  • scatter_reduce: performance hotspot in message-passing when the edge index is stored in coordinate format (COO).
  • gather: backward computation of scatter_reduce, specially tuned for the GNN compute when the index is an expanded tensor.
  • torch.sparse.mm with reduce flag: performance hotspot in message-passing when the edge index is stored in compressed sparse row (CSR). Supported reduce flag for: sum, mean, amax, amin.

End-to-end performance benchmark results for both inference and training on 3rd Gen Intel® Xeon® Scalable processors 8380 platform and on 4th Gen 8480+ platform are discussed in Accelerating PyG on Intel CPUs.

Optimize int8 Inference with Unified Quantization Backend for x86 CPU Platforms

The new X86 quantization backend is a combination of FBGEMM (Facebook General Matrix-Matrix Multiplication) and oneAPI Deep Neural Network Library (oneDNN) backends and replaces FBGEMM as the default quantization backend for x86 platforms. The result: better end-to-end int8 inference performance than FBGEMM.

Users access the x86 quantization backend by default for x86 platforms, and the selection between different kernels is automatically done behind the scenes. The rules of selection are based on prior performance testing data done by Intel during feature development. Thus, the x86 backend replaces FBGEMM and may offer better performance, depending on the use case.

The selection rules are:

  • On platforms without VNNI (e.g., Intel® Core™ i7 processors), FBGEMM is always used.
  • On platforms with VNNI (e.g., 2nd-4th Gen Intel® Xeon® Scalable processors and future platforms):
    • For linear, FBGEMM is always used.
    • For convolution layers, FBGEMM is used for depth-wise convolution whose layers > 100; otherwise, oneDNN is used.

Note that as the kernels continue to evolve.

The selection rules above are subject to change to achieve better performance. Performance metrics for through-put speed-up ratios of unified x86 backend vs. pure FBGEMM are discussed in [RFC] Unified quantization backend for x86 CPU platforms #83888.

Leverage oneDNN Graph API to Accelerate Inference on CPU

oneDNN Graph API extends oneDNN with a flexible graph API to maximize the optimization opportunity for generating efficient code on Intel® AI hardware. It automatically identifies the graph partitions to be accelerated via fusion. The fusion patterns focus on fusing compute-intensive operations such as convolution, matmul, and their neighbor operations for both inference and training use cases.

Currently, BFloat16 and Float32 datatypes are supported and only inference workloads can be optimized. BF16 is only optimized on machines with Intel® Advanced Vector Extensions 512 (Intel® AVX-512) BF16 support.

Few or no modifications are needed in PyTorch to support newer oneDNN Graph fusions/optimized kernels. To use oneDNN Graph, users can:

  • Either use the API torch.jit.enable_onednn_fusion(True) before JIT tracing a model, OR …
  • Use its context manager, viz. with torch.jit.fuser(“fuser3”).
  • For accelerating BFloat16 inference, we rely on eager-mode AMP (Automatic Mixed Precision) support in PyTorch and disable JIT mode’s AMP.

See the PyTorch performance tuning guide.

Next Steps

Get the Software

Try out PyTorch 2.0 and realize the performance benefits for yourself from these Intel-contributed features.

We encourage you to check out Intel’s other AI Tools and Framework optimizations and learn about the open, standards-based oneAPI multiarchitecture, multivendor programming model that forms the foundation of Intel’s AI software portfolio.

For more details about 4th Gen Intel Xeon Scalable processor, visit AI Platform where you can learn about how Intel is empowering developers to run high-performance, efficient end-to-end AI pipelines.

PyTorch Resources

Read More

Straggler Mitigation On PyTorch DDP By Hierarchical SGD

Straggler Mitigation On PyTorch DDP By Hierarchical SGD

PyTorch DDP has been widely adopted across the industry for distributed training, which by default runs synchronous SGD to synchronize gradients across model replicas at every step. The performance of this technique is critical for fast iteration during model exploration as well as resource and cost saving. The performance is critical for fast iteration and cost saving of model development and exploration. To resolve a ubiquitous performance bottleneck introduced by slow nodes in large-scale training, Cruise and Meta co-developed a solution based on the Hierarchical SGD algorithm to significantly accelerate training in the presence of these stragglers.

The Need For Straggler Mitigation

In DDP setup, a straggler problem can occur when one or more processes run much slower (“stragglers”) than other processes. When this happens, all the processes have to wait for the stragglers before synchronizing gradients and completing the communication, which essentially bottlenecks distributed performance to the slowest worker.As a result, even for the cases of training relatively small models, the communication cost can still be a major performance bottleneck.

Potential Causes of Stragglers

Severe straggler issues are usually caused by workload imbalance before synchronization, and many factors can contribute to this imbalance. For instance, some data loader workers in the distributed environment can become stragglers, because some input examples can be outliers in terms of the data size, or the data transfer of some examples can be drastically slowed down due to unstable network I/O, or the on-the-fly data transformation costs can have a high variance.

Besides data loading, other phases before gradient synchronization can also cause stragglers, such as unbalanced workloads of embedding table lookup during the forward pass in recommendation systems.

The Appearance of Stragglers

If we profile DDP training jobs that have stragglers, we can find that some processes may have much higher gradient synchronization costs (a.k.a., allreducing gradients) than other processes at a certain step. As a result, the distributed performance can be dominated by the communication cost even if the model size is very small. In this case, some processes run faster than the straggler(s) at a step, and hence they have to wait for the stragglers and spend a much longer time on allreduce.

The below shows screenshots of two trace files output by PyTorch profiler in a use case. Each screenshot profiles 3 steps.

  • The first screenshot shows that a process has a very high allreduce cost in both the first and the third steps, because this process reaches the synchronization phase earlier than the straggler(s), and it spends more time on waiting. On the other hand, the allreduce cost is relatively small in the second step, this suggests that 1) there is no straggler at this step; or 2) this process is the straggler among all the processes, so it does not need to wait for any other process.

chart showing allreduce cost

Both the 1st and the 3rd Steps Are Slowed Down by Stragglers

  • The second screenshot shows a normal case without stragglers. In this case, all the gradient synchronizations are relatively short.

chart showing normal case without stragglers

Normal Case Without Stragglers

Hierarchical SGD in PyTorch

Recently hierarchical SGD has been proposed to optimize the communication costs by mainly reducing the total amount of data transfer in large-scale distributed training, and multiple convergence analyses have been provided (example). As a main novelty of this post, at Cruise we could leverage hierarchical SGD to mitigate stragglers, which may also occur on training relatively small models. Our implementation has been upstreamed by Cruise to PyTorch in early 2022.

How Does Hierarchical SGD Work?

As the name implies, hierarchical SGD organizes all the processes into groups at different levels as a hierarchy, and runs synchronization by following the rules below:

  • All the groups at the same level have the same number of processes, and the processes in these groups synchronize at the same frequency concurrently, where the synchronization period is pre-defined by the user.
  • The higher level a group is, the larger synchronization period is used, as the synchronization becomes more expensive.
  • When multiple overlapping groups are supposed to synchronize according to their periods, to reduce redundant synchronization and avoid data race across groups, only the highest-level group runs synchronization.

The following figure illustrates an example of 4-level hierarchy SGD among 16 processes on 8 machines, each of which has 2 GPUs:

  1. Level 1: Each process runs mini-batch SGD locally;
  2. Level 2: Each 4-process group across 2 machines runs synchronization every 2 steps;
  3. Level 3: Each 8-process group across 4 machines runs synchronization every 4 steps;
  4. Level 4: The global process group of all 16 processes over 8 machines runs synchronization every 8 steps.

Particularly, when the step number can be divided by 8, only the synchronization at 3) is executed, and when the step number can be divided by 4 but not 8, only the synchronization at 2) is executed.

An example of 4-level hierarchy SGD among 16 processes on 8 machines, each of which has 2 GPUs

Intuitively, hierarchical SGD can be viewed as an extension of local SGD, which only has a two-level hierarchy – every process runs mini-batch SGD locally and then synchronizes globally at a certain frequency. This can also help explain that, just like local SGD, hierarchical SGD synchronizes model parameters instead of gradients. Otherwise the gradient descent will be mathematically incorrect when the frequency is greater than 1.

Why Can Hierarchical SGD Mitigate Stragglers?

The key insight here is that, when there is a random straggler, it only directly slows down a relatively small group of processes instead of all the processes. Next time another random straggler is very likely to slow down a different small group, and hence a hierarchy can help smooth out the straggler effect.

The example below assumes that there is a random straggler among totally 8 processes at every step. After 4 steps, vanilla DDP that runs synchronous SGD will be slowed down by straggler 4 times, because it runs global synchronization at every step. In contrast, hierarchical SGD runs synchronization with the groups of 4 processes after the first two steps, and then a global synchronization after another two steps. We can see that both the first two and the last two stragglers have a large overlap, and hence the performance loss can be mitigated.

flow diagram

Essentially, the mitigation effect of this hierarchical SGD example actually is between local SGD at a frequency of every 2 steps and every 4 steps. The main advantage of hierarchical SGD over local SGD is a better convergence efficiency of the same global synchronization frequency, because hierarchical SGD allows more low-level synchronization. Moreover, it is possible for hierarchical SGD to provide a global synchronization frequency lower than local SGD with model parity, leading to a higher training performance, especially in a large-scale distributed training.

Ease of Use

Straggler mitigation is not a novel study in distributed training. Multiple approaches have been proposed, such as gossip SGD, data encoding, gradient coding, as well as some particularly designed for parameter-server architecture, including backup workers and stale synchronous parallel. However, to the best of our knowledge, before this effort we have not found a good open-source PyTorch implementation of straggler mitigation that can work like a plugin to our training system at Cruise. In contrast, our implementation only requires the minimal changes – no need to modify the existing code or tune any existing hyperparameters. This is a very appealing advantage for industry users.

As the code example below shows, only a few lines need to be added to the setup of DDP model, and the training loop code can keep untouched. As explained previously, hierarchical SGD is an extended form of local SGD, so the enablement can be quite similar to local SGD (see PyTorch docs of PostLocalSGDOptimizer):

  1. Register a post-local SGD communication hook to run a warmup stage of fully synchronous SGD and defer hierarchical SGD.
  2. Create a post-local SGD optimizer that wraps an existing local optimizer and a hierarchical SGD configuration.
import torch.distributed.algorithms.model_averaging.hierarchical_model_averager as hierarchicalSGD
from torch.distributed.algorithms.ddp_comm_hooks.post_localSGD_hook import (
    PostLocalSGDState,
    post_localSGD_hook,
)
from torch.distributed.optim import PostLocalSGDOptimizer

ddp_model = nn.parallel.DistributedDataParallel(
    module=model,
    device_ids=[rank],
)

# Register a post-local SGD communication hook for the warmup.
subgroup, _ = torch.distributed.new_subgroups()
state = PostLocalSGDState(subgroup=subgroup, start_localSGD_iter=1_000)
ddp_model.register_comm_hook(state, post_localSGD_hook)

# Wraps the existing (local) optimizer to run hierarchical model averaging.
optim = PostLocalSGDOptimizer(
  optim=optim,
  averager=hierarchicalSGD.HierarchicalModelAverager(
    # The config runs a 4-level hierarchy SGD among 128 processes:
    # 1) Each process runs mini-batch SGD locally;
    # 2) Each 8-process group synchronize every 2 steps;
    # 3) Each 32-process group synchronize every 4 steps;
    # 4) All 128 processes synchronize every 8 steps.
    period_group_size_dict=OrderedDict([(2, 8), (4, 32), (8, 128)]),
    # Do not run hierarchical SGD until 1K steps for model parity.
    warmup_steps=1_000)
)

Algorithm Hyperparameters

Hierarchical SGD has two major hyperparameters: period_group_size_dict and warmup_steps.

  • period_group_size_dict is an ordered dictionary mapping from synchronization period to process group size, used for initializing process groups of different sizes in a hierarchy to synchronize parameters concurrently. A larger group is expected to use a larger synchronization period.
  • warmup_steps specifies a number of steps as the warmup stage to run synchronous SGD before hierarchical SGD. Similar to post-local SGD algorithm, a warmup stage is usually recommended to achieve a higher accuracy. The value should be the same as start_localSGD_iter arg used in PostLocalSGDState when post_localSGD_hook is registered. Typically the warmup stage should at least cover the beginning of training when the loss is decreased drastically.

A subtle difference between the PyTorch implementation and the initial design proposed by relevant papers is that, after the warmup stage, by default the processes within each host still run intra-host gradient synchronization at every step. This is because that:

  1. The intra-host communication is relatively cheap, and it can usually significantly accelerate the convergence;
  2. The intra-host group (of size 4 or 8 for most industry users) can usually be a good choice of the smallest group of processes that synchronize most frequently in hierarchical SGD. If the synchronization period is 1, then gradient synchronization is faster than model parameter synchronization (a.k.a., model averaging), because DDP automatically overlaps gradient synchronization and the backward pass.

Such intra-host gradient synchronization can be disabled by unsetting post_local_gradient_allreduce arg in PostLocalSGDState.

Demonstration

Now we demonstrate that hierarchical SGD can accelerate distributed training by mitigating stragglers.

Experimental Setup

We compared the performance of hierarchical SGD against local SGD and synchronous SGD on ResNet18 (model size: 45MB). Since the model is so small, the training is not bottlenecked by data transfer cost during synchronization. To avoid the noises incurred by data loading from remote storage, the input data was randomly simulated from memory. We varied the number of GPUs used by training from 64 to 256. The batch size per worker is 32, and the number of iterations of training is 1,000. Since we don’t evaluate convergence efficiency in this set of experiments, warmup is not enabled.

We also emulated stragglers at a rate of 1% on 128 and 256 GPUs, and 2% on 64 GPUs, to make sure at least one stragglers at every step on average. These stragglers randomly appear on different CUDA devices. Each straggler stalls for 1 second besides the normal per-step training time (~55ms in our setup). This can be perceived as a practical scenario where 1% or 2% of input data are outliers in terms of the data pre-processing cost (I/O and/or data transformation on the fly) during training, and such cost is 20X+ larger than the average.

The code snippet below shows how a straggler can be emulated in the training loop. We applied it to a ResNet model, and it can be easily applied to the other models as well.

     loss = loss_fn(y_pred, y)
     # Emulate a straggler that lags for 1 second at a rate of 1%.
     if random.randint(1, 100) == 1:
         time.sleep(1)
     loss.backward()
     optimizer.step()

The experiments are conducted on us-central1 GCP cluster. Each machine has 4 NVIDIA Tesla T4 GPUs with 16 GB memory per GPU, connected through a 32 Gbit/s ethernet network. Each instance also features 96 vCPUs, 360 GB RAM.

Architecture ResNet18 (45MB)
Workers 64, 128, 256
Backend NCCL
GPU Tesla T4, 16 GB memory
Batch size 32 x ## of workers
Straggler Duration 1 sec
Straggler Rate 1% on 128 and 256 GPUs, 2% on 64 GPUs

We used multiple configurations for both local SGD and hierarchical SGD. Local SGD runs global synchronization every 2, 4, and 8 steps, respectively.

We ran hierarchical SGD with the following configurations:

  1. On 64 GPUs:
    1. Each 8-process group, 32-process, and the global 64-process group synchronizes every 2, 4, and 8 steps, respectively. Denoted as “HSGD 2-8,4-32,8-64”.
    2. Each 32-process group and the global 64-process group synchronizes every 4 and 8 steps, respectively. Denoted as “HSGD 4-32,8-64”.
  2. On 128 GPUs:
    1. Each 8-process group, 32-process group, and the global 128-process group synchronizes every 2, 4, and 8 steps, respectively. Denoted as “HSGD 2-8,4-32,8-128”.
    2. Each 32-process group and the global 128-process group synchronizes every 4 and 8 steps, respectively. Denoted as “HSGD 4-32,8-128”.
  3. On 256 GPUs:
    1. Each 4-process group, 16-process group, 64-process group, and the global 256-process group synchronizes every 1, 2, 4, and 8 steps, respectively. Denoted as “HSGD 1-4,2-16,4-64,8-256”.
    2. Each 8-process group, 64-process group, and the global 256-process group synchronizes every 2, 4, and 8 steps. Denoted as “HSGD 2-8,4-64,8-256”.
    3. Each 16-process group and the global 256-process group synchronizes every 4 and 8 steps, respectively. Denoted as “HSGD 4-16,8-256”.

Experimental Results

The figures below show the speedups of different communication schemes against the baseline of synchronous SGD, with the emulated stragglers. We can make the following observations:

  1. As expected, we can see that both hierarchical SGD and local SGD can achieve a higher speedup with a lower synchronization frequency.
  2. The speedups of the hierarchical SGD schemes are 2.08X-2.45X on 64 GPUs, 2.57X-2.68X on 128 GPUs, and 2.63X-3.25X on 256 GPUs, respectively. This shows that hierarchical SGD can significantly mitigate stragglers, and such mitigation can be more effective at a larger scale.
  3. The performance of local SGD with the synchronization period of 2 steps and 8 steps can be perceived as the lower bound and upper bound of the experimented hierarchical SGD schemes, respectively. This is because the hierarchical SGD schemes synchronize less frequently than every 2 steps globally, but their low-level synchronization at small groups are the extra overheads in comparison with the global synchronization every 8 steps.

Overall, hierarchical SGD can provide a finer-grained trade-off between communication cost and model quality than local SGD. Therefore, when local SGD at a relatively large synchronization period like 8 or 4 cannot give a satisfactory convergence efficiency, hierarchical SGD can have a much better chance to achieve both a good speedup and a model parity.

Since only simulated data is used in the experiments, we did not demonstrate the model parity here, which in practice can be achieved in two ways:

  1. Tuning the hyperparameters including both hierarchy and warmup steps;
  2. For some cases, hierarchical SGD could lead to a slightly lower quality than the original model for the same number of training steps (i.e., lower convergence rate), but with a speedup like 2X+ per training step, it is still possible to achieve model parity with more steps but still less total training time.

Speedups on 64 GPUs

Speedups on 128 GPUs

Speedups on 256 GPUs

Limitations

Before applying hierarchical SGD to straggler mitigation, the user should be aware of a few limitations of this approach:

  1. This approach can only mitigate non-persistent stragglers, which occur to different workers at different times. However, for the case of persistent stragglers, which can be caused by hardware degradation or a network issue on a specific host, these stragglers will slow down the same low-level subgroup at every time, leading to nearly no straggler mitigation.
  2. This approach can only mitigate low-frequency stragglers. E.g., if 30% workers can randomly become stragglers at every step, then most low-level synchronizations will still be slowed down by stragglers. As a result, hierarchical SGD may not show an obvious performance advantage over synchronous SGD.
  3. Since hierarchical SGD applies model averaging that does not overlap with backward like gradient averaging used by vanilla DDP, its performance gain of straggler mitigation must outweigh the performance loss of no overlap between communication and backward pass. Therefore, if stragglers only slow down training by less than 10%, hierarchical SGD may not be able to bring much speedup. This limitation can be addressed by overlapping optimizer step and backward pass in the future.
  4. Since hierarchical SGD is less well-studied than local SGD, there is no guarantee that hierarchical SGD with a finer-grained synchronization granularity can converge faster than certain advanced forms of local SGD, such as SlowMo, which can improve convergence efficiency with slow momentum. However, to the best of our knowledge, these advanced algorithms cannot be natively supported as a PyTorch DDP plugin like hierarchical SGD yet.

Acknowledgements

We would like to thank Cruise teammates Bo Tian, Sergei Vorobev, Eugene Selivonchyk, Tsugn-Hsien Lee, Dan Ring, Ian Ackerman, Lei Chen, Maegan Chew, Viet Anh To, Xiaohui Long, Zeyu Chen, Alexander Sidorov, Igor Tsvetkov, Xin Hu, Manav Kataria, Marina Rubtsova, and Mohamed Fawzy, as well as Meta teammates Shen Li, Yanli Zhao, Suraj Subramanian, Hamid Shojanzeri, Anjali Sridhar and Bernard Nguyen for the support.

Read More

How Project Starline improves remote communication

How Project Starline improves remote communication

As companies settle into a new normal of hybrid and distributed work, remote communication technology remains critical for connecting and collaborating with colleagues. While this technology has improved, the core user experience often falls short: conversation can feel stilted, attention can be difficult to maintain, and usage can be fatiguing.

Project Starline renders people at natural scale on a 3D display and enables natural eye contact.

At Google I/O 2021 we announced Project Starline, a technology project that combines advances in hardware and software to create a remote communication experience that feels like you’re together, even when you’re thousands of miles apart. This perception of co-presence is created by representing users in 3D at natural scale, enabling eye contact, and providing spatially accurate audio. But to what extent do these technological innovations translate to meaningful, observable improvement in user value compared to traditional video conferencing?

In this blog we share results from a number of studies across a variety of methodologies, finding converging evidence that Project Starline outperforms traditional video conferencing in terms of conversation dynamics, video meeting fatigue, and attentiveness. Some of these results were previously published while others we are sharing for the first time as preliminary findings.

Improved conversation dynamics

In our qualitative studies, users often describe conversations in Project Starline as “more natural.” However, when asked to elaborate, many have difficulty articulating this concept in a way that fully captures their experience. Because human communication relies partly on unconscious processes like nonverbal behavior, people might have a hard time reflecting on these processes that are potentially impacted by experiencing a novel technology. To address this challenge, we conducted a series of behavioral lab experiments to shed light on what “more natural” might mean for Project Starline. These experiments employed within-subjects designs in which participants experienced multiple conditions (e.g., meeting in Project Starline vs. traditional videoconferencing) in randomized order. This allowed us to control for between-subject differences by comparing how the same individual responded to a variety of conditions, thus increasing statistical power and reducing the sample size necessary to detect statistical differences (sample sizes in our behavioral experiments range from ~ 20 to 30).

In one study, preliminary data suggest Project Starline improves conversation dynamics by increasing rates of turn-taking. We recruited pairs of participants who had never met each other to have unstructured conversations in both Project Starline and traditional video conferencing. We analyzed the audio from each conversation and found that Project Starline facilitated significantly more dynamic “back and forth” conversations compared to traditional video conferencing. Specifically, participants averaged about 2-3 more speaker hand-offs in Project Starline conversations compared to those in traditional video conferencing across a two minute subsample of their conversation (a uniform selection at the end of each conversation to help standardize for interpersonal rapport). Participants also rated their Starline conversations as significantly more natural (“smooth,” “easy,” “not awkward”), higher in quality, and easier to recognize when it was their turn to speak compared to conversations using traditional video conferencing.

In another study, participants had conversations with a confederate in both Project Starline and traditional video conferencing. We recorded these conversations to analyze select nonverbal behaviors. In Project Starline, participants were more animated, using significantly more hand gestures (+43%), head nods (+26%), and eyebrow movements (+49%). Participants also reported a significantly better ability to perceive and convey nonverbal cues in Project Starline than in traditional video conferencing. Together with the turn-taking results, these data help explain why conversations in Project Starline may feel more natural.

We recorded participants to quantify their nonverbal behaviors and found that they were more animated in Project Starline (left) compared to traditional video conferencing (right).

Reduced video meeting fatigue

A well-documented challenge of video conferencing, especially within the workplace, is video meeting fatigue. The causes of video meeting fatigue are complex, but one possibility is that video communication is cognitively taxing because it becomes more difficult to convey and interpret nonverbal behavior. Considering previous findings that suggested Project Starline might improve nonverbal communication, we examined whether video meeting fatigue might also be improved (i.e., reduced) compared to traditional video conferencing.

Our study found preliminary evidence that Project Starline indeed reduces video meeting fatigue. Participants held 30-minute mock meetings in Project Starline and traditional video conferencing. Meeting content was standardized across participants using an exercise adapted from academic literature that emulates key elements of a work meeting, such as brainstorming and persuasion. We then measured video meeting fatigue via the Zoom Exhaustion and Fatigue (ZEF) Scale. Additionally, we measured participants’ reaction times on a complex cognitive task originally used in cognitive psychology. We repurposed this task as a proxy for video meeting fatigue based on the assumption that more fatigue would lead to slower reaction times. Participants reported significantly less video meeting fatigue on the ZEF Scale (-31%) and had faster reaction times (-12%) on the cognitive task after using Project Starline compared to traditional video conferencing.

Increased attentiveness

Another challenge with video conferencing is focusing attention on the meeting at hand, rather than on other browser windows or secondary devices.

In our earlier study on nonverbal behavior, we included an exploratory information-retention task. We asked participants to write as much as they could remember about each conversation (one in Project Starline, and one in traditional video conferencing). We found that participants wrote 28% more in this task (by character count) after their conversation in Project Starline. This could be because they paid closer attention when in Project Starline, or possibly that they found conversations in Project Starline to be more engaging.

To explore the concept of attentiveness further, we conducted a study in which participants wore eye-tracking glasses. This allowed us to calculate the percentage of time participants spent focusing on their conversation partner’s face, an important source of social information in human interaction. Participants had a conversation with a confederate in Project Starline, traditional video conferencing, and in person. We found that participants spent a significantly higher proportion of time looking at their conversation partner’s face in Project Starline (+14%) than they did in traditional video conferencing. In fact, visual attentiveness in Project Starline mirrored that of the in-person condition: participants spent roughly the same proportion of time focusing on their meeting partner’s face in the Project Starline and in-person conditions.

The use of eye-tracking glasses and facial detection software allowed us to quantify participants’ gaze patterns. The video above illustrates how a hypothetical participant’s eye tracking data (red dot) correspond to their meeting partner’s face (white box).

User value in real meetings

The lab-based, experimental approach used in the studies above allows for causal inference while minimizing confounding variables. However, one limitation of these studies is that they are low in external validity — that is, they took place in a lab environment, and the extent to which their results extend to the real world is unclear. Thus, we studied actual users within Google who used Project Starline for their day-to-day work meetings and collected their feedback.

An internal pilot revealed that users derive meaningful value from using Project Starline. We used post-meeting surveys to capture immediate feedback on individual meetings, longer monthly surveys to capture holistic feedback on the experience, and conducted in-depth qualitative interviews with a subset of users. We evaluated Project Starline on concepts such as presence, nonverbal behavior, attentiveness, and personal connection. We found strong evidence that Project Starline delivered across these four metrics, with over 87% of participants expressing that their meetings in Project Starline were better than their previous experiences with traditional video conferencing.

Conclusion

Together, these findings offer a compelling case for Project Starline’s value to users: improved conversation dynamics, reduced video meeting fatigue, and increased attentiveness. Participants expressed that Project Starline was a significant improvement over traditional video conferencing in highly controlled lab experiments, as well as when they used Project Starline for their actual work meetings. We’re excited to see these findings converge across multiple methodologies (surveys, qualitative interviews, experiments) and measurements (self-report, behavioral, qualitative), and we’re eager to continue exploring the implications of Project Starline on human interaction.

Acknowledgments

We’d like to thank Melba Tellez, Eric Baczuk, Jinghua Zhang, Matthew DuVall, and Travis Miller for contributing to visual assets and illustrations.

Read More

Import data from over 40 data sources for no-code machine learning with Amazon SageMaker Canvas

Import data from over 40 data sources for no-code machine learning with Amazon SageMaker Canvas

Data is at the heart of machine learning (ML). Including relevant data to comprehensively represent your business problem ensures that you effectively capture trends and relationships so that you can derive the insights needed to drive business decisions. With Amazon SageMaker Canvas, you can now import data from over 40 data sources to be used for no-code ML. Canvas expands access to ML by providing business analysts with a visual interface that allows them to generate accurate ML predictions on their own—without requiring any ML experience or having to write a single line of code. Now, you can import data in-app from popular relational data stores such as Amazon Athena as well as third-party software as a service (SaaS) platforms supported by Amazon AppFlow such as Salesforce, SAP OData, and Google Analytics.

The process of gathering high-quality data for ML can be complex and time-consuming, because the proliferation of SaaS applications and data storage services has created a spread of data across a multitude of systems. For example, you may need to conduct a customer churn analysis using customer data from Salesforce, financial data from SAP, and logistics data from Snowflake. To create a dataset across these sources, you need to log into each application individually, select the desired data, and export it locally, where it can then be aggregated using a different tool. This dataset then needs to be imported into a separate application for ML.

With this launch, Canvas empowers you to capitalize on data stored in disparate sources by supporting in-app data import and aggregation from over 40 data sources. This feature is made possible through new native connectors to Athena and to Amazon AppFlow via the AWS Glue Data Catalog. Amazon AppFlow is a managed service that enables you to securely transfer data from third-party SaaS applications to Amazon Simple Storage Service (Amazon S3) and catalog the data with the Data Catalog with just a few clicks. After your data is transferred, you can simply access the data source within Canvas, where you can view table schemas, join tables within or across data sources, write Athena queries, and preview and import your data. After your data is imported, you can use existing Canvas functionalities such as building an ML model, viewing column impact data, or generating predictions. You can automate the data transfer process in Amazon AppFlow to activate on a schedule to ensure that you always have access to the latest data in Canvas.

Solution overview

The steps outlined in this post provide two examples of how to import data into Canvas for no-code ML. In the first example, we demonstrate how to import data through Athena. In the second example, we show how to import data from a third-party SaaS application via Amazon AppFlow.

Import data from Athena

In this section, we show an example of importing data in Canvas from Athena to conduct a customer segmentation analysis. We create an ML classification model to categorize our customer base into four different classes, with the end goal to use the model to predict which class a new customer will fall into. We follow three major steps: import the data, train a model, and generate predictions. Let’s get started.

Import the data

To import data from Athena, complete the following steps:

  1. On the Canvas console, choose Datasets in the navigation pane, then choose Import.
  2. Expand the Data Source menu and choose Athena.
  3. Choose the correct database and table that you want to import from. You can optionally preview the table by choosing the preview icon.

The following screenshot shows an example of the preview table.

In our example, we segment customers based on the marketing channel through which they have engaged our services. This is specified by the column segmentation, where A is print media, B is mobile, C is in-store promotions, and D is television.

  1. When you’re satisfied that you have the right table, drag the desired table into the Drag and drop datasets to join section.
  2. You can now optionally select or deselect columns, join tables by dragging another table into the Drag and drop datasets to join section, or write SQL queries to specify your data slice. For this post, we use all the data in the table.
  3. To import the data, choose Import data.

Your data is imported into Canvas as a dataset from the specific table in Athena.

Train a model

After your data is imported, it shows up on the Datasets page. At this stage, you can build a model. To do so, complete the following steps:

  1. Select your dataset and choose Create a model.
  2. For Model name, enter your model name (for this post, my_first_model).
  3. Canvas enables you to create models for predictive analysis, image analysis, and text analysis. Because we want to categorize customers, select Predictive analysis for Problem type.
  4. To proceed, choose Create.

On the Build page, you can see statistics about your dataset, such as the percentage of missing values and mean of the data.

  1. For Target column, choose a column (for this post, segmentation).

Canvas offers two types of models that can generate predictions. Quick build prioritizes speed over accuracy, providing a model in 2–15 minutes. Standard build prioritizes accuracy over speed, providing a model in 2–4 hours.

  1. For this post, choose Quick build.
  2. After the model is trained, you can analyze the model accuracy.

The following model categorizes customers correctly 94.67% of the time.

  1. You can optionally also view how each column impacts the categorization. In this example, as a customer ages, the column has less of an influence on the categorization. To generate predictions with your new model, choose Predict.

Generate predictions

On the Predict tab, you can generate both batch predictions and single predictions. Complete the following steps:

  1. For this post, choose Single prediction to understand what customer segmentation will result for a new customer.

For our prediction, we want to understand what segmentation a customer will be if they are 32 years old and a lawyer by profession.

  1. Replace the corresponding values with these inputs.
  2. Choose Update.

The updated prediction is displayed in the prediction window. In this example, a 32-year old lawyer is classified in segment D.

Import data from a third-party SaaS application to AWS

To import data from third-party SaaS applications into Canvas for no-code ML, you must first transfer data from the application to Amazon S3 via Amazon AppFlow. In this example, we transfer manufacturing data from SAP OData.

To transfer your data, complete the following steps:

  1. On the Amazon AppFlow console, choose Create flow.
  2. For Flow name, enter a name.
  3. Choose Next.
  4. For Source name, choose your desired third-party SaaS application (for this post, SAP OData).
  5. Choose Create new connection.
  6. In the Connect to SAP OData pop-up window, fill out the authentication details and choose Connect.
  7. For SAP OData object, choose the object containing your data within SAP OData.
  8. For Destination name, choose Amazon S3.
  9. For Bucket details, specify your S3 bucket details.
  10. Select Catalog your data in the AWS Glue Data Catalog.
  11. For User role, choose the AWS Identity and Access Management (IAM) role that the Canvas user will use to access the data from.
  12. For Flow trigger, select Run on demand.

Alternatively, you can automate the flow transfer by selecting Run flow on schedule.

  1. Choose Next.
  2. Choose how to map the fields and complete the field mapping. For this post, because there is no corresponding destination database to map to, there is no need to specify the mapping.
  3. Choose Next.

  4. Optionally, add filters if necessary to restrict data transferred.
  5. Choose Next.
  6. Review your details and choose Create flow.

When the flow is created, a green ribbon will populate at the top of the page indicating that it is successfully updated.

  1. Choose Run flow.

At this stage, you have successfully transferred your data from SAP OData to Amazon S3.

Now you can import the data from within the Canvas app. To import your data from Canvas, follow the same set of steps as described in the Data import section earlier in this post. For this example, on the Data source drop-down menu on the Data import page, you can see SAP OData listed.

You are now able to use all existing Canvas functionalities, such as cleaning your data, building an ML model, viewing column impact data, and generating predictions.

Clean up

To clean up the resources provisioned, log out of the Canvas application by choosing Log out in the navigation pane.

Conclusion

With Canvas, you can now import data for no-code ML from 47 data sources through native connectors with Athena and Amazon AppFlow via the AWS Glue Data Catalog. This process enables you to directly access and aggregate data across data sources within Canvas after data is transferred via Amazon AppFlow. You can automate the data transfer to activate on a schedule, which means that you don’t have to go through the process again to refresh your data. With this process, you can create new datasets with your latest data without having to leave the Canvas app. This feature is now available in all AWS Regions where Canvas is available. To get started with importing your data, navigate to the Canvas console and follow the steps outlined in this post. To learn more, refer to Connect to data sources.


About the authors

Brandon Nair is a Senior Product Manager for Amazon SageMaker Canvas. His professional interest lies in creating scalable machine learning services and applications. Outside of work he can be found exploring national parks, perfecting his golf swing or planning an adventure trip.

Sanjana Kambalapally is a Software Development Manager for AWS Sagemaker Canvas, which aims at democratizing machine learning by building no code ML applications.

Xin Xu is a software development engineer in the Canvas team, where he works on data preparation, among other aspects in no-code machine learning products. In his spare time, he enjoys jogging, reading and watching movies.

Volkan Unsal is a Sr. Frontend Engineer in the Canvas team, where he builds no-code products to make artificial intelligence accessible to humans. In his spare time, he enjoys running, reading, watching e-sports, and martial arts.

Read More

Predicting new and existing product sales in semiconductors using Amazon Forecast

Predicting new and existing product sales in semiconductors using Amazon Forecast

This is a joint post by NXP SEMICONDUCTORS N.V. & AWS Machine Learning Solutions Lab (MLSL)

Machine learning (ML) is being used across a wide range of industries to extract actionable insights from data to streamline processes and improve revenue generation. In this post, we demonstrate how NXP, an industry leader in the semiconductor sector, collaborated with the AWS Machine Learning Solutions Lab (MLSL) to use ML techniques to optimize the allocation of the NXP research and development (R&D) budget to maximize their long-term return on investment (ROI).

NXP directs its R&D efforts largely to the development of new semiconductor solutions where they see significant opportunities for growth. To outpace market growth, NXP invests in research and development to extend or create leading market positions, with an emphasis on fast-growing, sizable market segments. For this engagement, they sought to generate monthly sales forecasts for new and existing products across different material groups and business lines. In this post, we demonstrate how the MLSL and NXP employed Amazon Forecast and other custom models for long-term sales predictions for various NXP products.

“We engaged with the team of scientists and experts at [the] Amazon Machine Learning Solutions Lab to build a solution for predicting new product sales and understand if and which additional features could help inform [the] decision-making process for optimizing R&D spending. Within just a few weeks, the team delivered multiple solutions and analyses across some of our business lines, material groups, and on [an] individual product level. MLSL delivered a sales forecast model, which complements our current way of manual forecasting, and helped us model the product lifecycle with novel machine learning approaches using Amazon Forecast and Amazon SageMaker. While keeping a constant collaborative workstream with our team, MLSL helped us with upskilling our professionals when it comes to scientific excellence and best practices on ML development using AWS infrastructure.”

– Bart Zeeman, Strategist and Analyst at CTO office in NXP Semiconductors.

Goals and use case

The goal of the engagement between NXP and the MLSL team is to predict the overall sales of NXP in various end markets. In general, the NXP team is interested in macro-level sales that include the sales of various business lines (BLs), which contain multiple material groups (MAGs). Furthermore, the NXP team is also interested in predicting the product lifecycle of newly introduced products. The lifecycle of a product is divided into four different phases (Introduction, Growth, Maturity, and Decline). The product lifecycle prediction enables the NXP team to identify the revenue generated by each product to further allocate R&D funding to the products generating the highest amounts of sales or products with the highest potential to maximize the ROI for R&D activity. Additionally, they can predict the long-term sales on a micro level, which gives them a bottom-up look on how their revenue changes over time.

In the following sections, we present the key challenges associated with developing robust and efficient models for long-term sales forecasts. We further describe the intuition behind various modeling techniques employed to achieve the desired accuracy. We then present the evaluation of our final models, where we compare the performance of the proposed models in terms of sales prediction with the market experts at NXP. We also demonstrate the performance of our state-of-the-art point cloud-based product lifecycle prediction algorithm.

Challenges

One of the challenges we faced while using fine-grained or micro-level modeling like product-level models for sale prediction was missing sales data. The missing data is the result of lack of sales during every month. Similarly, for macro-level sales prediction, the length of the historical sales data was limited. Both the missing sales data and the limited length of historical sales data pose significant challenges in terms of model accuracy for long-term sales prediction into 2026. We observed during the exploratory data analysis (EDA) that as we move from micro-level sales (product level) to macro-level sales (BL level), missing values become less significant. However, the maximum length of historical sales data (maximum length of 140 months) still posed significant challenges in terms of model accuracy.

Modeling techniques

After EDA, we focused on forecasting at the BL and MAG levels and at the product level for one of the largest end markets (the automobile end market) for NXP. However, the solutions we developed can be extended to other end markets. Modeling at the BL, MAG, or product level has its own pros and cons in terms of model performance and data availability. The following table summarizes such pros and cons for each level. For macro-level sales prediction, we employed the Amazon Forecast AutoPredictor for our final solution. Similarly, for micro-level sales prediction, we developed a novel point cloud-based approach.

Macro sales prediction (top-down)

To predict the long terms sales values (2026) at the macro level, we tested various methods, including Amazon Forecast, GluonTS, and N-BEATS (implemented in GluonTS and PyTorch). Overall, Forecast outperformed all other methods based on a backtesting approach (described in the Evaluation Metrics section later in this post) for macro-level sales prediction. We also compared the accuracy of AutoPredictor against human predictions.

We also proposed using N-BEATS due to its interpretative properties. N-BEATS is based on a very simple but powerful architecture that uses an ensemble of feedforward networks that employ the residual connections with stacked residual blocks for forecasting. This architecture further encodes the inductive bias in its architecture to make the time series model capable of extracting trend and seasonality (see the following figure). These interpretations were generated using PyTorch Forecasting.

Micro sales prediction (bottom-up)

In this section, we discuss a novel method developed to predict the product lifecycle shown in the following figure while taking into consideration the cold start product. We implemented this method using PyTorch on Amazon SageMaker Studio. First, we introduced a point cloud-based method. This method first converts sales data into a point cloud, where each point represents sales data at a certain age of the product. The point cloud-based neural network model is further trained using this data to learn the parameters of the product lifecycle curve (see the following figure). In this approach, we also incorporated additional features, including product description as a bag of words to tackle the cold start problem for predicting the product lifecycle curve.

Time series as point cloud-based product lifecycle prediction

We developed a novel point cloud-based approach to predict the product lifecycle and micro-level sales predictions. We also incorporated additional features to further improve the model accuracy for the cold start product lifecycle predictions. These features include product fabrication techniques and other related categorical information related to the products. Such additional data can help the model predict sales of a new product even before the product is released on the market (cold start). The following figure demonstrates the point cloud-based approach. The model takes the normalized sales and age of the product (number of months since the product is launched) as input. Based on these inputs, the model learns parameters during the training using gradient descent. During the forecast phase, the parameters along with the features of a cold start product are used for predicting the lifecycle. The large number of missing values in the data at the product level negatively impacts nearly all of the existing time series models. This novel solution is based on the ideas of lifecycle modeling and treating time series data as point clouds to mitigate the missing values.

The following figure demonstrates how our point cloud-based lifecycle method addresses the missing data values and is capable of predicting the product lifecycle with very few training samples. The X-axis represents the age in time, and the Y-axis represents the sales of a product. Orange dots represent the training samples, green dots represent the testing samples, and the blue line demonstrates the predicted lifecycle of a product by the model.

Methodology

To predict macro-level sales, we employed Amazon Forecast among other techniques. Similarly, for micro sales, we developed a state-of-the-art point cloud-based custom model. Forecast outperformed all other methods in terms of model performance. We used Amazon SageMaker notebook instances to create a data processing pipeline that extracted training examples from Amazon Simple Storage Service (Amazon S3). The training data was further used as input for Forecast to train a model and predict long-term sales.

Training a time series model using Amazon Forecast consists of three main steps. In the first step, we imported the historical data into Amazon S3. Second, a predictor was trained using the historical data. Finally, we deployed the trained predictor to generate the forecast. In this section, we provide a detailed explanation along with code snippets of each step.

We started by extracting the latest sales data. This step included uploading the dataset to Amazon S3 in the correct format. Amazon Forecast takes three columns as inputs: timestamp, item_id, and target_value (sales data). The timestamp column contains the time of sales, which could be formatted as hourly, daily, and so on. The item_id column contains the name of the sold items, and the target_value column contains sales values. Next, we used the path of training data located in Amazon S3, defined the time series dataset frequency (H, D, W, M, Y), defined a dataset name, and identified the attributes of the dataset (mapped the respective columns in the dataset and their data types). Next, we called the create_dataset function from the Boto3 API to create a dataset with attributes such as Domain, DatasetType, DatasetName, DatasetFrequency, and Schema. This function returned a JSON object that contained the Amazon Resource Name (ARN). This ARN was subsequently used in the following steps. See the following code:

dataset_path = "PATH_OF_DATASET_IN_S3"
DATASET_FREQUENCY = "M" # Frequency of dataset (H, D, W, M, Y) 
TS_DATASET_NAME = "NAME_OF_THE_DATASET"
TS_SCHEMA = {
   "Attributes":[
      {
         "AttributeName":"item_id",
         "AttributeType":"string"
      },
       {
         "AttributeName":"timestamp",
         "AttributeType":"timestamp"
      },
      {
         "AttributeName":"target_value",
         "AttributeType":"float"
      }
   ]
}

create_dataset_response = forecast.create_dataset(Domain="CUSTOM",
                                                  DatasetType='TARGET_TIME_SERIES',
                                                  DatasetName=TS_DATASET_NAME,
                                                  DataFrequency=DATASET_FREQUENCY,
                                                  Schema=TS_SCHEMA)

ts_dataset_arn = create_dataset_response['DatasetArn']

After the dataset was created, it was imported into Amazon Forecast using the Boto3 create_dataset_import_job function. The create_dataset_import_job function takes the job name (a string value), the ARN of the dataset from the previous step, the location of the training data in Amazon S3 from the previous step, and the time stamp format as arguments. It returns a JSON object containing the import job ARN. See the following code:

TIMESTAMP_FORMAT = "yyyy-MM-dd"
TS_IMPORT_JOB_NAME = "SALES_DATA_IMPORT_JOB_NAME"

ts_dataset_import_job_response = 
    forecast.create_dataset_import_job(DatasetImportJobName=TS_IMPORT_JOB_NAME,
                                       DatasetArn=ts_dataset_arn,
                                       DataSource= {
                                         "S3Config" : {
                                             "Path": ts_s3_path,
                                             "RoleArn": role_arn
                                         } 
                                       },
                                       TimestampFormat=TIMESTAMP_FORMAT,
                                       TimeZone = TIMEZONE)

ts_dataset_import_job_arn = ts_dataset_import_job_response['DatasetImportJobArn']

The imported dataset was then used to create a dataset group using the create_dataset_group function. This function takes the domain (string values defining the domain of the forecast), dataset group name, and the dataset ARN as inputs:

DATASET_GROUP_NAME = "SALES_DATA_GROUP_NAME"
DATASET_ARNS = [ts_dataset_arn]

create_dataset_group_response = 
    forecast.create_dataset_group(Domain="CUSTOM",
                                  DatasetGroupName=DATASET_GROUP_NAME,
                                  DatasetArns=DATASET_ARNS)

dataset_group_arn = create_dataset_group_response['DatasetGroupArn']

Next, we used the dataset group to train forecasting models. Amazon Forecast offers various state-of-the-art models; any of these models can be used for training. We used AutoPredictor as our default model. The main advantage of using AutoPredictor is that it automatically generates the item-level forecast, using the optimal model from an ensemble of six state-of-the-art models based on the input dataset. The Boto3 API provides the create_auto_predictor function for training an auto prediction model. The input parameters of this function are PredictorName, ForecastHorizon, and ForecastFrequency. Users are also responsible for selecting the forecast horizon and frequency. The forecast horizon represents the window size of the future prediction, which can be formatted hours, days, weeks, months, and so on. Similarly, forecast frequency represents the granularity of the forecast values, such as hourly, daily, weekly, monthly, or yearly. We mainly focused on predicting monthly sales of NXP on various BLs. See the following code:

PREDICTOR_NAME = "SALES_PREDICTOR"
FORECAST_HORIZON = 24
FORECAST_FREQUENCY = "M"

create_auto_predictor_response = 
    forecast.create_auto_predictor(PredictorName = PREDICTOR_NAME,
                                   ForecastHorizon = FORECAST_HORIZON,
                                   ForecastFrequency = FORECAST_FREQUENCY,
                                   DataConfig = {
                                       'DatasetGroupArn': dataset_group_arn
                                    })

predictor_arn = create_auto_predictor_response['PredictorArn']

The trained predictor was then used to generate forecast values. Forecasts were generated using the create_forecast function from the previously trained predictor. This function takes the name of the forecast and the ARN of the predictor as inputs and generates the forecast values for the horizon and frequency defined in the predictor:

FORECAST_NAME = "SALES_FORECAST"

create_forecast_response = 
    forecast.create_forecast(ForecastName=FORECAST_NAME,
                             PredictorArn=predictor_arn)

Amazon Forecast is a fully managed service that automatically generates training and test datasets and provides various accuracy metrics to evaluate the reliability of the model-generated forecast. However, to build consensus on the predicted data and compare the predicted values with human predictions, we divided our historic data into training data and validation data manually. We trained the model using the training data without exposing the model to validation data and generated the prediction for the length of validation data. The validation data was compared with the predicted values to evaluate the model performance. Validation metrics may include mean absolute percent error (MAPE) and weighted absolute percent error (WAPE), among others. We used WAPE as our accuracy metric, as discussed in the next section.

Evaluation metrics

We first verified the model performance using backtesting to validate the prediction of our forecast model for long term sales forecast (2026 sales). We evaluated the model performance using the WAPE. The lower the WAPE value, the better the model. The key advantage of using WAPE over other error metrics like MAPE is that WAPE weighs the individual impact of each item’s sale. Therefore, it accounts for each product’s contribution to the total sale while calculating the overall error. For example, if you make an error of 2% on a product that generates $30 million and an error of 10% in a product that generates $50,000, your MAPE will not tell the entire story. The 2% error is actually costlier than the 10% error, something you can’t tell by using MAPE. Comparatively, WAPE will account for these differences. We also predicted various percentile values for the sales to demonstrate the upper and lower bounds of the model forecast.

Macro-level sales prediction model validation

Next, we validated the model performance in terms of WAPE values. We calculated the WAPE value of a model by splitting the data into test and validation sets. For example, in the 2019 WAPE value, we trained our model using sales data between 2011–2018 and predicted sales values for the next 12 months (2019 sale). Next, we calculated the WAPE value using the following formula:

We repeated the same procedure to calculate the WAPE value for 2020 and 2021. We evaluated the WAPE for all BLs in the auto end market for 2019, 2020, and 2021. Overall, we observed that Amazon Forecast can achieve a 0.33 WAPE value even for the year of 2020 (during the COVID-19 pandemic). In 2019 and 2020, our model achieved less than 0.1 WAPE values, demonstrating high accuracy.

Macro-level sales prediction baseline comparison

We compared the performance of the macro sales prediction models developed using Amazon Forecast to three baseline models in terms of WAPE value for 2019, 2020 and 2021 (see the following figure). Amazon Forecast either significantly outperformed the other baseline models or performed on par for all 3 years. These results further validate the effectiveness of our final model predictions.

Macro-level sales prediction model vs. human predictions

To further validate the confidence of our macro-level model, we next compared the performance of our model with the human-predicted sales values. At the beginning of the fourth quarter every year, market experts at NXP predict the sales value of each BL, taking into consideration global market trends as well as other global indicators that could potentially impact the sales of NXP products. We compare the percent error of the model prediction vs. human prediction to the actual sales values in 2019, 2020, and 2021. We trained three models using data from 2011–2018 and predicted the sales values until 2021. We next calculated the MAPE for the actual sales values. We then used the human-predicted values by the end of 2018 (test the model forecast 1Y ahead to 3Y ahead forecast). We repeated this process to predict the values in 2019 (1Y ahead forecast to 2Y ahead forecast) and 2020 (for 1Y ahead forecast). Overall, the model performed on par with the human predictors or better in some cases. These results demonstrate the effectiveness and reliability of our model.

Micro-level sales prediction and product lifecycle

The following figure depicts how the model behaves using product data while having access to very few observations for each product (namely one or two observations at the input for product lifecycle prediction). The orange dots represent the training data, the green dots represent the testing data, and the blue line represents the model predicted product lifecycle.

The model can be fed more observations for context without the need for re-training as new sales data become available. The following figure demonstrates how the model behaves if it is given more context. Ultimately, more context leads to lower WAPE values.

In addition, we managed to incorporate additional features for each product, including fabrication techniques and other categorical information. In this regard, external features helped reduce the WAPE value in the low-context regime (see the following figure). There are two explanations for this behavior. First, we need to let the data speak for itself in the high-context regimes. The additional features can interfere with this process. Second, we need better features. We used 1,000 dimensional one-hot-encoded features (bag of words). The conjecture is that better feature engineering techniques can help reduce WAPE even further.

Such additional data can help the model predict sales of new products even before the product is released on the market. For example, in the following figure, we plot how much mileage we can get only out of external features.

Conclusion

In this post, we demonstrated how the MLSL and NXP teams worked together to predict macro- and micro-level long-term sales for NXP. The NXP team will now learn how to use these sales predictions in their processes—for example, to use it as input for R&D funding decisions and enhance ROI. We used Amazon Forecast to predict the sales for business lines (macro sales), which we referred to as the top-down approach. We also proposed a novel approach using time series as a point cloud to tackle the challenges of missing values and cold start at the product level (micro level). We referred to this approach as bottom-up, where we predicted the monthly sales of each product. We further incorporated external features of each product to enhance the performance of the model for cold start.

Overall, the models developed during this engagement performed on par compared to human prediction. In some cases, the models performed better than human predictions in the long term. These results demonstrate the effectiveness and reliability of our models.

This solution can be employed for any forecasting problem. For further assistance in terms of designing and developing ML solutions, please free to get in touch with the MLSL team.


About the authors

Souad Boutane is a data scientist at NXP-CTO, where she is transforming various data into meaningful insights to support business decision using advanced tools and techniques.

Ben Fridolin is a data scientist at NXP-CTO, where he coordinates on accelerating AI and cloud adoption. He focuses on machine learning, deep learning and end-to-end ML solutions.

Cornee Geenen is a project lead in the Data Portfolio of NXP supporting the organization in it’s digital transformation towards becoming data centric.

Bart Zeeman is a strategist with a passion for data & analytics at NXP-CTO where he is driving for better data driven decisions for more growth and innovation.

Ahsan Ali is an Applied Scientist at the Amazon Machine Learning Solutions Lab, where he works with customers from different domains to solve their urgent and expensive problems using state-of-the-art AI/ML techniques.

Yifu Hu is an Applied Scientist in the Amazon Machine Learning Solutions lab, where he helps design creative ML solutions to address customers’ business problems in various industries.

Mehdi Noori is an Applied Science Manager at Amazon ML Solutions Lab, where he helps develop ML solutions for large organizations across various industries and leads the Energy vertical. He is passionate about using AI/ML to help customers achieve their Sustainability goals.

Huzefa Rangwala is a Senior Applied Science Manager at AIRE, AWS. He leads a team of scientists and engineers to enable machine learning based discovery of data assets. His research interests are in responsible AI, federated learning and applications of ML in health care and life sciences.

Read More

Pre-trained Gaussian processes for Bayesian optimization

Pre-trained Gaussian processes for Bayesian optimization

Bayesian optimization (BayesOpt) is a powerful tool widely used for global optimization tasks, such as hyperparameter tuning, protein engineering, synthetic chemistry, robot learning, and even baking cookies. BayesOpt is a great strategy for these problems because they all involve optimizing black-box functions that are expensive to evaluate. A black-box function’s underlying mapping from inputs (configurations of the thing we want to optimize) to outputs (a measure of performance) is unknown. However, we can attempt to understand its internal workings by evaluating the function for different combinations of inputs. Because each evaluation can be computationally expensive, we need to find the best inputs in as few evaluations as possible. BayesOpt works by repeatedly constructing a surrogate model of the black-box function and strategically evaluating the function at the most promising or informative input location, given the information observed so far.

Gaussian processes are popular surrogate models for BayesOpt because they are easy to use, can be updated with new data, and provide a confidence level about each of their predictions. The Gaussian process model constructs a probability distribution over possible functions. This distribution is specified by a mean function (what these possible functions look like on average) and a kernel function (how much these functions can vary across inputs). The performance of BayesOpt depends on whether the confidence intervals predicted by the surrogate model contain the black-box function. Traditionally, experts use domain knowledge to quantitatively define the mean and kernel parameters (e.g., the range or smoothness of the black-box function) to express their expectations about what the black-box function should look like. However, for many real-world applications like hyperparameter tuning, it is very difficult to understand the landscapes of the tuning objectives. Even for experts with relevant experience, it can be challenging to narrow down appropriate model parameters.

In “Pre-trained Gaussian processes for Bayesian optimization”, we consider the challenge of hyperparameter optimization for deep neural networks using BayesOpt. We propose Hyper BayesOpt (HyperBO), a highly customizable interface with an algorithm that removes the need for quantifying model parameters for Gaussian processes in BayesOpt. For new optimization problems, experts can simply select previous tasks that are relevant to the current task they are trying to solve. HyperBO pre-trains a Gaussian process model on data from those selected tasks, and automatically defines the model parameters before running BayesOpt. HyperBO enjoys theoretical guarantees on the alignment between the pre-trained model and the ground truth, as well as the quality of its solutions for black-box optimization. We share strong results of HyperBO both on our new tuning benchmarks for near–state-of-the-art deep learning models and classic multi-task black-box optimization benchmarks (HPO-B). We also demonstrate that HyperBO is robust to the selection of relevant tasks and has low requirements on the amount of data and tasks for pre-training.

In the traditional BayesOpt interface, experts need to carefully select the mean and kernel parameters for a Gaussian process model. HyperBO replaces this manual specification with a selection of related tasks, making Bayesian optimization easier to use. The selected tasks are used for pre-training, where we optimize a Gaussian process such that it can gradually generate functions that are similar to the functions corresponding to those selected tasks. The similarity manifests in individual function values and variations of function values across the inputs.

Loss functions for pre-training

We pre-train a Gaussian process model by minimizing the Kullback–Leibler divergence (a commonly used divergence) between the ground truth model and the pre-trained model. Since the ground truth model is unknown, we cannot directly compute this loss function. To solve for this, we introduce two data-driven approximations: (1) Empirical Kullback–Leibler divergence (EKL), which is the divergence between an empirical estimate of the ground truth model and the pre-trained model; (2) Negative log likelihood (NLL), which is the the sum of negative log likelihoods of the pre-trained model for all training functions. The computational cost of EKL or NLL scales linearly with the number of training functions. Moreover, stochastic gradient–based methods like Adam can be employed to optimize the loss functions, which further lowers the cost of computation. In well-controlled environments, optimizing EKL and NLL lead to the same result, but their optimization landscapes can be very different. For example, in the simplest case where the function only has one possible input, its Gaussian process model becomes a Gaussian distribution, described by the mean (m) and variance (s). Hence the loss function only has those two parameters, m and s, and we can visualize EKL and NLL as follows:

We simulate the loss landscapes of EKL (left) and NLL (right) for a simple model with parameters m and s. The colors represent a heatmap of the EKL or NLL values, where red corresponds to higher values and blue denotes lower values. These two loss landscapes are very different, but they both aim to match the pre-trained model with the ground truth model.

Pre-training improves Bayesian optimization

In the BayesOpt algorithm, decisions on where to evaluate the black-box function are made iteratively. The decision criteria are based on the confidence levels provided by the Gaussian process, which are updated in each iteration by conditioning on previous data points acquired by BayesOpt. Intuitively, the updated confidence levels should be just right: not overly confident or too unsure, since in either of these two cases, BayesOpt cannot make the decisions that can match what an expert would do.

In HyperBO, we replace the hand-specified model in traditional BayesOpt with the pre-trained Gaussian process. Under mild conditions and with enough training functions, we can mathematically verify good theoretical properties of HyperBO: (1) Alignment: the pre-trained Gaussian process guarantees to be close to the ground truth model when both are conditioned on observed data points; (2) Optimality: HyperBO guarantees to find a near-optimal solution to the black-box optimization problem for any functions distributed according to the unknown ground truth Gaussian process.

We visualize the Gaussian process (areas shaded in purple are 95% and 99% confidence intervals) conditional on observations (black dots) from an unknown test function (orange line). Compared to the traditional BayesOpt without pre-training, the predicted confidence levels in HyperBO captures the unknown test function much better, which is a critical prerequisite for Bayesian optimization.

Empirically, to define the structure of pre-trained Gaussian processes, we choose to use very expressive mean functions modeled by neural networks, and apply well-defined kernel functions on inputs encoded to a higher dimensional space with neural networks.

To evaluate HyperBO on challenging and realistic black-box optimization problems, we created the PD1 benchmark, which contains a dataset for multi-task hyperparameter optimization for deep neural networks. PD1 was developed by training tens of thousands of configurations of near–state-of-the-art deep learning models on popular image and text datasets, as well as a protein sequence dataset. PD1 contains approximately 50,000 hyperparameter evaluations from 24 different tasks (e.g., tuning Wide ResNet on CIFAR100) with roughly 12,000 machine days of computation.

We demonstrate that when pre-training for only a few hours on a single CPU, HyperBO can significantly outperform BayesOpt with carefully hand-tuned models on unseen challenging tasks, including tuning ResNet50 on ImageNet. Even with only ~100 data points per training function, HyperBO can perform competitively against baselines.

Tuning validation error rates of ResNet50 on ImageNet and Wide ResNet (WRN) on the Street View House Numbers (SVHN) dataset and CIFAR100. By pre-training on only ~20 tasks and ~100 data points per task, HyperBO can significantly outperform traditional BayesOpt (with a carefully hand-tuned Gaussian process) on previously unseen tasks.

Conclusion and future work

HyperBO is a framework that pre-trains a Gaussian process and subsequently performs Bayesian optimization with a pre-trained model. With HyperBO, we no longer have to hand-specify the exact quantitative parameters in a Gaussian process. Instead, we only need to identify related tasks and their corresponding data for pre-training. This makes BayesOpt both more accessible and more effective. An important future direction is to enable HyperBO to generalize over heterogeneous search spaces, for which we are developing new algorithms by pre-training a hierarchical probabilistic model.

Acknowledgements

The following members of the Google Research Brain Team conducted this research: Zi Wang, George E. Dahl, Kevin Swersky, Chansoo Lee, Zachary Nado, Justin Gilmer, Jasper Snoek, and Zoubin Ghahramani. We’d like to thank Zelda Mariet and Matthias Feurer for help and consultation on transfer learning baselines. We’d also like to thank Rif A. Saurous for constructive feedback, and Rodolphe Jenatton and David Belanger for feedback on previous versions of the manuscript. In addition, we thank Sharat Chikkerur, Ben Adlam, Balaji Lakshminarayanan, Fei Sha and Eytan Bakshy for comments, and Setareh Ariafar and Alexander Terenin for conversations on animation. Finally, we thank Tom Small for designing the animation for this post.

Read More

Counterfactual Logit Pairing

Counterfactual Logit Pairing

Posted by Bhaktipriya Radharapu, Software Engineer

TensorFlow Model Remediation is an open source toolkit that showcases solutions to help mitigate unfair bias in Machine Learning models. The toolkit offers resources to build fairer models for everyone – in line with Google’s AI Principles. Today, we’re excited to announce a new technique within the TensorFlow Model Remediation Library called Counterfactual Logit Pairing (CLP) to address unintended bias in ML models.

ML models are prone to making incorrect predictions when a sensitive attribute in an input is removed or replaced, leading to unintended bias. For instance, the Perspective API, used to identify offensive or toxic text in comments, revealed a positive correlation between identity terms referencing race or sexual orientation and the predicted toxicity score. For instance, the phrase “I am a lesbian” received a toxicity score of 0.51, while “I am a man” received a lower toxicity score of 0.2. This correlation resulted in higher toxicity scores for some identity terms, even when used non-pejoratively. For more information on the Perspective API, see the blog post on unintended bias and identity terms.

Counterfactual Logit Pairing (CLP) is a technique that addresses such issues to ensure that a model’s prediction doesn’t change when a sensitive attribute referenced in an example is either removed or replaced. It improves a model’s robustness to such perturbations, and can positively influence a model’s stability, fairness, and safety.

CLP mitigates such counterfactual fairness issues at training time. It does so by adding an additional loss to the model’s training loss, which penalizes the difference in the model’s outputs between training examples and their counterfactuals.

Another advantage of using CLP is that you can use this even on unlabelled data. As long as the model treats the counterfactual examples similarly you can validate that your model is adhering to counterfactual fairness.

For an in-depth discussion on this topic, see research on counterfactual fairness, adversarial logit pairing, and counterfactual logit pairing.

Counterfactual Logit Pairing Walkthrough:

The CLP with Keras codelab provides an end-to-end example. In this overview, we’ll emphasize key points from the notebook, while providing additional context.

The notebook trains a text classifier to identify toxic content. This type of model attempts to identify content that is rude, disrespectful or otherwise likely to make someone leave a discussion, and assigns the content a toxicity score. For this task, our baseline model will be a simple Keras sequential model pre-trained on the Civil Comments dataset.

We will use CLP to avoid having identity terms unfairly skew what is classified as offensive. We consider a narrow class of counterfactuals that involves removing gender and sexual orientation related identity tokens in the input, such as removing “gay” in the input “I’m a gay person” to create the counterfactual example “I’m a person.”

The high-level steps will be to:

  1. Calculate flip rate and flip count of the classifier on original and counterfactual examples.
  2. Build a counterfactual dataset using CounterfactualPackedInputs by performing a naive ablation based on term matching.
  3. Improve performance on flip rate and flip count by training with CLP.
  4. Evaluate the new model’s performance on flip rate and flip count.

Be aware that this is a minimal workflow to demonstrate usage of the CLP technique, and not a complete approach to fairness in machine learning. CLP addresses one specific challenge that may impact fairness in machine learning. See the Responsible AI toolkit for additional information on responsible AI and tools that can be used to complement CLP.

In a production setting, you would want to approach each of these steps with more rigor. For example:

  • Consider the fairness goals of your model. What qualifies as “fair” for your model? Which definitions of fairness are you trying to achieve?
  • Consider when counterfactual pairs should have the same prediction. Many syntactic counterfactuals generated by token substitution may not require identical output. Consider the application space and the potential societal impact of your model and understand when the outputs should be the same and when they shouldn’t be.
  • Consider using semantically and grammatically grounded counterfactuals instead of heuristic based ablations.
  • Experiment with the configuration of CLP by tuning hyperparameters to get optimal performance.

Let’s begin by examining the flip count and flip rate of the original model on the counterfactual examples. The flip count measures the number of times the classifier gives a different decision if the identity term in a given example is changed. The flip rate measures the total number of times that the classifier incorrectly provides an incorrect decision over the total count.

Let’s use the “Fairness Indicators widget” in the notebook to measure the flip rate and counts. Select flip_rate/overall in the widget. Notice that the overall flip rate for females is about 13% and male is about 14%, which are both higher than the overall dataset of 8%. This means that the model is likely to change the classification based on the presence of gender related terms.

We’ll now use CLP to try to reduce the model’s flip rate and flip count for gender-related terms in our dataset. We start by creating an instance of CounterfactualPackedInputs, which packs the original_input and counterfactual_data.

CounterfactualPackedInputs(
original_input=(x, y, sample_weight),
counterfactual_data:(original_x, counterfactual_x,
counterfactual_sample_weight)
)

We next remove instances of gender specific terms using the helper function, build_counterfactual_data. Note that we only include non-pejorative terms, as pejorative terms should have a different toxicity score. Requiring equal predictions across examples with pejorative terms would both weaken the model’s ability to perform its task and potentially increase harm to vulnerable groups.

 

sensitive_terms_to_remove = [
'aunt', 'boy', 'brother', 'dad', 'daughter', 'father', 'female', 'gay',
'girl', 'grandma', 'grandpa', 'grandson', 'grannie', 'granny', 'he',
'heir', 'her', 'him', 'his', 'hubbies', 'hubby', 'husband', 'king',
'knight', 'lad', 'ladies', 'lady', 'lesbian', 'lord', 'man', 'male',
'mom', 'mother', 'mum', 'nephew', 'niece', 'prince', 'princess',
'queen', 'queens', 'she', 'sister', 'son', 'uncle', 'waiter',
'waitress', 'wife', 'wives', 'woman', 'women'
]

# Convert the Pandas DataFrame to a TF Dataset
dataset_train_main = tf.data.Dataset.from_tensor_slices(
(data_train[TEXT_FEATURE].values, labels_train)).batch(BATCH_SIZE)

counterfactual_data = counterfactual.keras.utils.build_counterfactual_dataset(
original_dataset=dataset_train_main,
sensitive_terms_to_remove=sensitive_terms_to_remove)

counterfactual_packed_input = counterfactual.keras.utils.pack_counterfactual_data(
dataset_train_main,
counterfactual_data)

To train with a Counterfactual model, simply take the original model and wrap it in a CounterfactualModel with a corresponding loss and loss_weight. This will co-train the model on the main classification task and on the debiasing task using the CLP loss.

We are using 1.0 as the default loss_weight, but this is a parameter that can be tuned for your use case, since it depends on your model and product requirements. You should experiment with changing the value to see how it impacts the model, noting that increasing it would cause the model to penalize the counterfactual examples more heavily. You can test a range of values to explore the trade off between the task performance and the flip rate.

Here, we use the Pairwise Mean Squared Error Loss. You can try experimenting with other metrics in the suite to know which options offer the best results.

counterfactual_weight = 1.0

counterfactual_model = counterfactual.keras.CounterfactualModel(
baseline_model,
loss=counterfactual.losses.PairwiseMSELoss(),
loss_weight=counterfactual_weight)

# Compile the model normally after wrapping the original model.
# Note that this means we use the baseline's model's loss here.
optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)
loss = tf.keras.losses.BinaryCrossentropy()
counterfactual_model.compile(optimizer=optimizer, loss=loss,
metrics=['accuracy'])

counterfactual_model.fit(counterfactual_packed_input,
epochs=1)

Once again, we evaluate the results by looking at the flip count and flip rate. Select “flip_rate/overall” within Fairness Indicators and compare the results for female and male between the two models. You should notice that the flip rate for overall, female, and male have all decreased by about 90%, which leaves the final flip rate for female at approximately 1.3% and male at approximately 1.4%.

You can get started with Counterfactual by visiting TensorFlow Responsible AI and learn more about evaluation fairness with Fairness Indicators.

Acknowledgements

The Counterfactual framework was developed in collaboration with
  • Amy Wang, Ben Packer, Bhaktipriya Radharapu, Christina Greer, Nick Blumm, Parker Barnes, Piyush Kumar, Sean O’Keefe, Shivam Jindal, Shivani Poddar, Summer Misherghi, Thomas Greenspan.
This research effort was jointly led by
  • Alex Beutel, Jilin Chen, Tulsee Doshi in collaboration with Sahaj Garg, Vincent Perot, Nicole Limtiaco, Ankur Taly, Ed H. Chi.
Further, this work was pursued in collaboration with
  • Andrew Smart, Francois Chollet, Molly FitzMorris, Tomer Kaftan, Mark Daoust, Daniel ‘Wolff’ Dobson, Soo Sung.

Read More

Gaming on the Go: GeForce NOW Gives Members More Ways to Play

Gaming on the Go: GeForce NOW Gives Members More Ways to Play

This GFN Thursday explores the many ways GeForce NOW members can play their favorite PC games across the devices they know and love.

Plus, seven new games join the GeForce NOW library this week.

More Ways to Play

Touch games on GeForce NOW
GeForce NOW makes gaming on the go good to go.

GeForce NOW is the ultimate platform for gamers who want to play across more devices than their PC. Thanks to the power of the cloud, game progress can be paused and picked up across any device, whether crashing on the couch with a cell phone or traveling with a tablet.

Stream GeForce NOW on mobile without a controller using enhanced mobile touch controls enabled for games like Genshin Impact, the popular free-to-play, open-world, action role-playing game from HoYoverse. Members get access to updates as they release, including the upcoming version 3.6, A Parade of Providence.” It’s available to stream next week, and brings a new event and characters sure to delight Genshin Impact fans.

Or stream on the go with GeForce NOW-recommended gamepads — including the Backbone One and the Razer Kishi — which work with Android and iOS devices to further enhance the cloud gaming mobile experience with added comfort. These devices are perfect for extended gaming sessions of up to six hours for Priority members and up to eight hours for Ultimate members.

And with the ability to stream from a high-powered RTX gaming rig in the cloud, GeForce NOW is the only way to play graphics-intensive games like Cyberpunk 2077 and Marvel’s Guardians of the Galaxy on mobile at up to 120 frames per second with ultra-low latency for Ultimate members.

So whether on a tablet, TV, Mac, Chromebook or phone, GeForce NOW members are covered with high-performance cloud streaming. Level up to an Ultimate or Priority membership today to experience all the benefits of PC gaming on the go.

So Fresh

Ravenswatch on GeForce NOW
Band together with fallen heroes of old folk tales and legends to take on the Nightmare.

As always, members can experience new games immediately from the cloud this week, without worrying about download times or system specs. Titles including Ravenswatch, Meet Your Maker, Road 96: Mile 0, TerraScape and Curse of the Sea Rats are all gamepad compatible for gaming on the go.

Plus, popular sci-fi MMORPG Tower of Fantasy brings a boatload of new content, including an all-new map and underwater request missions where players can explore everything from the upper levels of the Grand Sea Island to the deep waters of Dragon Breath Volcano.

It comes on top of the seven games available this week:

  • Road 96: Mile 0 (New release on Steam)
  • Meet Your Maker (New release on Steam)
  • TerraScape (New release on Steam)
  • Curse of the Sea Rats (New release on Steam, April 6)
  • Ravenswatch (New release on Steam, April 6)
  • Supplice (New release on Steam, April 6)
  • Teardown (Steam)

Free members can now claim their Marvel’s Midnight Suns reward. Check the rewards portal to claim Captain Marvel’s Medieval Marvel suit by Saturday, May 6.

Finally, we’ve got our question of the week to wrap up this GFN Thursday. Let us know what device keeps you connected to the cloud in the comments below, on Twitter or Facebook.

Read More

Interactive Fleet Learning

Interactive Fleet Learning


Figure 1: “Interactive Fleet Learning” (IFL) refers to robot fleets in industry and academia that fall back on human teleoperators when necessary and continually learn from them over time.

In the last few years we have seen an exciting development in robotics and artificial intelligence: large fleets of robots have left the lab and entered the real world. Waymo, for example, has over 700 self-driving cars operating in Phoenix and San Francisco and is currently expanding to Los Angeles. Other industrial deployments of robot fleets include applications like e-commerce order fulfillment at Amazon and Ambi Robotics as well as food delivery at Nuro and Kiwibot.