Rethinking the Role of PPO in RLHF

Rethinking the Role of PPO in RLHF


Rethinking the Role of PPO in RLHF

TL;DR: In RLHF, there’s tension between the reward learning phase, which uses human preference in the form of comparisons, and the RL fine-tuning phase, which optimizes a single, non-comparative reward. What if we performed RL in a comparative way?


Figure 1:
This diagram illustrates the difference between reinforcement learning from absolute feedback and relative feedback. By incorporating a new component – pairwise policy gradient, we can unify the reward modeling stage and RL stage, enabling direct updates based on pairwise responses.

Large Language Models (LLMs) have powered increasingly capable virtual assistants, such as GPT-4, Claude-2, Bard and Bing Chat. These systems can respond to complex user queries, write code, and even produce poetry. The technique underlying these amazing virtual assistants is Reinforcement Learning with Human Feedback (RLHF). RLHF aims to align the model with human values and eliminate unintended behaviors, which can often arise due to the model being exposed to a large quantity of low-quality data during its pretraining phase.

Proximal Policy Optimization (PPO), the dominant RL optimizer in this process, has been reported to exhibit instability and implementation complications. More importantly, there’s a persistent discrepancy in the RLHF process: despite the reward model being trained using comparisons between various responses, the RL fine-tuning stage works on individual responses without making any comparisons. This inconsistency can exacerbate issues, especially in the challenging language generation domain.

Given this backdrop, an intriguing question arises: Is it possible to design an RL algorithm that learns in a comparative manner? To explore this, we introduce Pairwise Proximal Policy Optimization (P3O), a method that harmonizes the training processes in both the reward learning stage and RL fine-tuning stage of RLHF, providing a satisfactory solution to this issue.

Batch calibration: Rethinking calibration for in-context learning and prompt engineering

Batch calibration: Rethinking calibration for in-context learning and prompt engineering

Prompting large language models (LLMs) has become an efficient learning paradigm for adapting LLMs to a new task by conditioning on human-designed instructions. The remarkable in-context learning (ICL) ability of LLMs also leads to efficient few-shot learners that can generalize from few-shot input-label pairs. However, the predictions of LLMs are highly sensitive and even biased to the choice of templates, label spaces (such as yes/no, true/false, correct/incorrect), and demonstration examples, resulting in unexpected performance degradation and barriers for pursuing robust LLM applications. To address this problem, calibration methods have been developed to mitigate the effects of these biases while recovering LLM performance. Though multiple calibration solutions have been provided (e.g., contextual calibration and domain-context calibration), the field currently lacks a unified analysis that systematically distinguishes and explains the unique characteristics, merits, and downsides of each approach.

With this in mind, in “Batch Calibration: Rethinking Calibration for In-Context Learning and Prompt Engineering”, we conduct a systematic analysis of the existing calibration methods, where we both provide a unified view and reveal the failure cases. Inspired by these analyses, we propose Batch Calibration (BC), a simple yet intuitive method that mitigates the bias from a batch of inputs, unifies various prior approaches, and effectively addresses the limitations in previous methods. BC is zero-shot, self-adaptive (i.e., inference-only), and incurs negligible additional costs. We validate the effectiveness of BC with PaLM 2 and CLIP models and demonstrate state-of-the-art performance over previous calibration baselines across more than 10 natural language understanding and image classification tasks.

Motivation

In pursuit of practical guidelines for ICL calibration, we started with understanding the limitations of current methods. We find that the calibration problem can be framed as an unsupervised decision boundary learning problem. We observe that uncalibrated ICL can be biased towards predicting a class, which we explicitly refer to as contextual bias, the a priori propensity of LLMs to predict certain classes over others unfairly given the context. For example, the prediction of LLMs can be biased towards predicting the most frequent label, or the label towards the end of the demonstration. We find that, while theoretically more flexible, non-linear boundaries (prototypical calibration) tend to be susceptible to overfitting and may suffer from instability for challenging multi-class tasks. Conversely, we find that linear decision boundaries can be more robust and generalizable across tasks. In addition, we find that relying on additional content-free inputs (e.g., “N/A” or random in-domain tokens) as the grounds for estimating the contextual bias is not always optimal and may even introduce additional bias, depending on the task type.

Batch calibration

Inspired by the previous discussions, we designed BC to be a zero-shot, inference-only and generalizable calibration technique with negligible computation cost. We argue that the most critical component for calibration is to accurately estimate the contextual bias. We, therefore, opt for a linear decision boundary for its robustness, and instead of relying on content-free inputs, we propose to estimate the contextual bias for each class from a batch in a content-based manner by marginalizing the output score over all samples within the batch, which is equivalent to measuring the mean score for each class (visualized below).

We then obtain the calibrated probability by dividing the output probability over the contextual prior, which is equivalent to aligning the log-probability (LLM scores) distribution to the estimated mean of each class. It is noteworthy that because it requires no additional inputs to estimate the bias, this BC procedure is zero-shot, only involves unlabeled test samples, and incurs negligible computation costs. We may either compute the contextual bias once all test samples are seen, or alternatively, in an on-the-fly manner that dynamically processes the outputs. To do so, we may use a running estimate of the contextual bias for BC, thereby allowing BC’s calibration term to be estimated from a small number of mini-batches that is subsequently stabilized when more mini-batches arrive.

Illustration of Batch Calibration (BC). Batches of demonstrations with in-context examples and test samples are passed into the LLM. Due to sources of implicit bias in the context, the score distribution from the LLM becomes biased. BC is a modular and adaptable layer option appended to the output of the LLM that generates calibrated scores (visualized for illustration only).

Experiment design

For natural language tasks, we conduct experiments on 13 more diverse and challenging classification tasks, including the standard GLUE and SuperGLUE datasets. This is in contrast to previous works that only report on relatively simple single-sentence classification tasks.. For image classification tasks, we include SVHN, EuroSAT, and CLEVR. We conduct experiments mainly on the state-of-the-art PaLM 2 with size variants PaLM 2-S, PaLM 2-M, and PaLM 2-L. For VLMs, we report the results on CLIP ViT-B/16.

Results

Notably, BC consistently outperforms ICL, yielding a significant performance enhancement of 8% and 6% on small and large variants of PaLM 2, respectively. This shows that the BC implementation successfully mitigates the contextual bias from the in-context examples and unleashes the full potential of LLM in efficient learning and quick adaptation to new tasks. In addition, BC improves over the state-of-the-art prototypical calibration (PC) baseline by 6% on PaLM 2-S, and surpasses the competitive contextual calibration (CC) baseline by another 3% on average on PaLM 2-L. Specifically, BC is a generalizable and cheaper technique across all evaluated tasks, delivering stable performance improvement, whereas previous baselines exhibit varying degrees of performance across tasks.

Batch Calibration (BC) achieves the best performance on 1-shot ICL over calibration baselines: contextual calibration (CC), domain-context calibration (DC), and prototypical calibration (PC) on an average of 13 NLP tasks on PaLM 2 and outperforms the zero-shot CLIP on image tasks.

We analyze the performance of BC by varying the number of ICL shots from 0 to 4, and BC again outperforms all baseline methods. We also observe an overall trend for improved performance when more shots are available, where BC demonstrates the best stability.

The ICL performance on various calibration techniques over the number of ICL shots on PaLM 2-S. We compare BC with the uncalibrated ICL, contextual calibration (CC), domain-context calibration (DC), and prototypical calibration (PC) baselines.

We further visualize the decision boundaries of uncalibrated ICL after applying existing calibration methods and the proposed BC. We show success and failure cases for each baseline method, whereas BC is consistently effective.

Visualization of the decision boundaries of uncalibrated ICL, and after applying existing calibration methods and the proposed BC in representative binary classification tasks of SST-2 (top row) and QNLI (bottom row) on 1-shot PaLM 2-S. Each axis indicates the LLM score on the defined label.

Robustness and ablation studies

We analyze the robustness of BC with respect to common prompt engineering design choices that were previously shown to significantly affect LLM performance: choices and orders of in-context examples, the prompt template for ICL, and the label space. First, we find that BC is more robust to ICL choices and can mostly achieve the same performance with different ICL examples. Additionally, given a single set of ICL shots, altering the order between each ICL example has minimal impact on the BC performance. Furthermore, we analyze the robustness of BC under 10 designs of prompt templates, where BC shows consistent improvement over the ICL baseline. Therefore, though BC improves performance, a well-designed template can further enhance the performance of BC. Lastly, we examine the robustness of BC to variations in label space designs (see appendix in our paper). Remarkably, even when employing unconventional choices such as emoji pairs as labels, leading to dramatic oscillations of ICL performance, BC largely recovers performance. This observation demonstrates that BC increases the robustness of LLM predictions under common prompt design choices and makes prompt engineering easier.

Batch Calibration makes prompt engineering easier while being data-efficient. Data are visualized as a standard box plot, which illustrates values for the median, first and third quartiles, and minimum and maximum.

Moreover, we study the impact of batch size on the performance of BC. In contrast to PC, which also leverages an unlabeled estimate set, BC is remarkably more sample efficient, achieving a strong performance with only around 10 unlabeled samples, whereas PC requires more than 500 unlabeled samples before its performance stabilizes.

Batch Calibration makes prompt engineering easier while being insensitive to the batch size.

Conclusion

We first revisit previous calibration methods while addressing two critical research questions from an interpretation of decision boundaries, revealing their failure cases and deficiencies. We then propose Batch Calibration, a zero-shot and inference-only calibration technique. While methodologically simple and easy to implement with negligible computation cost, we show that BC scales from a language-only setup to the vision-language context, achieving state-of-the-art performance in both modalities. BC significantly improves the robustness of LLMs with respect to prompt designs, and we expect easy prompt engineering with BC.

Acknowledgements

This work was conducted by Han Zhou, Xingchen Wan, Lev Proleev, Diana Mincu, Jilin Chen, Katherine Heller, Subhrajit Roy. We would like to thank Mohammad Havaei and other colleagues at Google Research for their discussion and feedback.

Read More

From Skylines to Streetscapes: How SHoP Architects Brings Innovative Designs to Life

From Skylines to Streetscapes: How SHoP Architects Brings Innovative Designs to Life

At SHoP Architects, a New York City-based architectural firm, Mengyi Fan and her team aim to inspire industry professionals to create visual masterpieces by incorporating emerging technologies.

Fan, the director of visualization at SHoP, has expertise that spans the fields of architectural visualization and design. She takes a definitive, novel and enduring approach to designing and planning architecture for city skylines and streetscapes.

Fan and her team work on various architecture visualization projects, from still renderings to real-time walkthroughs. They use multiple creative applications throughout the course of their projects, including Adobe Photoshop, Autodesk 3ds Max, Autodesk Revit and Epic Games’ Unreal Engine. SHoP also collaborates directly with architects at project kickoff, providing images and animations that facilitate quicker decision-making during the design process.

The team consistently integrates new technologies that allow them to explore untapped innovation opportunities, as well as boost research and development. Fan often incorporates real-time and traditional rendering, extended reality and AI into her creative workflows.

To capture all the details that bring the designs together, SHoP uses NVIDIA RTX A5500. Fan is also part of the NVIDIA RTX Ambassador Program, which is designed to amplify the work of professionals from diverse industries who are using RTX technology. Equipped with the latest capabilities of RTX, Fan hopes to continue pushing boundaries in real-time visualization, AI and digital twin applications.

All images courtesy of SHoP Architects.

Redefining Creative Experiences 

3D models play a critical role as the single source of truth, which is why SHoP designers need advanced technology to help them create detailed models and visualizations without creativity or productivity slowdowns.

Previously, the team used CPU-based offerings, which limited the scope of work and research and development they could take on. But with RTX, ‌designers can create and communicate complex designs while continuously collaborating with others.

By tapping into RTX A5500, Fan can prioritize efficiency and high rendering quality without worrying about compute power limitations.

“NVIDIA’s professional RTX GPUs are currently known as the industry standard for graphics cards solutions,” said Fan. “RTX provides us with the performance and power needed to do all the above without worrying about hardware constraints.”

The advanced features of the RTX GPUs allow SHoP designers to explore new ways of representation.

SHoP Architects’ projects have grown in scale, location and diversity, and Fan and her team are constantly learning and adapting from each project, drawing inspiration from diverse areas such as automotive, aviation, film and gaming.

Fan views RTX-powered tools as a means of opening up diverse approaches and solutions to be more widely adopted within the industry. And as an NVIDIA RTX Ambassador, she aims to push past technological boundaries by connecting with like-minded designers and creatives.

See more of Fan’s work below. Discover how NVIDIA RTX can help enhance architectural workflows and learn more about the NVIDIA RTX Ambassador Program.

Read More

Flash-Decoding for long-context inference

Flash-Decoding for long-context inference

Motivation

Large language models (LLM) such as ChatGPT or Llama have received unprecedented attention lately. However, they remain massively expensive to run. Even though generating a single response can cost about $0.01 (a few seconds of an 8xA100 instance on AWS), the costs quickly add up when scaling to billions of users, who could have multiple daily interactions with such LLMs. Some use cases are more expensive, like code auto-completion, because it runs whenever a new character is typed. As LLM applications multiply, even small efficiency gains to the generation time can have a massive impact.

LLM inference (or “decoding”) is an iterative process: tokens are generated one at a time. Generating full sentences of N tokens requires N forward passes through the model. Fortunately, it is possible to cache previously calculated tokens: this means that a single generation step does not depend on the context length, except for a single operation, the attention. This operation does not scale well with context length.

There are a number of important emerging use cases of LLMs that utilize a long context. With a longer context, LLMs can reason about longer documents, either to summarize or answer questions about them, they can keep track of longer conversations, or even process entire codebases before writing code. As an example, most LLMs had a context length of up to 2k in 2022 (GPT-3), but we now have open-source LLMs scaling up to 32k (Llama-2-32k), or even 100k more recently (CodeLlama). In this setting, attention takes a significant fraction of time during inference.

When scaling on the batch size dimension, the attention can also become a bottleneck even with relatively small contexts. This is because the amount of memory to read scales with the batch dimension, whereas it only depends on the model size for the rest of the model.

We present a technique, Flash-Decoding, that significantly speeds up attention during inference, bringing up to 8x faster generation for very long sequences. The main idea is to load the keys and values in parallel as fast as possible, then separately rescale and combine the results to maintain the right attention outputs.

Multi-head attention for decoding

During decoding, every new token that is generated needs to attend to all previous tokens, to compute:

softmax(queries @ keys.transpose) @ values

This operation has been optimized with FlashAttention (v1 and v2 recently) in the training case, where the bottleneck is the memory bandwidth to read and write the intermediate results (e.g. Q @ K^T). However, these optimizations don’t apply directly to the inference case, because the bottlenecks are different. For training, FlashAttention parallelizes across the batch size and query length dimensions. During inference, the query length is typically 1: this means that if the batch size is smaller than the number of streaming multiprocessors (SMs) on the GPU (108 for an A100), the operation will only use a small part of the GPU! This is especially the case when using long contexts, because it requires smaller batch sizes to fit in GPU memory. With a batch size of 1, FlashAttention will use less than 1% of the GPU!

FlashAttention

FlashAttention parallelizes across blocks of queries and batch size only, and does not manage to occupy the entire GPU during decoding

The attention can also be done using matrix multiplication primitives – without using FlashAttention. In this case, the operation occupies the GPU entirely, but launches many kernels that write and read intermediate results, which is not optimal.

A faster attention for decoding: Flash-Decoding

Our new approach Flash-Decoding is based on FlashAttention, and adds a new parallelization dimension: the keys/values sequence length. It combines the benefits of the 2 approaches from above. Like FlashAttention, it stores very little extra data to global memory, however it fully utilizes the GPU even when the batch size is small, as long as the context length is large enough.

Flash-Decoding

Flash-Decoding also parallelizes across keys and values, at the cost of a small final reduction step

Flash-Decoding works in 3 steps:

  1. First, we split the keys/values in smaller chunks.
  2. We compute the attention of the query with each of these splits in parallel using FlashAttention. We also write 1 extra scalar per row and per split: the log-sum-exp of the attention values.
  3. Finally, we compute the actual output by reducing over all the splits, using the log-sum-exp to scale the contribution of each split.

All of this is possible because the attention/softmax can be calculated iteratively. In Flash-Decoding, it is used at 2 levels: within splits (like FlashAttention), and across splits to perform the final reduction.

In practice, step (1) does not involve any GPU operation, as the key/value chunks are views of the full key/value tensors. We then have 2 separate kernels to perform respectively (2) and (3).

Benchmarks on CodeLlama 34B

To validate this approach, we benchmark the decoding throughput of the CodeLLaMa-34b. This model has the same architecture as Llama 2, and more generally results should generalize across many LLMs. We measure the decoding speed in tok/s at various sequence lengths, from 512 to 64k, and compare multiple ways of calculating the attention:

  • Pytorch: Running the attention using pure PyTorch primitives (without using FlashAttention)
  • FlashAttention v2
  • FasterTransformer: Uses the FasterTransformer attention kernel
  • Flash-Decoding
  • And an upper bound calculated as the time it takes to read from memory the entire model along with the KV-cache

Flash-Decoding unlocks up to 8x speedups in decoding speed for very large sequences, and scales much better than alternative approaches.

CodeLlama

All approaches perform similarly for small prompts, but scale poorly as the sequence length increases from 512 to 64k, except Flash-Decoding. In this regime (batch size 1) with Flash-Decoding, scaling the sequence length has little impact on generation speed

Component-level micro-benchmarks

We also micro-benchmark the scaled multi-head attention for various sequence lengths and batch sizes on A100 with inputs in f16. We set the batch size to 1, and use 16 query heads of dimension 128, for 2 key/value heads (grouped-query attention), which matches the dimensions used in CodeLLaMa-34b when running on 4 GPUs.

       
Setting Algorithm PyTorch Eager Flash-Attention v2.0.9 Flash-Decoding
B=256, seqlen=256 3058.6 390.5 63.4
B=128, seqlen=512 3151.4 366.3 67.7
B=64, seqlen=1024 3160.4 364.8 77.7
B=32, seqlen=2048 3158.3 352 58.5
B=16, seqlen=4096 3157 401.7 57
B=8, seqlen=8192 3173.1 529.2 56.4
B=4, seqlen=16384 3223 582.7 58.2
B=2, seqlen=32768 3224.1 1156.1 60.3
B=1, seqlen=65536 1335.6 2300.6 64.4
B=1, seqlen=131072 2664 4592.2 106.6

Micro-benchmark of the multi-head attention, run-time in us. Flash-Decoding achieves almost constant run-time as the sequence length scales to up to 64k.

The up to 8x speedup end-to-end measured earlier is made possible because the attention itself is up to 50x faster than FlashAttention. Up until sequence length 32k, the attention time is roughly constant, because Flash-Decoding manages to fully utilize the GPU.

Using Flash-Decoding

Flash-decoding is available:

  • In the FlashAttention package, starting at version 2.2
  • Through xFormers starting at version 0.0.22 through `xformers.ops.memory_efficient_attention`. The dispatcher will automatically use either the Flash-Decoding or FlashAttention approaches depending on the problem size. When these approaches are not supported, it can dispatch to an efficient triton kernel that implements the Flash-Decoding algorithm.

A full example of decoding with LLaMa v2 / CodeLLaMa is available in the FlashAttention repo here and in the xFormers repo here. We also provide a minimal example of an efficient decoding code for LLaMa v1/v2 models, meant to be fast, easy to read, educational and hackable.

Acknowledgements

Thanks to Erich Elsen, Ashish Vaswani, and Michaël Benesty for suggesting this idea of splitting the KVcache loading. We want to thank Jeremy Reizenstein, Patrick Labatut and Andrew Tulloch for the valuable discussions. We also want to thank Geeta Chauhan and Gregory Chanan for helping with the writing and more broadly contributing to getting this published on the PyTorch blog.

Read More

Developing industrial use cases for physical simulation on future error-corrected quantum computers

Developing industrial use cases for physical simulation on future error-corrected quantum computers

If you’ve paid attention to the quantum computing space, you’ve heard the claim that in the future, quantum computers will solve certain problems exponentially more efficiently than classical computers can. They have the potential to transform many industries, from pharmaceuticals to energy.

For the most part, these claims have rested on arguments about the asymptotic scaling of algorithms as the problem size approaches infinity, but this tells us very little about the practical performance of quantum computers for finite-sized problems. We want to be more concrete: Exactly which problems are quantum computers more suited to tackle than their classical counterparts, and exactly what quantum algorithms could we run to solve these problems? Once we’ve designed an algorithm, we can go beyond analysis based on asymptotic scaling — we can determine the actual resources required to compile and run the algorithm on a quantum computer, and how that compares to a classical computation.

Over the last few years, Google Quantum AI has collaborated with industry and academic partners to assess the prospects for quantum simulation to revolutionize specific technologies and perform concrete analyses of the resource requirements. In 2022, we developed quantum algorithms to analyze the chemistry of an important enzyme family called cytochrome P450. Then, in our paper released this fall, we demonstrated how to use a quantum computer to study sustainable alternatives to cobalt for use in lithium ion batteries. And most recently, as we report in a preprint titled “Quantum computation of stopping power for inertial fusion target design,” we’ve found a new application in modeling the properties of materials in inertial confinement fusion experiments, such as those at the National Ignition Facility (NIF) at Lawrence Livermore National Laboratory, which recently made headlines for a breakthrough in nuclear fusion.

Below, we describe these three industrially relevant applications for simulations with quantum computers. While running the algorithms will require an error-corrected quantum computer, which is still years away, working on this now will ensure that we are ready with efficient quantum algorithms when such a quantum computer is built. Already, our work has reduced the cost of compiling and running the algorithms significantly, as we have reported in the past. Our work is essential for demonstrating the potential of quantum computing, but it also provides our hardware team with target specifications for the number of qubits and time needed to run useful quantum algorithms in the future.

Application 1: The CYP450 mechanism

The pharmaceutical industry is often touted as a field ripe for discovery using quantum computers. But concrete examples of such potential applications are few and far between. Working with collaborators at the pharmaceutical company Boehringer Ingelheim, our partners at the startup QSimulate, and academic colleagues at Columbia University, we explored one example in the 2022 PNAS article, “Reliably assessing the electronic structure of cytochrome P450 on today’s classical computers and tomorrow’s quantum computers”.

Cytochrome P450 is an enzyme family naturally found in humans that helps us metabolize drugs. It excels at its job: more than 70% of all drug metabolism is performed by enzymes of the P450 family. The enzymes work by oxidizing the drug — a process that depends on complex correlations between electrons. The details of the interactions are too complicated for scientists to know a priori how effective the enzyme will be on a particular drug.

In the paper, we showed how a quantum computer could approach this problem. The CYP450 metabolic process is a complex chain of reactions with many intermediate changes in the electronic structure of the enzymes throughout. We first use state-of-the-art classical methods to determine the resources required to simulate this problem on a classical computer. Then we imagine implementing a phase-estimation algorithm — which is needed to compute the ground-state energies of the relevant electronic configurations throughout the reaction chain — on a surface-code error-corrected quantum computer.

With a quantum computer, we could follow the chain of changing electronic structure with greater accuracy and fewer resources. In fact, we find that the higher accuracy offered by a quantum computer is needed to correctly resolve the chemistry in this system, so not only will a quantum computer be better, it will be necessary. And as the system size gets bigger, i.e., the more quantum energy levels we include in the simulation, the more the quantum computer wins over the classical computer. Ultimately, we show that a few million physical qubits would be required to reach quantum advantage for this problem.

Left: Example of an electron orbital (red and blue) of a CYP enzyme. More than 60 such orbitals are required to model the CYP system. Right: Comparison of actual runtime (CPU) of various classical techniques (blue) to hypothetical runtime (QPU) of a quantum algorithm (green). The lower slope of the quantum algorithm demonstrates the favorable asymptotic scaling over classical methods. Already at about 20-30 orbitals, we see a crossover to the regime where a quantum algorithm would be more efficient than classical methods.

Application 2: Lithium-ion batteries

Lithium-ion batteries rely on the electrochemical potential difference between two lithium containing materials. One material used today for the cathodes of Li-ion batteries is LiCoO2. Unfortunately, it has drawbacks from a manufacturing perspective. Cobalt mining is expensive, destructive to the environment, and often utilizes unsafe or abusive labor practices. Consequently, many in the field are interested in alternatives to cobalt for lithium-ion cathodes.

In the 1990’s, researchers discovered that nickel could replace cobalt to form LiNiO2 (called “lithium nickel oxide” or “LNO”) for cathodes. While pure LNO was found to be unstable in production, many cathode materials used in the automotive industry today use a high fraction of nickel and hence, resemble LNO. Despite its applications to industry, however, not all of the chemical properties of LNO are understood — even the properties of its ground state remains a subject of debate.

In our recent paper, “Fault tolerant quantum simulation of materials using Bloch orbitals,” we worked with the chemical company, BASF, the molecular modeling startup, QSimulate, and collaborators at Macquarie University in Australia to develop techniques to perform quantum simulations on systems with periodic, regularly spaced atomic structure, such as LNO. We then applied these techniques to design algorithms to study the relative energies of a few different candidate structures of LNO. With classical computers, high accuracy simulations of the quantum wavefunction are considered too expensive to perform. In our work, we found that a quantum computer would need tens of millions of physical qubits to calculate the energies of each of the four candidate ground-state LNO structures. This is out of reach of the first error-corrected quantum computers, but we expect this number to come down with future algorithmic improvements.

Four candidate structures of LNO. In the paper, we consider the resources required to compare the energies of these structures in order to find the ground state of LNO.

Application 3: Fusion reactor dynamics

In our third and most recent example, we collaborated with theorists at Sandia National Laboratories and our Macquarie University collaborators to put our hypothetical quantum computer to the task of simulating dynamics of charged particles in the extreme conditions typical of inertial confinement fusion (ICF) experiments, like those at the National Ignition Facility. In those experiments, high-intensity lasers are focused into a metallic cavity (hohlraum) that holds a target capsule consisting of an ablator surrounding deuterium–tritium fuel. When the lasers heat the inside of the hohlraum, its walls radiate x-rays that compress the capsule, heating the deuterium and tritium inside to 10s of millions of Kelvin. This allows the nucleons in the fuel to overcome their mutual electrostatic repulsion and start fusing into helium nuclei, also called alpha particles.

Simulations of these experiments are computationally demanding and rely on models of material properties that are themselves uncertain. Even testing these models, using methods similar to those in quantum chemistry, is extremely computationally expensive. In some cases, such test calculations have consumed >100 million CPU hours. One of the most expensive and least accurate aspects of the simulation is the dynamics of the plasma prior to the sustained fusion stage (>10s of millions of Kelvin), when parts of the capsule and fuel are a more balmy 100k Kelvin. In this “warm dense matter” regime, quantum correlations play a larger role in the behavior of the system than in the “hot dense matter” regime when sustained fusion takes place.

In our new preprint, “Quantum computation of stopping power for inertial fusion target design”, we present a quantum algorithm to compute the so-called “stopping power” of the warm dense matter in a nuclear fusion experiment. The stopping power is the rate at which a high energy alpha particle slows down due to Coulomb interactions with the surrounding plasma. Understanding the stopping power of the system is vital for optimizing the efficiency of the reactor. As the alpha particle is slowed by the plasma around it, it transfers its energy to the plasma, heating it up. This self-heating process is the mechanism by which fusion reactions sustain the burning plasma. Detailed modeling of this process will help inform future reactor designs.

We estimate that the quantum algorithm needed to calculate the stopping power would require resources somewhere between the P450 application and the battery application. But since this is the first case study on first-principles dynamics (or any application at finite temperature), such estimates are just a starting point and we again expect to find algorithmic improvements to bring this cost down in the future. Despite this uncertainty, it is still certainly better than the classical alternative, for which the only tractable approaches for these simulations are mean-field methods. While these methods incur unknown systematic errors when describing the physics of these systems, they are currently the only meaningful means of performing such simulations.

Left: A projectile (red) passing through a medium (blue) with initial velocity vproj. Right: To calculate the stopping power, we monitor the energy transfer between the projectile and the medium (blue solid line) and determine its average slope (red dashed line).

Discussion and conclusion

The examples described above are just three of a large and growing body of concrete applications for a future error-corrected quantum computer in simulating physical systems. This line of research helps us understand the classes of problems that will most benefit from the power of quantum computing. In particular, the last example is distinct from the other two in that it is simulating a dynamical system. In contrast to the other problems, which focus on finding the lowest energy, static ground state of a quantum system, quantum dynamics is concerned with how a quantum system changes over time. Since quantum computers are inherently dynamic — the qubit states evolve and change as each operation is performed — they are particularly well suited to solving these kinds of problems. Together with collaborators at Columbia, Harvard, Sandia National Laboratories and Macquarie University in Australia we recently published a paper in Nature Communications demonstrating that quantum algorithms for simulating electron dynamics can be more efficient even than approximate, “mean-field” classical calculations, while simultaneously offering much higher accuracy.

Developing and improving algorithms today prepares us to take full advantage of them when an error-corrected quantum computer is eventually realized. Just as in the classical computing case, we expect improvements at every level of the quantum computing stack to further lower the resource requirements. But this first step helps separate hyperbole from genuine applications amenable to quantum computational speedups.

Acknowledgements


We would like to thank Katie McCormick, our Quantum Science Communicator, for helping to write this blog post.

Read More

UK Tech Festival Showcases Startups Using AI for Creative Industries

UK Tech Festival Showcases Startups Using AI for Creative Industries

At one of the U.K.’s largest technology festivals, top enterprises and startups are this week highlighting their latest innovations, hosting workshops and celebrating the growing tech ecosystem based in the country’s southwest.

The Bristol Technology Festival today showcased the work of nine startups that recently participated in a challenge hosted by Digital Catapult — the U.K. authority on advanced digital technology — in collaboration with NVIDIA.

The challenge, which ran for four months, supported companies in developing a prototype or extending an innovation that could transform experiences using reality capture, real-time collaboration and creation, or cross-platform content delivery.

It’s part of MyWorld, an initiative for pioneering creative technology focused on the western U.K.

Each selected startup was given £50,000 to help develop projects that foster the advancement of generative AI, digital twins and other groundbreaking technologies for use in creative industries.

Lux Aeterna Explores Generative AI for Visual Effects

Emmy Award-winning independent visual effects studio Lux Aeterna — which is using gen AI and neural networks for VFX production — deployed its funds to develop a generative AI-powered text-to-image toolkit for creating maps, or 2D images used to represent aspects of a scene, object or effect.

At the Bristol Technology Festival, Lux Aeterna demonstrated this technology, powered by NVIDIA RTX 40 Series GPUs, with a focus on its ability to generate parallax occlusion maps, a method of creating the effect of depth for 3D textured surfaces.

“Our goal is to tackle the unique VFX challenges with bespoke AI-assisted solutions, and to put these tools of the future into the hands of our talented artists,” said James Pollock, creative technologist at Lux Aeterna. “NVIDIA’s insightful feedback on our work as a part of the MyWorld challenge has been invaluable in informing our strategy toward innovation in this rapidly changing space.”

Meaning Machine Brings AI to Game Characters, Dialogue

Meaning Machine, a studio pioneering gameplay that uses natural language AI, used its funds from the challenge to develop a generative AI system for in-game characters and dialogue. Its Game Consciousness technology enables in-game characters to accurately talk about their world, in real time, so that every line of dialogue reflects the game developer’s creative vision.

Meaning Machine’s demo at today’s showcase invited attendees to experience its interrogation game, “Dead Meat,” in which players must chat with an in-game character — a murder suspect — with the aim of manipulating them into giving a confession.

A member of the NVIDIA Inception program for cutting-edge startups, Meaning Machine powers its generative AI technology for game development using the NVIDIA NeMo framework for building, customizing and deploying large language models.

“NVIDIA NeMo enables us to deliver scalable model tuning and inference,” said Ben Ackland, cofounder and chief technology officer at Meaning Machine. “We see potential for Game Consciousness to transform blockbuster games — delivering next-gen characters that feel at home in bigger, deeper, more complex virtual worlds — and our collaboration with NVIDIA will help us make this a reality sooner.”

More Startups Showcase AI for Creative Industries

Additional challenge participants that hosted demos today at the Bristol Technology Festival include:

  • Black Laboratory, an NVIDIA Inception member demonstrating a live puppet-performance capture system, puppix, that can seamlessly transfer the physicality of puppets to digital characters.
  • IMPRESS, which is developing an AI-powered launchpad for self-publishing indie video games. It offers data-driven market research for game development, marketing campaign support, press engagement tools and more.
  • Larkhall, which is expanding Otto, its AI system that generates live, reactive visuals based on musical performances, as well as automatic, expressive captioning for speech-based performances.
  • Motion Impossible, which is building a software platform for centralized control of its AGITO systems — free-roaming, modular, camera dolly systems for filmmaking.
  • Zubr and Uninvited Guests, two companies collaborating on the development of augmented- and virtual-reality tools for designing futuristic urban environments.

“NVIDIA’s involvement in the MyWorld challenge, led by Digital Catapult, has created extraordinary value for the participating teams,” said Sarah Addezio, senior innovation partner and MyWorld program lead at Digital Catapult. “We’ve seen the benefit of our cohort having access to industry-leading technical and business-development expertise, elevating their projects in ways that would not have been possible otherwise.”

Learn more about NVIDIA Inception and NVIDIA generative AI technologies.

Read More

Learning to learn learning-rate schedules

In a series of papers, Amazon researchers performed a theoretical analysis of a simplified problem that led to a learnable learning-rate scheduler, applied that scheduler to a more complex neural model, and distilled the results into a practical algorithm.Read More

Microsoft at VL/HCC 2023: Focus on co-audit tools for spreadsheets

Microsoft at VL/HCC 2023: Focus on co-audit tools for spreadsheets

These research papers were presented at the IEEE Symposium on Visual Languages and Human-Centric Computing (opens in new tab) (VL/HCC 2023), a premier forum for design, theory, and application of computing technologies for programming, modelling, and communication.

Microsoft at VL/HCC 2023: Focus on co-audit tools for spreadsheets

Large language models (LLMs) have revolutionized the way novice programmers and everyday computer users tap into the capabilities of natural language for programming. Among the tools used in this context, spreadsheets stand out as the preferred choice. The integration of LLMs into spreadsheets promises to substantially enhance their functionality and the user experience. At the same time, it’s well known that spreadsheet users commonly though inadvertently introduce errors (opens in new tab), and this can carry significant risks. For example, in 2010, a spreadsheet used in a Harvard economic analysis (opens in new tab) to inform austerity measures imposed on Greece was discovered to contain multiple errors (opens in new tab).

Microsoft is actively pursuing (opens in new tab) research focused on developing co-auditing tools and techniques, with an initial emphasis on spreadsheets. These tools are designed to help users verify the results generated by LLMs. At VL/HCC 2023 (opens in new tab), we introduce two new spreadsheet tools, ColDeco and FxD, specifically built to help users thoroughly examine and debug their programs within spreadsheets. Additionally, it is worth mentioning that the paper on FxD was awarded the Honorable Mention (opens in new tab).

ColDeco: An end-user inspection tool

Working with tables in spreadsheets is a common task, and the ability to add a calculated column can be incredibly useful. A calculated column not only adds information but also facilitates tasks like filtering and sorting. Generative AI can enable users to create sophisticated calculated columns in tables. However, verification of AI-generated code in this scenario is crucial because AI can misinterpret the user’s intent or overlook important data. 

In our paper, “ColDeco: An End User Spreadsheet Inspection Tool for AI-Generated Code,” we introduce ColDeco, a no-code inspection tool for calculated columns. ColDeco uses helper columns and row grouping to help users understand how an AI-generated column works and locate any errors. 

To describe how ColDeco works, we’ll use an example table containing people’s first, middle, and last names in separate columns. Our user asks the system to “create a column called ‘Abbreviation’ that takes the first letter of each part of the name.” In this example, there’s an error in the generated code that fails to handle rows with no middle names, causing some Abbreviation cells to be empty.  

First, the model generates a program that computes an abbreviation for each row and adds it to the new Abbreviation column. ColDeco’s interface automatically opens as a side panel, as shown in Figure 1. 

The Inspect Columns view displays any generated columns, accompanied by a natural language description of the generated code. The Inspect Rows view displays a subset of the table, organized by behavior. The Row Inspection view uses dataflow analysis to group rows, highlighting key distinct execution behaviors. In our example, this view quickly draws the user’s attention to the two rows that fail to calculate an abbreviation.

Two graphics. The first graphic depicts a table with columns: “First Name”, “Middle Name”, “Last Name”, “DoB”, and “Abbreviation”. There are 11 rows. As examples, row 3 contains the information: First Name: Christopher, Middle Name: Michael, Last Name: Fleming, DoB: 11/5/1995, Abbreviation: CMF. Row 9 contains the information: First Name: William, Middle name is empty, Last Name: Smith, DoB: 6/3/1968, Abbreviation is empty. The second graphic depicts a side panel with two sections. The first section is the Inspect Columns view (labelled 1a). A single column named “Abbreviation” and a corresponding description is shown. The second section is the Inspect Rows view (labelled 1b). It contains a table with columns “Index”, “First Name”, “Middle Name”, “Last Name”, and “Abbreviation”. Within the table there are two groups of rows. The first group has an example row: Index: 4716, First Name: William, Middle Name is empty, Last Name: Smith, Abbreviation is empty. The second group has an example row: Index: 8984, First Name: Christopher, Middle Name: Michael, Last Name: Flemming, Abbreviation: CMF.
Figure 1. The initial view of the ColDeco side panel. An Abbreviation program is generated by the AI and added to the table as a new column. The Inspect Columns view (1a) shows the column generated by the AI, including a description of how the code works. The Inspect Rows view (1b) groups rows into different behaviors, indicating that there are errors in two rows.

If our user wants to investigate an error, they can expand a generated column into multiple helper columns, illustrated in Figure 2. These helper columns are visible in both the table (2a) and the side panel (2b), and they show the intermediate values. The user can now see that the missing abbreviations are caused by an error that occurred when the system tried to take the first and middle initials.

Two graphics. The first graphic (labelled 2a) depicts a table with 4 columns: “DoB”, “text concatenation”, “1st letter of Last Name”, “Abbreviation”. As examples, row 3 contains the information: DoB: 11/5/1995, text concatenation: CM, 1st letter of Lan Name: F, Abbreviation: CMF. Row 9 contains the information DoB: 6/3/1968, text concatenation: is empty, 1st letter of Lan Name: S, Abbreviation: is empty. The second graphic (labelled 2b) depicts a side panel showing the Inspect Columns view. A tree view shows “Abbreviation” as the root with two children: “1st letter of Last Name” and “text concatenation”, corresponding to the columns in the table. Each column in the tree view has a corresponding description.
Figure 2. The ColDeco side panel after a user expands the Abbreviation column into two additional helper columns. Each additional column has a description.

Spotlight: On-Demand EVENT

Microsoft Research Summit 2022

On-Demand
Watch now to learn about some of the most pressing questions facing our research community and listen in on conversations with 120+ researchers around how to ensure new technologies have the broadest possible benefit for humanity.


FxD: A functional debugger 

Not every spreadsheet task involves generating a new table column. Moreover, many users are already well acquainted with spreadsheet formulas. This brings us to our second tool, a spreadsheet formula debugger, introduced in the paper, “FxD: a functional debugger for dysfunctional spreadsheets.” 

We employed a user-centered approach when designing FxD, extensively reviewing existing literature on functional programming debuggers. This informed the four key features we implemented into FxD: 

Live debugging. FxD dynamically updates as a user edits a formula, allowing for quick formula modification and exploration (Figure 3, image 1).

Hybrid formula tracing. The debugger combines step-based evaluation (Figure 3, image 1) with tree-based derivations (Figure 3, image 3) to provide a step-by-step breakdown of the formula. Substeps are hidden behind expandable cards to prevent user overload.  

Subformula coloring. Color coding highlights changes in a formula as FxD evaluates it. This facilitates the tracking of these updates when a user hovers over a step (Figure 3, images 2 and 4). 

Information inspector. Context-aware tooltips improve the user experience. One example is table previews when a user hovers over ranges in functions like VLOOKUP. These tooltips offer insights into the range, surrounding context, and the lookup column used by the containing function (Figure 3, image 3).

Four graphics, each graphic describing a different feature of the debugger. The formula being debugged is ‘=IF(G3 < (B1 + B2) * (1 + B3), “low”, “high”)’. The first graphic (labelled 1) shows the formula and its evaluation trace. Each step in the trace shows the formula with some part evaluated. The last step is the value “low” which is the result of the formula. The second graphic (labelled 2) shows a step being highlighted. The step has a before formula and after formula, with multiple parts evaluated. Each part that is evaluated is highlighted with the same color in the “before” and “after” formula. The third graphic (labelled 3) shows a cell range being hovered on and a range information inspector being shown. The inspector shows a preview of the grid for the corresponding range. The fourth graphic (labelled 4) shows a step being highlighted and an evaluated subpart being hovered over. The user hovers over the value 15 in the “after” formula and the corresponding formula “B1 + B2” in the “before” formula is underlined.
Figure 3. The FxD debugger. Image 1 shows the edited formula and evaluation steps. The steps update as a user edits the formula. Image 2 shows subformula coloring, which highlights a subformula and its value upon hovering. Image 3 shows an information inspector that previews the range referenced in a formula. Image 4 shows the concurrent evaluation of multiple subformulas. When the user hovers over a value, the corresponding subformula is underlined.

Growing importance of AI code verification 

As the complexity of AI-generated code rises, the need for tools to verify accuracy becomes increasingly critical. In response, we developed these two co-audit tools tailored to spreadsheets. Moving forward, a key consideration lies in managing the complexity of these tools. Our vision is that debugging tools will become infused with generative AI to assist users in both generating and verifying workflows. 

Review our paper on co-auditing in general to learn more.

The post Microsoft at VL/HCC 2023: Focus on co-audit tools for spreadsheets appeared first on Microsoft Research.

Read More