A new quantum algorithm for classical mechanics with an exponential speedup

A new quantum algorithm for classical mechanics with an exponential speedup

Quantum computers promise to solve some problems exponentially faster than classical computers, but there are only a handful of examples with such a dramatic speedup, such as Shor’s factoring algorithm and quantum simulation. Of those few examples, the majority of them involve simulating physical systems that are inherently quantum mechanical — a natural application for quantum computers. But what about simulating systems that are not inherently quantum? Can quantum computers offer an exponential advantage for this?

In “Exponential quantum speedup in simulating coupled classical oscillators”, published in Physical Review X (PRX) and presented at the Symposium on Foundations of Computer Science (FOCS 2023), we report on the discovery of a new quantum algorithm that offers an exponential advantage for simulating coupled classical harmonic oscillators. These are some of the most fundamental, ubiquitous systems in nature and can describe the physics of countless natural systems, from electrical circuits to molecular vibrations to the mechanics of bridges. In collaboration with Dominic Berry of Macquarie University and Nathan Wiebe of the University of Toronto, we found a mapping that can transform any system involving coupled oscillators into a problem describing the time evolution of a quantum system. Given certain constraints, this problem can be solved with a quantum computer exponentially faster than it can with a classical computer. Further, we use this mapping to prove that any problem efficiently solvable by a quantum algorithm can be recast as a problem involving a network of coupled oscillators, albeit exponentially many of them. In addition to unlocking previously unknown applications of quantum computers, this result provides a new method of designing new quantum algorithms by reasoning purely about classical systems.

Simulating coupled oscillators

The systems we consider consist of classical harmonic oscillators. An example of a single harmonic oscillator is a mass (such as a ball) attached to a spring. If you displace the mass from its rest position, then the spring will induce a restoring force, pushing or pulling the mass in the opposite direction. This restoring force causes the mass to oscillate back and forth.

A simple example of a harmonic oscillator is a mass connected to a wall by a spring. [Image Source: Wikimedia]

Now consider coupled harmonic oscillators, where multiple masses are attached to one another through springs. Displace one mass, and it will induce a wave of oscillations to pulse through the system. As one might expect, simulating the oscillations of a large number of masses on a classical computer gets increasingly difficult.

An example system of masses connected by springs that can be simulated with the quantum algorithm.

To enable the simulation of a large number of coupled harmonic oscillators, we came up with a mapping that encodes the positions and velocities of all masses and springs into the quantum wavefunction of a system of qubits. Since the number of parameters describing the wavefunction of a system of qubits grows exponentially with the number of qubits, we can encode the information of N balls into a quantum mechanical system of only about log(N) qubits. As long as there is a compact description of the system (i.e., the properties of the masses and the springs), we can evolve the wavefunction to learn coordinates of the balls and springs at a later time with far fewer resources than if we had used a naïve classical approach to simulate the balls and springs.

We showed that a certain class of coupled-classical oscillator systems can be efficiently simulated on a quantum computer. But this alone does not rule out the possibility that there exists some as-yet-unknown clever classical algorithm that is similarly efficient in its use of resources. To show that our quantum algorithm achieves an exponential speedup over any possible classical algorithm, we provide two additional pieces of evidence.

The glued-trees problem and the quantum oracle

For the first piece of evidence, we use our mapping to show that the quantum algorithm can efficiently solve a famous problem about graphs known to be difficult to solve classically, called the glued-trees problem. The problem takes two branching trees — a graph whose nodes each branch to two more nodes, resembling the branching paths of a tree — and glues their branches together through a random set of edges, as shown in the figure below.

A visual representation of the glued trees problem. Here we start at the node labeled ENTRANCE and are allowed to locally explore the graph, which is obtained by randomly gluing together two binary trees. The goal is to find the node labeled EXIT.

The goal of the glued-trees problem is to find the exit node — the “root” of the second tree — as efficiently as possible. But the exact configuration of the nodes and edges of the glued trees are initially hidden from us. To learn about the system, we must query an oracle, which can answer specific questions about the setup. This oracle allows us to explore the trees, but only locally. Decades ago, it was shown that the number of queries required to find the exit node on a classical computer is proportional to a polynomial factor of N, the total number of nodes.

But recasting this as a problem with balls and springs, we can imagine each node as a ball and each connection between two nodes as a spring. Pluck the entrance node (the root of the first tree), and the oscillations will pulse through the trees. It only takes a time that scales with the depth of the tree — which is exponentially smaller than N — to reach the exit node. So, by mapping the glued-trees ball-and-spring system to a quantum system and evolving it for that time, we can detect the vibrations of the exit node and determine it exponentially faster than we could using a classical computer.

BQP-completeness

The second and strongest piece of evidence that our algorithm is exponentially more efficient than any possible classical algorithm is revealed by examination of the set of problems a quantum computer can solve efficiently (i.e., solvable in polynomial time), referred to as bounded-error quantum polynomial time or BQP. The hardest problems in BQP are called “BQP-complete”.

While it is generally accepted that there exist some problems that a quantum algorithm can solve efficiently and a classical algorithm cannot, this has not yet been proven. So, the best evidence we can provide is that our problem is BQP-complete, that is, it is among the hardest problems in BQP. If someone were to find an efficient classical algorithm for solving our problem, then every problem solved by a quantum computer efficiently would be classically solvable! Not even the factoring problem (finding the prime factors of a given large number), which forms the basis of modern encryption and was famously solved by Shor’s algorithm, is expected to be BQP-complete.

A diagram showing the believed relationships of the classes BPP and BQP, which are the set of problems that can be efficiently solved on a classical computer and quantum computer, respectively. BQP-complete problems are the hardest problems in BQP.

To show that our problem of simulating balls and springs is indeed BQP-complete, we start with a standard BQP-complete problem of simulating universal quantum circuits, and show that every quantum circuit can be expressed as a system of many balls coupled with springs. Therefore, our problem is also BQP-complete.

Implications and future work

This effort also sheds light on work from 2002, when theoretical computer scientist Lov K. Grover and his colleague, Anirvan M. Sengupta, used an analogy to coupled pendulums to illustrate how Grover’s famous quantum search algorithm could find the correct element in an unsorted database quadratically faster than could be done classically. With the proper setup and initial conditions, it would be possible to tell whether one of N pendulums was different from the others — the analogue of finding the correct element in a database — after the system had evolved for time that was only ~√(N). While this hints at a connection between certain classical oscillating systems and quantum algorithms, it falls short of explaining why Grover’s quantum algorithm achieves a quantum advantage.

Our results make that connection precise. We showed that the dynamics of any classical system of harmonic oscillators can indeed be equivalently understood as the dynamics of a corresponding quantum system of exponentially smaller size. In this way we can simulate Grover and Sengupta’s system of pendulums on a quantum computer of log(N) qubits, and find a different quantum algorithm that can find the correct element in time ~√(N). The analogy we discovered between classical and quantum systems can be used to construct other quantum algorithms offering exponential speedups, where the reason for the speedups is now more evident from the way that classical waves propagate.

Our work also reveals that every quantum algorithm can be equivalently understood as the propagation of a classical wave in a system of coupled oscillators. This would imply that, for example, we can in principle build a classical system that solves the factoring problem after it has evolved for time that is exponentially smaller than the runtime of any known classical algorithm that solves factoring. This may look like an efficient classical algorithm for factoring, but the catch is that the number of oscillators is exponentially large, making it an impractical way to solve factoring.

Coupled harmonic oscillators are ubiquitous in nature, describing a broad range of systems from electrical circuits to chains of molecules to structures such as bridges. While our work here focuses on the fundamental complexity of this broad class of problems, we expect that it will guide us in searching for real-world examples of harmonic oscillator problems in which a quantum computer could offer an exponential advantage.

Acknowledgements

We would like to thank our Quantum Computing Science Communicator, Katie McCormick, for helping to write this blog post.

Read More

Bringing Personality to Pixels, Inworld Levels Up Game Characters Using Generative AI

Bringing Personality to Pixels, Inworld Levels Up Game Characters Using Generative AI

To enhance the gaming experience, studios and developers spend tremendous effort creating photorealistic, immersive in-game environments.

But non-playable characters (NPCs) often get left behind. Many behave in ways that lack depth and realism, making their interactions repetitive and forgettable.

Inworld AI is changing the game by using generative AI to drive NPC behaviors that are dynamic and responsive to player actions. The Mountain View, Calif.-based startup’s Character Engine, which can be used with any character design, is helping studios and developers enhance gameplay and improve player engagement.

Elevate Gaming Experiences: Achievement Unlocked

The Inworld team aims to develop AI-powered NPCs that can learn, adapt and build relationships with players while delivering high-quality performance and maintaining in-game immersion.

To make it easier for developers to integrate AI-based NPCs into their games, Inworld built Character Engine, which uses generative AI running on NVIDIA technology to create immersive, interactive characters. It’s built to be production-ready, scalable and optimized for real-time experiences.

The Character Engine comprises three layers: Character Brain, Contextual Mesh and Real-Time AI.

Character Brain orchestrates a character’s performance by syncing to its multiple personality machine learning models, such as for text-to-speech, automatic speech recognition, emotions, gestures and animations.

The layer also enables AI-based NPCs to learn and adapt, navigate relationships and perform motivated actions. For example, users can create triggers using the “Goals and Action” feature to program NPCs to behave in a certain way in response to a given player input.

Contextual Mesh allows developers to set parameters for content and safety mechanisms, custom knowledge and narrative controls. Game developers can use the “Relationships” feature to create emergent narratives, such that an ally can turn into an enemy or vice versa based on how players treat an NPC.

One big challenge developers face when using generative AI is keeping NPCs in-world and on-message. Inworld’s Contextual Mesh layer helps overcome this hurdle by rendering characters within the logic and fantasy of their worlds, effectively avoiding the hallucinations that commonly appear when using large language models (LLMs).

The Real-Time AI layer ensures optimal performance and scalability for real-time experiences.

Powering Up AI Workflows With NVIDIA 

Inworld, a member of the NVIDIA Inception program, which supports startups through every stage of their development, uses NVIDIA A100 Tensor Core GPUs and NVIDIA Triton Inference Server as integral parts of its generative AI training and deployment infrastructure.

Inworld used the open-source NVIDIA Triton Inference Server software to standardize other non-generative machine learning model deployments required to power Character Brain features, such as emotions. The startup also plans to use the open-source NVIDIA TensorRT-LLM library to optimize inference performance. Both NVIDIA Triton Inference Server and TensorRT-LLM are available with the NVIDIA AI Enterprise software platform, which provides security, stability and support for production AI.

Inworld also used NVIDIA A100 GPUs within Slurm-managed bare-metal machines for its production training pipelines. Similar machines wrapped in Kubernetes help manage character interactions during gameplay. This setup delivers real-time generative AI at the lowest possible cost.

“We chose to use NVIDIA A100 GPUs because they provided the best, most cost-efficient option for our machine learning workloads compared to other solutions,” said Igor Poletaev, vice president of AI at Inworld.

“Our customers and partners are looking to find novel and innovative ways to drive player engagement metrics by integrating AI NPC functionalities into their gameplay,” said Poletaev. “There’s no way to achieve real-time performance without hardware accelerators, which is why we required GPUs to be integrated into our backend architecture from the very beginning.”

Inworld’s generative AI-powered NPCs have enabled dynamic, evergreen gaming experiences that keep players coming back. Developers and gamers alike have reported enhanced player engagement, satisfaction and retention.

Inworld has powered AI-based NPC experiences from Niantic, LG UPlus, Alpine Electronics and more. One open-world virtual reality game using the Inworld Character Engine saw a 5% increase in playtime, while a detective-themed indie game garnered over $300,000 in free publicity after some of the most popular Twitch streamers discovered it.

Learn more about Inworld AI and NVIDIA technologies for game developers.

Read More

Summary report optimization in the Privacy Sandbox Attribution Reporting API

Summary report optimization in the Privacy Sandbox Attribution Reporting API

In recent years, the Privacy Sandbox initiative was launched to explore responsible ways for advertisers to measure the effectiveness of their campaigns, by aiming to deprecate third-party cookies (subject to resolving any competition concerns with the UK’s Competition and Markets Authority). Cookies are small pieces of data containing user preferences that websites store on a user’s device; they can be used to provide a better browsing experience (e.g., allowing users to automatically sign in) and to serve relevant content or ads. The Privacy Sandbox attempts to address concerns around the use of cookies for tracking browsing data across the web by providing a privacy-preserving alternative.

Many browsers use differential privacy (DP) to provide privacy-preserving APIs, such as the Attribution Reporting API (ARA), that don’t rely on cookies for ad conversion measurement. ARA encrypts individual user actions and collects them in an aggregated summary report, which estimates measurement goals like the number and value of conversions (useful actions on a website, such as making a purchase or signing up for a mailing list) attributed to ad campaigns.

The task of configuring API parameters, e.g., allocating a contribution budget across different conversions, is important for maximizing the utility of the summary reports. In “Summary Report Optimization in the Privacy Sandbox Attribution Reporting API”, we introduce a formal mathematical framework for modeling summary reports. Then, we formulate the problem of maximizing the utility of summary reports as an optimization problem to obtain the optimal ARA parameters. Finally, we evaluate the method using real and synthetic datasets, and demonstrate significantly improved utility compared to baseline non-optimized summary reports.

ARA summary reports

We use the following example to illustrate our notation. Imagine a fictional gift shop called Du & Penc that uses digital advertising to reach its customers. The table below captures their holiday sales, where each record contains impression features with (i) an impression ID, (ii) the campaign, and (iii) the city in which the ad was shown, as well as conversion features with (i) the number of items purchased and (ii) the total dollar value of those items.

Impression and conversion feature logs for Du & Penc.

Mathematical model

ARA summary reports can be modeled by four algorithms: (1) Contribution Vector, (2) Contribution Bounding, (3) Summary Reports, and (4) Reconstruct Values. Contribution Bounding and Summary Reports are performed by the ARA, while Contribution Vector and Reconstruct Values are performed by an AdTech provider — tools and systems that enable businesses to buy and sell digital advertising. The objective of this work is to assist AdTechs in optimizing summary report algorithms.

The Contribution Vector algorithm converts measurements into an ARA format that is discretized and scaled. Scaling needs to account for the overall contribution limit per impression. Here we propose a method that clips and performs randomized rounding. The outcome of the algorithm is a histogram of aggregatable keys and values.

Next, the Contribution Bounding algorithm runs on client devices and enforces the contribution bound on attributed reports where any further contributions exceeding the limit are dropped. The output is a histogram of attributed conversions.

The Summary Reports algorithm runs on the server side inside a trusted execution environment and returns noisy aggregate results that satisfy DP. Noise is sampled from the discrete Laplace distribution, and to enforce privacy budgeting, a report may be queried only once.

Finally, the Reconstruct Values algorithm converts measurements back to the original scale. Reconstruct Values and Contribution Vector Algorithms are designed by the AdTech, and both impact the utility received from the summary report.

Illustrative usage of ARA summary reports, which include Contribution Vector (Algorithm A), Contribution Bounding (Algorithm C), Summary Reports (Algorithm S), and Reconstruct Values (Algorithm R). Algorithms C and S are fixed in the API. The AdTech designs A and R.

Error metrics

There are several factors to consider when selecting an error metric for evaluating the quality of an approximation. To choose a particular metric, we considered the desirable properties of an error metric that further can be used as an objective function. Considering desired properties, we have chosen 𝜏-truncated root mean square relative error (RMSRE𝜏) as our error metric for its properties. See the paper for a detailed discussion and comparison to other possible metrics.

Optimization

To optimize utility as measured by RMSRE𝜏, we choose a capping parameter, C, and privacy budget, 𝛼, for each slice. The combination of both determines how an actual measurement (such as two conversions with a total value of $3) is encoded on the AdTech side and then passed to the ARA for Contribution Bounding algorithm processing. RMSRE𝜏 can be computed exactly, since it can be expressed in terms of the bias from clipping and the variance of the noise distribution. Following those steps we find out that RMSRE𝜏 for a fixed privacy budget, 𝛼,or a capping parameter, C, is convex (so the error-minimizing value for the other parameter can be obtained efficiently), while for joint variables (C, 𝛼) it becomes non-convex (so we may not always be able to select the best possible parameters). In any case, any off-the-shelf optimizer can be used to select privacy budgets and capping parameters. In our experiments, we use the SLSQP minimizer from the scipy.optimize library.

Synthetic data

Different ARA configurations can be evaluated empirically by testing them on a conversion dataset. However, access to such data can be restricted or slow due to privacy concerns, or simply unavailable. One way to address these limitations is to use synthetic data that replicates the characteristics of real data.

We present a method for generating synthetic data responsibly through statistical modeling of real-world conversion datasets. We first perform an empirical analysis of real conversion datasets to uncover relevant characteristics for ARA. We then design a pipeline that uses this distribution knowledge to create a realistic synthetic dataset that can be customized via input parameters.

The pipeline first generates impressions drawn from a power-law distribution (step 1), then for each impression it generates conversions drawn from a Poisson distribution (step 2) and finally, for each conversion, it generates conversion values drawn from a log-normal distribution (step 3). With dataset-dependent parameters, we find that these distributions closely match ad-dataset characteristics. Thus, one can learn parameters from historical or public datasets and generate synthetic datasets for experimentation.

Overall dataset generation steps with features for illustration.

Experimental evaluation

We evaluate our algorithms on three real-world datasets (Criteo, AdTech Real Estate, and AdTech Travel) and three synthetic datasets. Criteo consists of 15M clicks, Real Estate consists of 100K conversions, and Travel consists of 30K conversions. Each dataset is partitioned into a training set and a test set. The training set is used to choose contribution budgets, clipping threshold parameters, and the conversion count limit (the real-world datasets have only one conversion per click), and the error is evaluated on the test set. Each dataset is partitioned into slices using impression features. For real-world datasets, we consider three queries for each slice; for synthetic datasets, we consider two queries for each slice.

For each query we choose the RMSRE𝝉 𝜏 value to be five times the median value of the query on the training dataset. This ensures invariance of the error metric to data rescaling, and allows us to combine the errors from features of different scales by using 𝝉 per each feature.

Scatter plots of real-world datasets illustrating the probability of observing a conversion value. The fitted curves represent best log-normal distribution models that effectively capture the underlying patterns in the data.

Results

We compare our optimization-based algorithm to a simple baseline approach. For each query, the baseline uses an equal contribution budget and a fixed quantile of the training data to choose the clipping threshold. Our algorithms produce substantially lower error than baselines on both real-world and synthetic datasets. Our optimization-based approach adapts to the privacy budget and data.

RMSREτ for privacy budgets {1, 2, 4, 8, 16, 32, 64} for our algorithms and baselines on three real-world and three synthetic datasets. Our optimization-based approach consistently achieves lower error than baselines that use a fixed quantile for the clipping threshold and split the contribution budget equally among the queries.

Conclusion

We study the optimization of summary reports in the ARA, which is currently deployed on hundreds of millions of Chrome browsers. We present a rigorous formulation of the contribution budgeting optimization problem for ARA with the goal of equipping researchers with a robust abstraction that facilitates practical improvements.

Our recipe, which leverages historical data to bound and scale the contributions of future data under differential privacy, is quite general and applicable to settings beyond advertising. One approach based on this work is to use past data to learn the parameters of the data distribution, and then to apply synthetic data derived from this distribution for privacy budgeting for queries on future data. Please see the paper and accompanying code for detailed algorithms and proofs.

Acknowledgements

This work was done in collaboration with Badih Ghazi, Pritish Kamath, Ravi Kumar, Pasin Manurangsi, and Avinash Varadarajan. We thank Akash Nadan for his help.

Read More

Why GPUs Are Great for AI

Why GPUs Are Great for AI

GPUs have been called the rare Earth metals — even the gold — of artificial intelligence, because they’re foundational for today’s generative AI era.

Three technical reasons, and many stories, explain why that’s so. Each reason has multiple facets well worth exploring, but at a high level:

  • GPUs employ parallel processing.
  • GPU systems scale up to supercomputing heights.
  • The GPU software stack for AI is broad and deep.

The net result is GPUs perform technical calculations faster and with greater energy efficiency than CPUs. That means they deliver leading performance for AI training and inference as well as gains across a wide array of applications that use accelerated computing.

In its recent report on AI, Stanford’s Human-Centered AI group provided some context. GPU performance “has increased roughly 7,000 times” since 2003 and price per performance is “5,600 times greater,” it reported.

Stanford report on GPU performance increases
A 2023 report captured the steep rise in GPU performance and price/performance.

The report also cited analysis from Epoch, an independent research group that measures and forecasts AI advances.

“GPUs are the dominant computing platform for accelerating machine learning workloads, and most (if not all) of the biggest models over the last five years have been trained on GPUs … [they have] thereby centrally contributed to the recent progress in AI,” Epoch said on its site.

A 2020 study assessing AI technology for the U.S. government drew similar conclusions.

“We expect [leading-edge] AI chips are one to three orders of magnitude more cost-effective than leading-node CPUs when counting production and operating costs,” it said.

NVIDIA GPUs have increased performance on AI inference 1,000x in the last ten years, said Bill Dally, the company’s chief scientist in a keynote at Hot Chips, an annual gathering of semiconductor and systems engineers.

ChatGPT Spread the News

ChatGPT provided a powerful example of how GPUs are great for AI. The large language model (LLM), trained and run on thousands of NVIDIA GPUs, runs generative AI services used by more than 100 million people.

Since its 2018 launch, MLPerf, the industry-standard benchmark for AI, has provided numbers that detail the leading performance of NVIDIA GPUs on both AI training and inference.

For example, NVIDIA Grace Hopper Superchips swept the latest round of inference tests. NVIDIA TensorRT-LLM, inference software released since that test, delivers up to an 8x boost in performance and more than a 5x reduction in energy use and total cost of ownership. Indeed, NVIDIA GPUs have won every round of MLPerf training and inference tests since the benchmark was released in 2019.

In February, NVIDIA GPUs delivered leading results for inference, serving up thousands of inferences per second on the most demanding models in the STAC-ML Markets benchmark, a key technology performance gauge for the financial services industry.

A RedHat software engineering team put it succinctly in a blog: “GPUs have become the foundation of artificial intelligence.”

AI Under the Hood

A brief look under the hood shows why GPUs and AI make a powerful pairing.

An AI model, also called a neural network, is essentially a mathematical lasagna, made from layer upon layer of linear algebra equations. Each equation represents the likelihood that one piece of data is related to another.

For their part, GPUs pack thousands of cores, tiny calculators working in parallel to slice through the math that makes up an AI model. This, at a high level, is how AI computing works.

Highly Tuned Tensor Cores

Over time, NVIDIA’s engineers have tuned GPU cores to the evolving needs of AI models. The latest GPUs include Tensor Cores that are 60x more powerful than the first-generation designs for processing the matrix math neural networks use.

In addition, NVIDIA Hopper Tensor Core GPUs include a Transformer Engine that can automatically adjust to the optimal precision needed to process transformer models, the class of neural networks that spawned generative AI.

Along the way, each GPU generation has packed more memory and optimized techniques to store an entire AI model in a single GPU or set of GPUs.

Models Grow, Systems Expand

The complexity of AI models is expanding a whopping 10x a year.

The current state-of-the-art LLM, GPT4, packs more than a trillion parameters, a metric of its mathematical density. That’s up from less than 100 million parameters for a popular LLM in 2018.

Chart shows 1,000x performance improvement on AI inference over a decade for single GPUs
In a recent talk at Hot Chips, NVIDIA Chief Scientist Bill Dally described how single-GPU performance on AI inference expanded 1,000x in the last decade.

GPU systems have kept pace by ganging up on the challenge. They scale up to supercomputers, thanks to their fast NVLink interconnects and NVIDIA Quantum InfiniBand networks.

For example, the DGX GH200, a large-memory AI supercomputer, combines up to 256 NVIDIA GH200 Grace Hopper Superchips into a single data-center-sized GPU with 144 terabytes of shared memory.

Each GH200 superchip is a single server with 72 Arm Neoverse CPU cores and four petaflops of AI performance. A new four-way Grace Hopper systems configuration puts in a single compute node a whopping 288 Arm cores and 16 petaflops of AI performance with up to 2.3 terabytes of high-speed memory.

And NVIDIA H200 Tensor Core GPUs announced in November pack up to 288 gigabytes of the latest HBM3e memory technology.

Software Covers the Waterfront

An expanding ocean of GPU software has evolved since 2007 to enable every facet of AI, from deep-tech features to high-level applications.

The NVIDIA AI platform includes hundreds of software libraries and apps. The CUDA programming language and the cuDNN-X library for deep learning provide a base on top of which developers have created software like NVIDIA NeMo, a framework to let users build, customize and run inference on their own generative AI models.

Many of these elements are available as open-source software, the grab-and-go staple of software developers. More than a hundred of them are packaged into the NVIDIA AI Enterprise platform for companies that require full security and support. Increasingly, they’re also available from major cloud service providers as APIs and services on NVIDIA DGX Cloud.

SteerLM, one of the latest AI software updates for NVIDIA GPUs, lets users fine tune models during inference.

A 70x Speedup in 2008

Success stories date back to a 2008 paper from AI pioneer Andrew Ng, then a Stanford researcher. Using two NVIDIA GeForce GTX 280 GPUs, his three-person team achieved a 70x speedup over CPUs processing an AI model with 100 million parameters, finishing work that used to require several weeks in a single day.

“Modern graphics processors far surpass the computational capabilities of multicore CPUs, and have the potential to revolutionize the applicability of deep unsupervised learning methods,” they reported.

Picture of Andrew Ng showing slide in a talk on GPU performance for AI
Andrew Ng described his experiences using GPUs for AI in a GTC 2015 talk.

In a 2015 talk at NVIDIA GTC, Ng described how he continued using more GPUs to scale up his work, running larger models at Google Brain and Baidu. Later, he helped found Coursera, an online education platform where he taught hundreds of thousands of AI students.

Ng counts Geoff Hinton, one of the godfathers of modern AI, among the people he influenced. “I remember going to Geoff Hinton saying check out CUDA, I think it can help build bigger neural networks,” he said in the GTC talk.

The University of Toronto professor spread the word. “In 2009, I remember giving a talk at NIPS [now NeurIPS], where I told about 1,000 researchers they should all buy GPUs because GPUs are going to be the future of machine learning,” Hinton said in a press report.

Fast Forward With GPUs

AI’s gains are expected to ripple across the global economy.

A McKinsey report in June estimated that generative AI could add the equivalent of $2.6 trillion to $4.4 trillion annually across the 63 use cases it analyzed in industries like banking, healthcare and retail. So, it’s no surprise Stanford’s 2023 AI report said that a majority of business leaders expect to increase their investments in AI.

Today, more than 40,000 companies use NVIDIA GPUs for AI and accelerated computing, attracting a global community of 4 million developers. Together they’re advancing science, healthcare, finance and virtually every industry.

Among the latest achievements, NVIDIA described a whopping 700,000x speedup using AI to ease climate change by keeping carbon dioxide out of the atmosphere (see video below). It’s one of many ways NVIDIA is applying the performance of GPUs to AI and beyond.

Learn how GPUs put AI into production.

Read More

How Getir reduced model training durations by 90% with Amazon SageMaker and AWS Batch

How Getir reduced model training durations by 90% with Amazon SageMaker and AWS Batch

This is a guest post co-authored by Nafi Ahmet Turgut, Hasan Burak Yel, and Damla Şentürk from Getir.

Established in 2015, Getir has positioned itself as the trailblazer in the sphere of ultrafast grocery delivery. This innovative tech company has revolutionized the last-mile delivery segment with its compelling offering of “groceries in minutes.” With a presence across Turkey, the UK, the Netherlands, Germany, and the United States, Getir has become a multinational force to be reckoned with. Today, the Getir brand represents a diversified conglomerate encompassing nine different verticals, all working synergistically under a singular umbrella.

In this post, we explain how we built an end-to-end product category prediction pipeline to help commercial teams by using Amazon SageMaker and AWS Batch, reducing model training duration by 90%.

Understanding our existing product assortment in a detailed manner is a crucial challenge that we, along with many businesses, face in today’s fast-paced and competitive market. An effective solution to this problem is the prediction of product categories. A model that generates a comprehensive category tree allows our commercial teams to benchmark our existing product portfolio against that of our competitors, offering a strategic advantage. Therefore, our central challenge is the creation and implementation of an accurate product category prediction model.

We capitalized on the powerful tools provided by AWS to tackle this challenge and effectively navigate the complex field of machine learning (ML) and predictive analytics. Our efforts led to the successful creation of an end-to-end product category prediction pipeline, which combines the strengths of SageMaker and AWS Batch.

This capability of predictive analytics, particularly the accurate forecast of product categories, has proven invaluable. It provided our teams with critical data-driven insights that optimized inventory management, enhanced customer interactions, and strengthened our market presence.

The methodology we explain in this post ranges from the initial phase of feature set gathering to the final implementation of the prediction pipeline. An important aspect of our strategy has been the use of SageMaker and AWS Batch to refine pre-trained BERT models for seven different languages. Additionally, our seamless integration with AWS’s object storage service Amazon Simple Storage Service (Amazon S3) has been key to efficiently storing and accessing these refined models.

SageMaker is a fully managed ML service. With SageMaker, data scientists and developers can quickly and effortlessly build and train ML models, and then directly deploy them into a production-ready hosted environment.

As a fully managed service, AWS Batch helps you run batch computing workloads of any scale. AWS Batch automatically provisions compute resources and optimizes the workload distribution based on the quantity and scale of the workloads. With AWS Batch, there’s no need to install or manage batch computing software, so you can focus your time on analyzing results and solving problems. We used GPU jobs that help us run jobs that use an instance’s GPUs.

Overview of solution

Five people from Getir’s data science team and infrastructure team worked together on this project. The project was completed in a month and deployed to production after a week of testing.

The following diagram shows the solution’s architecture.

The model pipeline is run separately for each country. The architecture includes two AWS Batch GPU cron jobs for each country, running on defined schedules.

We overcame some challenges by strategically deploying SageMaker and AWS Batch GPU resources. The process used to address each difficulty is detailed in the following sections.

Fine-tuning multilingual BERT models with AWS Batch GPU jobs

We sought a solution to support multiple languages for our diverse user base. BERT models were an obvious choice due to their established ability to handle complex natural language tasks effectively. In order to tailor these models to our needs, we harnessed the power of AWS by using single-node GPU instance jobs. This allowed us to fine-tune pre-trained BERT models for each of the seven languages we required support for. Through this method, we ensured high precision in predicting product categories, overcoming any potential language barriers.

Efficient model storage using Amazon S3

Our next step was to address model storage and management. For this, we selected Amazon S3, known for its scalability and security. Storing our fine-tuned BERT models on Amazon S3 enabled us to provide easy access to different teams within our organization, thereby significantly streamlining our deployment process. This was a crucial aspect in achieving agility in our operations and a seamless integration of our ML efforts.

Creating an end-to-end prediction pipeline

An efficient pipeline was required to make the best use of our pre-trained models. We first deployed these models on SageMaker, an action that allowed for real-time predictions with low latency, thereby enhancing our user experience. For larger-scale batch predictions, which were equally vital to our operations, we utilized AWS Batch GPU jobs. This ensured the optimal use of our resources, providing us with a perfect balance of performance and efficiency.

Exploring future possibilities with SageMaker MMEs

As we continue to evolve and seek efficiencies in our ML pipeline, one avenue we are keen to explore is using SageMaker multi-model endpoints (MMEs) for deploying our fine-tuned models. With MMEs, we can potentially streamline the deployment of various fine-tuned models, ensuring efficient model management while also benefiting from the native capabilities of SageMaker like shadow variants, auto scaling, and Amazon CloudWatch integration. This exploration aligns with our continuous pursuit of enhancing our predictive analytics capabilities and providing superior experiences to our customers.

Conclusion

Our successful integration of SageMaker and AWS Batch has not only addressed our specific challenges but also significantly boosted our operational efficiency. Through the implementation of a sophisticated product category prediction pipeline, we are able to empower our commercial teams with data-driven insights, thereby facilitating more effective decision-making.

Our results speak volumes about our approach’s effectiveness. We have achieved an 80% prediction accuracy across all four levels of category granularity, which plays an important role in shaping the product assortments for each country we serve. This level of precision extends our reach beyond language barriers and ensures we cater to our diverse user base with the utmost accuracy.

Moreover, by strategically using scheduled AWS Batch GPU jobs, we’ve been able to reduce our model training durations by 90%. This efficiency has further streamlined our processes and bolstered our operational agility. Efficient model storage using Amazon S3 has played a critical role in this achievement, balancing both real-time and batch predictions.

For more information about how to get started building your own ML pipelines with SageMaker, see Amazon SageMaker resources. AWS Batch is an excellent option if you are looking for a low-cost, scalable solution for running batch jobs with low operational overhead. To get started, see Getting Started with AWS Batch.


About the Authors

Nafi Ahmet Turgut finished his master’s degree in Electrical & Electronics Engineering and worked as a graduate research scientist. His focus was building machine learning algorithms to simulate nervous network anomalies. He joined Getir in 2019 and currently works as a Senior Data Science & Analytics Manager. His team is responsible for designing, implementing, and maintaining end-to-end machine learning algorithms and data-driven solutions for Getir.

Hasan Burak Yel received his bachelor’s degree in Electrical & Electronics Engineering at Boğaziçi University. He worked at Turkcell, mainly focused on time series forecasting, data visualization, and network automation. He joined Getir in 2021 and currently works as a Data Science & Analytics Manager with the responsibility of Search, Recommendation, and Growth domains.

Damla Şentürk received her bachelor’s degree of Computer Engineering at Galatasaray University. She continues her master’s degree of Computer Engineering in Boğaziçi University. She joined Getir in 2022, and has been working as a Data Scientist. She has worked on commercial, supply chain, and discovery-related projects.

Esra Kayabalı is a Senior Solutions Architect at AWS, specialized in the analytics domain, including data warehousing, data lakes, big data analytics, batch and real-time data streaming, and data integration. She has 12 years of software development and architecture experience. She is passionate about learning and teaching cloud technologies.

Read More

Fast Optimal Locally Private Mean Estimation via Random Projections

We study the problem of locally private mean estimation of high-dimensional vectors in the Euclidean ball. Existing algorithms for this problem either incur sub-optimal error or have high communication and/or run-time complexity. We propose a new algorithmic framework, ProjUnit, for private mean estimation that yields algorithms that are computationally efficient, have low communication complexity, and incur optimal error up to a 1+o(1)-factor. Our framework is deceptively simple: each randomizer projects its input to a random low-dimensional subspace, normalizes the result, and then runs an…Apple Machine Learning Research

Manifold Diffusion Fields

This paper was accepted at the Diffusion Models workshop at NeurIPS 2023.
Score-based models have quickly become the de facto choice for generative modeling of images, text and more recently molecules. However, to adapt a score-based generative modeling to these domains the score network needs to be carefully designed, hampering its applicability to arbitrary data domains. In this paper we tackle this problem by taking a textit{functional} view of data. This functional view allows to cast seemingly different domains to a common shared representation. We then re-formulate the score function to…Apple Machine Learning Research

Generating Molecular Conformers with Manifold Diffusion Fields

This paper was accepted at Generative AI and Biology workshop at NeurIPS 2023.
In this paper we tackle the problem of generating a molecule conformation in 3D space given its 2D structure. We approach this problem through the lens of a diffusion model for functions in Riemannian Manifolds. Our approach is simple and scalable, and obtains results that are on par with state-of-the-art while making no assumptions about the explicit structure of molecules.Apple Machine Learning Research

Pre-trained Language Models Do Not Help Auto-regressive Text-to-Image Generation

This paper was accepted at the workshop I Can’t Believe It’s Not Better! (ICBINB) at NeurIPS 2023.
Recent advances in image tokenizers, such as VQ-VAE, have enabled text-to-image generation using auto-regressive methods, similar to language modeling. However, these methods have yet to leverage pre-trained language models, despite their adaptability to various downstream tasks. In this work, we explore this gap, and find that pre-trained language models offer limited help in auto-regressive text-to-image generation. We provide a two-fold explanation by analyzing tokens from each modality…Apple Machine Learning Research