July 2024 – Page 3

Intuit uses Amazon Bedrock and Anthropic’s Claude to explain taxes in TurboTax to millions of consumer tax filers

Intuit is committed to providing its customers innovative solutions that simplify complex financial processes. Tax filing can be a challenge, with its ever-changing regulations and intricate nuances. That’s why the company empowers millions of individuals and small businesses to comprehend tax-related information effortlessly and file with full confidence that their taxes are done right.

For the 2024 tax season, Intuit set out to raise the bar with generative AI, using Anthropic’s advanced language model Claude in Amazon Bedrock—underpinned by Intuit’s proprietary tax engine—to provide individual tax filers with simple-to-understand contextual explanations of tax calculations, backed by real-time accuracy checks.

In this blog post, we discuss the journey of developing a solution that benefited millions of TurboTax customers in 2024.

The challenge

Taxes, with their complicated regulations and nuances, can be a labyrinth for even the most seasoned. The tax code includes 15,000+ federal tax forms and state tax forms for individual and business tax filers in the U.S. It is estimated that Americans spend 8.9 billion hours every year doing their taxes.

To streamline and simplify the tax filing experiences, Intuit’s AI/GenAI-powered TurboTax products guide consumers through the process. One challenge is to explain complex calculations in a simple-to-understand manner so taxpayers can confidently file their taxes —and seamlessly connect to a human expert whenever needed. According to Nhung Ho, vice president of AI at Intuit, “With Intuit Assist for TurboTax, we wanted to answer every customer’s question about how they arrived at their final tax outcome, and we had to do it in clear, concise language, so they have peace of mind before they file.”

The solution

Applying its years of domain expertise, robust data set and proprietary tax knowledge engine, Intuit worked closely with Anthropic and Amazon Web Services to further boost filer confidence by integrating Claude via Amazon Bedrock into its AI financial assistant, Intuit Assist for TurboTax. During federal tax reviews where customers see a summary of their return, the combined work of Intuit, Anthropic and AWS provides simple explanations of tax calculations. Altogether, helping users better understand how their tax result is calculated will help them feel assured their taxes were filed correctly. The following video shows examples of tax explanations.

Implementing Claude in Amazon Bedrock: a collaborative effort

In June 2023, Intuit announced its proprietary generative AI operati ng system (GenOS), which runs on AWS infrastructure and empowers the company’s developers to design, build, and deploy breakthrough generative AI experiences. GenOS serves as the primary paved path for rolling out generative AI applications or capabilities in production across the company.

Last fall, Intuit began experimenting with Anthropic’s Claude via Amazon Bedrock.

“After a successful partnership with Amazon SageMaker for its ML capabilities, Intuit looked forward to working with Amazon Bedrock as a managed service to simplify the deployment and management of LLMs,” explained Nhung.

Each year, tax filing is a seasonal process between January 1 and October 15, so the ability to scale rapidly to help meet the needs of millions of Intuit customers during this period was a critical success factor for Intuit’s tax explanations use case with Anthropic Claude in Amazon Bedrock.

“Amazon Bedrock offered Intuit the latency, scalability, and reliability to introduce AI-powered tax explanations to its customers,” Nhung added. “This allowed Intuit to deliver valuable generative AI experiences to its users.”

The company took advantage of AWS elasticity to acquire resources as they needed them, and to release resources when no longer needed. Provisioned throughput for Amazon Bedrock enabled Intuit to achieve the scalability and latency needed to serve millions of customers, beginning in January 2024. Intuit also implemented a multi-region setup to provide resiliency needed for such a critical application.

Additionally, a private connection between TurboTax Virtual Private Cloud (VPC) and Amazon Bedrock made sure that user data was appropriately protected.

“Intuit takes great pains to protect user data with our anti-fraud technology. It is important that user data remain secure. Anthropic’s Claude LLM, managed by Amazon Bedrock, provides that capability.” Nhung explained.

Conclusion

By using Amazon Bedrock to integrate Anthropic’s Claude into its tax preparation software, Intuit expanded the following benefits:

Simplified Tax Explanations: By demystifying tax complexities, Intuit instilled confidence in users, empowering them to navigate the tax filing process with greater ease and assurance.

Simplified Management: A simplified management experience of Anthropic’s Claude with Bedrock made it simple for Intuit to scale securely.

For the 2024 tax season, Intuit’s innovative use of Anthropic’s Claude in Amazon Bedrock is helping demystify the complexities of tax filing. By harnessing the power of advanced language models, the company is redefining the way people understand and engage with tax-related information. Through personalized explanations, tailored guidance, and a commitment to continuous improvement, Intuit is paving the way for a “done for you” future, where the hard work of tax preparation is done on its customers’ behalf, with a seamless path to human tax and bookkeeping experts whenever needed.

As the company moves forward, it remains dedicated to using cutting-edge generative AI technologies to enhance its solutions and provide its customers with the tools they need to achieve financial success. The successful integration of Amazon Bedrock in the tax domain has opened up new opportunities for Intuit to leverage advanced language models in other areas of financial management, solidifying its position as a trailblazer in fintech.

About the Author

Shivanshu Upadhyay is a Principal Solutions Architect in the AWS Industries group. In this role, he helps the most advanced adopters of AWS transform their industry by effectively using data and AI.

Quantization-Aware Training for Large Language Models with PyTorch

In this blog, we present an end-to-end Quantization-Aware Training (QAT) flow for large language models in PyTorch. We demonstrate how QAT in PyTorch can recover up to 96% of the accuracy degradation on hellaswag and 68% of the perplexity degradation on wikitext for Llama3 compared to post-training quantization (PTQ). We present the QAT APIs in torchao and showcase how users can leverage them for fine-tuning in torchtune.

Figure 1: Llama3-8B fine-tuned on the C4 dataset (en subset) with and without QAT using int8 per token dynamic activations + int4 grouped per channel weights, evaluated on hellaswag and wikitext on a A100 GPU. Note the log scale for wikitext (lower is better).

To demonstrate the effectiveness of QAT in an end-to-end flow, we further lowered the quantized model to XNNPACK, a highly optimized neural network library for backends including iOS and Android, through executorch. After lowering to XNNPACK, the QAT model saw 16.8% lower perplexity than the PTQ model, while maintaining the same model size and on-device inference and generation speeds.

Lowered model metric	PTQ	QAT
Wikitext word perplexity (↓)	23.316	19.403
Wikitext byte perplexity (↓)	1.850	1.785
Wikitext bits per byte (↓)	0.887	0.836
Model size	3.881 GB	3.881 GB
On-device inference speed	5.065 tok/s	5.265 tok/s
On-device generation speed	8.369 tok/s	8.701 tok/s

Table 1: QAT achieved 16.8% lower perplexity and unchanged model sizes and on-device inference and generation speeds on the Llama3-8B model lowered to XNNPACK. Linear layers are quantized using int8 per token dynamic activations + int4 grouped per channel weights, and embeddings are additionally quantized to int4 using a group size of 32 (QAT is only applied to linear layers). Wikitext evaluation is performed using 5 samples and a max sequence length of 127 on server CPU, since evaluation is not available on device (lower is better for all wikitext results). On-device inference and generation is benchmarked on the Samsung Galaxy S22 smartphone.

QAT APIs

We are excited for users to try our QAT API in torchao, which can be leveraged for both training and fine-tuning. This API involves two steps, prepare and convert: prepare applies a transformation on the linear layers in the model to simulate the numerics of quantization during training, and convert actually quantizes these layers into lower bit-widths after training. The converted model can then be used in the exact same way as the PTQ model:

import torch
from torchtune.models.llama3 import llama3
from torchao.quantization.prototype.qat import Int8DynActInt4WeightQATQuantizer

# Smaller version of llama3 to fit in a single GPU
model = llama3(
    vocab_size=4096,
    num_layers=16,
    num_heads=16,
    num_kv_heads=4,
    embed_dim=2048,
    max_seq_len=2048,
).cuda()

# Quantizer for int8 dynamic per token activations +
# int4 grouped per channel weights, only for linear layers
qat_quantizer = Int8DynActInt4WeightQATQuantizer()

# Insert "fake quantize" operations into linear layers.
# These operations simulate quantization numerics during
# training without performing any dtype casting
model = qat_quantizer.prepare(model)

# Standard training loop
optimizer = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9, weight_decay=1e-5)
loss_fn = torch.nn.CrossEntropyLoss()
for i in range(10):
    example = torch.randint(0, 4096, (2, 16)).cuda()
    target = torch.randn((2, 16, 4096)).cuda()
    output = model(example)
    loss = loss_fn(output, target)
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

# Convert fake quantize to actual quantize operations
# The quantized model has the exact same structure as the
# quantized model produced in the corresponding PTQ flow
# through `Int8DynActInt4WeightQuantizer`
model = qat_quantizer.convert(model)

# inference or generate

Fine-tuning with torchtune

We also integrated this QAT flow into torchtune and provided recipes to run this in a distributed setting, similar to the existing full fine-tune distributed recipe. Users can additionally apply QAT during LLM fine-tuning by running the following command. See this README for more details.

tune run --nproc_per_node 8 qat_distributed --config llama3/8B_qat_full

What is Quantization-Aware Training?

Quantization-Aware Training (QAT) is a common quantization technique for mitigating model accuracy/perplexity degradation that arises from quantization. This is achieved by simulating quantization numerics during training while keeping the weights and/or activations in the original data type, typically float, effectively “fake quantizing” the values instead of actually casting them to lower bit-widths:

# PTQ: x_q is quantized and cast to int8
# scale and zero point (zp) refer to parameters used to quantize x_float
# qmin and qmax refer to the range of quantized values
x_q = (x_float / scale + zp).round().clamp(qmin, qmax).cast(int8)

# QAT: x_fq is still in float
# Fake quantize simulates the numerics of quantize + dequantize
x_fq = (x_float / scale + zp).round().clamp(qmin, qmax)
x_fq = (x_fq - zp) * scale

Since quantization involves non-differentiable operations like rounding, the QAT backward pass typically uses straight-through estimators (STE), a mechanism to estimate the gradients flowing through non-smooth functions, to ensure the gradients passed to the original weights are still meaningful. In this manner, the gradients are computed with the knowledge that the weights will ultimately be quantized after training, effectively allowing the model to adjust for quantization noise during the training process. Note that an alternative to QAT is quantized training, which actually casts the values to lower bit dtypes during training, but prior efforts have only seen success up to 8-bits, whereas QAT is effective even at lower bit-widths.

QAT in PyTorch

We added an initial QAT flow in torchao under prototype here. Currently we support int8 dynamic per-token activations + int4 grouped per-channel weights (abbreviated 8da4w) for linear layers. These settings are motivated by a combination of kernel availability on edge backends and prior research on LLM quantization, which found that per-token activation and per-group weight quantization achieves the best model quality for LLMs compared to other quantization schemes.

Figure 2: torchao QAT flow. This flow involves two steps: (1) prepare, which inserts the fake quantization ops into the model’s linear layers, and (2) convert, which converts these fake quantization ops with actual quantize and dequantize ops after training.

This flow produces the exact same quantized model as the PTQ flow using the same quantization settings (through Int8DynActInt4WeightQuantizer), but with quantized weights that achieve superior accuracies and perplexities. Thus, we can use the model converted from the QAT flow as a drop-in replacement for the PTQ model and reuse all the backend delegation logic and underlying kernels.

Experimental Results

All experiments in this blog post are performed using the torchtune QAT integration described above. We use 6-8 A100 GPUs with 80 GBs each to fine-tune Llama2-7B and Llama3-8B on the C4 dataset (en subset) for 5000 steps. For all experiments, we use batch size = 2, learning rate = 2e-5, max sequence length = 4096 for Llama2 and 8192 for Llama3, Fully Sharded Data Parallel (FSDP) as our distribution strategy, and activation checkpointing to reduce memory footprint. For 8da4w experiments, we use a group size of 256 for weights.

Since the pre-training dataset is not easily accessible, we perform QAT during the fine-tuning process. Empirically, we found that disabling fake quantization for the first N steps led to better results, presumably because doing so allows the weights to stabilize before we start introducing quantization noise to the fine-tuning process. We disable fake quantization for the first 1000 steps for all our experiments.

We evaluate our quantized models using the lm-evaluation-harness integration in torchtune. We report evaluation results from a variety of tasks commonly used to evaluate LLMs, including hellaswag, a commonsense sentence completion task, wikitext, a next token/byte prediction task, and a few question-answering tasks such as arc, openbookqa, and piqa. For wikitext, perplexity refers to the inverse of how well the model can predict the next word or byte (lower is better), and bits_per_byte refers to how many bits are needed to predict the next byte (lower is also better here). For all other tasks, acc_norm refers to the accuracy normalized by the byte-length of the target string.

Int8 Dynamic Activations + Int4 Weight Quantization (8da4w)

Starting with Llama2 8da4w quantization, we saw that QAT was able to recover 62% of the normalized accuracy degradation on hellaswag compared to PTQ, and 58% and 57% of the word and byte perplexity degradation (respectively) on wikitext. We see similar improvements for most of the other tasks.

Figure 3a: Llama2-7B 8da4w quantization with and without QAT

Figure 3b: Llama2-7B 8da4w quantization with and without QAT, evaluated on wikitext (lower is better)

Llama3 8da4w quantization saw even more pronounced improvements with QAT. On the hellaswag evaluation task, we were able to recover 96% of the normalized accuracy degradation on hellaswag compared to PTQ, with minimal overall degradation (<1%) compared to the non-quantized accuracy. On the wikitext evaluation task, QAT recovered 68% and 65% of the word and byte perplexity degradation (respectively). Even on arc_challenge, which was difficult for Llama2 QAT, we were able to recover 51% of the normalized accuracy degradation.

Figure 4a: Llama3-8B 8da4w quantization with and without QAT

Figure 4b: Llama3-8B 8da4w quantization with and without QAT, evaluated on wikitext (lower is better)

Lower Bit Weight Only Quantization

We further extended the torchao QAT flow to 2-bit and 3-bit weight only quantization and repeated the same experiments for Llama3-8B. Quantization degradation is more severe at lower bit-widths, so we use a group size of 32 for all experiments for finer-grained quantization.

However, this is still not enough for 2-bits PTQ, which saw wikitext perplexity explode. To mitigate this problem, we leverage knowledge from prior sensitivity analysis that the first 3 and last 2 layers of the Llama3 model are the most sensitive, and skip quantizing these layers in exchange for a moderate increase in quantized model size (1.78 GB for 2-bits and 1.65 GB for 3-bits). This brought the wikitext word perplexity down from 603336 to 6766, which is significant but still far from acceptable. To further improve the quantized model, we turn to QAT.

Figure 5a: Llama3-8B 2-bit weight only quantization with and without QAT, evaluated on wikitext (lower is better). Bars with “skip” refer to skipping quantization for the first 3 and last 2 layers of the model, which are more sensitive to quantization. Note the log scale.

We observe that applying QAT while skipping quantization for the first 3 and last 2 layers further brought the word perplexity down to a much more reasonable value of 30 (from 6766). More generally, QAT was able to recover 53% of the normalized accuracy degradation on hellaswag compared to PTQ, and 99% and 89% of the word and byte perplexity degradation (respectively) on wikitext. Without skipping the sensitive layers, however, QAT was far less effective at mitigating degradation in quantized model quality.

Figure 5b: Llama3-8B 2-bit weight only quantization with and without QAT. Bars with “skip” refer to skipping quantization for the first 3 and last 2 layers of the model, which are more sensitive to quantization.

For 3-bit weight only quantization, QAT was effective even without skipping the first 3 and last 2 layers, though skipping these layers still led to better results for both PTQ and QAT. In the skip case, QAT was able to recover 63% of the normalized accuracy degradation on hellaswag compared to PTQ, and 72% and 65% of the word and byte perplexity degradation (respectively) on wikitext.

Figure 6a: Llama3-8B 3-bit weight only quantization with and without QAT. Bars with “skip” refer to skipping quantization for the first 3 and last 2 layers of the model, which are more sensitive to quantization.

Figure 6b: Llama3-8B 3-bit weight only quantization with and without QAT, evaluated on wikitext (lower is better). Bars with “skip” refer to skipping quantization for the first 3 and last 2 layers of the model, which are more sensitive to quantization. Note the log scale.

QAT Overhead

QAT inserts many fake quantize operations throughout the model, adding considerable overhead to both the fine-tuning speed and the memory usage. For a model like Llama3-8B for example, we have (32 * 7) + 1 = 225 linear layers, each of which has at least 1 fake quantize for the weights and potentially 1 fake quantize for the input activations. Memory footprint increase is also significant, since we cannot mutate the weights in-place and so we need to clone them before applying fake quantization, though this overhead can be mostly mitigated by enabling activation checkpointing.

In our microbenchmarks, we found that 8da4w QAT fine-tuning is ~34% slower than regular full fine-tuning. With activation checkpointing, the memory increase per GPU is around 2.35 GB. Most of these overheads are fundamental to how QAT works, though we may be able to speed up computation with torch.compile in the future.

Per GPU statistics	Full fine-tuning	QAT fine-tuning
Median tokens per second	546.314 tok/s	359.637 tok/s
Median peak memory	67.501 GB	69.850 GB

Table 2: Llama3 QAT fine-tuning overhead for int8 per token dynamic activations + int4 grouped per channel weights on 6 A100 GPUs (each with 80GB memory).

Looking Ahead

In this blog, we presented a QAT flow for LLMs through torchao, integrated this flow with the fine-tuning APIs in torchtune, and demonstrated its potential to recover most of the quantization degradation compared to PTQ and match non-quantized performance on certain tasks. There are many directions for future explorations:

Hyperparameter tuning. It is likely that extensive hyperparameter tuning can further improve the results of finetuning and QAT. In addition to the general hyperparameters like the learning rate, batch size, dataset size, and number of fine-tuning steps, we should also tune QAT-specific ones, such as when to start/stop fake quantization, how many steps to fake quantize, and regularization parameters for fake quantized values.
Outlier reduction techniques. In our experiments, we found that both PTQ and QAT were susceptible to outliers. In addition to simple clamping and regularization during fine-tuning, we can explore techniques that allow the network to learn how to control these outliers (e.g. learned quantization ranges, clipped softmax, and gated attention), or possibly even borrow outlier suppression techniques from post-training settings (e.g. SpinQuant, SmoothQuant) and apply them sparingly throughout the fine-tuning process.
Mixed-precision and more complex dtypes. Especially in the lower bit regime, we saw that skipping quantization for certain sensitive layers was effective for both PTQ and QAT. Did we need to skip quantizing these layers altogether, or can we still quantize them, just to lower bit-widths? It will be interesting to explore mixed-precision quantization in the context of QAT. Training with newer dtypes such as MX4 is another promising direction, especially given that the upcoming Blackwell GPUs will no longer support int4 tensor cores.
Composability with LoRA and QLoRA. Our QAT integration in torchtune currently only supports the full fine-tuning workflow. However, many users wish to fine-tune their models using low-ranked adaptors to substantially reduce their memory footprint. Composing QAT with techniques like LoRA / QLoRA will enable users to reap the memory and performance benefits of these approaches while producing a model that will ultimately be quantized with minimal model quality degradation.
Composability with torch.compile. This is another potential way to significantly speed up fake quantization computations in QAT while reducing memory footprint. torch.compile is currently not compatible with the distribution strategy used in full distributed fine-tuning recipes in torchtune (with or without QAT), but support will be added in the near future.
Quantizing other layers. In this work, we only explored quantizing the linear layers. However, in the context of long sequence lengths, the KV cache often becomes the throughput bottleneck and can reach tens of GBs, hence LLM-QAT explored quantizing the KV cache alongside activations and weights. Prior work has also had success with quantizing the embedding layer down to 2-bits in other transformer-based models.
End-to-end evaluation on performant cuda kernels. A natural extension of this work is to provide an end-to-end QAT flow evaluated on performant cuda kernels, similar to the existing 8da4w QAT flow lowered to XNNPACK kernels through executorch. For int4 weight only quantization, we can leverage the efficient int4 weight mm kernel with bitpacking for quantization, and there is ongoing work to add QAT support for this kernel: https://github.com/pytorch/ao/pull/383. For 8da4w quantization, mixed 4-bit/8-bit GEMM is also being added in cutlass. This will be needed to build an efficient 8da4w cuda kernel.

The QAT code can be found here. Please refer to this torchtune tutorial to get started. If you have any further questions, please feel free to open an issue on the torchao github or reach out to andrewor@meta.com. We welcome your feedback and contributions!

Introducing torchchat: Accelerating Local LLM Inference on Laptop, Desktop and Mobile

Today, we’re releasing torchchat, a library showcasing how to seamlessly and performantly run Llama 3, 3.1, and other large language models across laptop, desktop, and mobile.

In our previous blog posts, we showed how to use native PyTorch 2.0 to run LLMs with great performance using CUDA. Torchchat expands on this with more target environments, models and execution modes as well as providing important functions such as export, quantization and export in a way that’s easy to understand.

You will find the project organized into three areas:

Python: Torchchat provides a REST API that is called via a Python CLI or can be accessed via the browser
C++: Torchchat produces a desktop-friendly binary using PyTorch’s AOTInductor backend
Mobile devices: Torchchat uses ExecuTorch to export a .pte binary file for on-device inference

Performance

The following table tracks the performance of torchchat for Llama 3 for a variety of configurations.

Numbers for Llama 3.1 are coming soon.

Llama 3 8B Instruct on Apple MacBook Pro M1 Max 64GB

Mode	DType	Llama 3 8B Tokens/Sec
Arm Compile	float16	5.84
	int8	1.63
	int4	3.99
Arm AOTI	float16	4.05
	int8	1.05
	int4	3.28
MPS Eager	float16	12.63
	int8	16.9
	int4	17.15

Llama 3 8B Instruct on Linux x86 and CUDA

Intel(R) Xeon(R) Platinum 8339HC CPU @ 1.80GHz with 180GB Ram + A100 (80GB)

Mode	DType	Llama 3 8B Tokens/Sec
x86 Compile	bfloat16	2.76
	int8	3.15
	int4	5.33
CUDA Compile	bfloat16	83.23
	int8	118.17
	int4	135.16

Torchchat provides exceptional performance for Llama 3 8B on mobile (iPhone and Android). We run Llama 2 7B on Samsung Galaxy S22, and S23, and on iPhone 15 Pro using 4-bit GPTQ and post training quantization (PTQ). Early work on Llama 3 8B support is included in collaboration with ExecuTorch. Many improvements were made to export speed, memory overhead, and runtime speed. Ultimately, though, we’ll be seeing even stronger performance through Core ML, MPS, and HTP in the near future. We are excited!

We encourage you to clone the torchchat repo and give it a spin, explore its capabilities, and share your feedback as we continue to empower the PyTorch community to run LLMs locally and on constrained devices. Together, let’s unlock the full potential of generative AI and LLMs on any device. Please submit issues as you see them as well as in PyTorch plus ExecuTorch, since we are still iterating quickly. We’re also inviting community contributions across a broad range of areas, from additional models, target hardware support, new quantization schemes, or performance improvements. Happy experimenting!

Creators To Have Personalized AI Assistants, Meta CEO Mark Zuckerberg Tells NVIDIA CEO Jensen Huang

In a highly anticipated fireside chat at SIGGRAPH 2024, NVIDIA founder and CEO Jensen Huang and Meta founder and CEO Mark Zuckerberg discussed the transformative potential of open source AI and AI assistants.

Zuckerberg kicked off the discussion by announcing the launch of AI Studio, a new platform that allows users to create, share and discover AI characters, making AI more accessible to millions of creators and small businesses.

“Every single restaurant, every single website will probably, in the future, have these AIs …” Huang said.

“…just like every business has an email address and a website and a social media account, I think, in the future, every business is going to have an AI,” Zuckerberg responded.

Zuckerberg has gotten it right before. Huang credited Zuckerberg and Meta with being leaders in AI, even if only some have noticed until recently.

“You guys have done amazing AI work,” Huang said, citing advancements from Meta in computer vision, language models, real-time translation. “We all use Pytorch, that comes out of Meta.”

The Importance of Open Source in Advancing AI

Zuckerberg highlighted the importance of open source in advancing AI — with the two business leaders emphasizing the importance of open platforms for innovation.

Meta has rapidly emerged as a leader in AI, putting it to work throughout its businesses — most notably with Meta AI, which is used across Facebook, Instagram and WhatsApp — and advancing open-source AI throughout the industry, most recently with the release of Llama 3.1.

The open-source model represents a significant investment of time and training resources. The largest version of Llama boasts 405 billion parameters and was trained on over 16,000 NVIDIA H100 GPUs.

“One of the things that drives quality improvements is it used to be that you have a different model for each type of content,” Zuckerberg explained.

“A the models get bigger and more general, that gets better and better. So, I kind of dream of one day like you can almost imagine all of Facebook or Instagram being like a single AI model that has unified all these different content types and systems together,” he added.

Zuckerberg sees collaboration as key to more advancements. In a blog post released last week, Zuckerberg wrote that the release of Llama 3.1 promises to be an “inflection point” in adopting open source in AI.

These advancements promise more tools to foster engagement, create compelling and personalized content — such as digital avatars — and build virtual worlds.

More broadly, the advancement of AI across a broad ecosystem promises to supercharge human productivity, for example, by giving every human on earth a digital assistant — or assistants — allowing people to live richer lives that they can interact with quickly and fluidly.

“I feel like I’m collaborating with WhatsApp,” Huang said. “Imagine I’m sitting here typing, and it’s generating the images as I’m going. I go back and change my words, and it’s generating other images.”

Vision for the Future

Looking ahead, both CEOs shared their visions for the future.

Zuckerberg expressed optimism about bringing AI together with the real world through eyeglasses — nothing his company’s collaboration with eyewear maker Luxotic — that can be used to help transform education, entertainment and work.

Huang emphasized how interacting with AIs is becoming more fluid, moving beyond just text-based interactions.

“Today’s AI is kind of turn-based. You say something, it says something back to you,” Huang said. In the future, AI could contemplate multiple options, or come up with a tree of options and simulate outcomes, making it much more powerful.”

Throughout the conversation, the two leaders playfully bantered about everything from fashion to steak sandwiches, ending the discussion by exchanging leather jackets.

Zuckerberg give Huang with a black leather shearling jacket with an enormous hood.

Huang gave Zuckerberg his own leather jacket, which he got from his wife, Lori, just for SIGGRAPH, quipping that it was just “two hours old.”

“Well this one’s yours,” Zuckerberg said with a smile. “This is worth more because it’s used.”

“Everybody Will Have An AI Assistant,“ NVIDIA CEO Tells SIGGRAPH Audience

The generative AI revolution — with deep roots in visual computing — is amplifying human creativity even as accelerated computing promises significant gains in energy efficiency, NVIDIA founder and CEO Jensen Huang said Monday.

That makes this week’s SIGGRAPH professional graphics conference, in Denver, the logical venue to discuss what’s next.

“Everybody will have an AI assistant,” Huang said. “Every single company, every single job within the company, will have AI assistance.”

But even as generative AI promises to amplify human productivity, Huang said the accelerated computing technology that underpins it promises to make computing more energy efficient.

“Accelerated computing helps you save so much energy, 20 times, 50 times, and doing the same processing,” Huang said. “The first thing we have to do, as a society, is accelerate every application we can: this reduces the amount of energy being used all over the world.”

The conversation follows a spate of announcements from NVIDIA today.

NVIDIA introduced a new suite of NIM microservices tailored for diverse workflows, including OpenUSD, 3D modeling, physics, materials, robotics, industrial digital twins and physical AI.
These advancements aim to enhance developer capabilities, particularly with the integration of Hugging Face Inference-as-a-Service on DGX Cloud.

In addition, Shutterstock has launched a Generative 3D Service, while Getty Images has upgraded its offerings using NVIDIA Edify technology.

In the realm of AI and graphics, NVIDIA has revealed new OpenUSD NIM microservices and reference workflows designed for generative physical AI applications.

This includes a program for accelerating humanoid robotics development through new NIM microservices for robotics simulation and more.

Finally, WPP, the world’s largest advertising agency, is using Omniverse-driven generative AI for The Coca-Cola Company, helping drive brand authenticity, showcasing the practical applications of NVIDIA’s advancements in AI technology across various industries.

Huang and Goode started their conversation by exploring how visual computing gave rise to everything from computer games to digital animation to GPU-accelerated computing and, most recently, generative AI powered by industrial-scale AI factories.

All these advancements build on one another. Robotics, for example, requires advanced AI and photorealistic virtual worlds where AI can be trained before being deployed into next-generation humanoid robots.

Huang explained that robotics requires three computers: one to train the AI, one to test the AI in a physically accurate simulation, and one within the robot itself.

“Just about every industry is going to be affected by this, whether it’s scientific computing trying to do a better job predicting the weather with a lot less energy, to augmenting and collaborating with creators to generate images, or generating virtual scenes for industrial visualization,” Huang said. “Robotic self-driving cars are all going to be transformed by generative AI.”

Likewise, NVIDIA Omniverse systems — built around the OpenUSD standard — will also be key to harnessing generative AI to create assets that the world’s largest brands can use.

By pulling from brand assets that live in Omniverse, which can capture brand assets, these systems can capture and replicate carefully curated brand magic.

Finally, all these systems — visual computing, simulation and large-language models — will come together to create digital humans who can help people interact with digital systems of all kinds.

“One of the things that we’re announcing here this week is the concept of digital agents, digital AIs that will augment every single job in the company,” Huang said.

“And so one of the most important use cases that people are discovering is customer service,” Huang said. “In the future, my guess is that it’s going to be human still, but AI in the loop.”

All of this, like any new tool, promises to amplify human productivity and creativity. “Imagine the stories that you’re going to be able to tell with these tools,” Huang said.

Wildfire boundary maps expand to new countries in Europe and Africa

Learn how we’re using AI to keep communities across Europe and Africa safe during wildfires.Read More

Build generative AI–powered Salesforce applications with Amazon Bedrock

This post is co-authored by Daryl Martis and Darvish Shadravan from Salesforce.

This is the fourth post in a series discussing the integration of Salesforce Data Cloud and Amazon SageMaker.

In Part 1 and Part 2, we show how Salesforce Data Cloud and Einstein Studio integration with SageMaker allows businesses to access their Salesforce data securely using SageMaker’s tools to build, train, and deploy models to endpoints hosted on SageMaker. SageMaker endpoints can be registered with Salesforce Data Cloud to activate predictions in Salesforce. In Part 3, we demonstrate how business analysts and citizen data scientists can create machine learning (ML) models, without code, in Amazon SageMaker Canvas and deploy trained models for integration with Salesforce Einstein Studio to create powerful business applications.

In this post, we show how native integrations between Salesforce and Amazon Web Services (AWS) enable you to Bring Your Own Large Language Models (BYO LLMs) from your AWS account to power generative artificial intelligence (AI) applications in Salesforce. Requests and responses between Salesforce and Amazon Bedrock pass through the Einstein Trust Layer, which promotes responsible AI use across Salesforce.

We demonstrate BYO LLM integration by using Anthropic’s Claude model on Amazon Bedrock to summarize a list of open service cases and opportunities on an account record page, as shown in the following figure.

Partner quote

“We continue to expand on our strong collaboration with AWS with our BYO LLM integration with Amazon Bedrock, empowering our customers with more model choices and allowing them to create AI-powered features and Copilots customized for their specific business needs. Our open and flexible AI environment, grounded with customer data, positions us well to be leaders in AI-driven solutions in the CRM space.”

–Kaushal Kurapati, Senior Vice President of Product for AI at Salesforce

Amazon Bedrock

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI. Using Amazon Bedrock, you can quickly experiment with and evaluate top FMs for your use case, privately customize them with your data using techniques such as fine-tuning and Retrieval Augmented Generation (RAG), and build agents that execute tasks using your enterprise systems and data sources. Since Amazon Bedrock is serverless, you don’t have to manage infrastructure, and you can securely integrate and deploy generative AI capabilities into your applications using the AWS services you are already familiar with.

Salesforce Data Cloud and Einstein Model Builder

Salesforce Data Cloud is a data platform that unifies your company’s data, giving every team a 360-degree view of the customer to drive automation and analytics, personalize engagement, and power trusted AI. Data Cloud creates a holistic customer view by turning volumes of disconnected data into a single, trusted model that’s simple to access and understand. With data harmonized within Salesforce Data Cloud, customers can put their data to work to build predictions and generative AI–powered business processes across sales, support, and marketing.

With Einstein Model Builder, customers can build their own models using Salesforce’s low-code model builder experience or integrate their own custom-built models into the Salesforce platform. Einstein Model Builder’s BYO LLM experience provides the capability to register custom generative AI models from external environments such as Amazon Bedrock and Salesforce Data Cloud.

Once custom Amazon Bedrock models are registered in Einstein Model Builder, models are connected through the Einstein Trust Layer, a robust set of features and guardrails that protect the privacy and security of data, improve the safety and accuracy of AI results, and promote the responsible use of AI across Salesforce. Registered models can then be used in Prompt Builder, a newly launched, low-code prompt engineering tool that allows Salesforce admins to build, test, and fine-tune trusted AI prompts that can be used across the Salesforce platform. These prompts can be integrated with Salesforce capabilities such as Flows and Invocable Actions and Apex.

Solution overview

With the Salesforce Einstein Model Builder BYO LLM feature, you can invoke Amazon Bedrock models in your AWS account. At the time of this writing, Salesforce supports Anthropic Claude 3 models on Amazon Bedrock for BYO LLM. For this post, we use the Anthropic Claude 3 Sonnet model. To learn more about inference with Claude 3, refer to Anthropic Claude models in the Amazon Bedrock documentation.

For your implementation, you may use the model of your choice. Refer to Bring Your Own Large Language Model in Einstein 1 Studio for models supported with Salesforce Einstein Model Builder.

The following image shows a high-level architecture of how you can integrate the LLM from your AWS account into the Salesforce Prompt Builder.

In this post, we show how to build generative AI–powered Salesforce applications with Amazon Bedrock. The following are the high-level steps involved:

Grant Amazon Bedrock invoke model permission to an AWS Identity and Access Management (IAM) user
Register the Amazon Bedrock model in Salesforce Einstein Model Builder
Integrate the prompt template with the field in the Lightning App Builder

Prerequisites

Before deploying this solution, make sure you meet the following prerequisites:

Have access to Salesforce Data Cloud and meet the requirements for using BYO LLM.
Have Amazon Bedrock set up. If this is the first time you are accessing Anthropic Claude models on Amazon Bedrock, you need to request access. You need to have sufficient permissions to request access to models through the console. To request model access, sign in to the Amazon Bedrock console and select Model access at the bottom of the left navigation pane.

Solution walkthrough

To build generative AI–powered Salesforce applications with Amazon Bedrock, implement the following steps.

Grant Amazon Bedrock invoke model permission to an IAM User

Salesforce Einstein Studio requires an access key and a secret to access the Amazon Bedrock API. Follow the instructions to set up an IAM user and access keys. The IAM user must have Amazon Bedrock invoke model permission to access the model. Complete the following steps:

On the IAM console, select Users in the navigation panel. On the right side of the console, choose Add permissions and Create inline policy.
On the Specify permissions screen, in the Service dropdown menu, select Bedrock.
Under Actions allowed, enter “invoke.” Under Read, select InvokeModel. Select All under Resources. Choose Next.
On the Review and create screen, under Policy name, enter BedrockInvokeModelPolicy. Choose Create policy.

Register Amazon Bedrock model in Einstein Model Builder

On the Salesforce Data Cloud console, under the Einstein Studio tab, choose Add Foundation Model.
Choose Connect to Amazon Bedrock.
For Endpoint information, enter the endpoint name, your AWS account Access Key, and your Secret Key. Enter the Region and Model information. Choose Connect.
Now, create the configuration for the model endpoint you created in the previous steps. Provide Inference parameters such as temperature to set the deterministic factor of the LLM. Enter a sample prompt to verify the response.
Next, you can save this new model configuration. Enter the name for the saved LLM model and choose Create Model.
After the model creation is successful, choose Close and proceed to create the prompt template.
Select the Model name to open the Model configuration.
Select Create Prompt Template to launch the prompt builder.
Select Field Generation as the prompt template type, template name, set Object to Account, and set Object Field to PB Case and Oppty Summary. This will associate the template to a custom field in the account record object to summarize the cases.

For this demo, a rich text field named PB Case and Oppty Summary was created and added to the Salesforce Account page layout according to the Add a Field Generation Prompt Template to a Lightning Record Page instructions.

Provide the prompt and input variables or objects for data grounding and select the model. Refer to Prompt Builder to learn more.

Integrate prompt template with the field in the Lightning App builder

On the Salesforce console, use the search bar to find Lightning App Builder. Build or edit an existing page to integrate the prompt template with the field as shown in the following screenshot. Refer to Add a Field Generation Prompt Template to a Lightning Record Page for detailed instructions.
Navigate to the Account page and click on the PB Case and Oppty Summary enabled for chat completion to launch the Einstein generative AI assistant and summarize the account case data.

Cleanup

Complete the following steps to clean up your resources.

Amazon Bedrock offers on-demand inference pricing. There’s no additional costs with a continued model subscription. To remove model access, refer to the steps in Remove model access.

Conclusion

In this post, we demonstrated how to use your own LLM in Amazon Bedrock to power Salesforce applications. We used summarization of open service cases on an account object as an example to showcase the implementation steps.

Amazon Bedrock is a fully managed service that makes high-performing FMs from leading AI companies and Amazon available for your use through a unified API. You can choose from a wide range of FMs to find the model that is best suited for your use case.

Salesforce Einstein Model Builder lets you register your Amazon Bedrock model and use it in Prompt Builder to create prompts grounded in your data. These prompts can then be integrated with Salesforce capabilities such as Flows and Invocable Actions and Apex. You can then build custom generative AI applications with Claude 3 that are grounded in the Salesforce user experience. Amazon Bedrock requests from Salesforce pass through the Einstein Trust Layer, which provides responsible AI use with features such as dynamic grounding, zero data retention, and toxicity detection while maintaining safety and security standards.

AWS and Salesforce are excited for our mutual customers to harness this integration and build generative AI–powered applications. To learn more and start building, refer to the following resources.

About the Authors

Daryl Martis is the Director of Product for Einstein Studio at Salesforce Data Cloud. He has over 10 years of experience in planning, building, launching, and managing world-class solutions for enterprise customers, including AI/ML and cloud solutions. He has previously worked in the financial services industry in New York City. Follow him on LinkedIn.

Darvish Shadravan is a Director of Product Management in the AI Cloud at Salesforce. He focuses on building AI/ML features for CRM, and is the product owner for the Bring Your Own LLM feature. You can connect with him on LinkedIn.

Rachna Chadha is a Principal Solutions Architect AI/ML in Strategic Accounts at AWS. Rachna is an optimist who believes that ethical and responsible use of AI can improve society in the future and bring economic and social prosperity. In her spare time, Rachna likes spending time with her family, hiking, and listening to music.

Ravi Bhattiprolu is a Sr. Partner Solutions Architect at AWS. Ravi works with strategic partners Salesforce and Tableau to deliver innovative and well-architected products and solutions that help joint customers realize their business objectives.

Ife Stewart is a Principal Solutions Architect in the Strategic ISV segment at AWS. She has been engaged with Salesforce Data Cloud over the last 2 years to help build integrated customer experiences across Salesforce and AWS. Ife has over 10 years of experience in technology. She is an advocate for diversity and inclusion in the technology field.

Mike Patterson is a Senior Customer Solutions Manager in the Strategic ISV segment at AWS. He has partnered with Salesforce Data Cloud to align business objectives with innovative AWS solutions to achieve impactful customer experiences. In Mike’s spare time, he enjoys spending time with his family, sports, and outdoor activities.

Dharmendra Kumar Rai (DK Rai) is a Sr. Data Architect, Data Lake & AI/ML, serving strategic customers. He works closely with customers to understand how AWS can help them solve problems, especially in the AI/ML and analytics space. DK has many years of experience in building data-intensive solutions across a range of industry verticals, including high-tech, FinTech, insurance, and consumer-facing applications.

CMU-MATH Team’s Innovative Approach Secures 2nd Place at the AIMO Prize

Recently, our CMU-MATH team proudly clinched 2nd place in the Artificial Intelligence Mathematical Olympiad (AIMO) out of 1,161 participating teams, earning a prize of $65,536! This prestigious competition aims to revolutionize AI in mathematical problem-solving, with the ultimate goal of building a publicly-shared AI model capable of winning a gold medal in the International Mathematical Olympiad (IMO). Dive into our blog to discover the winning formula that set us apart in this significant contest.

Background: The AIMO competition

The Artificial Intelligence Mathematical Olympiad (AIMO) Prize, initiated by XTX Markets, is a pioneering competition designed to revolutionize AI’s role in mathematical problem-solving. It pushes the boundaries of AI by solving complex mathematical problems akin to those in the International Mathematical Olympiad (IMO). The advisory committee of AIMO includes Timothy Gowers and Terence Tao, both winners of the Fields Medal. Attracting attention from world-class mathematicians as well as machine learning researchers, the AIMO sets a new benchmark for excellence in the field.

AIMO has introduced a series of progress prizes. The first of these was a Kaggle competition, with the 50 test problems hidden from competitors. The problems are comparable in difficulty to the AMC12 and AIME exams for the USA IMO team pre-selection. The private leaderboard determined the final rankings, which then determined the distribution of $253,952 in the one-million dollar prize pool among the top five teams. Each submitted solution was allocated either a P100 GPU or 2xT4 GPUs, with up to 9 hours to solve the 50 problems.

Just to give an idea about how the problems look like, AIMO provided a 10-problem training set open to the public. Here are two example problems in the set:

Let (k, l > 0) be parameters. The parabola (y = kx^2 – 2kx + l) intersects the line (y = 4) at two points (A) and (B). These points are distance 6 apart. What is the sum of the squares of the distances from (A) and (B) to the origin?

Each of the three-digits numbers (111) to (999) is coloured blue or yellow in such a way that the sum of any two (not necessarily different) yellow numbers is equal to a blue number. What is the maximum possible number of yellow numbers there can be?

The first problem is about analytic geometry. It requires the model to understand geometric objects based on textual descriptions and perform symbolic computations using the distance formula and Vieta’s formulas. The second problem falls under extremal combinatorics, a topic beyond the scope of high school math. It’s notoriously challenging because there’s no general formula to apply; solving it requires creative thinking to exploit the problem’s structure. It’s non-trivial to master all these required capabilities even for humans, let alone language models.

In general, the problems in AIMO were significantly more challenging than those in GSM8K, a standard mathematical reasoning benchmark for LLMs, and about as difficult as the hardest problems in the challenging MATH dataset. The limited computational resources—P100 and T4 GPUs, both over five years old and much slower than more advanced hardware—posed an additional challenge. Thus, it was crucial to employ appropriate models and inference strategies to maximize accuracy within the constraints of limited memory and FLOPs.

Our winning formula

Unlike most teams that relied on a single model for the competition, we utilized a dual-model approach. Our final solutions were derived through a weighted majority voting system, which consists of generating multiple solutions with a policy model, assigning a weight to each solution using a reward model, and then choosing the answer with the highest total weight. Specifically, we paired a policy model—designed to generate problem solutions in the form of computer code—with a reward model—which scored the outputs of the policy model. Our final solutions were derived through a weighted majority voting system, where the answers were generated by the policy model and the weights were determined by the scores from the reward model. This strategy stemmed from our study on compute-optimal inference, demonstrating that weighted majority voting with a reward model consistently outperforms naive majority voting given the same inference budget.

Both models in our submission were fine-tuned from the DeepSeek-Math-7B-RL checkpoint. Below, we detail the fine-tuning process and inference strategies for each model.

Policy model: Program-aided problem solver based on self-refinement

The policy model served as the primary problem solver in our approach. We noted that LLMs can perform mathematical reasoning using both text and programs. Natural language excels in abstract reasoning but falls short in precise computation, symbolic manipulation, and algorithmic processing. Programs, on the other hand, are adept at rigorous operations and can leverage specialized tools like equation solvers for complex calculations. To harness the benefits of both methods, we implemented the Program-Aided Language Models (PAL) or more precisely Tool-Augmented Reasoning (ToRA) approach, originally proposed by CMU & Microsoft. This approach combines natural language reasoning with program-based problem-solving. The model first generates rationales in text form, followed by a computer program which executes to derive a numerical answer

Figure 1: The tool-integrated reasoning format (from ToRA paper)

To train the model, we needed a suitable problem set (the given “training set” of this competition is too small for fine-tuning) with “ground truth” solutions in ToRA format for supervised fine-tuning. Given the problem difficulty (comparable to AMC12 and AIME exams) and the special format (integer answers only), we used a combination of AMC, AIME, and Odyssey-Math as our problem set, removing multiple-choice options and filtering out problems with non-integer answers. This resulted in a dataset of 2,600 problems. We prompted GPT-4o (and DeepSeek-Coder-V2) with few-shot examples to generate 64 solutions for each problem, retaining those that led to correct answers. Our final dataset contained 41,160 problem-solution pairs. We performed supervised fine-tuning on the open-sourced DeepSeek-Math-7B-RL model for 3 epochs with a learning rate of 2e-5.

During inference, we employed the self-refinement technique (which is another widely adopted technique proposed by CMU!), providing feedback to the policy model on the execution results of the generated program (e.g., invalid output, execution failure) and allowing the model to refine the solution accordingly.

Below we present our ablation study on the techniques we employed for the policy model. We used the accuracy on a selected subset of the MATH test set as the evaluation metric. It’s easy to see the combination of techniques that lead to large performance gains compared with naive baselines.

Model	Output format	Inference strategy	Accuracy
DeepSeek RL 7b	Text-only	Greedy decoding	54.02%
DeepSeek RL 7b	ToRA	Greedy decoding	58.05%
DeepSeek RL 7b	ToRA	Greedy + Self-refine	60.73%
DeepSeek RL 7b	ToRA	Maj@16 + Self-refine	70.46%
DeepSeek RL 7b	ToRA	Maj@64 + Self-refine	72.48%
Our finetuned model	ToRA	Maj@16 + Self-refine	74.50%
Our finetuned model	ToRA	Maj@64 + Self-refine	76.51%

Table: Ablation study of our techniques on a selected MATH subset (in which the problems are similar to AIMO problems). Maj@(n) denotes majority voting over (n) sampled solutions.

Notably, the first-place team also used ToRA with self-refinement. However, they curated a much larger problem set of 60,000 problems and used GPT-4 to generate solutions in the ToRA format. Their dataset was more than 20x larger than ours. The cost to generate solutions was far beyond our budget as an academic team (over $100,000 based on our estimation). Our problem set was based purely on publicly available data, and we spent only ~$1,000 for solution generation.

Reward model: Solution scorer using label-balance training

While the policy model was a creative problem solver, it could sometimes hallucinate and produce incorrect solutions. On the publicly available 10-problem training set, our policy model only correctly solved two problems using standard majority voting with 32 sampled solutions. Interestingly, for another 2 problems, the model generated correct solutions that failed to be selected due to wrong answers dominating in majority voting.

This observation highlighted the potential of the reward model. The reward model was a solution scorer that took the policy model’s output and generated a score between 0 and 1. Ideally, it assigned high scores to correct solutions and low scores to incorrect ones, aiding in the selection of correct answers during weighted majority voting.

The reward model was fine-tuned from a DeepSeek-Math-7B-RL model on a labeled dataset containing both correct and incorrect problem-solution pairs. We utilized the same problem set as for the policy model training and expanded it by incorporating problems from the MATH dataset with integer answers. Simple as it may sound, generating high-quality data and training a strong reward model was non-trivial. We considered the following two essential factors for the reward model training set:

Label balance: The dataset should contain both correct (positive examples) and incorrect solutions (negative examples) for each problem, with a balanced number of correct and incorrect solutions.

Diversity: The dataset should include diverse solutions for each problem, encompassing different correct approaches and various failure modes.

Sampling solutions from a single model cannot meet those factors. For example, while our fine-tuned policy model achieved very high accuracy on the problem set, it was unable to generate any incorrect solutions and lacked diversity amongst correct solutions. Conversely, sampling from a weaker model, such as DeepSeek-Math-7B-Base, rarely yielded correct solutions. To create a diverse set of models with varying capabilities, we employed two novel strategies:

Interpolate between strong and weak models. For MATH problems, we interpolated the model parameters of a strong model (DeepSeek-Math-7B-RL) and a weak model (DeepSeek-Math-7B-Base) to get models with different level of capabilities. Denote by (mathbf{theta}_{mathrm{strong}}) and (mathbf{theta}_{mathrm{weak}}) the model parameters of the strong and weak model. We considered interpolated models with parameters (mathbf{theta}_{alpha}=alphamathbf{theta}_{mathrm{strong}}+(1-alpha)mathbf{theta}_{mathrm{weak}}) and set (alphain{0.3, 0.4, cdots, 1.0}), obtaining 8 models. Those models exhibited different problem solving accuracies on MATH. We sampled two solutions from each model for each problem, yielding diverse outputs with balanced correct and incorrect solutions. This technique was motivated by the research on model parameter merging (e.g., model soups) and represented an interesting application of this idea, i.e., generating models with different levels of capabilities.
Leverage intermediate checkpoints. For the AMC, AIME, and Odyssey, recall that our policy model had been fine-tuned on those problems for 3 epochs. The final model and its intermediate checkpoints naturally provided us with multiple models exhibiting different levels of accuracy on these problems. We leveraged these intermediate checkpoints, sampling 12 solutions from each model trained for 0, 1, 2, and 3 epochs.

These strategies allowed us to obtain a diverse set of models almost for free, sampling varied correct and incorrect solutions. We further filtered the generated data by removing wrong solutions with non-integer answers since it was trivial to determine that those answers are incorrect during inference. In addition, for each problem, we maintained equal numbers of correct and incorrect solutions to ensure label balance and avoid a biased reward model. The final dataset contains 7000 unique problems and 37880 labeled problem-solution pairs. We finetuned DeepSeek-Math-7B-RL model for 2 epochs with learning rate 2e-5 on the curated dataset.

Figure 2: Weighted majority voting system based on the policy and reward models.

We validated the effectiveness of our reward model on the public training set. Notably, by pairing the policy model with the reward model and applying weighted majority voting, our method correctly solved 4 out of the 10 problems – while a single policy model could only solve 2 using standard majority voting.

Concluding remarks: Towards machine-based mathematical reasoning

With the models and techniques described above, our CMU-MATH team solved 22 out of 50 problems in the private test set, snagging the second place and establishing the best performance of an academic team. This outcome marks a significant step towards the goal of machine-based mathematical reasoning.

However, we also note that the accuracy achieved by our models still trails behind that of proficient human competitors who can easily solve over 95% of AIMO problems, indicating substantial room for improvement. There are a wide range of directions to be explored:

Advanced inference-time algorithms for mathematical reasoning. Our dual-model approach is a robust technique to enhance model reasoning at inference time. Recent research from our team suggests that more advanced inference-time algorithms, e.g., tree search methods, could even surpass weighted majority voting. Although computational constraints limited our ability to deploy this technique in the AIMO competition, future explorations on optimizing these inference-time algorithms can potentially lead to better mathematical reasoning approaches.
Integration of Automated Theorem Proving. Integrating automated theorem proving (ATP) tools, such as Lean, represents another promising frontier. ATP tools can provide rigorous logical frameworks and support for deeper mathematical analyses, potentially elevating the precision and reliability of problem-solving strategies employed by LLMs. The synergy between LLMs and ATP could lead to breakthroughs in complex problem-solving scenarios, where deep logical reasoning is essential.
Leveraging Larger, More Diverse Datasets. The competition reinforced a crucial lesson about the pivotal role of data in machine learning. Rich, diverse datasets, especially those comprising challenging mathematical problems, are vital for training more capable models. We advocate for the creation and release of larger datasets focused on mathematical reasoning, which would not only benefit our research but also the broader AI and mathematics communities.

Finally, we would like to thank Kaggle and XTX Markets for organizing this wonderful competition. We have open-sourced our code and datasets used in our solution to ensure reproducibility and facilitate future research. We invite the community to explore, utilize, and build upon our work, which is available in our GitHub repository. For further details about our results, please feel free to reach out to us!

Recipe for Magic: WPP and NVIDIA Omniverse Help The Coca-Cola Company Scale Generative AI Content That Pops With Brand Authenticity

When The Coca-Cola Company produces thirst-quenching marketing, the creative elements of campaigns aren’t just left to chance — there’s a recipe for the magic. Now, the beverage company, through its partnership with WPP Open X, is beginning to scale its global campaigns with generative AI from NVIDIA Omniverse and NVIDIA NIM microservices.

“With NVIDIA, we can personalize and customize Coke and meals imagery across 100-plus markets, delivering on hyperlocal relevance with speed and at global scale,” said Samir Bhutada, global vice president of StudioX Digital Transformation at The Coca-Cola Company.

Coca-Cola has been working with WPP to develop digital twin tools and roll out Prod X — a custom production studio experience created specifically for the beverage maker to use globally.

WPP announced today at SIGGRAPH that The Coca-Cola Company will be an early adopter for integrating the new NVIDIA NIM microservices for Universal Scene Description (aka OpenUSD) into its Prod X roadmap. OpenUSD is a 3D framework that enables interoperability between software tools and data types for building virtual worlds. NIM inference microservices provide models as optimized containers.

The USD Search NIM allows WPP to tap into a large archive of models to create on-brand assets, and the USD Code NIM can be used to assemble them into scenes.

These NIM microservices will enable Prod X users to create 3D advertising assets that contain culturally relevant elements on a global scale, using prompt engineering to quickly make adjustments to AI-generated images so that brands can better target their products at local markets.

Tapping Into NVIDIA NIM Microservices to Deploy Generative AI

WPP said that the NVIDIA NIM microservices will have a lasting impact on the 3D engineering and art world.

The USD Search NIM can make WPP’s massive visual asset libraries quickly available via written prompts. The USD Code NIM allows developers to enter prompts and get Python code to create novel 3D worlds.

“The beauty of the solution is that it compresses multiple phases of the production process into a single interface and process,” said Perry Nightingale, senior vice president of creative AI at WPP, of the new NIM microservices. “It empowers artists to get more out of the technology and create better work.”

Redefining Content Production With Production Studio

WPP recently announced the release of Production Studio on WPP Open, the company’s intelligent marketing operating system powered by AI. Co-developed with its production company, Hogarth, Production Studio taps into the Omniverse development platform and OpenUSD for its generative AI-enabled product configurator workflows.

Production Studio can streamline and automate multilingual text, image and video creation, simplifying content creation for advertisers and marketers, and directly addresses the challenges advertisers continue to face in producing brand-compliant and product-accurate content at scale.

“Our groundbreaking research with NVIDIA Omniverse for the past few years, and the research and development associated with having built our own core USD pipeline and decades of experience in 3D workflows, is what made it possible for us to stand up a tailored experience like this for The Coca-Cola Company,” said Priti Mhatre, managing director for strategic consulting and AI at Hogarth.

SIGGRAPH attendees can hear more about WPP’s efforts by joining the company’s session on “Robotics, Generative AI, and OpenUSD: How WPP Is Building the Future of Creativity.”

NVIDIA founder and CEO Jensen Huang will also be featured at the event in fireside chats with Meta founder and CEO Mark Zuckerberg and WIRED Senior Writer Lauren Goode. Watch the talks and other sessions from NVIDIA at SIGGRAPH 2024 on demand.

Photo credit: WPP, The Coca-Cola Company

See notice regarding software product information.

Reality Reimagined: NVIDIA Introduces fVDB to Build Bigger Digital Models of the World

NVIDIA announced at SIGGRAPH fVDB, a new deep-learning framework for generating AI-ready virtual representations of the real world.

fVDB is built on top of OpenVDB, the industry-standard library for simulating and rendering sparse volumetric data such as water, fire, smoke and clouds.

Generative physical AI, such as autonomous vehicles and robots that inhabit the real world, need to have “spatial intelligence” — the ability to understand and operate in 3D space.

Capturing the large scale and super-fine details of the world around us is essential. But converting reality into a virtual representation to train AI is hard.

Raw data for real-world environments can be collected through many different techniques, like neural radiance fields (NeRFs) and lidar. fVDB translates this data into massive, AI-ready environments rendered in real time.

Building on a decade of innovation in the OpenVDB standard, the introduction of fVDB at SIGGRAPH represents a significant leap forward in how industries can benefit from digital twins of the real world.

Reality-scale virtual environments are used for training autonomous agents. City-scale 3D models are captured by drones for climate science and disaster planning. Today, 3D generative AI is even used to plan urban spaces and smart cities.

fVDB enables industries to tap into spatial intelligence on a larger scale and with higher resolution than ever before, making physical AI even smarter.

The framework builds NVIDIA-accelerated AI operators on top of NanoVDB, a GPU-accelerated data structure for efficient 3D simulations. These operators include convolution, pooling, attention and meshing, all of which are designed for high-performance 3D deep learning applications.

AI operators allow businesses to build complex neural networks for spatial intelligence, like large-scale point cloud reconstruction and 3D generative modeling.

fVDB is the result of a long-running effort by NVIDIA’s research team and is already used to support NVIDIA Research, NVIDIA DRIVE and NVIDIA Omniverse projects that require high-fidelity models of large, complex real-world spaces.

Key Advantages of fVDB

Larger: 4x larger spatial scale than prior frameworks
Faster: 3.5x faster than prior frameworks
Interoperable: Businesses can fully tap into massive real-world datasets. fVDB reads VDB datasets into full-sized 3D environments. AI-ready and real-time rendered for building physical AI with spatial intelligence.
More powerful: 10x more operators than prior frameworks. fVDB simplifies processes by combining functionalities that previously required multiple deep-learning libraries.

fVDB will soon be available as NVIDIA NIM inference microservices. A trio of the microservices will enable businesses to incorporate fVDB into OpenUSD workflows, generating AI-ready OpenUSD geometry in NVIDIA Omniverse, a development platform for industrial digitalization and generative physical AI applications. They are:

fVDB Mesh Generation NIM — Generates digital 3D environments of the real world
fVDB NeRF-XL NIM — Generates large-scale NeRFs in OpenUSD using Omniverse Cloud APIs
fVDB Physics Super-Res NIM — Performs super-resolution to generate an OpenUSD-based, high-resolution physics simulation

Over the past decade, OpenVDB, housed at the Academy Software Foundation, has earned multiple Academy Awards as a core technology used throughout the visual-effects industry. It has since grown beyond entertainment to industrial and scientific uses, like industrial design and robotics.

NVIDIA continues to enhance the open-source OpenVDB library. Four years ago, the company introduced NanoVDB, which added GPU support to OpenVDB. This delivered an order-of-magnitude speed-up, enabling faster performance and easier development, and opening the door to real-time simulation and rendering.

Two years ago, NVIDIA introduced NeuralVDB, which builds machine learning on top of NanoVDB to compress the memory footprint of VDB volumes up to 100x, allowing creators, developers and researchers to interact with extremely large and complex datasets.

fVDB builds AI operators on top of NanoVDB to unlock spatial intelligence at the scale of reality. Apply to the early-access program for the fVDB PyTorch extension. fVDB will also be available as part of the OpenVDB GitHub repository.

Dive deeper into fVDB in this technical blog and watch how accelerated computing and generative AI are transforming industries and creating new opportunities for innovation and growth in NVIDIA founder and CEO Jensen Huang’s two fireside chats at SIGGRAPH.

See notice regarding software product information.

The challenge

The solution

Implementing Claude in Amazon Bedrock: a collaborative effort

Conclusion

About the Author

QAT APIs

Fine-tuning with torchtune

What is Quantization-Aware Training?

QAT in PyTorch

Experimental Results

Int8 Dynamic Activations + Int4 Weight Quantization (8da4w)

Lower Bit Weight Only Quantization

QAT Overhead

Looking Ahead

Performance

Partner quote

Amazon Bedrock

Salesforce Data Cloud and Einstein Model Builder

Solution overview

Prerequisites

Solution walkthrough

Grant Amazon Bedrock invoke model permission to an IAM User

Register Amazon Bedrock model in Einstein Model Builder

Integrate prompt template with the field in the Lightning App builder

Cleanup

Conclusion

About the Authors

Tapping Into NVIDIA NIM Microservices to Deploy Generative AI

Redefining Content Production With Production Studio

Key Advantages of fVDB

Navigation

GenAI Vision Endless Possibilities

"I'm interested in things that change the world or that affect the future and wondrous, new technology where you see it, and you're like, 'Wow, how did that even happen? How is that possible?'" -- Elon Musk

Copyright © 2019-2025 Vedere AI. All Rights Reserved.