March 2024 – Page 16

Efficiently fine-tune the ESM-2 protein language model with Amazon SageMaker

In this post, we demonstrate how to efficiently fine-tune a state-of-the-art protein language model (pLM) to predict protein subcellular localization using Amazon SageMaker.

Proteins are the molecular machines of the body, responsible for everything from moving your muscles to responding to infections. Despite this variety, all proteins are made of repeating chains of molecules called amino acids. The human genome encodes 20 standard amino acids, each with a slightly different chemical structure. These can be represented by letters of the alphabet, which then allows us to analyze and explore proteins as a text string. The enormous possible number of protein sequences and structures is what gives proteins their wide variety of uses.

Proteins also play a key role in drug development, as potential targets but also as therapeutics. As shown in the following table, many of the top-selling drugs in 2022 were either proteins (especially antibodies) or other molecules like mRNA translated into proteins in the body. Because of this, many life science researchers need to answer questions about proteins faster, cheaper, and more accurately.

Name	Manufacturer	2022 Global Sales ($ billions USD)	Indications
Comirnaty	Pfizer/BioNTech	$40.8	COVID-19
Spikevax	Moderna	$21.8	COVID-19
Humira	AbbVie	$21.6	Arthritis, Crohn’s disease, and others
Keytruda	Merck	$21.0	Various cancers

Data source: Urquhart, L. Top companies and drugs by sales in 2022. Nature Reviews Drug Discovery 22, 260–260 (2023).

Because we can represent proteins as sequences of characters, we can analyze them using techniques originally developed for written language. This includes large language models (LLMs) pretrained on huge datasets, which can then be adapted for specific tasks, like text summarization or chatbots. Similarly, pLMs are pre-trained on large protein sequence databases using unlabeled, self-supervised learning. We can adapt them to predict things like the 3D structure of a protein or how it may interact with other molecules. Researchers have even used pLMs to design novel proteins from scratch. These tools don’t replace human scientific expertise, but they have the potential to speed up pre-clinical development and trial design.

One challenge with these models is their size. Both LLMs and pLMs have grown by orders of magnitude in the past few years, as illustrated in the following figure. This means that it can take a long time to train them to sufficient accuracy. It also means that you need to use hardware, especially GPUs, with large amounts of memory to store the model parameters.

Long training times, plus large instances, equals high cost, which can put this work out of reach for many researchers. For example, in 2023, a research team described training a 100 billion-parameter pLM on 768 A100 GPUs for 164 days! Fortunately, in many cases we can save time and resources by adapting an existing pLM to our specific task. This technique is called fine-tuning, and also allows us to borrow advanced tools from other types of language modeling.

Solution overview

The specific problem we address in this post is subcellular localization: Given a protein sequence, can we build a model that can predict if it lives on the outside (cell membrane) or inside of a cell? This is an important piece of information that can help us understand the function and whether it would make a good drug target.

We start by downloading a public dataset using Amazon SageMaker Studio. Then we use SageMaker to fine-tune the ESM-2 protein language model using an efficient training method. Finally, we deploy the model as a real-time inference endpoint and use it to test some known proteins. The following diagram illustrates this workflow.

In the following sections, we go through the steps to prepare your training data, create a training script, and run a SageMaker training job. All of the code featured in this post is available on GitHub.

Prepare the training data

We use part of the DeepLoc-2 dataset, which contains several thousand SwissProt proteins with experimentally determined locations. We filter for high-quality sequences between 100–512 amino acids:

df = pd.read_csv(
    "https://services.healthtech.dtu.dk/services/DeepLoc-2.0/data/Swissprot_Train_Validation_dataset.csv"
).drop(["Unnamed: 0", "Partition"], axis=1)
df["Membrane"] = df["Membrane"].astype("int32")

# filter for sequences between 100 and 512 amino acides
df = df[df["Sequence"].apply(lambda x: len(x)).between(100, 512)]

# Remove unnecessary features
df = df[["Sequence", "Kingdom", "Membrane"]]

Next, we tokenize the sequences and split them into training and evaluation sets:

dataset = Dataset.from_pandas(df).train_test_split(test_size=0.2, shuffle=True)
tokenizer = AutoTokenizer.from_pretrained("facebook/esm2_t33_650M_UR50D")

def preprocess_data(examples, max_length=512):
    text = examples["Sequence"]
    encoding = tokenizer(text, truncation=True, max_length=max_length)
    encoding["labels"] = examples["Membrane"]
    return encoding

encoded_dataset = dataset.map(
    preprocess_data,
    batched=True,
    num_proc=os.cpu_count(),
    remove_columns=dataset["train"].column_names,
)

encoded_dataset.set_format("torch")

Finally, we upload the processed training and evaluation data to Amazon Simple Storage Service (Amazon S3):

train_s3_uri = S3_PATH + "/data/train"
test_s3_uri = S3_PATH + "/data/test"

encoded_dataset["train"].save_to_disk(train_s3_uri)
encoded_dataset["test"].save_to_disk(test_s3_uri)

Create a training script

SageMaker script mode allows you to run your custom training code in optimized machine learning (ML) framework containers managed by AWS. For this example, we adapt an existing script for text classification from Hugging Face. This allows us to try several methods for improving the efficiency of our training job.

Method 1: Weighted training class

Like many biological datasets, the DeepLoc data is unevenly distributed, meaning there isn’t an equal number of membrane and non-membrane proteins. We could resample our data and discard records from the majority class. However, this would reduce the total training data and potentially hurt our accuracy. Instead, we calculate the class weights during the training job and use them to adjust the loss.

In our training script, we subclass the Trainer class from transformers with a WeightedTrainer class that takes class weights into account when calculating cross-entropy loss. This helps prevent bias in our model:

class WeightedTrainer(Trainer):
    def __init__(self, class_weights, *args, **kwargs):
        self.class_weights = class_weights
        super().__init__(*args, **kwargs)

    def compute_loss(self, model, inputs, return_outputs=False):
        labels = inputs.pop("labels")
        outputs = model(**inputs)
        logits = outputs.get("logits")
        loss_fct = torch.nn.CrossEntropyLoss(
            weight=torch.tensor(self.class_weights, device=model.device)
        )
        loss = loss_fct(logits.view(-1, self.model.config.num_labels), labels.view(-1))
        return (loss, outputs) if return_outputs else loss

Method 2: Gradient accumulation

Gradient accumulation is a training technique that allows models to simulate training on larger batch sizes. Typically, the batch size (the number of samples used to calculate the gradient in one training step) is limited by the GPU memory capacity. With gradient accumulation, the model calculates gradients on smaller batches first. Then, instead of updating the model weights right away, the gradients get accumulated over multiple small batches. When the accumulated gradients equal the target larger batch size, the optimization step is performed to update the model. This lets models train with effectively bigger batches without exceeding the GPU memory limit.

However, extra computation is needed for the smaller batch forward and backward passes. Increased batch sizes via gradient accumulation can slow down training, especially if too many accumulation steps are used. The aim is to maximize GPU usage but avoid excessive slowdowns from too many extra gradient computation steps.

Method 3: Gradient checkpointing

Gradient checkpointing is a technique that reduces the memory needed during training while keeping the computational time reasonable. Large neural networks take up a lot of memory because they have to store all the intermediate values from the forward pass in order to calculate the gradients during the backward pass. This can cause memory issues. One solution is to not store these intermediate values, but then they have to be recalculated during the backward pass, which takes a lot of time.

Gradient checkpointing provides a balanced approach. It saves only some of the intermediate values, called checkpoints, and recalculates the others as needed. Therefore, it uses less memory than storing everything, but also less computation than recalculating everything. By strategically selecting which activations to checkpoint, gradient checkpointing enables large neural networks to be trained with manageable memory usage and computation time. This important technique makes it feasible to train very large models that would otherwise run into memory limitations.

In our training script, we turn on gradient activation and checkpointing by adding the necessary parameters to the TrainingArguments object:

from transformers import TrainingArguments

training_args = TrainingArguments(
	gradient_accumulation_steps=4,
	gradient_checkpointing=True
)

Method 4: Low-Rank Adaptation of LLMs

Large language models like ESM-2 can contain billions of parameters that are expensive to train and run. Researchers developed a training method called Low-Rank Adaptation (LoRA) to make fine-tuning these huge models more efficient.

The key idea behind LoRA is that when fine-tuning a model for a specific task, you don’t need to update all the original parameters. Instead, LoRA adds new smaller matrices to the model that transform the inputs and outputs. Only these smaller matrices are updated during fine-tuning, which is much faster and uses less memory. The original model parameters stay frozen.

After fine-tuning with LoRA, you can merge the small adapted matrices back into the original model. Or you can keep them separate if you want to quickly fine-tune the model for other tasks without forgetting previous ones. Overall, LoRA allows LLMs to be efficiently adapted to new tasks at a fraction of the usual cost.

In our training script, we configure LoRA using the PEFT library from Hugging Face:

from peft import get_peft_model, LoraConfig, TaskType
import torch
from transformers import EsmForSequenceClassification

model = EsmForSequenceClassification.from_pretrained(
	“facebook/esm2_t33_650M_UR50D”,
	Torch_dtype=torch.bfloat16,
	Num_labels=2,
)

peft_config = LoraConfig(
    task_type=TaskType.SEQ_CLS,
    inference_mode=False,
    bias="none",
    r=8,
    lora_alpha=16,
    lora_dropout=0.05,
    target_modules=[
        "query",
        "key",
        "value",
        "EsmSelfOutput.dense",
        "EsmIntermediate.dense",
        "EsmOutput.dense",
        "EsmContactPredictionHead.regression",
        "EsmClassificationHead.dense",
        "EsmClassificationHead.out_proj",
    ]
)

model = get_peft_model(model, peft_config)

Submit a SageMaker training job

After you have defined your training script, you can configure and submit a SageMaker training job. First, specify the hyperparameters:

hyperparameters = {
    "model_id": "facebook/esm2_t33_650M_UR50D",
    "epochs": 1,
    "per_device_train_batch_size": 8,
    "gradient_accumulation_steps": 4,
    "use_gradient_checkpointing": True,
    "lora": True,
}

Next, define what metrics to capture from the training logs:

metric_definitions = [
    {"Name": "epoch", "Regex": "'epoch': ([0-9.]*)"},
    {
        "Name": "max_gpu_mem",
        "Regex": "Max GPU memory use during training: ([0-9.e-]*) MB",
    },
    {"Name": "train_loss", "Regex": "'loss': ([0-9.e-]*)"},
    {
        "Name": "train_samples_per_second",
        "Regex": "'train_samples_per_second': ([0-9.e-]*)",
    },
    {"Name": "eval_loss", "Regex": "'eval_loss': ([0-9.e-]*)"},
    {"Name": "eval_accuracy", "Regex": "'eval_accuracy': ([0-9.e-]*)"},
]

Finally, define a Hugging Face estimator and submit it for training on an ml.g5.2xlarge instance type. This is a cost-effective instance type that is widely available in many AWS Regions:

from sagemaker.experiments.run import Run
from sagemaker.huggingface import HuggingFace
from sagemaker.inputs import TrainingInput

hf_estimator = HuggingFace(
    base_job_name="esm-2-membrane-ft",
    entry_point="lora-train.py",
    source_dir="scripts",
    instance_type="ml.g5.2xlarge",
    instance_count=1,
    transformers_version="4.28",
    pytorch_version="2.0",
    py_version="py310",
    output_path=f"{S3_PATH}/output",
    role=sagemaker_execution_role,
    hyperparameters=hyperparameters,
    metric_definitions=metric_definitions,
    checkpoint_local_path="/opt/ml/checkpoints",
    sagemaker_session=sagemaker_session,
    keep_alive_period_in_seconds=3600,
    tags=[{"Key": "project", "Value": "esm-fine-tuning"}],
)

with Run(
    experiment_name=EXPERIMENT_NAME,
    sagemaker_session=sagemaker_session,
) as run:
    hf_estimator.fit(
        {
            "train": TrainingInput(s3_data=train_s3_uri),
            "test": TrainingInput(s3_data=test_s3_uri),
        }
    )

The following table compares the different training methods we discussed and their effect on the runtime, accuracy, and GPU memory requirements of our job.

Configuration	Billable Time (min)	Evaluation Accuracy	Max GPU Memory Usage (GB)
Base Model	28	0.91	22.6
Base + GA	21	0.90	17.8
Base + GC	29	0.91	10.2
Base + LoRA	23	0.90	18.6

All of the methods produced models with high evaluation accuracy. Using LoRA and gradient activation decreased the runtime (and cost) by 18% and 25%, respectively. Using gradient checkpointing decreased the maximum GPU memory usage by 55%. Depending on your constraints (cost, time, hardware), one of these approaches may make more sense than another.

Each of these methods perform well by themselves, but what happens when we use them in combination? The following table summarizes the results.

Configuration	Billable Time (min)	Evaluation Accuracy	Max GPU Memory Usage (GB)
All methods	12	0.80	3.3

In this case, we see a 12% reduction in accuracy. However, we’ve reduced the runtime by 57% and GPU memory use by 85%! This is a massive decrease that allows us to train on a wide range of cost-effective instance types.

Clean up

If you’re following along in your own AWS account, delete the any real-time inference endpoints and data you created to avoid further charges.

predictor.delete_endpoint()

bucket = boto_session.resource("s3").Bucket(S3_BUCKET)
bucket.objects.filter(Prefix=S3_PREFIX).delete()

Conclusion

In this post, we demonstrated how to efficiently fine-tune protein language models like ESM-2 for a scientifically relevant task. For more information about using the Transformers and PEFT libraries to train pLMS, check out the posts Deep Learning With Proteins and ESMBind (ESMB): Low Rank Adaptation of ESM-2 for Protein Binding Site Prediction on the Hugging Face blog. You can also find more examples of using machine learning to predict protein properties in the Awesome Protein Analysis on AWS GitHub repository.

About the Author

Brian Loyal is a Senior AI/ML Solutions Architect in the Global Healthcare and Life Sciences team at Amazon Web Services. He has more than 17 years’ experience in biotechnology and machine learning, and is passionate about helping customers solve genomic and proteomic challenges. In his spare time, he enjoys cooking and eating with his friends and family.

Bria Builds Responsible Generative AI for Enterprises Using NVIDIA NeMo, Picasso

As visual generative AI matures from research to the enterprise domain, businesses are seeking responsible ways to integrate the technology into their products.

Bria, a startup based in Tel Aviv, is responding with an open platform for visual generative AI that emphasizes model transparency alongside fair attribution and copyright protections. Currently offering models that convert text prompts to images or transform existing images, the company will this year add text-to-video and image-to-video AI.

“Creating generative AI models requires time and expertise,” said Yair Adato, co-founder and CEO of Bria. “We do the heavy lifting so product teams can adopt our models to achieve a technical edge and go to market quickly, without investing as many resources.”

Advertising agencies and retailers can use Bria’s tools to quickly generate visuals for marketing campaigns. And creative studios can adopt the models to develop stock imagery or edit visuals. Dozens of enterprise clients have integrated the startup’s pretrained models or use its application programming interfaces.

Bria develops its models with the NVIDIA NeMo framework, which is available on NGC, NVIDIA’s hub for accelerated software. The company uses reference implementations from the NeMo Multimodal collection, trained on NVIDIA Tensor Core GPUs, to enable high-throughput, low-latency image generation. It’s also adopting NVIDIA Picasso, a foundry for visual generative AI models, to run inference.

“We were looking for a framework to train our models efficiently — one that would minimize compute cost while scaling AI training to more quickly reach model convergence,” said Misha Feinstein, vice president of research and development at Bria. “NeMo features optimization techniques that allow us to maximize the GPUs’ performance during both training and inference.”

Creative Solutions to Creative Challenges

Bria, founded in 2020, offers flexible options for enterprises adopting visual generative AI. By adopting Bria’s platform, its customers can gain a competitive edge by creating visual content at scale while retaining control of their data and technology. Developers can access its pretrained models through APIs or by directly licensing the source code and model weights for further fine-tuning.

“We want to build a company where we respect privacy, content ownership, data ownership and copyright,” said Adato. “To create a healthy, sustainable industry, it’s important to incentivize individuals to keep creating and innovating.”

Adato likens Bria’s attribution program to a music streaming service that pays artists each time one of their songs is played. It’s required for all customers who use Bria’s models — even if they further train and fine-tune the model on their own.

Using licensed datasets provides additional benefits: the Bria team doesn’t need to spend time cleaning the data or sorting out inappropriate content and misinformation.

A Growing Suite of NVIDIA-Accelerated Models

Bria offers two versions of its text-to-image model. One islatency-optimized to rapidly accomplish tasks like image background generation. The other offers higher image resolution. Additional foundation models enable super-resolution, object removal, object generation, inpainting and outpainting.

The company is working to continuously increase the resolution of its generated images, further reduce latency and develop domain-specific models for industries such as ecommerce and stock imagery. Inference is accelerated by the NVIDIA Triton Inference Server software and the NVIDIA TensorRT software development kit.

“We’re running on NVIDIA frameworks, hardware and software,” said Feinstein. “NVIDIA experts have helped us optimize these tools for our needs — we would probably run much slower without their help.”

To keep up with the latest hardware and networking infrastructure, Bria uses cloud computing resources: NVIDIA H100 Tensor Core GPUs for AI training and a variety of NVIDIA Tensor Core GPUs for inference.

Bria is a member of NVIDIA Inception, a program that provides startups with technological support and AI platform guidance. Visit Bria in the Inception Pavilion at NVIDIA GTC, running March 18-21 in San Jose and online.

To train optimized text-to-image models, check out the NeMo Multimodal user guide and GitHub repository. NeMo Multimodal is also available as part of the NeMo container on NGC.

AI Decoded: Demystifying AI and the Hardware, Software and Tools That Power It

With the 2018 launch of RTX technologies and the first consumer GPU built for AI — GeForce RTX — NVIDIA accelerated the shift to AI computing. Since then, AI on RTX PCs and workstations has grown into a thriving ecosystem with more than 100 million users and 500 AI applications.

Generative AI is now ushering in a new wave of capabilities from PC to cloud. And NVIDIA’s rich history and expertise in AI is helping ensure all users have the performance to handle a wide range of AI features.

Users at home and in the office are already taking advantage of AI on RTX with productivity- and entertainment-enhancing software. Gamers feel the benefits of AI on GeForce RTX GPUs with higher frame rates at stunning resolutions in their favorite titles. Creators can focus on creativity, instead of watching spinning wheels or repeating mundane tasks. And developers can streamline workflows using generative AI for prototyping and to automate debugging.

The field of AI is moving fast. As research advances, AI will tackle more complex tasks. And the demanding performance needs will be handled by RTX.

What Is AI?

In its most fundamental form, artificial intelligence is a smarter type of computing. It’s the capability of a computer program or a machine to think, learn and take actions without being explicitly coded with commands to do so, or a user having to control each command.

AI can be thought of as the ability for a device to perform tasks autonomously, by ingesting and analyzing enormous amounts of data, then recognizing patterns in that data — often referred to as being “trained.”

AI development is always oriented around developing systems that perform tasks that would otherwise require human intelligence, and often significant levels of input, to complete — only at speeds beyond any individual’s or group’s capabilities. For this reason, AI is broadly seen as both disruptive and highly transformational.

A key benefit of AI systems is the ability to learn from experiences or patterns inside data, adjusting conclusions on their own when fed new inputs or data. This self-learning allows AI systems to accomplish a stunning variety of tasks, including image recognition, speech recognition, language translation, medical diagnostics, car navigation, image and video enhancement, and hundreds of other use cases.

The next step in the evolution of AI is content generation — referred to as generative AI. It enables users to quickly create new content, and iterate on it, based on a variety of inputs, which can include text, images, sounds, animation, 3D models or other types of data. It then generates new content in the same or a new form.

Popular language applications, like the cloud-based ChatGPT, allow users to generate long-form copy based on a short text request. Image generators like Stable Diffusion turn descriptive text inputs into the desired image. New applications are turning text into video and 2D images into 3D renderings.

GeForce RTX AI PCs and NVIDIA RTX Workstations

AI PCs are computers with dedicated hardware designed to help AI run faster. It’s the difference between sitting around waiting for a 3D image to load, and seeing it update instantaneously with an AI denoiser.

On RTX GPUs, these specialized AI accelerators are called Tensor Cores. And they dramatically speed up AI performance across the most demanding applications for work and play.

One way that AI performance is measured is in teraops, or trillion operations per second (TOPS). Similar to an engine’s horsepower rating, TOPS can give users a sense of a PC’s AI performance with a single metric. The current generation of GeForce RTX GPUs offers performance options that range from roughly 200 AI TOPS all the way to over 1,300 TOPS, with many options across laptops and desktops in between. Professionals get even higher AI performance with the NVIDIA RTX 6000 Ada Generation GPU.

To put this in perspective, the current generation of AI PCs without GPUs range from 10 to 45 TOPS.

More and more types of AI applications will require the benefits of having a PC capable of performing certain AI tasks locally — meaning on the device rather than running in the cloud. Benefits of running on an AI PC include that computing is always available, even without an internet connection; systems offer low latency for high responsiveness; and increased privacy so that users don’t have to upload sensitive materials to an online database before it becomes usable by an AI.

AI for Everyone

RTX GPUs bring more than just performance. They introduce capabilities only possible with RTX technology. Many of these AI features are accessible — and impactful — to millions, regardless of the individual’s skill level.

From AI upscaling to improved video conferencing to intelligent, personalizable chatbots, there are tools to benefit all types of users.

RTX Video uses AI to upscale streaming video and display it in HDR. Bringing lower-resolution video in standard dynamic range to vivid, up to 4K high-resolution high dynamic range. RTX users can enjoy the feature with one-time, one-click enablement on nearly any video streamed in a Chrome or Edge browser.

NVIDIA Broadcast, a free app for RTX users with a straightforward user interface, has a host of AI features that improve video conferencing and livestreaming. It removes unwanted background sounds like clicky keyboards, vacuum cleaners and screaming children with Noise and Echo Removal. It can replace or blur backgrounds with better edge detection using Virtual Background. It smooths low-quality camera images with Video Noise Removal. And it can stay centered on the screen with eyes looking at the camera no matter where the user moves, using Auto Frame and Eye Contact.

Chat with RTX is a local, personalized AI chatbot demo that’s easy to use and free to download.

The tech demo, originally released in January, will get an update with Google’s Gemma soon.

Users can easily connect local files on a PC to a supported large language model simply by dropping files into a single folder and pointing the demo to the location. It enables queries for quick, contextually relevant answers.

Since Chat with RTX runs locally on Windows with GeForce RTX PCs and NVIDIA RTX workstations, results are fast — and the user’s data stays on the device. Rather than relying on cloud-based services, Chat with RTX lets users process sensitive data on a local PC without the need to share it with a third party or have an internet connection.

AI for Gamers

Over the past six years, game performance has seen the greatest leaps with AI acceleration. Gamers have been turning NVIDIA DLSS on since 2019, boosting frame rates and improving image quality. It’s a technique that uses AI to generate pixels in video games automatically. With ongoing improvements, it now increases frame rates by up to 4x.

And with the introduction of Ray Reconstruction in the latest version, DLSS 3.5, visual quality is further enhanced in some of the world’s top titles, setting a new standard for visually richer and more immersive gameplay.

There are now over 500 games and applications that have revolutionized the ways people play and create with ray tracing, DLSS and AI-powered technologies.

Beyond frames, AI is set to improve the way gamers interact with characters and remaster classic games.

NVIDIA ACE microservices — including generative AI-powered speech and animation models — are enabling developers to add intelligent, dynamic digital avatars to games. Demonstrated at CES, ACE won multiple awards for its ability to bring game characters to life as a glimpse into the future of PC gaming.

NVIDIA RTX Remix, a platform for modders to create stunning RTX remasters of classic games, delivers generative AI tools that can transform basic textures from classic games into modern, 4K-resolution, physically based rendering materials. Several projects have already been released or are in the works, including Half-Life 2 RTX and Portal with RTX.

AI for Creators

AI is unlocking creative potential by reducing or automating tedious tasks, freeing up time for pure creativity. These features run fastest or solely on PCs with NVIDIA RTX or GeForce RTX GPUs.

Adobe Premiere Pro’s AI-powered Enhance Speech tool removes unwanted noise and improves dialogue quality.

Adobe Premiere Pro’s Enhance Speech tool is accelerated by RTX, using AI to remove unwanted noise and improve the quality of dialogue clips so they sound professionally recorded. It’s up to 4.5x faster on RTX vs. Mac. Another Premiere feature, Auto Reframe, uses GPU acceleration to identify and track the most relevant elements in a video and intelligently reframes video content for different aspect ratios.

Another time-saving AI feature for video editors is DaVinci Resolve’s Magic Mask. Previously, if editors needed to adjust the color/brightness of a subject in one shot or remove an unwanted object, they’d have to use a combination of rotoscoping techniques or basic power windows and masks to isolate the subject from the background.

Magic Mask has completely changed that workflow. With it, simply draw a line over the subject and the AI will process for a moment before revealing the selection. And GeForce RTX laptops can run the feature 2.5x faster than the fastest non-RTX laptops.

This is just a sample of the ways that AI is increasing the speed of creativity. There are now more than 125 AI applications accelerated by RTX.

AI for Developers

AI is enhancing the way developers build software applications through scalable environments, hardware and software optimizations, and new APIs.

NVIDIA AI Workbench helps developers quickly create, test and customize pretrained generative AI models and LLMs using PC-class performance and memory footprint. It’s a unified, easy-to-use toolkit that can scale from running locally on RTX PCs to virtually any data center, public cloud or NVIDIA DGX Cloud.

After building AI models for PC use cases, developers can optimize them using NVIDIA TensorRT — the software that helps developers take full advantage of the Tensor Cores in RTX GPUs.

TensorRT acceleration is now available in text-based applications with TensorRT-LLM for Windows. The open-source library increases LLM performance and includes pre-optimized checkpoints for popular models, including Google’s Gemma, Meta Llama 2, Mistral and Microsoft Phi-2.

Developers also have access to a TensorRT-LLM wrapper for the OpenAI Chat API. With just one line of code change, continue.dev — an open-source autopilot for VS Code and JetBrains that taps into an LLM — can use TensorRT-LLM locally on an RTX PC for fast, local LLM inference using this popular tool.

Every week, we’ll demystify AI by making the technology more accessible, and we’ll showcase new hardware, software, tools and accelerations for RTX AI PC users.

The iPhone moment of AI is here, and it’s just the beginning. Welcome to AI Decoded.

Get weekly updates directly in your inbox by subscribing to the AI Decoded newsletter.

The Magic Behind the Screen: Celebrating the 96th Academy Awards Nominees for Best Visual Effects

The 96th Academy Awards nominees for Best Visual Effects are a testament to the incredible technological advancements pushing the boundaries of what’s possible in film.

Whether showcasing colossal destruction scenes, heart-pumping action sequences or interstellar adventures, each nominee demonstrates unique contributions in visual effects, or VFX — and they all used cutting-edge NVIDIA technologies in their workflows to bring their magic to the screen.

This year’s nominees include:

The Creator (20th Century Studios) — Jay Cooper, Ian Comley, Andrew Roberts and Neil Corbould
Godzilla: Minus One (Toho) — Takashi Yamazaki, Kiyoko Shibuya, Masaki Takahashi and Tatsuji Nojima
Guardians of the Galaxy Vol. 3 (Marvel Studios) — Stephane Ceretti, Alexis Wajsbrot, Guy Williams and Theo Bialek
Napoleon (Apple Original Films/Sony Pictures) — Charley Henley, Luc-Ewen Martin-Fenouillet, Simone Coco and Neil Corbould
Mission: Impossible – Dead Reckoning Part One (Paramount Pictures) — Alex Wuttke, Simone Coco, Jeff Sutherland and Neil Corbould

Reinventing the Monster Movie

Godzilla: Minus One presented a unique challenge: making a well-known giant monster, or kaijū, feel terrifying anew.

With a budget under $15 million, small by today’s standards, the film’s VFX team relied on rapid iterations with the director to eliminate long review cycles, along with a heavily detailed computer-generated imagery (CGI) model to bring Godzilla to life.

Godzilla was ready for its closeup, the monster’s head alone containing over 200 million polygons. The animators injected nuanced, lifelike behaviors into the creature to round out its performance.

In addition, the film’s destruction scenes used a sophisticated, memory-intensive physics engine, allowing for realistic simulations of crumbling buildings and landscapes under destruction to further immerse audiences in the chaos.

A Cosmic Spectacle

Guardians of the Galaxy Vol. 3 continued the series’s tradition of blending humor with breathtaking cosmic visuals. This installment pushed the envelope with its use of real-time rendering, enabling its artists to visualize complex space environments and characters on set.

The film brought together Wētā FX, Framestore and Sony Pictures Imageworks, among others, to create a whopping 3,000+ VFX shots. The dense, immersive 3D environments allowed for a seamless integration of live-action and CGI elements and characters, resulting in a visually stunning space opera that maintained the series’ signature style while exploring new visual territories.

One of Guardians’s greatest achievements is the hallway fight scene filmed at 120 frames per second and delivered as a single continuous shot with variable speed ramps and nonstop action.

Epic Storytelling Through Detailed VFX

The historical epic Napoleon was brought to life with meticulous attention to detail and scale. The film used various set extensions and practical effects to recreate the vast battlefields and period-specific architecture of early 19th-century Europe.

Advanced crowd simulation was used to depict the massive armies of Napoleon’s time, each soldier animated with individual behaviors to enhance the battle scenes’ realism. These touches, combined with high-resolution textures and dynamic lighting, created a visually compelling narrative grounded in reality.

Exploring AI’s Boundaries

The Creator explored the themes of AI and virtual reality, requiring VFX that could realistically depict advanced technology and digital worlds.

The film made significant use of CG animation and visual effects to create environments both futuristic and plausible. Director Gareth Edwards, also known for Rogue One and Godzilla (2014), has been widely applauded for delivering a film with the look of an expensive summer blockbuster using a fraction of the typical budget.

The portrayal of AI entities involved a combination of motion-capture and procedural animation to create characters that moved and interacted with complexity and fluidity at human level. The VFX team developed custom software to simulate the intricate patterns of digital consciousness, blurring the lines between the virtual and the real.

High-Octane Action Meets Precision VFX

For Mission: Impossible – Dead Reckoning Part One, the visual effects team faced the challenge of enhancing the film’s signature action sequences without detracting from the series’s reputation for practical stunts. To achieve this, they took a hybrid approach, using CGI to seamlessly augment practical effects.

High-speed drone footage integrated with CG elements created breathtaking chase scenes, while advanced compositing techniques added layers of detail and depth to explosions and hand-to-hand combat scenes, elevating the film’s action to new heights.

NVIDIANs at the SciTech Awards

NVIDIA’s Christopher Jon Horvath, joined by Steve LaVietes and Joe Ardent, on stage to accept their award.

The Academy Awards for Scientific and Technical Achievements highlight technical contributions that have significantly affected the way movies are made, as well as the brilliant inventors behind them.

OpenUSD was honored in the science and engineering subcategory for its importance as the first open-source scene description framework that streamlines the entire production workflow. Its innovative layering system and efficient crate file format have established it as the de facto standard for 3D scene interchange, facilitating unparalleled collaboration across the industry.

The science and engineering subcategory also celebrated other remarkable technologies, including the OpenVDB open-source library, for sparse 3D volumes, which has become an industry standard for visual-effects simulations and renderings of water, fire, smoke and clouds.

Initially created in 2009 by Ken Museth, senior director of physics research at NVIDIA, OpenVDB has been further developed by Museth, Peter Cucka and Mihai Aldén. Learn more about the latest advancements in OpenVDB including NanoVDB and NeuralVDB.

In addition, the Alembic Caching and Interchange system, developed by Lucas Miller, NVIDIA’s Christopher Jon Horvath, Steve LaVietes and Joe Ardent, received recognition for its efficient algorithms in storing and retrieving baked, time-sampled data, facilitating high-efficiency caching and scene sharing across the digital production pipeline.

OpenVDB and Alembic are both interoperable with OpenUSD, enhancing their utility and integration within the industry’s production workflows.

See How Oscar-Nominated VFX Are Created at GTC

Learn more about visual effects, AI, virtual production and animation at NVIDIA GTC, a global AI conference taking place March 18-21 at the San Jose Convention Center and online. Register to hear from industry luminaries creating stunning visuals in film and TV.

Academy Award-winner Ken Museth will present a session, Open-Source Software for Visual Effects: OpenUSD and OpenVDB, on Monday, March 18, at 9 a.m. PT.

And join us for OpenUSD Day to learn how to build generative AI-enabled 3D pipelines and tools using Universal Scene Description. Browse the full list of media and entertainment sessions at GTC.

Featured image courtesy of Toho Co., Ltd. TOHO CO., LTD.

Delays from post-quantum cryptography may not be so bad

Using time to last byte — rather than time to first byte — to assess the effects of data-heavy TLS 1.3 on real-world connections yields more encouraging results.Read More

Orca-Math: Demonstrating the potential of SLMs with model specialization

abstract wave lines on a gradient background

Our work on Orca and Orca 2 demonstrated the power of improved training signals and methods to enhance the reasoning abilities of smaller language models, getting closer to the levels found in much larger language models. Orca-Math is another step in this direction, where we explore the capabilities of small language models (SLMs) when specialized in a certain area, in this case solving grade school math problems, which has long been recognized as a complex task for SLMs.

Orca-Math is a 7 billion parameters model created by fine-tuning the Mistral 7B model. Orca-Math achieves 86.81% on GSM8k pass@1, exceeding the performance of much bigger models including general models (e.g. LLAMA-2-70, Gemini Pro and GPT-3.5) and math-specific models (e.g. MetaMath-70B and WizardMa8th-70B). Note that the base model (Mistral-7B) achieves 37.83% on GSM8K.

Alt Text: Bar graph comparing GSM8K score of different models with an upward trend in quality. The models are LLAMA-2-70, GPT-3.5, Gemini Pro, WizardMath-70B, MetaMath-70B and Orca-Math-7B. The graph shows that the Orca-Math-7B model outperforms other bigger models on GSM8K. — Bar graph comparing GSM8K score of different models with an upward trend in quality. The models are LLAMA-2-70, GPT-3.5, Gemini Pro, WizardMath-70B, MetaMath-70B and Orca-Math-7B. The graph shows that the Orca-Math-7B model outperforms other bigger models on GSM8K.

The state-of-the-art (SOTA) performance of the Orca-Math model can be attributed to two key insights:

Training on high-quality synthetic data with 200,000 math problems, created using multi-agents (AutoGen). This is smaller than other math datasets, which could have millions of problems. The smaller model and smaller dataset mean faster and cheaper training.
In addition to traditional supervised fine-tuning, the model was trained using an iterative learning process, where the model is allowed to practice solving problems and continues to improve based on feedback from a teacher.

Our findings show that smaller models are valuable in specialized settings, where they can match the performance of much larger models while also highlighting the potential of continual learning and using feedback to improve language models. We are making the dataset (opens in new tab) publicly available, along with a report (opens in new tab) describing the training procedure to encourage research on the improvement and specialization of smaller language models.

Teaching SLMs math

Solving mathematical word problems has long been recognized as a complex task for SLMs. Models that achieve over 80% accuracy on the GSM8K benchmark (GSM8K, which stands for Grade School Math 8K, is a dataset of 8,500 high-quality grade school mathematical word problems that require multi-step reasoning) typically exceed 30 billion parameters.

To reach higher levels of performance with smaller models, researchers often train SLMs to generate code, or use calculators to help avoid calculation errors. Additionally, they employ a technique called ensembling, in which the model is called up to 100 times, with each call reattempting to solve the problem. Ensembling provides a substantial boost in accuracy but at a significant increase in compute cost increase, due to multiple calls to the model.

This research aims to explore how far we can push the native ability of smaller language models when they are specialized to solve math problems, without the use of external tools, verifiers or ensembling. More specifically, we focus on two directions:

AgentInstruct

Previous work on synthetic data creation often uses frontier models to generate similar problems based on a seed problem. Providing paraphrases of the seed with different numbers and attributes can be useful for creating training data for the smaller model. We propose employing multi-agent flows, using AutoGen, to create new problems and solutions, which can not only create more demonstrations of the problem but also increase the diversity and range of difficulty of the problems.

To generate more challenging problems, we create a setup with a team of agents working collaboratively to create a dataset geared toward a predefined objective. For example, we can use two agents, namely Suggester and Editor. The Suggester examines a problem and proposes several methods for increasing its complexity, while the Editor takes the original word problem and the Suggester’s recommendations to generate an updated, more challenging problem. This iterative process can occur over multiple rounds, with each round further increasing the complexity of the previously generated problem. A third agent can then verify that the problem is solvable and create the solution.

Iterative learning

Using high-quality training data that may elicit richer learning signals (e.g. explanations) has been shown to significantly improve SLM’s abilities in acquiring skills that had only emerged before at much larger scale.

This paradigm fits under a teacher-student approach where the large model (the teacher) is
creating demonstrations for the SLM (the student) to learn from. In this work, we extend the teacher-student paradigm to iterative learning settings as follows:

Teaching by demonstration: In this stage, we train the SLM by using AgentInstruct to demonstrate problems and their solutions.

Practice and feedback: We let the SLM practice solving problems on its own. For every problem, we allow the SLM to create multiple solutions. We then use the teacher model to provide feedback on the SLM solutions. If the SLM is unable to correctly solve the problem, even after multiple attempts, we use a solution provided by the teacher.

Iterative improvement: We use the teacher feedback to create preference data showing the SLM both good and bad solutions to the same problem, and then retrain the SLM.

The practice, feedback, and iterative improvement steps can be repeated multiple times.

Conclusion

Our findings show that smaller models are valuable in specialized settings where they can match the performance of much larger models but with a limited scope. By training Orca-Math on a small dataset of 200,000 math problems, we have achieved performance levels that rival or surpass those of much larger models.

The relatively small size of the dataset also shows the potential of using multi-agent flows to simulate the process of data and feedback generation. The small dataset size has implications for the cost of training and highlights that training data with richer learning signals can improve the efficiency of the learning process. Our findings also highlight the potential of continual learning and the improvement of language models, where the model iteratively improves as it receives more feedback from a person or another model.

The post Orca-Math: Demonstrating the potential of SLMs with model specialization appeared first on Microsoft Research.

OpenAI and Elon Musk

We are dedicated to the OpenAI mission and have pursued it every step of the way.OpenAI Blog

VeCLIP: Improving CLIP Training via Visual-enriched Captions

Paper abstract: Large-scale web-crawled datasets are fundamental for the success of pre-training vision-language models, such as CLIP. However, the inherent noise and potential irrelevance of web-crawled AltTexts pose challenges in achieving precise image-text alignment. Existing methods utilizing large language models (LLMs) for caption rewriting have shown promise on small, curated datasets like CC3M and CC12M. This study introduces a scalable pipeline for noisy caption rewriting. Unlike recent LLM rewriting techniques, we emphasize the incorporation of visual concepts into captions, termed…Apple Machine Learning Research

Alida gains deeper understanding of customer feedback with Amazon Bedrock

This post is co-written with Sherwin Chu from Alida.

Alida helps the world’s biggest brands create highly engaged research communities to gather feedback that fuels better customer experiences and product innovation.

Alida’s customers receive tens of thousands of engaged responses for a single survey, therefore the Alida team opted to leverage machine learning (ML) to serve their customers at scale. However, when employing the use of traditional natural language processing (NLP) models, they found that these solutions struggled to fully understand the nuanced feedback found in open-ended survey responses. The models often only captured surface-level topics and sentiment, and missed crucial context that would allow for more accurate and meaningful insights.

In this post, we learn about how Anthropic’s Claude Instant model on Amazon Bedrock enabled the Alida team to quickly build a scalable service that more accurately determines the topic and sentiment within complex survey responses. The new service achieved a 4-6 times improvement in topic assertion by tightly clustering on several dozen key topics vs. hundreds of noisy NLP keywords.

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies, such as AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon via a single API, along with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI.

Using Amazon Bedrock allowed Alida to bring their service to market faster than if they had used other machine learning (ML) providers or vendors.

The challenge

Surveys with a combination of multiple-choice and open-ended questions allow market researchers to get a more holistic view by capturing both quantitative and qualitative data points.

Multiple-choice questions are easy to analyze at scale, but lack nuance and depth. Set response options may also lead to biasing or priming participant responses.

Open-ended survey questions allow responders to provide context and unanticipated feedback. These qualitative data points deepen researchers’ understanding beyond what multiple-choice questions can capture alone. The challenge with the free-form text is that it can lead to complex and nuanced answers that are difficult for traditional NLP to fully understand. For example:

“I recently experienced some of life’s hardships and was really down and disappointed. When I went in, the staff were always very kind to me. It’s helped me get through some tough times!”

Traditional NLP methods will identify topics as “hardships,” “disappointed,” “kind staff,” and “get through tough times.” It can’t distinguish between the responder’s overall current negative life experiences and the specific positive store experiences.

Alida’s existing solution automatically process large volumes of open-ended responses, but they wanted their customers to gain better contextual comprehension and high-level topic inference.

Amazon Bedrock

Prior to the introduction of LLMs, the way forward for Alida to improve upon their existing single-model solution was to work closely with industry experts and develop, train, and refine new models specifically for each of the industry verticals that Alida’s customers operated in. This was both a time- and cost-intensive endeavor.

One of the breakthroughs that make LLMs so powerful is the use of attention mechanisms. LLMs use self-attention mechanisms that analyze the relationships between words in a given prompt. This allows LLMs to better handle the topic and sentiment in the earlier example and presents an exciting new technology that can be used to address the challenge.

With Amazon Bedrock, teams and individuals can immediately start using foundation models without having to worry about provisioning infrastructure or setting up and configuring ML frameworks. You can get started with the following steps:

Verify that your user or role has permission to create or modify Amazon Bedrock resources. For details, see Identity-based policy examples for Amazon Bedrock
Log in into the Amazon Bedrock console.
On the Model access page, review the EULA and enable the FMs you’d like in your account.
Start interacting with the FMs via the following methods:
- Directly in the Amazon Bedrock console using the Amazon Bedrock playgrounds.
- Programmatically using the Amazon Bedrock API and SDKs.
- In a console terminal using the Amazon Bedrock CLI.

Alida’s executive leadership team was eager to be an early adopter of the Amazon Bedrock because they recognized its ability to help their teams to bring new generative AI-powered solutions to market faster.

Vincy William, the Senior Director of Engineering at Alida who leads the team responsible for building the topic and sentiment analysis service, says,

“LLMs provide a big leap in qualitative analysis and do things (at a scale that is) humanly not possible to do. Amazon Bedrock is a game changer, it allows us to leverage LLMs without the complexity.”

The engineering team experienced the immediate ease of getting started with Amazon Bedrock. They could select from various foundation models and start focusing on prompt engineering instead of spending time on right-sizing, provisioning, deploying, and configuring resources to run the models.

Solution overview

Sherwin Chu, Alida’s Chief Architect, shared Alida’s microservices architecture approach. Alida built the topic and sentiment classification as a service with survey response analysis as its first application. With this approach, common LLM implementation challenges such as the complexity of managing prompts, token limits, request constraints, and retries are abstracted away, and the solution allows for consuming applications to have a simple and stable API to work with. This abstraction layer approach also enables the service owners to continually improve internal implementation details and minimize API-breaking changes. Finally, the service approach allows for a single point to implement any data governance and security policies that evolve as AI governance matures in the organization.

The following diagram illustrates the solution architecture and flow.

Alida evaluated LLMs from various providers, and found Anthropic’s Claude Instant to be the right balance between cost and performance. Working closely with the prompt engineering team, Chu advocated to implement a prompt chaining strategy as opposed to a single monolith prompt approach.

Prompt chaining enables you to do the following:

Break down your objective into smaller, logical steps
Build a prompt for each step
Provide the prompts sequentially to the LLM

This creates additional points of inspection, which has the following benefits:

It’s straightforward to systematically evaluate changes you make to the input prompt
You can implement more detailed tracking and monitoring of the accuracy and performance at each step

Key considerations with this strategy include the increase in the number of requests made to the LLM and the resulting increase in the overall time it takes to complete the objective. For Alida’s use case they chose to batching a collection of open-ended responses in a single prompt to the LLM is what they chose to offset these effects.

NLP vs. LLM

Alida’s existing NLP solution relies on clustering algorithms and statistical classification to analyze open-ended survey responses. When applied to sample feedback for a coffee shop’s mobile app, it extracted topics based on word patterns but lacked true comprehension. The following table includes some examples comparing NLP responses vs. LLM responses.

Survey Response	Existing Traditional NLP	Amazon Bedrock with Claude Instant
Survey Response	Topic	Topic	Sentiment
I almost exclusively order my drinks through the app bc of convenience and it’s less embarrassing to order super customized drinks lol. And I love earning rewards!	[‘app bc convenience’, ‘drink’, ‘reward’]	Mobile Ordering Convenience	positive
The app works pretty good the only complaint I have is that I can’t add Any number of money that I want to my gift card. Why does it specifically have to be $10 to refill?!	[‘complaint’, ‘app’, ‘gift card’, ‘number money’]	Mobile Order Fulfillment Speed	negative

The example results show how the existing solution was able to extract relevant keywords, but isn’t able to achieve a more generalized topic group assignment.

In contrast, using Amazon Bedrock and Anthropic Claude Instant, the LLM with in-context training is able to assign the responses to pre-defined topics and assign sentiment.

In additional to delivering better answers for Alida’s customers, for this particular use-case, pursuing a solution using an LLM over traditional NLP methods saved a vast amount of time and effort in training and maintaining a suitable model. The following table compares training a traditional NLP model vs. in-context training of an LLM.

Data Requirement

Training Process

Model Adaptability

Training a traditional NLP model

Thousands of human-labeled examples

Combination of automated and manual feature engineering.

Iterative train and evaluate cycles.

Slower turnaround due to the need to retrain model

In-context training of LLM

Several examples

Trained on the fly within the prompt.

Limited by context window size.

Faster iterations by modifying the prompt.

Limited retention due to context window size.

Conclusion

Alida’s use of Anthropic’s Claude Instant model on Amazon Bedrock demonstrates the powerful capabilities of LLMs for analyzing open-ended survey responses. Alida was able to build a superior service that was 4-6 times more precise at topic analysis when compared to their NLP-powered service. Additionally, using in-context prompt engineering for LLMs significantly reduced development time, because they didn’t need to curate thousands of human-labeled data points to train a traditional NLP model. This ultimately allows Alida to give their customers richer insights sooner!

If you’re ready to start building your own foundation model innovation with Amazon Bedrock, checkout this link to Set up Amazon Bedrock. If you interested in reading about other intriguing Amazon Bedrock applications, see the Amazon Bedrock specific section of the AWS Machine Learning Blog.

About the authors

Kinman Lam is an ISV/DNB Solution Architect for AWS. He has 17 years of experience in building and growing technology companies in the smartphone, geolocation, IoT, and open source software space. At AWS, he uses his experience to help companies build robust infrastructure to meet the increasing demands of growing businesses, launch new products and services, enter new markets, and delight their customers.

Sherwin Chu is the Chief Architect at Alida, helping product teams with architectural direction, technology choice, and complex problem-solving. He is an experienced software engineer, architect, and leader with over 20 years in the SaaS space for various industries. He has built and managed numerous B2B and B2C systems on AWS and GCP.

Mark Roy is a Principal Machine Learning Architect for AWS, helping customers design and build AI/ML and generative AI solutions. His focus since early 2023 has been leading solution architecture efforts for the launch of Amazon Bedrock, AWS’ flagship generative AI offering for builders. Mark’s work covers a wide range of use cases, with a primary interest in generative AI, agents, and scaling ML across the enterprise. He has helped companies in insurance, financial services, media and entertainment, healthcare, utilities, and manufacturing. Prior to joining AWS, Mark was an architect, developer, and technology leader for over 25 years, including 19 years in financial services. Mark holds six AWS certifications, including the ML Specialty Certification.

Robo Rendezvous: Robotics Innovators and AI Leaders to Converge at NVIDIA GTC

Bringing together pioneers in robotics and AI, NVIDIA GTC will be a state-of-the-art showcase of applied AI for autonomous machines.

The conference, running March 18-21 at the San Jose Convention Center and online, boasts a star-studded lineup. This includes a fireside chat with Marc Raibert, executive director of The AI Institute, and Dieter Fox, senior director of robotics research at NVIDIA, as well as panels featuring heavyweights like Disney, Google DeepMind and Amazon, alongside insights from NVIDIA stalwarts like Senior Research Scientist Jim Fan.

With over 77 ecosystem partners and more than 25 partner robots, from industrial giants to entertainment bots, GTC is where the future of robotics unfolds.

Attendees will be able to explore the convergence of AI and robotics through dynamic displays in the AI at the Edge pavilion, the Metropolis pavilion and demo areas, featuring the latest robot arms, robotic vision systems and high-accuracy 3D scanning systems.

These demonstrations provide compelling examples of how AI seamlessly enhances human capabilities across diverse industries. Groundbreaking demos using large language models for real-world applications will push the boundaries of human-machine interaction.

Here are a few of the conference’s must-see robotics events:

AI’s Impact on Robotics: In a fireside chat, Raibert and Fox discuss AI’s transformative role in robotics, moving from traditional controls to AI-driven technologies.
Generative AI in Robotics: A panel led by NVIDIA Senior Strategic Partnerships Manager Sandra Skaff and featuring leaders from Ambi Robotics, Covariant, Vayu Robotics and Scaled Foundations discusses the impact of generative AI in robotics, focusing on its potential to advance reasoning, planning and perception.
Generative AI Revolution: Google DeepMind unveils the next frontier in robotics, powered by generative AI’s advancements in perception and interaction, in the session “Robotics in the Age of Generative AI.”
Metaverse Meets Automation: SICK AG offers a vision of the metaverse revolutionizing the automation industry, in the session “Exploring Tomorrow’s Industrial Automation.”
Robots and AI Vision: The session “Empowering Collaborative Robots: The Future of AI Vision With Digital Twins” shows how AI vision is setting new standards for robot-human collaboration.
Robots With Character: Walt Disney Imagineering will show how it’s enabling the rapid design of legged robotic characters that learn to imitate artist-specified animations, in the session “Breathing Life Into Disney’s Robotic Characters With Deep Reinforcement Learning.”
Boosting Humanoid Performance: Sanctuary AI explores the role of synthetic data in enhancing humanoid robot performance for complex tasks, in the session “Using Omniverse to Generate First-Person Experiential Data for Humanoid Robots.”

Plus, a special session with Deepu Talla, vice president of robotics and edge computing, about “AI Robotics: Driving Innovation for the Future of Automation” was just added to the GTC catalog.

This year’s GTC also offers 40 hands-on labs, providing attendees with an immersive experience of the practical applications of these technologies.

A Jetson and Robotics Developer Day will be held on Thursday, March 21, featuring a full day of sessions and panels that dive deep into building next-gen AI-powered robotics and edge applications on the NVIDIA Jetson, Isaac and Metropolis platforms.

Over the past decade, GTC has been where advances in computer graphics, deep learning and generative AI were launched. As industries from agriculture to manufacturing are transformed by these technologies, this year’s event will offer a glimpse into the innovations that will soon define our daily lives.

Register for GTC to secure your spot at the forefront of technology’s next leap.

Solution overview

Prepare the training data

Create a training script

Method 1: Weighted training class

Method 2: Gradient accumulation

Method 3: Gradient checkpointing

Method 4: Low-Rank Adaptation of LLMs

Submit a SageMaker training job

Clean up

Conclusion

About the Author

Creative Solutions to Creative Challenges

A Growing Suite of NVIDIA-Accelerated Models

What Is AI?

GeForce RTX AI PCs and NVIDIA RTX Workstations

AI for Everyone

AI for Gamers

AI for Creators

AI for Developers

Reinventing the Monster Movie

A Cosmic Spectacle

Epic Storytelling Through Detailed VFX

Exploring AI’s Boundaries

High-Octane Action Meets Precision VFX

NVIDIANs at the SciTech Awards

See How Oscar-Nominated VFX Are Created at GTC

Teaching SLMs math

AgentInstruct

Iterative learning

Conclusion

The challenge

Amazon Bedrock

Solution overview

NLP vs. LLM

Conclusion

About the authors

Navigation

GenAI Vision Endless Possibilities

"I'm interested in things that change the world or that affect the future and wondrous, new technology where you see it, and you're like, 'Wow, how did that even happen? How is that possible?'" -- Elon Musk

Copyright © 2019-2025 Vedere AI. All Rights Reserved.