July 2024 – Page 14

Mile-High AI: NVIDIA Research to Present Advancements in Simulation and Gen AI at SIGGRAPH

NVIDIA is taking an array of advancements in rendering, simulation and generative AI to SIGGRAPH 2024, the premier computer graphics conference, which will take place July 28 – Aug. 1 in Denver.

More than 20 papers from NVIDIA Research introduce innovations advancing synthetic data generators and inverse rendering tools that can help train next-generation models. NVIDIA’s AI research is making simulation better by boosting image quality and unlocking new ways to create 3D representations of real or imagined worlds.

The papers focus on diffusion models for visual generative AI, physics-based simulation and increasingly realistic AI-powered rendering. They include two technical Best Paper Award winners and collaborations with universities across the U.S., Canada, China, Israel and Japan as well as researchers at companies including Adobe and Roblox.

These initiatives will help create tools that developers and businesses can use to generate complex virtual objects, characters and environments. Synthetic data generation can then be harnessed to tell powerful visual stories, aid scientists’ understanding of natural phenomena or assist in simulation-based training of robots and autonomous vehicles.

Diffusion Models Improve Texture Painting, Text-to-Image Generation

Diffusion models, a popular tool for transforming text prompts into images, can help artists, designers and other creators rapidly generate visuals for storyboards or production, reducing the time it takes to bring ideas to life.

Two NVIDIA-authored papers are advancing the capabilities of these generative AI models.

ConsiStory, a collaboration between researchers at NVIDIA and Tel Aviv University, makes it easier to generate multiple images with a consistent main character — an essential capability for storytelling use cases such as illustrating a comic strip or developing a storyboard. The researchers’ approach introduces a technique called subject-driven shared attention, which reduces the time it takes to generate consistent imagery from 13 minutes to around 30 seconds.

Panels of multiple AI-generated images featuring the same character — ConsiStory is capable of generating a series of images featuring the same character.

NVIDIA researchers last year won the Best in Show award at SIGGRAPH’s Real-Time Live event for AI models that turn text or image prompts into custom textured materials. This year, they’re presenting a paper that applies 2D generative diffusion models to interactive texture painting on 3D meshes, enabling artists to paint in real time with complex textures based on any reference image.

Kick-Starting Developments in Physics-Based Simulation

Graphics researchers are narrowing the gap between physical objects and their virtual representations with physics-based simulation — a range of techniques to make digital objects and characters move the same way they would in the real world.

Several NVIDIA Research papers feature breakthroughs in the field, including SuperPADL, a project that tackles the challenge of simulating complex human motions based on text prompts (see video at top).

Using a combination of reinforcement learning and supervised learning, the researchers demonstrated how the SuperPADL framework can be trained to reproduce the motion of more than 5,000 skills — and can run in real time on a consumer-grade NVIDIA GPU.

Another NVIDIA paper features a neural physics method that applies AI to learn how objects — whether represented as a 3D mesh, a NeRF or a solid object generated by a text-to-3D model — would behave as they are moved in an environment.

A paper written in collaboration with Carnegie Mellon University researchers develops a new kind of renderer — one that, instead of modeling physical light, can perform thermal analysis, electrostatics and fluid mechanics. Named one of five best papers at SIGGRAPH, the method is easy to parallelize and doesn’t require cumbersome model cleanup, offering new opportunities for speeding up engineering design cycles.

In the example above, the renderer performs a thermal analysis of the Mars Curiosity rover, where keeping temperatures within a specific range is critical to mission success.

Additional simulation papers introduce a more efficient technique for modeling hair strands and a pipeline that accelerates fluid simulation by 10x.

Raising the Bar for Rendering Realism, Diffraction Simulation

Another set of NVIDIA-authored papers present new techniques to model visible light up to 25x faster and simulate diffraction effects — such as those used in radar simulation for training self-driving cars — up to 1,000x faster.

A paper by NVIDIA and University of Waterloo researchers tackles free-space diffraction, an optical phenomenon where light spreads out or bends around the edges of objects. The team’s method can integrate with path-tracing workflows to increase the efficiency of simulating diffraction in complex scenes, offering up to 1,000x acceleration. Beyond rendering visible light, the model could also be used to simulate the longer wavelengths of radar, sound or radio waves.

Urban scene with colors showing simulation of cellular radiation propagation around buildings — Simulation of cellular signal coverage in a city.

Path tracing samples numerous paths — multi-bounce light rays traveling through a scene — to create a photorealistic picture. Two SIGGRAPH papers improve sampling quality for ReSTIR, a path-tracing algorithm first introduced by NVIDIA and Dartmouth College researchers at SIGGRAPH 2020 that has been key to bringing path tracing to games and other real-time rendering products.

One of these papers, a collaboration with the University of Utah, shares a new way to reuse calculated paths that increases effective sample count by up to 25x, significantly boosting image quality. The other improves sample quality by randomly mutating a subset of the light’s path. This helps denoising algorithms perform better, producing fewer visual artifacts in the final render.

Model of a sheep rendering with three different path-tracing techniques — From L to R: Compare the visual quality of previous sampling, the 25x improvement and a reference image. Model courtesy Blender Studio.

Teaching AI to Think in 3D

NVIDIA researchers are also showcasing multipurpose AI tools for 3D representations and design at SIGGRAPH.

One paper introduces fVDB, a GPU-optimized framework for 3D deep learning that matches the scale of the real world. The fVDB framework provides AI infrastructure for the large spatial scale and high resolution of city-scale 3D models and NeRFs, and segmentation and reconstruction of large-scale point clouds.

A Best Technical Paper award winner written in collaboration with Dartmouth College researchers introduces a theory for representing how 3D objects interact with light. The theory unifies a diverse spectrum of appearances into a single model.

And a collaboration with University of Tokyo, University of Toronto and Adobe Research introduces an algorithm that generates smooth, space-filling curves on 3D meshes in real time. While previous methods took hours, this framework runs in seconds and offers users a high degree of control over the output to enable interactive design.

NVIDIA at SIGGRAPH

Learn more about NVIDIA at SIGGRAPH, with special events including a fireside chat between NVIDIA founder and CEO Jensen Huang and Lauren Goode, senior writer at WIRED, on the impact of robotics and AI in industrial digitalization.

NVIDIA researchers will also present OpenUSD Day by NVIDIA, a full-day event showcasing how developers and industry leaders are adopting and evolving OpenUSD to build AI-enabled 3D pipelines.

NVIDIA Research has hundreds of scientists and engineers worldwide, with teams focused on topics including AI, computer graphics, computer vision, self-driving cars and robotics. See more of their latest work.

Enhancing CTC-based Speech Recognition with Diverse Modeling Units

In recent years, the evolution of end-to-end (E2E) automatic speech recognition (ASR) models has been remarkable, largely due to advances in deep learning architectures like transformer. On top of E2E systems, researchers have achieved substantial accuracy improvement by rescoring E2E model’s N-best hypotheses with a phoneme-based model. This raises an interesting question about where the improvements come from other than the system combination effect. We examine the underlying mechanisms driving these gains and propose an efficient joint training approach, where E2E models are trained jointly…Apple Machine Learning Research

On Computationally Efficient Multi-Class Calibration

Consider a multi-class labelling problem, where the labels can take values in [k], and a predictor predicts a distribution over the labels. In this work, we study the following foundational question: Are there notions of multi-class calibration that give strong guarantees of meaningful predictions and can be achieved in time and sample complexities polynomial in k? Prior notions of calibration exhibit a tradeoff between computational efficiency and expressivity: they either suffer from having sample complexity exponential in k, or needing to solve computationally intractable problems, or give…Apple Machine Learning Research

Omnipredictors for Regression and the Approximate Rank of Convex Functions

Consider the supervised learning setting where the goal is to learn to predict labels y given points x from a distribution. An omnipredictor for a class L of loss functions and a class C of hypotheses is a predictor whose predictions incur less expected loss than the best hypothesis in C for every loss in L. Since the work of [GKR+21] that introduced the notion, there has been a large body of work in the setting of binary labels where y∈{0,1}, but much less is known about the regression setting where y∈[0,1] can be continuous. Our main conceptual contribution is the notion of sufficient…Apple Machine Learning Research

Superposition Prompting: Improving and Accelerating Retrieval-Augmented Generation

Despite the successes of large language models (LLMs), they exhibit significant drawbacks, particularly when processing long contexts. Their inference cost scales quadratically with respect to sequence length, making it expensive for deployment in some real-world text processing applications, such as retrieval-augmented generation (RAG). Additionally, LLMs also exhibit the “distraction phenomenon,” where irrelevant context in the prompt degrades output quality. To address these drawbacks, we propose a novel RAG prompting methodology, superposition prompting, which can be directly applied to…Apple Machine Learning Research

How Smooth Is Attention?

Self-attention and masked self-attention are at the heart of Transformers’ outstanding success. Still, our mathematical understanding of attention, in particular of its Lipschitz properties — which are key when it comes to analyzing robustness and expressive power — is incomplete. We provide a detailed study of the Lipschitz constant of self-attention in several practical scenarios, discussing the impact of the sequence length and layer normalization on the local Lipschitz constant of both unmasked and masked self-attention. In particular, we show that for inputs of length n in any compact…Apple Machine Learning Research

Careful With That Scalpel: Improving Gradient Surgery With an EMA

Beyond minimizing a single training loss, many deep learning estimation pipelines rely on an auxiliary objective to quantify and encourage desirable properties of the model (e.g. performance on another dataset, robustness, agreement with a prior). Although the simplest approach to incorporating an auxiliary loss is to sum it with the training loss as a regularizer, recent works have shown that one can improve performance by blending the gradients beyond a simple sum; this is known as gradient surgery. We cast the problem as a constrained minimization problem where the auxiliary objective is…Apple Machine Learning Research

Using Agents for Amazon Bedrock to interactively generate infrastructure as code

In the diverse toolkit available for deploying cloud infrastructure, Agents for Amazon Bedrock offers a practical and innovative option for teams looking to enhance their infrastructure as code (IaC) processes. Agents for Amazon Bedrock automates the prompt engineering and orchestration of user-requested tasks. After being configured, an agent builds the prompt and augments it with your company-specific information to provide responses back to the user in natural language.

This solution shows how Amazon Bedrock agents can be configured to accept cloud architecture diagrams, automatically analyze them, and generate Terraform or AWS CloudFormation templates. This solution uses Retrieval Augmented Generation (RAG) to ensure the generated scripts adhere to organizational needs and industry standards. A key feature is the agent’s ability to dynamically interact with users. During the IaC generation process, Amazon Bedrock agents actively probe for additional information by analyzing the provided diagrams and querying the user to fill any gaps. This interaction allows for a more tailored and precise IaC configuration.

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading artificial intelligence (AI) companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI.

In this blog post, we explore how Agents for Amazon Bedrock can be used to generate customized, organization standards-compliant IaC scripts directly from uploaded architecture diagrams. This will help accelerate deployments, reduce errors, and ensure adherence to security guidelines.

Solution overview

Before we explore the deployment process, let’s walk through the key steps of the architecture as illustrated in Figure 1.

Figure 1 : High level overview of creating Infrastructure as Code from architecture diagram

Initial Input through the Amazon Bedrock chat console: The user begins by entering the name of their Amazon Simple Storage Service (Amazon S3) bucket and the object (key) name where the architecture diagram is stored into the Amazon Bedrock chat console. For instance, if an architecture diagram is saved as s3://testbucket/architecturediagram.png, the user will enter testbucket as the S3 bucket name and architecturediagram.png as the object name.
Diagram analysis and query generation: The Amazon Bedrock agent forwards the architecture diagram location to an action group that invokes an AWS Lambda. This function retrieves the architecture diagram from the specified S3 bucket, analyzes it using the Amazon Bedrock model, and produces a summary of the diagram. It also generates questions regarding any missing components, dependencies, or parameter values that are needed to create IaC for AWS services. This detailed response is then sent back to the agent.
Interaction and user confirmation: The agent displays the generated questions to the user and records their responses. Next, the agent provides a comprehensive summary of the architecture diagram along with additional inputs provided by the user. Users then have the opportunity to approve this configuration or suggest any necessary adjustments. On receiving confirmation from the user, the agent passes this information to the second action group to generate IaC.
IaC generation and deployment: The second action group invokes a Lambda function that processes the user’s input data along with organization-specific coding guidelines from Knowledge Bases for Amazon Bedrock to create the IaC. After being generated, the IaC is automatically pushed to a designated GitHub repository.

Prerequisites

You should have the following:

Understanding of Agents for Amazon Bedrock, prompt engineering, Knowledge Bases for Amazon Bedrock, Lambda functions, and AWS Identity and Access Management (IAM).
An AWS account with the appropriate IAM permissions to create Amazon Bedrock agents and knowledge bases, Lambda functions, and IAM roles.
Create a service role for Agents for Amazon Bedrock.
A GitHub account with a repository to store the generated Terraform scripts.

Deployment steps

The solution can be used to create IaC (using Terraform or CloudFormation) by inputting the architecture diagram. For the purpose of this blog post, we focus on creating Terraform IaC. There are four steps to deploy the solution.

Step 1: Configure an Amazon Bedrock knowledge base: Configuring a knowledge base (KB) enables you to access information about organization standard Terraform modules. Follow these steps to set up your KB:

Sign in and go to the AWS Management Console for Amazon Bedrock. Go directly to the Knowledge Base section. This is your starting point for creating a new KB.
Enter a clear and descriptive name that reflects the purpose of your KB, such as Terraform KB.
Assign a pre-configured IAM role with the necessary permissions. It’s typically best to let Amazon Bedrock create this role for you to ensure it has the correct permissions.
Define the data sources by uploading a JSON file to an S3 bucket with encryption enabled for security. This file should contain a structured list of AWS services and Terraform modules. For the JSON structure, use the example provided in the repository.
Choose the default embeddings model. For most use cases, the Amazon Bedrock Titan G1 Embeddings – Text model will suffice. It’s pre-configured and ready to use, simplifying the process.
Use the managed vector store to allow Amazon Bedrock to create and manage the vector store for you in Amazon OpenSearch Service.
Select the KB and in the Data source section, choose Sync to begin data ingestion. When data ingestion completes, a green success banner appears if it is successful.
Double-check all entered information for accuracy. Pay special attention to the S3 bucket URI and IAM role details.

Step 2: Configure the Bedrock agent:

Open the Amazon Bedrock console, select Agents in the left navigation panel, then choose Create Agent.
Enter agent details including agent name and description (optional).
Next, grant the agent permissions to AWS services through the IAM service role. This gives your agent access to required services, such as Lambda.
Select a foundation model from Amazon Bedrock (for example, Anthropic Claude 3 Sonnet).
To create Terraform code using Agents for Amazon Bedrock, attach the following instruction to the agent:

“Assist users in creating IaC for provided architecture diagram. Ask user for S3 bucket name and object name where the diagram is stored. Upon receiving the information, run analysis-query action group. Give structured summary and ask user only the questions that are received from action group response. Take the answers from the user and give detailed summary to the user. Take approval from user. When approved, give all that information to final draft along with S3 bucket name, object name as input for the iac-deployment action group and run the action group.”

Step 3: Configuring agent action groups: After initial agent configuration and adding the above instruction to the agent, there are two actions that need to be added to the agent to create Terraform IaC by passing an architecture diagram.

Create an action group linked to a Lambda function (for creating a Lambda function, see Getting started with Lambda) that is designed to analyze the architecture diagram and generates questions related to any missing components, dependencies, or parameter values necessary for IaC creation of AWS services. This group is invoked by the agent following the user’s input of S3 bucket and object details. The responses are then relayed back to the agent, which conducts an interactive session to collect any missing information from the user. See Lambda code and OpenAPI-schema in the repository.
Establish a second action group tied to a different Lambda function responsible for creating the Terraform code and uploading it to a GitHub repository. This group is invoked only after the user has reviewed and approved the infrastructure configuration. See Lambda code and OpenAPI-schema in the repository.

Step 4: Add the action groups to the agent:

Assign a descriptive name to each action group and detail their functions in the description fields. This helps clarify the purpose of each group within the workflow.
For each action group, select the appropriate Lambda functions that you set up previously. These functions run the business logic required when an action is invoked. Make sure to choose the correct version of each Lambda function. For additional details, see the section on Action Group Lambda Functions.
Provide the Amazon S3 URI that links to the API schema for each action group. This schema should include the API’s description, structure, and parameters. The API is crucial for managing the workflow, such as receiving user inputs, invoking Lambda functions to run the process, validating inputs, initiating Terraform module creation, and monitoring the provisioning status. For further guidance, see the section on Action Group OpenAPI Schemas.

The following screenshot shows an example of the user interaction with Agents for Amazon Bedrock

The following screenshot shows an example Terraform output

Clean up

The services used in this demonstration can incur costs. Complete the following steps to clean up your resources:

Delete the Lambda functions if they’re no longer required.
Delete action groups and Amazon Bedrock agent that were created.
Empty and delete the S3 bucket used for storing the architecture diagram.
Remove the generated Terraform scripts from the GitHub repo.
Delete the Amazon Bedrock knowledge base Bedrock if it’s no longer needed.

Conclusion

Agents for Amazon Bedrock uses generative AI to transform architecture diagrams into compliant infrastructure as code (IaC) scripts for AWS deployments, such as Terraform and AWS CloudFormation. This capability is a crucial tool for engineers transitioning to the cloud, speeding up the cloud adoption process while ensuring that deployments adhere to established best practices from the start.

Through the interactive features of Agents for Amazon Bedrock, the automation of IaC generation not only streamlines the initial set up but also significantly improves ongoing operations like infrastructure management. Although this post concentrates on IaC creation, the interactive capabilities of Agents for Amazon Bedrock can be used across various AWS services, providing a dynamic and comprehensive solution for managing and optimizing cloud infrastructure.

Are you ready to streamline your cloud deployment process with the generative AI of Amazon Bedrock? Start by delving into the Amazon Bedrock User Guide to see how it can facilitate your organization’s transition to the cloud. For specialized assistance, consider engaging with AWS Professional Services to maximize the efficiency and benefits of using Amazon Bedrock. Embrace the potential for a swift, secure, and efficient cloud transformation with Amazon Bedrock. Take the first step today and discover how using generative AI can revolutionize your approach to cloud infrastructure.

About the Author

Akhil Raj Yallamelli is a Cloud Infrastructure Architect at AWS, specializing in optimizing cloud infrastructures for enhanced data security and cost efficiency. He skillfully integrates technical solutions with business strategies to create scalable, reliable, and secure cloud environments. Akhil builds technical solutions focusing on customer business outcomes, incorporating generative AI (Gen AI) technologies to drive innovation. With deep expertise in AWS and a strong background in DevOps methodologies throughout the software development life cycle (SDLC), Akhil leads critical implementation and migration projects. He holds an MS degree in Computer Science. Outside of his professional work, Akhil enjoys watching and playing sports.

Ebbey Thomas specializes in strategizing and developing custom AWS Landing Zones with a focus on leveraging Generative AI to enhance cloud infrastructure automation. In his role at AWS Professional Services, Ebbey’s expertise is central to architecting solutions that streamline cloud adoption, ensuring a secure and efficient operational framework for AWS users. He is known for his innovative approach to cloud challenges and his commitment to driving forward the capabilities of cloud services.

Improve RAG accuracy with fine-tuned embedding models on Amazon SageMaker

Retrieval Augmented Generation (RAG) is a popular paradigm that provides additional knowledge to large language models (LLMs) from an external source of data that wasn’t present in their training corpus.

RAG provides additional knowledge to the LLM through its input prompt space and its architecture typically consists of the following components:

Indexing: Prepare a corpus of unstructured text, parse and chunk it, and then, embed each chunk and store it in a vector database.
Retrieval: Retrieve context relevant to answering a question from the vector database using vector similarity. Use prompt engineering to provide this additional context to the LLM along with the original question. The LLM will then use the original question and the context from the vector database to generate an answer based on data that wasn’t part of its training corpus.

Challenges in RAG accuracy

Pre-trained embedding models are typically trained on large, general-purpose datasets like Wikipedia or web-crawl data. While these models capture a broad range of semantic relationships and can generalize well across various tasks, they might struggle to accurately represent domain-specific concepts and nuances. This limitation can lead to suboptimal performance when using these pre-trained embeddings for specialized tasks or domains, such as legal, medical, or technical domains. Furthermore, pre-trained embeddings might not effectively capture the contextual relationships and nuances that are specific to a particular task or domain. For example, in the legal domain, the same term can have different meanings or implications depending on the context, and these nuances might not be adequately represented in a general-purpose embedding model.

To address the limitations of pre-trained embeddings and improve the accuracy of RAG systems for specific domains or tasks, it’s essential to fine tune the embedding model on domain-specific data. By fine tuning the model on data that is representative of the target domain or task, the model can learn to capture the relevant semantics, jargon, and contextual relationships that are crucial for that domain.

Domain-specific embeddings can significantly improve the quality of vector representations, leading to more accurate retrieval of relevant context from the vector database. This, in turn, enhances the performance of the RAG system in terms of generating more accurate and relevant responses.

This post demonstrates how to use Amazon SageMaker to fine tune a Sentence Transformer embedding model and deploy it with an Amazon SageMaker Endpoint. The code from this post and more examples are available in the GitHub repo. For more information about fine tuning Sentence Transformer, see Sentence Transformer training overview.

Fine tuning embedding models using SageMaker

SageMaker is a fully managed machine learning service that simplifies the entire machine learning workflow, from data preparation and model training to deployment and monitoring. It provides a seamless and integrated environment that abstracts away the complexities of infrastructure management, allowing developers and data scientists to focus solely on building and iterating their machine learning models.

One of the key strengths of SageMaker is its native support for popular open source frameworks such as TensorFlow, PyTorch, and Hugging Face transformers. This integration enables seamless model training and deployment using these frameworks, their powerful capabilities and extensive ecosystem of libraries and tools.

SageMaker also offers a range of built-in algorithms for common use cases like computer vision, natural language processing, and tabular data, making it easy to get started with pre-built models for various tasks. SageMaker also supports distributed training and hyperparameter tuning, allowing for efficient and scalable model training.

Prerequisites

For this walkthrough, you should have the following prerequisites:

An AWS account set up.
An Amazon SageMaker JupyterLab configured with the python3 kernel
To quickly set up SageMaker Studio, you can create a domain for a single user and launch your JupyterLab.
An AWS Identity and Access Management (IAM) role for the Sagemaker notebook with sufficient permissions to write into an Amazon Simple Storage Service (Amazon S3) bucket, and create a Sagemaker endpoint. If you have administrator access to the account, no additional action is required.

Steps to fine tune embedding models on Amazon SageMaker

In the following sections, we use a SageMaker JupyterLab to walk through the steps of data preparation, creating a training script, training the model, and deploying it as a SageMaker endpoint.

We will fine tune the embedding model sentence-transformers, all-MiniLM-L6-v2, which is an open source Sentence Transformers model fine tuned on a 1B sentence pairs dataset. It maps sentences and paragraphs to a 384-dimensional dense vector space and can be used for tasks like clustering or semantic search. To fine tune it, we will use the Amazon Bedrock FAQs, a dataset of question and answer pairs, using the MultipleNegativesRankingLoss function.

In Losses, you can find the different loss functions that can be used to fine-tune embedding models on training data. The choice of loss function plays a critical role when fine tuning the model. It determines how well our embedding model will work for the specific downstream task.

The MultipleNegativesRankingLoss function is recommended when you only have positive pairs in your training data, for example, only pairs of similar texts like pairs of paraphrases, pairs of duplicate questions, pairs of query and response, or pairs of (source_language and target_language).

In our case, considering that we’re using Amazon Bedrock FAQs as training data, which consists of pairs of questions and answers, the MultipleNegativesRankingLoss function could be a good fit.

The following code snippet demonstrates how to load a training dataset from a JSON file, prepares the data for training, and then fine tunes the pre-trained model. After fine tuning, the updated model is saved.

The EPOCHS variable determines the number of times the model will iterate over the entire training dataset during the fine-tuning process. A higher number of epochs typically leads to better convergence and potentially improved performance but might also increase the risk of overfitting if not properly regularized.

In this example, we have a small training set consisting of only 100 records. As a result, we’re using a high value for the EPOCHS parameter. Typically, in real-world scenarios, you would have a much larger training set. In such cases, the EPOCHS value should be a single- or two-digit number to avoid overfitting the model to the training data.

from sentence_transformers import SentenceTransformer, InputExample, losses, evaluation
from torch.utils.data import DataLoader
from sentence_transformers.evaluation import InformationRetrievalEvaluator
import json

def load_data(path):
    """Load the dataset from a JSON file."""
    with open(path, 'r', encoding='utf-8') as f:
        data = json.load(f)
    return data

dataset = load_data("training.json")


# Load the pre-trained model
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

# Convert the dataset to the required format
train_examples = [InputExample(texts=[data["sentence1"], data["sentence2"]]) for data in dataset]

# Create a DataLoader object
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=8)

# Define the loss function
train_loss = losses.MultipleNegativesRankingLoss(model)

EPOCHS=100

model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=EPOCHS,
    show_progress_bar=True,
)

# Save the fine-tuned model
model.save("opt/ml/model/",safe_serialization=False)

To deploy and serve the fine-tuned embedding model for inference, we create an inference.py Python script that serves as the entry point. This script implements two essential functions: model_fn and predict_fn, as required by SageMaker for deploying and using machine learning models.

The model_fn function is responsible for loading the fine-tuned embedding model and the associated tokenizer. The predict_fn function takes input sentences, tokenizes them using the loaded tokenizer, and computes their sentence embeddings using the fine-tuned model. To obtain a single vector representation for each sentence, it performs mean pooling over the token embeddings followed by normalization of the resulting embedding. Finally, predict_fn returns the normalized embeddings as a list, which can be further processed or stored as required.

%%writefile opt/ml/model/inference.py

from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F
import os

def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


def model_fn(model_dir, context=None):
  # Load model from HuggingFace Hub
  tokenizer = AutoTokenizer.from_pretrained(f"{model_dir}/model")
  model = AutoModel.from_pretrained(f"{model_dir}/model")
  return model, tokenizer

def predict_fn(data, model_and_tokenizer, context=None):
    # destruct model and tokenizer
    model, tokenizer = model_and_tokenizer
    
    # Tokenize sentences
    sentences = data.pop("inputs", data)
    encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

    # Compute token embeddings
    with torch.no_grad():
        model_output = model(**encoded_input)

    # Perform pooling
    sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

    # Normalize embeddings
    sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)
    
    # return dictonary, which will be json serializable
    return {"vectors": sentence_embeddings[0].tolist()}

After creating the inference.py script, we package it together with the fine-tuned embedding model into a single model.tar.gz file. This compressed file can then be uploaded to an S3 bucket, making it accessible for deployment as a SageMaker endpoint.

import boto3
import tarfile
import os

model_dir = "opt/ml/model"
model_tar_path = "model.tar.gz"

with tarfile.open(model_tar_path, "w:gz") as tar:
    tar.add(model_dir, arcname=os.path.basename(model_dir))
    
s3 = boto3.client('s3')

# Get the region name
session = boto3.Session()
region_name = session.region_name

# Get the account ID from STS (Security Token Service)
sts_client = session.client("sts")
account_id = sts_client.get_caller_identity()["Account"]

model_path = f"s3://sagemaker-{region_name}-{account_id}/model_trained_embedding/model.tar.gz"

bucket_name = f"sagemaker-{region_name}-{account_id}"
s3_key = "model_trained_embedding/model.tar.gz"

with open(model_tar_path, "rb") as f:
    s3.upload_fileobj(f, bucket_name, s3_key)

Finally, we can deploy our fine-tuned model in a SageMaker endpoint.

from sagemaker.huggingface.model import HuggingFaceModel
import sagemaker

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
   model_data=model_path,  # path to your trained SageMaker model
   role=sagemaker.get_execution_role(),                                            # IAM role with permissions to create an endpoint
   transformers_version="4.26",                           # Transformers version used
   pytorch_version="1.13",                                # PyTorch version used
   py_version='py39',                                    # Python version used
   entry_point="opt/ml/model/inference.py",
)

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
   initial_instance_count=1,
   instance_type="ml.m5.xlarge"
)

After the deployment is completed, you can find the deployed SageMaker endpoint in the AWS Management Console for SageMaker by choosing the Inference from the navigation pane, and then choosing Endpoints.

You have multiple options to invoke you endpoint. For example, in your SageMaker JupyterLab, you can invoke it with the following code snippet:

# example request: you always need to define "inputs"
data = {
   "inputs": "Are Agents fully managed?."
}

# request
predictor.predict(data)

It returns the vector containing the embedding of the inputs key:

{'vectors': [0.04694557189941406,
-0.07266131788492203,
-0.058242443948984146,
....,
]}

To illustrate the impact of fine tuning, we can compare the cosine similarity scores between two semantically related sentences using both the original pre-trained model and the fine-tuned model. A higher cosine similarity score indicates that the two sentences are more semantically similar, because their embeddings are closer in the vector space.

Let’s consider the following pair of sentences:

What are agents, and how can they be used?
Agents for Amazon Bedrock are fully managed capabilities that automatically break down tasks, create an orchestration plan, securely connect to company data through APIs, and generate accurate responses for complex tasks like automating inventory management or processing insurance claims.

These sentences are related to the concept of agents in the context of Amazon Bedrock, although with different levels of detail. By generating embeddings for these sentences using both models and calculating their cosine similarity, we can evaluate how well each model captures the semantic relationship between them.

The original pre-trained model returns a similarity score of only 0.54.

The fine-tuned model returns a similarity score of 0.87.

We can observe how the fine-tuned model was able to identify a much higher semantic similarity between the concepts of agents and Agents for Amazon Bedrock when compared to the pre-trained model. This improvement is attributed to the fine-tuning process, which exposed the model to the domain-specific language and concepts present in the Amazon Bedrock FAQs data, enabling it to better capture the relationship between these terms.

Clean up

To avoid future charges in your account, delete the resources you created in this walkthrough. The SageMaker endpoint and the SageMaker JupyterLab instance will incur charges as long as the instances are active, so when you’re done delete the endpoint and resources that you created while running the walkthrough.

Conclusion

In this blog post, we have explored the importance of fine tuning embedding models to improve the accuracy of RAG systems in specific domains or tasks. We discussed the limitations of pre-trained embeddings, which are trained on general-purpose datasets and might not capture the nuances and domain-specific semantics required for specialized domains or tasks.

We highlighted the need for domain-specific embeddings, which can be obtained by fine tuning the embedding model on data representative of the target domain or task. This process allows the model to capture the relevant semantics, jargon, and contextual relationships that are crucial for accurate vector representations and, consequently, better retrieval performance in RAG systems.

We then demonstrated how to fine tune embedding models on Amazon SageMaker using the popular Sentence Transformers library.

By fine tuning embeddings on domain-specific data using SageMaker, you can unlock the full potential of RAG systems, enabling more accurate and relevant responses tailored to your specific domain or task. This approach can be particularly valuable in domains like legal, medical, or technical fields where capturing domain-specific nuances is crucial for generating high-quality and trustworthy outputs.

This and more examples are available in the GitHub repo. Try it out today using the Set up for single users (Quick setup) on Amazon SageMaker and let us know what you think in the comments.

About the Authors

Ennio Emanuele Pastore is a Senior Architect on the AWS GenAI Labs team. He is an enthusiast of everything related to new technologies that have a positive impact on businesses and general livelihood. He helps organizations in achieving specific business outcomes by using data and AI and accelerating their AWS Cloud adoption journey.

How BRIA AI used distributed training in Amazon SageMaker to train latent diffusion foundation models for commercial use

This post is co-written with Bar Fingerman from BRIA AI.

This post explains how BRIA AI trained BRIA AI 2.0, a high-resolution (1024×1024) text-to-image diffusion model, on a dataset comprising petabytes of licensed images quickly and economically. Amazon SageMaker training jobs and Amazon SageMaker distributed training libraries took on the undifferentiated heavy lifting associated with infrastructure management. SageMaker helps you build, train, and deploy machine learning (ML) models for your use cases with fully managed infrastructure, tools, and workflows.

BRIA AI is a pioneering platform specializing in responsible and open generative artificial intelligence (AI) for developers, offering advanced models exclusively trained on licensed data from partners such as Getty Images, DepositPhotos, and Alamy. BRIA AI caters to major brands, animation and gaming studios, and marketing agencies with its multimodal suite of generative models. Emphasizing ethical sourcing and commercial readiness, BRIA AI’s models are source-available, secure, and optimized for integration with various tech stacks. By addressing foundational challenges in data procurement, continuous model training, and seamless technology integration, BRIA AI aims to be the go-to platform for creative AI application developers.

You can also find the BRIA AI 2.0 model for image generation on AWS Marketplace.

This blog post discusses how BRIA AI worked with AWS to address the following key challenges:

Achieving out-of-the-box operational excellence for large model training
Reducing time-to-train by using data parallelism
Maximizing GPU utilization with efficient data loading
Reducing model training cost (by paying only for net training time)

Importantly, BRIA AI was able to use SageMaker while keeping the initially used HuggingFace Accelerate (Accelerate) software stack intact. Thus, transitioning to SageMaker training didn’t require changes to BRIA AI’s model implementation or training code. Later, BRIA AI was able to seamlessly evolve their software stack on SageMaker along with their model training.

Training pipeline architecture

BRIA AI’s training pipeline consists of two main components:

Data preprocessing:

Data contributors upload licensed raw image files to BRIA AI’s Amazon Simple Storage Service (Amazon S3) bucket.
An image pre-processing pipeline using Amazon Simple Queue Service (Amazon SQS) and AWS Lambda functions generates missing image metadata and packages training data into large webdataset files for later efficient data streaming directly from an S3 bucket, and data sharding across GPUs. See the [Challenge 1] section. Webdataset is a PyTorch implementation therefore it fits well with Accelerate.

Model training:

SageMaker distributes training jobs for managing the training cluster and runs the training itself.
Streaming data from S3 to the training instances using SageMaker’s FastFile mode.

Pre-training challenges and solutions

Pre-training foundation models is a challenging task. Challenges include cost, performance, orchestration, monitoring, and the engineering expertise needed throughout the weeks-long training process.

The four challenges we faced were:

Challenge 1: Achieving out-of-the-box operational excellence for large model training

To orchestrate the training cluster and recover from failures, BRIA AI relies on SageMaker Training Jobs’ resiliency features. These include cluster health checks, built-in retries, and job resiliency. Before your job starts, SageMaker runs GPU health checks and verifies NVIDIA Collective Communications Library (NCCL) communication on GPU instances, replacing faulty instances (if necessary) to make sure your training script starts running on a healthy cluster of instances. You can also configure SageMaker to automatically retry training jobs that fail with a SageMaker internal server error (ISE). As part of retrying a job, SageMaker will replace instances that encountered unrecoverable GPU errors with fresh instances, reboot the healthy instances, and start the job again. This results in faster restarts and workload completion. By using AWS Deep Learning Containers, the BRIA AI workload benefited from the SageMaker SDK automatically setting the necessary environment variables to tune NVIDIA NCCL AWS Elastic Fabric Adapter (EFA) networking based on well-known best practices. This helps maximize the workload throughput.

To monitor the training cluster, BRIA AI used the built-in SageMaker integration to Amazon CloudWatch logs (applicative logs), and CloudWatch metrics (CPU, GPU, and networking metrics).

Challenge 2: Reducing time-to-train by using data parallelism

BRIA AI needed to train a stable-diffusion 2.0 model from scratch on petabytes-scale licensed image dataset. Training on a single GPU could take few month to complete. To meet deadline requirements, BRIA AI used data parallelism by using a SageMaker training with 16 p4de.24xlarge instances, reducing the total training time to under two weeks. Distributed data parallel training allows for much faster training of large models by splitting data across many devices that train in parallel, while syncing gradients regularly to keep a consistent shared model. It uses the combined computing power of many devices. BRIA AI used a cluster of four p4de.24xlarge instances (8xA100 80GB NVIDIA GPUs) to achieve a throughput of 1.8 it per second for an effective batch size of 2048 (batch=8, bf16, accumulate=2).

p4de.24xlarge instances include 600 GB per second peer-to-peer GPU communication with NVIDIA NVSwitch. 400 gigabits per second (Gbps) instance networking with support for EFA and NVIDIA GPUDirect RDMA (remote direct memory access).

Note: Currently you can use p5.48xlarge instances (8XH100 80GB GPUs) with 3200 Gbps networking between instances using EFA 2.0 (not used in this pre-training by BRIA AI).

Accelerate is a library that enables the same PyTorch code to be run across a distributed configuration with minimal code adjustments.

BRIA AI used Accelerate for small scale training off the cloud. When it was time to scale out training in the cloud, BRIA AI was able to continue using Accelerate, thanks to its built-in integration with SageMaker and Amazon SageMaker distributed data parallel library (SMDDP). SMDDP is purpose built to the AWS infrastructure, reducing communications overhead in two ways:

The library performs AllReduce, a key operation during distributed training that’s responsible for a large portion of communication overhead (optimal GPU usage with efficient AllReduce overlapping with a backward pass).
The library performs optimized node-to-node communication by fully utilizing the AWS network infrastructure and Amazon Elastic Compute Cloud (Amazon EC2) instance topology (optimal bandwidth use with balanced fusion buffer).

Note that SageMaker training supports many open source distributed training libraries, for example Fully Sharded Data Parallel (FSDP), and DeepSpeed. BRIA AI used FSDP in SageMaker in other training workloads. In this case, by using the ShardingStrategy.SHARD_GRAD_OP feature, BRIA AI was able to achieve an optimal batch size and accelerate their training process.

Challenge 3: Achieving efficient data loading

The BRIA AI dataset included hundreds of millions of images that needed to be delivered from storage onto GPUs for processing. Efficiently accessing this large amount of data across a training cluster presents several challenges:

The data might not fit into the storage of a single instance.
Downloading the multi-terabyte dataset to each training instance is time consuming while the GPUs sit idle.
Copying millions of small image files from Amazon S3 can become a bottleneck because of accumulated roundtrip time of fetching objects from S3.
The data needs to be split correctly between instances.

BRIA AI addressed these challenges by using SageMaker fast file input mode, which provided the following out-of-the-box features:

Streaming Instead of copying data when training starts, or using an additional distributed file system, we chose to stream data directly from Amazon S3 to the training instances using SageMaker fast file mode. This allows training to start immediately without waiting for downloads. Streaming also reduces the need to fit datasets into instance storage.
Data distribution: Fast file mode was configured to shard the dataset files between multiple instances using S3DataDistributionType=ShardedByS3Key.
Local file access: Fast file mode provides a local POSIX filesystem interface to data in Amazon S3. This allowed BRIA AI’s data loader to access remote data as if it was local.
Packaging files to large containers: Using millions of small image and metadata files is an overhead when streaming data from object storage like Amazon S3. To reduce this overhead, BRIA AI compacted multiple files into large TAR file containers (2–5 GB), which can be efficiently streamed from S3 using fast file mode to the instances. Specifically, BRIA AI used WebDataset for efficient local data loading and used a policy wherein there is no data loading synchronization between instances and each GPU loads random batches through a fixed seed. This policy helps eliminate bottlenecks and maintains fast and deterministic data loading performance.

For more on data loading considerations, see Choose the best data source for your Amazon SageMaker training job blog post.

Challenge 4: Paying only for net training time

Pre-training large language models is not continuous. The model training often requires intermittent stops for evaluation and adjustments. For instance, the model might stop converging and need adjustments, or you might want to pause training to test the model, refine data, or troubleshoot issues. These pauses result in extended periods where the GPU cluster is idle. With SageMaker training jobs, BRIA AI was able to only pay for the duration of their active training time. This allowed BRIA AI to train models at a lower cost and with greater efficiency.

BRIA AI training strategy is composed of three steps for resolution for optimal model convergence:

Initial training on a 256×256 – 32 GPUs cluster
Progressive refinement to a 512×512 – 64 GPUs cluster
Final training on a 1024×1024 – 128 GPUs cluster

In each step, the computing required was different due to applied tradeoffs, such as the batch size per resolution and the upper limit of the GPU and gradient accumulation. The tradeoff is between cost-saving and model coverage.

BRIA AI’s cost calculations were facilitated by maintaining a consistent iteration per second rate, which allowed for accurate estimation of training time. This enabled precise determination of the required number of iterations and calculation of the training compute cost per hour.

BRIA AI training GPU utilization and average batch size time:

GPU utilization: Average is over 98 percent, signifying maximization of GPUs for the whole training cycle and that our data loader is efficiently streaming data at a high rate.
Iterations per second : Training strategy is composed of three steps—Initial training on 256×256, progressive refinement to 512×512, and final training on 1024×1024 resolution for optimal model convergence. For each step, the amount of computing varies because there are tradeoffs that we can apply with different batch sizes per resolution while considering the upper limit of the GPU and gradient accumulation, where the tension is cost-saving against model coverage.

Result examples

Prompts used for generating the images
Prompt 1, upper left image: A stylish man sitting casually on outdoor steps, wearing a green hoodie, matching green pants, black shoes, and sunglasses. He is smiling and has neatly groomed hair and a short beard. A brown leather bag is placed beside him. The background features a brick wall and a window with white frames.

Prompt 2, upper right image: A vibrant Indian wedding ceremony. The smiling bride in a magenta saree with gold embroidery and henna-adorned hands sits adorned in traditional gold jewelry. The groom, sitting in front of her, in a golden sherwani and white dhoti, pours water into a ceremonial vessel. They are surrounded by flowers, candles, and leaves in a colorful, festive atmosphere filled with traditional objects.

Prompt 3, lower left image: A wooden tray filled with a variety of delicious pastries. The tray includes a croissant dusted with powdered sugar, a chocolate-filled croissant, a partially eaten croissant, a Danish pastry and a muffin next to a small jar of chocolate sauce, and a bowl of coffee beans, all arranged on a beige cloth.

Prompt 4, lower right image: A panda pouring milk into a white cup on a table with coffee beans, flowers, and a coffee press. The background features a black-and-white picture and a decorative wall piece.

Conclusion

In this post, we saw how Amazon SageMaker enabled BRIA AI to train a diffusion model efficiently, without needing to manually provision and configure infrastructure. By using SageMaker training, BRIA AI was able to reduce costs and accelerate iteration speed, reducing training time with distributed training while maintaining 98 percent GPU utilization, and maximize value per cost. By taking on the undifferentiated heavy lifting, SageMaker empowered BRIA AI’s team to be more productive and deliver innovations faster. The ease of use and automation offered by SageMaker training jobs makes it an attractive option for any team looking to efficiently train large, state-of-the-art models.

To learn more about how SageMaker can help you train large AI models efficiently and cost-effectively, explore the Amazon SageMaker page. You can also reach out to your AWS account team to discover how to unlock the full potential of your large-scale AI initiatives.

About the Authors

Bar Fingerman, Head Of Engineering AI/ML at BRIA AI.

Doron Bleiberg, Senior Startup Solutions Architect.

Gili Nachum, Principal Gen AI/ML Specialist Solutions Architect.

Erez Zarum, Startup Solutions Architect,

Diffusion Models Improve Texture Painting, Text-to-Image Generation

Kick-Starting Developments in Physics-Based Simulation

Raising the Bar for Rendering Realism, Diffraction Simulation

Teaching AI to Think in 3D

NVIDIA at SIGGRAPH

Solution overview

Prerequisites

Deployment steps

Clean up

Conclusion

About the Author

Challenges in RAG accuracy

Fine tuning embedding models using SageMaker

Prerequisites

Steps to fine tune embedding models on Amazon SageMaker

Clean up

Conclusion

About the Authors

Training pipeline architecture

Pre-training challenges and solutions

Challenge 1: Achieving out-of-the-box operational excellence for large model training

Challenge 2: Reducing time-to-train by using data parallelism

Challenge 3: Achieving efficient data loading

Challenge 4: Paying only for net training time

Result examples

Conclusion

About the Authors

Navigation

GenAI Vision Endless Possibilities

"I'm interested in things that change the world or that affect the future and wondrous, new technology where you see it, and you're like, 'Wow, how did that even happen? How is that possible?'" -- Elon Musk

Copyright © 2019-2025 Vedere AI. All Rights Reserved.