November 2023 – Page 9

Text embedding and sentence similarity retrieval at scale with Amazon SageMaker JumpStart

Text vectors or embeddings are numerical vector representations of text that are generated by large language models (LLMs). After LLMs are fully pre-trained on a large dataset or fine-tuned from different tasks, including text completion, question answering, and translations, text embeddings capture semantic information of the input text. Different downstream applications are made possible by text embeddings, including similarity searching, information retrieval, recommendations and personalization, multilingual translations, and more.

Before intelligent applications could be built from embeddings, enterprises and organizations had to embed their existing documents, which can be expensive and technically complicated. Amazon SageMaker JumpStart is a machine learning (ML) hub that helps accelerate this journey. With SageMaker JumpStart, you can access pre-trained, cutting-edge text embedding models from various model providers, including Hugging Face, AI 21 Labs, Cohere, and Meta AI. You can seamlessly deploy these models into production with the SageMaker JumpStart user interface or SDK. In addition, none of your data is used to train the underlying models. Because all data is encrypted and doesn’t leave its own VPC, you can trust your data remains private and confidential.

In this post, we demonstrate how to use the SageMaker Python SDK for text embedding and sentence similarity. Sentence similarity involves assessing the likeness between two pieces of text after they are converted into embeddings by the LLM, which is a foundation step for applications like Retrieval Augmented Generation (RAG). We demonstrate how to do the following:

Run inference on a text embedding model deployed from SageMaker JumpStart
Find the nearest neighbors for an input sentence with your own dataset
Run the batch transform on large documents to minimize costs

All the code is available on GitHub.

Deploy a text embedding model via SageMaker JumpStart

To host a model on Amazon SageMaker, the first step is to set up and authenticate the use of AWS services. In Amazon SageMaker Studio, we use the execution role associated with the notebook instance. See the following code:

import sagemaker, boto3, json
from sagemaker.session import Session
sagemaker_session = Session()
aws_role = sagemaker_session.get_caller_identity_arn()
aws_region = boto3.Session().region_name
sess = sagemaker.Session()

On Hugging Face, the Massive Text Embedding Benchmark (MTEB) is provided as a leaderboard for diverse text embedding tasks. It currently provides 129 benchmarking datasets across 8 different tasks on 113 languages. The top text embedding models from the MTEB leaderboard are made available from SageMaker JumpStart, including bge, gte, e5, and more. In this post, we use huggingface-sentencesimilarity-bge-large-en as an example. We can use the SageMaker SDK to deploy this state-of-the-art text embedding model:

from sagemaker.jumpstart.model import JumpStartModel

model_id = "huggingface-sentencesimilarity-bge-large-en"
text_embedding_model = JumpStartModel(model_id=model_id)
predictor = text_embedding_model.deploy()

Text embedding model query

Let’s look at the text embedding model query in more detail.

Text to embedding

If you have already deployed a SageMaker endpoint before, the predictor can be restored as follows:

from sagemaker.predictor import Predictor
from sagemaker.deserializers import JSONDeserializer
from sagemaker.serializers import IdentitySerializer

predictor = Predictor(
    endpoint_name=<YOUR_ENDPOINT_NAME>,
    deserializer=JSONDeserializer(),
    serializer=IdentitySerializer(),
)
predictor.content_type = "application/x-text"

After the model is successfully deployed, you can query the endpoint with a batch of input texts within a JSON payload:

sentences = [
    # Pets
    "Your dog is so cute.",
    "How cute your dog is!",
    "You have such a cute dog!",
    # Cities
    "Sydney is the place where I work.",
    "I work in Sydney.",
    # Color
    "What colour do you like the most?",
    "What is your favourite colour?",
]

predictor.predict(json.dumps(sentences).encode('utf-8'))

The correlation of the embeddings of these sentences is plotted in the following figure.

As shown in the preceding figure, same subjects are highly correlated within themselves, including Pets, Cities, and Color; different subjects are much dissimilar. This indicates the embedding generated by the LLMs (in this case, bge) can represent the semantic information accurately.

For this post, we used the preceding sample and compared the latency across different sentence embedding models currently available from SageMaker JumpStart. Latency is the amount of time from the moment that a user sends a request until the time that the application indicates that the request has been completed. The numbers in the following table represent the average latency for a total of 100 requests using the same batch of input texts on the ml.g5.2xlarge and ml.c6i.xlarge instances.

Model	g5.2xlarge Average Latency (ms)	c6i.xlarge Average Latency(ms)	Language Support
all-MiniLM-L6-v2	19.5	27.9	English
BGE Base En	21.2	114	English
BGE Small En	28.3	45.6	English
BGE Large En	34.7	337	English
Multilingual E5 Base	22.1	118	Multilingual
Multilingual E5 Large	39.8	360	Multilingual
E5 Base	25.6	117	English
E5 Base V2	25.2	123	English
E5 Large	32.2	339	English
E5 Large V2	32.5	331	English
GTE Base	22.2	112	English
GTE Small	19.7	46	English
GTE Large	39.7	347	English

Get the nearest neighbors

The deployed model from SageMaker JumpStart can also facilitate the process of identifying the nearest neighbors to queries within the corpus. When provided with queries and a corpus, the model will produce the corpus_id, which denotes the position of the relevant corpus entry in the input corpus list, and a score indicating the degree of proximity to the query. It uses the following parameters:

corpus – Provides the list of inputs from which to find the nearest neighbor
queries – Provides the list of inputs for which to find the nearest neighbor from the corpus
top_k – The number of nearest neighbors to find from the corpus
mode – Set as nn_corpus for getting the nearest neighbors to input queries within the corpus

See the following code:

corpus = [
    "Amazon SageMaker is a fully managed service to prepare data and build, train, and deploy machine learning (ML) models for any use case with fully managed infrastructure, tools, and workflows.",
    "Amazon SageMaker stores code in ML storage volumes, secured by security groups and optionally encrypted at rest.",
    "Amazon SageMaker provides a full end-to-end workflow, but you can continue to use your existing tools with SageMaker. You can easily transfer the results of each stage in and out of SageMaker as your business requirements dictate."
]
queries = [
    "What is Amazon SageMaker?",
    "How does Amazon SageMaker secure my code?",
    "What if I have my own notebook, training, or hosting environment in my own business environment?"
]

payload_nearest_neighbor = {"corpus": corpus, "queries": queries, "top_k": 3, "mode": "nn_corpus"}
query_response = predictor.predict(payload_nearest_neighbor)

We get the following output:

[
    [
        {'corpus_id': 0, 'score': 0.8992230892181396},
        {'corpus_id': 2, 'score': 0.8664969205856323},
        {'corpus_id': 1, 'score': 0.8456423282623291}
    ],
    [
        {'corpus_id': 1, 'score': 0.8919335603713989},
        {'corpus_id': 0, 'score': 0.840064525604248},
        {'corpus_id': 2, 'score': 0.8145401477813721}
    ],
    [
        {'corpus_id': 2, 'score': 0.7712811231613159},
        {'corpus_id': 1, 'score': 0.7564010620117188},
        {'corpus_id': 0, 'score': 0.7525666356086731}
    ]
]

This result means the first query is most similar to the first corpus, the second is closer to the second corpus, and so on. This is a correct match in this example.

We also took the preceding sample and compared the latency across different sentence embedding models currently available from SageMaker JumpStart. The numbers in the following table represent the average latency for a total of 100 requests using the same payload on the ml.g5.2xlarge and ml.c6i.xlarge instances.

Model	g5.2xlarge Average Latency (ms)	c6i.xlarge Average Latency(ms)	Language Support
all-MiniLM-L6-v2	21.7	69.1	English
BGE Base En	29.1	372	English
BGE Small En	29.2	124	English
BGE Large En	47.2	1240	English
Multilingual E5 Base	30	389	Multilingual
Multilingual E5 Large	47.1	1380	Multilingual
E5 Base	30.4	373	English
E5 Base V2	31	409	English
E5 Large	45.9	1230	English
E5 Large V2	49.6	1220	English
GTE Base	30.3	375	English
GTE Small	28.5	129	English
GTE Large	46.6	1320	English

Get the nearest neighbors on a large dataset

When making requests to the SageMaker invoke endpoint, payloads are restricted to approximately 5 MB, and the request timeout is set to 1 minute. If corpus size exceeds these limits, you could use a SageMaker training job, which generates embeddings for your large dataset and persists them alongside the model inside the SageMaker endpoint. Therefore, they don’t have to be passed as part of the invocation payload. The process of finding the nearest neighbors is carried out using SentenceTransformer and its utility function. The nearest neighbor is based on the cosine similarity between the input sentence embedding and the precomputed sentence embeddings during the training job.

In the following example, we fetch and prepare the Amazon_SageMaker_FAQs dataset to use it in finding the nearest neighbor to an input question:

!aws s3 cp s3://jumpstart-cache-prod-us-west-2/training-datasets/Amazon_SageMaker_FAQs/Amazon_SageMaker_FAQs.csv Amazon_SageMaker_FAQs.csv

import pandas as pd

data = pd.read_csv("Amazon_SageMaker_FAQs.csv", names=["Questions", "Answers"])
data["id"] = data.index
data_req = data[["id", "Answers"]]
data_req.to_csv("data.csv", index=False, header=False)

output_bucket = sess.default_bucket()
output_prefix = "jumpstart-example-ss-training"

s3_output_location = f"s3://{output_bucket}/{output_prefix}/output"
training_dataset_s3_path = f"s3://{output_bucket}/{output_prefix}/data/data.csv"

!aws s3 cp data.csv {training_dataset_s3_path}

For algorithm-specific training hyperparameters, the SageMaker SDK can be fetched or overwritten:

from sagemaker import hyperparameters

hyperparameters = hyperparameters.retrieve_default(model_id=model_id, model_version = "*")
hyperparameters["batch_size"] = "64"
print(hyperparameters)
>>> {'max_seq_length': 'None', 'batch_size': '64', 'store_text_with_embedding': 'True'}

The SageMaker training consists of two steps: create the estimator object and launch the training job. The output is a model prepackaged with embeddings of your large dataset used as training data, which can be deployed for inference to get the nearest neighbor for any input sentence. See the following code:

from sagemaker.jumpstart.estimator import JumpStartEstimator

estimator = JumpStartEstimator(
    model_id=model_id,
    hyperparameters=hyperparameters,
    output_path=s3_output_location
)

estimator.fit(
    {"training": f"s3://{output_bucket}/{output_prefix}/data"}
)
predictor = estimator.deploy()

The query syntax to convert text into embeddings is the same as before. The code to get the nearest neighbor, however, can be simplified as follows:

payload_nearest_neighbour = {
    "queries": ["Is R supported with Amazon SageMaker?"],
    "top_k": 1,
    "mode": "nn_train_data",
}

response = predictor.predict(payload_nearest_neighbour)
>>> [[{'id': '9', 'score': 0.9240573048591614}]]

data["Answers"].iloc[int(response[0][0]["id"])]
>>> "Yes, R is supported with Amazon SageMaker. You can use R within SageMaker notebook instances, which include a preinstalled R kernel and the reticulate library. Reticulate offers an R interface for the Amazon SageMaker Python SDK, enabling ML practitioners to build, train, tune, and deploy R models."

We can also query the endpoint with questions in the Amazon_SageMaker_FAQs dataset and compare how many of the correct corresponding answers are returned. In the following example, we measure the top-3 accuracy, given there could be similar question answer pairs. This means if the correct answer is returned as one of the top-3 returns, it’s treated as a correct query.

total_correct_answers = 0

for i in range(len(data)):
    question = data["Questions"].iloc[i]
    payload_nearest_neighbor = {
        "queries": [question],
        "top_k": 3,
        "mode": "nn_train_data",
    }
    response = predictor.predict(payload_nearest_neighbor)
    response_ids = [int(res["id"]) for res in response[0]]

    if i in response_ids:
        total_correct_answers += 1
    else:
        pred_answer = [data["Answers"].iloc[response_id] for response_id in response_ids]

print(total_correct_answers*100/len(data))
>>>
81.16883116883118

Run a batch transform to get embeddings on large datasets

For enterprises and organizations with a large volume of historical documents that exceed the memory of a single endpoint instance, you can use SageMaker batch transform to save cost. When you start a batch transform job, SageMaker launches the necessary compute resources to process the data. During the job, SageMaker automatically provisions and manage the compute resources. When the batch transform job is complete, those resources are automatically cleaned up, which minimizes costs. By dividing a large dataset into smaller chunks and using more instances, you can scale out the compute for faster inference with similar cost, without managing infrastructure. The maximum payload for batch transform is 100 MB and timeout is 1 hour.

The input format for our batch transform job is a JSONL file, with entries as a line of JSON, which consists of id and text_inputs. See the following code:

test_data_file_name = "test.jsonl"
test_data = []

for i in range(len(data)):
    answer = data.loc[i, "Answers"]
    payload = {"id": i, "text_inputs": answer}
    test_data.append(payload)

with open(test_data_file_name, "w") as outfile:
    for entry in test_data:
        outfile.write(f"{json.dumps(entry)}n")

s3 = boto3.client("s3")
s3.upload_file(test_data_file_name, output_bucket, f"{output_prefix}/batch_input/test.jsonl")

When the data is ready in Amazon Simple Storage Service (Amazon S3), you can create the batch transform object from the SageMaker JumpStart model, which triggers the transform job:

s3_input_data_path = f"s3://{output_bucket}/{output_prefix}/batch_input/"
s3_output_data_path = f"s3://{output_bucket}/{output_prefix}/batch_output/"

batch_transformer = text_embedding_model.transformer(
    instance_count=1,
    instance_type="ml.p3.2xlarge",
    output_path=s3_output_data_path,
    assemble_with="Line",
    accept="text/csv",
    max_payload=1,
)

batch_transformer.transform(
    s3_input_data_path,
    content_type="application/jsonlines",
    split_type="Line"
)

batch_transformer.wait()

After the batch transform job is complete, you can download the result from Amazon S3:

s3 = boto3.client("s3")
s3.download_file(
    output_bucket, output_prefix + "/batch_output/" + "test.jsonl.out", "predict.jsonl"
)

with open("predict.jsonl", "r") as json_file:
    json_list = list(json_file)

Conclusion

SageMaker JumpStart provides a straightforward way to use state-of-the-art large language foundation models for text embedding and semantic search. With the user interface or just a few lines of code, you can deploy a highly accurate text embedding model and find semantic matches across large datasets, at scale and cost-efficiently. SageMaker JumpStart removes the barriers to implement semantic search by providing instant access to cutting-edge models like the ones benchmarked on the MTEB leaderboard. Businesses and developers can build intelligent search and recommendation systems faster.

This post demonstrated how to find semantically similar questions and answers, which could be applied to RAG use cases, recommendations and personalization, multilingual translations, and more. With continued advances in language models and the simplicity of SageMaker JumpStart, more organizations can infuse generative AI capabilities into their products. As the next step, you can try text-embedding models from SageMaker JumpStart on your own dataset to test and benchmark the results for your RAG use cases.

About the Authors

Dr. Baichuan Sun, currently serving as a Sr. AI/ML Solution Architect at AWS, focuses on generative AI and applies his knowledge in data science and machine learning to provide practical, cloud-based business solutions. With experience in management consulting and AI solution architecture, he addresses a range of complex challenges, including robotics computer vision, time series forecasting, and predictive maintenance, among others. His work is grounded in a solid background of project management, software R&D, and academic pursuits. Outside of work, Dr. Sun enjoys the balance of traveling and spending time with family and friends, reflecting a commitment to both his professional growth and personal well-being.

Hemant Singh is an Applied Scientist with experience in Amazon SageMaker JumpStart. He got his masters from Courant Institute of Mathematical Sciences and B.Tech from IIT Delhi. He has experience in working on a diverse range of machine learning problems within the domain of natural language processing, computer vision, and time series analysis.

Dr. Ashish Khetan is a Senior Applied Scientist with Amazon SageMaker built-in algorithms and helps develop machine learning algorithms. He got his PhD from University of Illinois Urbana-Champaign. He is an active researcher in machine learning and statistical inference, and has published many papers in NeurIPS, ICML, ICLR, JMLR, ACL, and EMNLP conferences.

Open sourcing Project Guideline: A platform for computer vision accessibility technology

Posted by Dave Hawkey, Software Engineer, Google Research

Two years ago we announced Project Guideline, a collaboration between Google Research and Guiding Eyes for the Blind that enabled people with visual impairments (e.g., blindness and low-vision) to walk, jog, and run independently. Using only a Google Pixel phone and headphones, Project Guideline leverages on-device machine learning (ML) to navigate users along outdoor paths marked with a painted line. The technology has been tested all over the world and even demonstrated during the opening ceremony at the Tokyo 2020 Paralympic Games.

Since the original announcement, we set out to improve Project Guideline by embedding new features, such as obstacle detection and advanced path planning, to safely and reliably navigate users through more complex scenarios (such as sharp turns and nearby pedestrians). The early version featured a simple frame-by-frame image segmentation that detected the position of the path line relative to the image frame. This was sufficient for orienting the user to the line, but provided limited information about the surrounding environment. Improving the navigation signals, such as alerts for obstacles and upcoming turns, required a much better understanding and mapping of the users’ environment. To solve these challenges, we built a platform that can be utilized for a variety of spatially-aware applications in the accessibility space and beyond.

Today, we announce the open source release of Project Guideline, making it available for anyone to use to improve upon and build new accessibility experiences. The release includes source code for the core platform, an Android application, pre-trained ML models, and a 3D simulation framework.

System design

The primary use-case is an Android application, however we wanted to be able to run, test, and debug the core logic in a variety of environments in a reproducible way. This led us to design and build the system using C++ for close integration with MediaPipe and other core libraries, while still being able to integrate with Android using the Android NDK.

Under the hood, Project Guideline uses ARCore to estimate the position and orientation of the user as they navigate the course. A segmentation model, built on the DeepLabV3+ framework, processes each camera frame to generate a binary mask of the guideline (see the previous blog post for more details). Points on the segmented guideline are then projected from image-space coordinates onto a world-space ground plane using the camera pose and lens parameters (intrinsics) provided by ARCore. Since each frame contributes a different view of the line, the world-space points are aggregated over multiple frames to build a virtual mapping of the real-world guideline. The system performs piecewise curve approximation of the guideline world-space coordinates to build a spatio-temporally consistent trajectory. This allows refinement of the estimated line as the user progresses along the path.

Project Guideline builds a 2D map of the guideline, aggregating detected points in each frame (red) to build a stateful representation (blue) as the runner progresses along the path.

A control system dynamically selects a target point on the line some distance ahead based on the user’s current position, velocity, and direction. An audio feedback signal is then given to the user to adjust their heading to coincide with the upcoming line segment. By using the runner’s velocity vector instead of camera orientation to compute the navigation signal, we eliminate noise caused by irregular camera movements common during running. We can even navigate the user back to the line while it’s out of camera view, for example if the user overshot a turn. This is possible because ARCore continues to track the pose of the camera, which can be compared to the stateful line map inferred from previous camera images.

Project Guideline also includes obstacle detection and avoidance features. An ML model is used to estimate depth from single images. To train this monocular depth model, we used SANPO, a large dataset of outdoor imagery from urban, park, and suburban environments that was curated in-house. The model is capable of detecting the depth of various obstacles, including people, vehicles, posts, and more. The depth maps are converted into 3D point clouds, similar to the line segmentation process, and used to detect the presence of obstacles along the user’s path and then alert the user through an audio signal.

Using a monocular depth ML model, Project Guideline constructs a 3D point cloud of the environment to detect and alert the user of potential obstacles along the path.

A low-latency audio system based on the AAudio API was implemented to provide the navigational sounds and cues to the user. Several sound packs are available in Project Guideline, including a spatial sound implementation using the Resonance Audio API. The sound packs were developed by a team of sound researchers and engineers at Google who designed and tested many different sound models. The sounds use a combination of panning, pitch, and spatialization to guide the user along the line. For example, a user veering to the right may hear a beeping sound in the left ear to indicate the line is to the left, with increasing frequency for a larger course correction. If the user veers further, a high-pitched warning sound may be heard to indicate the edge of the path is approaching. In addition, a clear “stop” audio cue is always available in the event the user veers too far from the line, an anomaly is detected, or the system fails to provide a navigational signal.

Project Guideline has been built specifically for Google Pixel phones with the Google Tensor chip. The Google Tensor chip enables the optimized ML models to run on-device with higher performance and lower power consumption. This is critical for providing real-time navigation instructions to the user with minimal delay. On a Pixel 8 there is a 28x latency improvement when running the depth model on the Tensor Processing Unit (TPU) instead of CPU, and 9x improvement compared to GPU.

Testing and simulation

Project Guideline includes a simulator that enables rapid testing and prototyping of the system in a virtual environment. Everything from the ML models to the audio feedback system runs natively within the simulator, giving the full Project Guideline experience without needing all the hardware and physical environment set up.

Screenshot of Project Guideline simulator.

Future direction

To launch the technology forward, WearWorks has become an early adopter and teamed up with Project Guideline to integrate their patented haptic navigation experience, utilizing haptic feedback in addition to sound to guide runners. WearWorks has been developing haptics for over 8 years, and previously empowered the first blind marathon runner to complete the NYC Marathon without sighted assistance. We hope that integrations like these will lead to new innovations and make the world a more accessible place.

The Project Guideline team is also working towards removing the painted line completely, using the latest advancements in mobile ML technology, such as the ARCore Scene Semantics API, which can identify sidewalks, buildings, and other objects in outdoor scenes. We invite the accessibility community to build upon and improve this technology while exploring new use cases in other fields.

Acknowledgements

Many people were involved in the development of Project Guideline and the technologies behind it. We’d like to thank Project Guideline team members: Dror Avalon, Phil Bayer, Ryan Burke, Lori Dooley, Song Chun Fan, Matt Hall, Amélie Jean-aimée, Dave Hawkey, Amit Pitaru, Alvin Shi, Mikhail Sirotenko, Sagar Waghmare, John Watkinson, Kimberly Wilber, Matthew Willson, Xuan Yang, Mark Zarich, Steven Clark, Jim Coursey, Josh Ellis, Tom Hoddes, Dick Lyon, Chris Mitchell, Satoru Arao, Yoojin Chung, Joe Fry, Kazuto Furuichi, Ikumi Kobayashi, Kathy Maruyama, Minh Nguyen, Alto Okamura, Yosuke Suzuki, and Bryan Tanaka. Thanks to ARCore contributors: Ryan DuToit, Abhishek Kar, and Eric Turner. Thanks to Alec Go, Jing Li, Liviu Panait, Stefano Pellegrini, Abdullah Rashwan, Lu Wang, Qifei Wang, and Fan Yang for providing ML platform support. We’d also like to thank Hartwig Adam, Tomas Izo, Rahul Sukthankar, Blaise Aguera y Arcas, and Huisheng Wang for their leadership support. Special thanks to our partners Guiding Eyes for the Blind and Achilles International.

Amazon Textract’s new Layout feature introduces efficiencies in general purpose and generative AI document processing tasks

Amazon Textract is a machine learning (ML) service that automatically extracts text, handwriting, and data from any document or image. AnalyzeDocument Layout is a new feature that allows customers to automatically extract layout elements such as paragraphs, titles, subtitles, headers, footers, and more from documents. Layout extends Amazon Textract’s word and line detection by automatically grouping the text into these layout elements and sequencing them according to human reading patterns. (That is, reading order from left to right and top to bottom.).

Building document processing and understanding solutions for financial and research reports, medical transcriptions, contracts, media articles, and so on requires extraction of information present in titles, headers, paragraphs, and so on. For example, when cataloging financial reports in a document database, extracting and storing the title as a catalog index enables easy retrieval. Prior to the introduction of this feature, customers had to construct these elements using post-processing code and the words and lines response from Amazon Textract.

The complexity of implementing this code is amplified with documents with multiple columns and complex layouts. With this announcement, extraction of commonly occurring layout elements from documents becomes easier and allows customers to build efficient document processing solutions faster with less code.

In Sept 2023, Amazon Textract launched the Layout feature that automatically extracts layout elements such as paragraphs, titles, lists, headers, and footers and orders the text and elements as a human would read. We also released the updated version of the open source postprocessing toolkit, purpose-built for Amazon Textract, known as Amazon Textract Textractor.

In this post, we discuss how customers can take advantage of this feature for document processing workloads. We also discuss a qualitative study demonstrating how Layout improves generative artificial intelligence (AI) task accuracy for both abstractive and extractive tasks for document processing workloads involving large language models (LLMs).

Layout elements

Central to the Layout feature of Amazon Textract are the new Layout elements. The LAYOUT feature of AnalyzeDocument API can now detect up to ten different layout elements in a document’s page. These layout elements are represented as block type in the response JSON and contain the confidence, geometry (that is, bounding box and polygon information), and Relationships, which is a list of IDs corresponding to the LINE block type.

Title – The main title of the document. Returned as LAYOUT_TITLE block type.
Header – Text located in the top margin of the document. Returned as LAYOUT_HEADER block type.
Footer – Text located in the bottom margin of the document. Returned as LAYOUT_FOOTER block type.
Section Title – The titles below the main title that represent sections in the document. Returned as LAYOUT_SECTION_HEADER block type.
Page Number – The page number of the documents. Returned as LAYOUT_PAGE_NUMBER block type.
List – Any information grouped together in list form. Returned as LAYOUT_LIST block type.
Figure – Indicates the location of an image in a document. Returned as LAYOUT_FIGURE block type.
Table – Indicates the location of a table in the document. Returned as LAYOUT_TABLE block type.
Key Value – Indicates the location of form key-value pairs in a document. Returned as LAYOUT_KEY_VALUE block type.
Text – Text that is present typically as a part of paragraphs in documents. It is a catch all for text that is not present in other elements. Returned as LAYOUT_TEXT block type.

Each layout element may contain one or more LINE relationships, and these lines constitute the actual textual content of the layout element (for example, LAYOUT_TEXT is typically a paragraph of text containing multiple LINEs). It is important to note that layout elements appear in the correct reading order in the API response as the reading order in the document, which makes it easy to construct the layout text from the API’s JSON response.

Use cases of layout-aware extraction

Following are some of the common use cases for the new AnalyzeDocument LAYOUT feature.

Extracting layout elements for search indexing and cataloging purposes. The contents of the LAYOUT_TITLE or LAYOUT_SECTION_HEADER, along with the reading order, can be used to appropriately tag or enrich metadata. This improves the context of a document in a document repository to improve search capabilities or organize documents.
Summarize the entire document or parts of a document by extracting text in proper reading order and using the layout elements.
Extracting specific parts of the document. For example, a document may contain a mix of images with text within it and other plaintext sections or paragraphs. You can now isolate the text sections using the LAYOUT_TEXT element.
Better performance and accurate answers for in-context document Q&A and entity extractions using an LLM.

There are other possible document automation use cases where Layout can be useful. However, in this post we explain how to extract layout elements in order to help understand how to use the feature for traditional documentation automation solutions. We discuss the benefits of using Layout for a document Q&A use case with LLMs using a common method known as Retrieval Augmented Generation (RAG), and for entity extraction use-case. For the outcomes of both of these use-cases, we present comparative scores that helps differentiate the benefits of layout aware text as opposed to just plaintext.

To highlight the benefits, we ran tests to compare how plaintext extracted using raster scans with DetectDocumentText and layout-aware linearized text extracted using AnalyzeDocument with LAYOUT feature impacts the outcome of in-context Q&A outputs by an LLM. For this test, we used Anthropic’s Claude Instant model with Amazon Bedrock. However, for complex document layouts, the generation of text in proper reading order and subsequently chunking them appropriately may be challenging, depending on how complex the document layout is. In the following sections, we discuss how to extract layout elements, and linearize the text to build an LLM-based application. Specifically, we discuss the comparative evaluation of the responses generated by the LLM for document Q&A application using raster scan–based plaintext and layout-aware linearized text.

Extracting layout elements from a page

The Amazon Textract Textractor toolkit can process a document through the AnalyzeDocument API with LAYOUT feature and subsequently exposes the detected layout elements through the page’s PAGE_LAYOUT property and its own subproperty TITLES, HEADERS, FOOTERS, TABLES, KEY_VALUES, PAGE_NUMBERS, LISTS, and FIGURES. Each element has its own visualization function, allowing you to see exactly what was detected. To get started, you start by installing Textractor using

pip install amazon-textract-textractor

As demonstrated in the following code snippet, the document news_article.pdf is processed with the AnalyzeDocument API with LAYOUT feature. The response results in a variable document that contains each of the detected Layout blocks from the properties.

from textractor import Textractor
from textractor.data.constants import TextractFeatures

extractor = Textractor(profile_name="default")

input_document = "./news_article.pdf"

document = extractor.analyze_document(
                   file_source=input_document,
                   features=[TextractFeatures.LAYOUT],
                   save_image=True)

document.pages[0].visualize()
document.pages[0].page_layout.titles.visualize()
document.pages[0].page_layout.headers.visualize()

document.pages[0].page_layout.section_headers.visualize()
document.pages[0].page_layout.footers.visualize()
document.pages[0].page_layout.tables.visualize()
document.pages[0].page_layout.key_values.visualize()
document.pages[0].page_layout.page_numbers.visualize()
document.pages[0].page_layout.lists.visualize()
document.pages[0].page_layout.figures.visualize()

See a more in-depth example in the official Textractor documentation.

Linearizing text from the layout response

To use the layout capabilities, Amazon Textract Textractor was extensively reworked for the 1.4 release to provide linearization with over 40 configuration options, allowing you to tailor the linearized text output to your downstream use case with little effort. The new linearizer supports all currently available AnalyzeDocument APIs, including forms and signatures, which lets you add selection items to the resulting text without making any code changes.

from textractor import Textractor
from textractor.data.constants import TextractFeatures
from textractor.data.text_linearization_config import TextLinearizationConfig

extractor = Textractor(profile_name="default")

config = TextLinearizationConfig(
                         hide_figure_layout=True,
                         title_prefix="# ",
                         section_header_prefix="## ")

document = extractor.analyze_document(
                                 file_source=input_document,
                                 features=[TextractFeatures.LAYOUT],
                                 save_image=True)

print(document.get_text(config=config))

See this example and more in the official Textractor documentation.

We have also added a layout pretty printer to the library that allows you to call a single function by passing in the layout API response in JSON format and get the linearized text (by page) in return.

python -m pip install -q amazon-textract-prettyprinter

You have the option to format the text in markdown format, exclude text from within figures in the document, and exclude page header, footer, and page number extractions from the linearized output. You can also store the linearized output in plaintext format in your local file system or in an Amazon S3 location by passing the save_txt_path parameter. The following code snippet demonstrates a sample usage –

from textractcaller.t_call import call_textract, Textract_Features
from textractprettyprinter.t_pretty_print import get_text_from_layout_json

textract_json = call_textract(input_document=input_document,
                      features=[Textract_Features.LAYOUT,
                      Textract_Features.TABLES])
layout = get_text_from_layout_json(textract_json=textract_json,
exclude_figure_text=True, # optional
exclude_page_header=True, # optional
exclude_page_footer=True, # optional
exclude_page_number=True, # optional
save_txt_path="s3://bucket/prefix") # optional

full_text = layout[1]
print(full_text)

Evaluating LLM performing metrics for abstractive and extractive tasks

Layout-aware text is found to improve the performance and quality of text generated by LLMs. In particular, we evaluate two types of LLM tasks—abstractive and extractive tasks.

Abstractive tasks refer to assignments that require the AI to generate new text that is not directly found in the source material. Some examples of abstractive task include summarization and question answering. For these tasks, we use the Recall-Oriented Understudy for Gisting Evaluation (ROUGE) metric to evaluate the performance of an LLM on question-answering tasks with respect to a set of ground truth data.

Extractive tasks refer to activities where the model identifies and extracts specific portions of the input text to construct a response. In these tasks, the model is focused on selecting relevant segments (such as sentences, phrases, or keywords) from the source material rather than generating new content. Some examples are named entity recognition (NER) and keyword extraction. For these tasks, we use Average Normalized Levenshtein Similarity (ANLS) on named entity recognition tasks based on the layout-linearized text extracted by Amazon Textract.

ROUGE score analysis on abstractive question-answering task

Our test is set up to perform in-context Q&A on a multicolumn document by extracting the text and then performing RAG to get answer responses from the LLM. We perform Q&A on a set of questions using the raster scan–based raw text and layout-aware linearized text. We then evaluate ROUGE metrics for each question by comparing the machine-generated response to the corresponding ground truth answer. In this case, the ground truth is the same set of questions answered by a human, which is considered as a control group.

In-context Q&A with RAG requires extracting text from the document, creating smaller chunks of the text, generating vector embeddings of the chunks, and subsequently storing them in a vector database. This is done so that the system can perform a relevance search with the question on the vector database to return chunks of text that are most relevant to the question being asked. These relevant chunks are then used to build the overall context and provided to the LLM so that it can accurately answer the question.

The following document, taken from the DocUNet: Document Image Unwarping via a Stacked U-Net dataset, is used for the test. This document is a multicolumn document with headers, titles, paragraphs, and images. We also defined a set of 20 questions answered by a human as a control group or ground truth. The same set of 20 questions was then used to generate responses from the LLM.

In the next step, we extract the text from this document using DetectDocumentText API and AnalyzeDocument API with LAYOUT feature. Since most LLMs have a limited token context window, we kept the chunk size small, about 250 characters with a chunk overlap of 50 characters, using LangChain’s RecursiveCharacterTextSplitter. This resulted in two separate sets of document chunks—one generated using the raw text and the other using the layout-aware linearized text. Both sets of chunks were stored in a vector database by generating vector embeddings using the Amazon Titan Embeddings G1 Text embedding model.

The following code snippet generates the raw text from the document.

import textractcaller as tc
from textractcaller.t_call import call_textract
from textractprettyprinter.t_pretty_print import get_lines_string

plain_textract_json = call_textract(input_document = input_document)
plain_text = get_lines_string(textract_json = plain_textract_json)

print(plain_text)

The output (trimmed for brevity) looks like the following. The text reading order is incorrect due to the lack of layout awareness of the API, and the extracted text spans the text columns.

PHOTONICS FOR A BETTER WORLD
UNESCO ENDORSES
INTERNATIONAL DAY OF LIGHT
First celebration in 2018 will become an annual
reminder of photonics-enabled technologies
T he executive board of the United Nations Educational,
in areas such as science, culture, education, sustainable development,
Scientific, and Cultural Organization (UNESCO) has endorsed
medicine, communications, and energy.
a proposal to establish an annual International Day of Light
The final report of IYL 2015 was delivered to UNESCO in Paris
(IDL) as an extension of the highly successful International Year of
during a special meeting in October 2016. At this event, SPIE member
Light and Light-based Technologies (IYL 2015).
...

The visual of the reading order for raw text extracted by DetectDocumentText can be seen in the following image.

The following code snippet generates the layout-linearized text from the document. You can use either method to generate the linearized text from the document using the latest version of Amazon Textract Textractor Python library.

import textractcaller as tc
from textractcaller.t_call import call_textract, Textract_Features
from textractprettyprinter.t_pretty_print import get_text_from_layout_json

layout_textract_json = call_textract(input_document = input_document,
                                     features = [Textract_Features.LAYOUT])
layout_text = get_text_from_layout_json(textract_json = layout_textract_json)[1]
print(layout_text)

The output (trimmed for brevity) looks like the following. The text reading order is preserved since we used the LAYOUT feature, and the text makes more sense.

PHOTONICS FOR A BETTER WORLD

UNESCO ENDORSES INTERNATIONAL DAY OF LIGHT

First celebration in 2018 will become an annual
reminder of photonics-enabled technologies

T he executive board of the United Nations Educational,
Scientific, and Cultural Organization (UNESCO) has endorsed
a proposal to establish an annual International Day of Light
(IDL) as an extension of the highly successful International Year of
Light and Light-based Technologies (IYL 2015).
The endorsement for a Day of Light has been
embraced by SPIE and other founding partners of
IYL 2015.
...

The visual of the reading order for raw text extracted by AnalyzeDocument with LAYOUT feature can be seen in the following image.

We performed chunking on both the extracted text separately, with a chunk size of 250 and an overlap of 50.

Next, we generate vector embeddings for the chunks and load them into a vector database in two separate collections. We used open source ChromaDB as our in-memory vector database and used topK value of 3 for the relevance search. This means that for every question, our relevance search query with ChromaDB returns 3 relevant chunks of text of size 250 each. These three chunks are then used to build a context for the LLM. We intentionally chose a smaller chunk size and smaller topK to build the context for the following specific reasons.

Shorten the overall size of our context since research suggests that LLMs tend to perform better with shorter context, even though the model supports longer context (through a larger token context window).
Smaller overall prompt size results in lower overall text generation model latency. The larger the overall prompt size (which includes the context), the longer it may take the model to generate a response.
Comply with the model’s limited token context window, as is the case with most LLMs.
Cost efficiency since using fewer tokens means lower cost per question for input and output tokens combined.

Note that Anthropic Claude Instant v1 does support a 100,000 token context window via Amazon Bedrock. We intentionally limited ourselves to a smaller chunk size since that also makes the test relevant to models with fewer parameters and overall shorter context windows.

We used ROUGE metrics to evaluate machine-generated text against a reference text (or ground truth), measuring various aspects like the overlap of n-grams, word sequences, and word pairs between the two texts. We chose three ROUGE metrics for evaluation.

ROUGE-1: Compares the overlap of unigrams (single words) between the generated text and a reference text.
ROUGE-2: Compares the overlap of bigrams (two-word sequences) between the generated text and a reference text.
ROUGE-L: Measures the longest common subsequence (LCS) between the generated text and a reference text, focusing on the longest sequence of words that appear in both texts, albeit not necessarily consecutively.

For our 20 sample questions relevant to the document, we ran Q&A with the raw text and linearized text, respectively, and then ran the ROUGE score analysis. We noticed almost 50 percent average improvement in precision overall. And there was significant improvement in F1-scores when layout-linearized text was compared to ground truth as opposed to when raw text was compared to ground truth.

This suggests that the model became better at generating correct responses with the help of linearized text and smaller chunking. This led to an increase in precision, and the balance between precision and recall shifted favorably towards precision, leading to an increase in the F1 score. The increased F1 score, which balances precision and recall, suggests an improvement. It’s essential to consider the practical implications of these metric changes. For instance, in a scenario where false positives are costly, the increase in precision is highly beneficial.

ANLS score analysis on extractive tasks over academic datasets

We measure the ANLS or the Average Normalized Levenshtein Similarity, which is an edit distance metric that was introduced by the paper Scene Text Visual Question Answering and aims to softly penalize minor OCR imperfections while considering the model’s reasoning abilities at the same time. This metric is a derivative version of traditional Levenshtein distance, which is a measure of the difference between two sequences (such as strings). It is defined as the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one word into the other.

For our ANLS tests, we performed an NER task where the LLM was prompted to extract the exact value from the OCR-extracted text. The two academic datasets used for the tests are DocVQA and InfographicVQA. We used zero-shot prompting to attempt extraction of key entities. The prompt used for the LLMs is of the following structure.

template = """You are asked to answer a question using only the provided Document.

The answer to the question should be taken as-is from the document and as short as possible.

Document:n{document}

Question: {question}

Extract the answer from the document with as few words as possible."""

Accuracy improvements were observed in all document question-answering datasets tested with the open source FlanT5-XL model when using layout-aware linearized text, as opposed to raw text (raster scan), in response to zero-shot prompts. In the InfographicVQA dataset, using layout-aware linearized text enables the smaller 3B parameter FlanT5-XL model to match the performance of the larger FlanT5-XXL model (on raw text), which has nearly four times as many parameters (11B).

Dataset	ANLS*
	FlanT5-XL (3B)			FlanT5-XXL (11B)
	Not Layout-aware (Raster)	Layout-aware	Δ	Not Layout- aware (Raster)	Layout-aware	Δ
DocVQA	66.03%	68.46%	1.43%	70.71%	72.05%	1.34%
InfographicsVQA	29.47%	35.76%	6.29%	37.82%	45.61%	7.79%

* ANLS is measured on text extracted by Amazon Textract, not the provided document transcription

Conclusion

The launch of Layout marks a significant advancement in using Amazon Textract to build document automation solutions. As discussed in this post, Layout uses traditional and generative AI methods to improve efficiencies when building a wide variety of document automation solutions such as document search, contextual Q&A, summarization, key-entities extraction, and more. As we continue to embrace the power of AI in building document processing and understanding systems, these enhancements will no doubt pave the way for more streamlined workflows, higher productivity, and more insightful data analysis.

For more information on the Layout feature and how to take advantage of the feature for document automation solutions, refer to AnalyzeDocument, Layout analysis, and Text linearization for generative AI applications documentation.

About the Authors

Anjan Biswas is a Senior AI Services Solutions Architect who focuses on computer vision, NLP, and generative AI. Anjan is part of the worldwide AI services specialist team and works with customers to help them understand and develop solutions to business problems with AWS AI Services and generative AI.

Lalita Reddi is a Senior Technical Product Manager with the Amazon Textract team. She is focused on building machine learning–based services for AWS customers. In her spare time, Lalita likes to play board games and go on hikes.

Edouard Belval is a Research Engineer in the computer vision team at AWS. He is the main contributor behind the Amazon Textract Textractor library.

NVIDIA Collaborates With Genentech to Accelerate Drug Discovery Using Generative AI

Genentech, a member of the Roche Group, is pioneering the use of generative AI to discover and develop new therapeutics and deliver treatments to patients more efficiently.

A new collaboration between Genentech, the biotechnology pioneer, and NVIDIA aims to transform the discovery and development of new medicines by bringing together experts from each company to optimize and accelerate Genentech’s proprietary algorithms.

NVIDIA will work with Genentech to accelerate these models on NVIDIA DGX Cloud, which provides dedicated instances of AI supercomputing and software hosted by NVIDIA cloud service provider partners.

Genentech plans to use NVIDIA BioNeMo, which enables biotech companies to customize models at scale, and integrate BioNeMo cloud application programming interfaces directly into computational drug discovery workflows.

BioNeMo, now generally available as a training service, is a domain-specific platform that simplifies, accelerates and scales generative AI applications for computational drug discovery. It allows researchers to pretrain or fine-tune state-of-the-art models on DGX Cloud.

The collaboration will initially focus on optimizing Genentech’s drug discovery AI models in its “lab in a loop” framework. The goal: To allow its researchers to understand complex biomolecular patterns and relationships to truly disrupt drug development and improve the success rate of R&D, and to empower scientists to deliver multiplicative, rather than linear or additive, benefits for patients and the broader healthcare ecosystem.

“Our collaboration with NVIDIA builds on our long history of successfully inventing and deploying technology in ways that were not initially apparent to others,” said Aviv Regev, executive vice president and head of Genentech Research & Early Development (gRED). “We were the first biotech company to leverage molecular biology for drug discovery and development, which changed the world. We pioneered antibody therapeutics that became the paradigm of treatment. And now, we have brought AI, the lab and the clinic together to uncover otherwise inaccessible patterns in vast quantities of data, and to design experiments to test those patterns. Collaborating with NVIDIA, and introducing generative AI, has the power to turbocharge the discovery and design of therapeutics that will improve the lives of patients across the world.”

Streamlining Drug Discovery With Computation

Drug discovery and development is currently a lengthy, complicated and costly process. Drug targets for novel medicines are difficult to predict, as is successfully developing a molecule as a potential therapeutic. AI can play a transformational role because generative and other AI models can help scientists rapidly identify potential drug molecules and interactions by training on large-scale datasets.

For Genentech, using AI helps bridge the gap between lab experiments and computational algorithms.

The company’s R&D group, gRED, has already done significant work using AI — across multiple modalities — to discover and develop novel therapeutics while learning more about the building blocks of biology and diseases.

Teams from Genentech and NVIDIA will now work together to optimize Genentech’s custom-developed models to shorten this time-consuming process of drug discovery and development and lead to greater success.

Putting AI in a Loop

Genentech’s “lab in a loop” is an iterative framework for generating and exploring molecular designs with predicted properties. It aims to use experimental data to inform generative computational models and better optimize future molecular designs. NVIDIA will help Genentech optimize its framework by accelerating training and inference of Genentech’s drug discovery models.

Through this collaboration, NVIDIA AI experts will gain insights into AI-related challenges in drug discovery and development. NVIDIA plans to use these insights to improve its BioNeMo platform and others to further accommodate the requirements of models used by the biotech industry.

“AI can play a transformational role in accelerating drug discovery and development — as it has across many parts of healthcare and life sciences,” said Kimberly Powell, vice president of healthcare at NVIDIA. “Together, NVIDIA and Genentech are unlocking scientific innovation by developing and implementing AI models and algorithms that enable us to rapidly iterate and unearth insights.”

Subscribe to NVIDIA healthcare news.

3D Artist Cooks Up Stunningly Photorealistic Food Renders This Week ‘In the NVIDIA Studio’

Editor’s note: This post is part of our weekly In the NVIDIA Studio series, which celebrates featured artists, offers creative tips and tricks, and demonstrates how NVIDIA Studio technology improves creative workflows.

It’s the season of gratitude: that time of year to give thanks for the people and small moments that make life so special.

This week’s featured In the NVIDIA Studio artist, Ravissen Carpenen, is serving up a feast of mouthwateringly photorealistic 3D food renders to the dinner table.

His delectable time-lapse videos are featured on his YouTube channel, CG Realism — presented with a side of upbeat music and a pinch of style.

Carpenen was one of several contributors to the food-themed Studio Standout video contest, alongside Roger Roque (@rogerroqueid), Nicole Morena (@nicky.blender), Heloise Cart (@isoheell) and Kris Theroin (@kristheorin).

Finally, livestreamers using OBS Studio — a free, open-source software for video recording and livestreaming — can download the latest update with HDR10 capture support, WHIP and WebRTC output and more. Learn more details.

All About That Baste

Carpenen’s wife, a pastry chef, inspired his photorealistic, food-centered works.

“My aim, one day, is to be able to create ultra-realistic renders that will be used in film and movies,” he said.

His projects begin with online research and reference gathering, mainly on Pinterest and Behance, which he then compiles into mood boards using the stand-alone image tracking program PureRef.

Before any modeling takes place, Carpenen lights the scene — but without textures.

“This is to tell the story of the artwork, as light intends to give artwork an emotional flow, alongside as well as color and materials,” he said.

Carpenen initially sculpts his models in ZBrush using customizable brushes to shape, texture and paint his virtual clay in a real-time environment with instant feedback.

He then browses the Quixel Megascans library for models that can further add realism, such as garlic cloves and rosemary garnishes for his turkey project.

Rare-in for More

Carpenen uses Marmoset Toolbag’s ambient occlusion, curvature, normal and thickness features to bake the ZBrush meshes from high-poly to low-poly models as 32-bit textures.

The process saves memory space, minimizing lag time while allowing greater flexibility in the modeling stage.

Bake ZBrush meshes from high-poly to low-poly models as 32-bit textures in Marmoset Toolbag.

Carpenen’s GeForce RTX 3070 GPU-powered system with RTX acceleration instantly optimized his meshes. RTX-accelerated ray tracing and OptiX AI-powered denoising also enabled smoother viewport movement.

Baking a Berry Good Pie

Once the renders are ready, Carpenen imports them into Adobe Substance 3D Painter to apply custom colors and textures.

There, Carpenen uses RTX-accelerated light and ambient occlusion baking — though not in the oven — to optimize his assets, such as this berry pie, in mere seconds.

He also had the option to set up a live link connection between 3D Painter and NVIDIA Omniverse, a development platform for connecting and building Universal Scene Description (OpenUSD)-based tools and applications, via the USD Composer foundation app.

The connection would allow Carpenen’s texture work in Substance 3D Painter to directly translate to USD Composer — eliminating the need for numerous file imports, exports and reformatting.

Donut Hole in One

Carpenen uses Blender to bring his scenes together with advanced model sculpting, animations and further lighting refinement.

RTX GPUs allow smoother movement in the viewport thanks to Blender Cycles’ RTX-accelerated OptiX ray tracing.

Beautifully rendered donuts make us go nuts.

And for exporting final files, RTX-accelerated OptiX ray tracing in Blender Cycles delivers the fastest final-frame render.

Thanks to AI, This Work Is Toast

Carpenen uses Adobe Photoshop Lightroom to put the finishing touches on his food scenes.

GPU-accelerated image processing enables dramatically more responsive adjustments on his 4K-resolution display.

Carpenen had even more RTX-accelerated AI tools at his disposal in Lightroom. The “Raw Details” feature refines the fine color details of high-resolution RAW images. And “Super Resolution” uses AI to upscale images with higher quality than traditional methods.

According to Carpenen, putting in the work is key.

“It’s equivalent to practicing football — if you don’t get enough time daily to practice, you can’t hone skills,” he said. “It’s important to know how to tackle obstacles — and that knowledge can only be gained by experience.”

Digital 3D artist Ravissen Carpenen’s logo and signature work.

Check out Carpenen’s YouTube channel, CG Realism, and ArtStation to check out more food projects.

Follow NVIDIA Studio on Instagram, Twitter and Facebook. Access tutorials on the Studio YouTube channel and get updates directly in your inbox by subscribing to the Studio newsletter.

Use Amazon SageMaker Studio to build a RAG question answering solution with Llama 2, LangChain, and Pinecone for fast experimentation

Retrieval Augmented Generation (RAG) allows you to provide a large language model (LLM) with access to data from external knowledge sources such as repositories, databases, and APIs without the need to fine-tune it. When using generative AI for question answering, RAG enables LLMs to answer questions with the most relevant, up-to-date information and optionally cite their data sources for verification.

A typical RAG solution for knowledge retrieval from documents uses an embeddings model to convert the data from the data sources to embeddings and stores these embeddings in a vector database. When a user asks a question, it searches the vector database and retrieves documents that are most similar to the user’s query. Next, it combines the retrieved documents and the user’s query in an augmented prompt that is sent to the LLM for text generation. There are two models in this implementation: the embeddings model and the LLM that generates the final response.

In this post, we demonstrate how to use Amazon SageMaker Studio to build a RAG question answering solution.

Using notebooks for RAG-based question answering

Implementing RAG typically entails experimenting with various embedding models, vector databases, text generation models, and prompts, while also debugging your code until you achieve a functional prototype. Amazon SageMaker offers managed Jupyter notebooks equipped with GPU instances, enabling you to rapidly experiment during this initial phase without spinning up additional infrastructure. There are two options for using notebooks in SageMaker. The first option is fast launch notebooks available through SageMaker Studio. In SageMaker Studio, the integrated development environment (IDE) purpose-built for ML, you can launch notebooks that run on different instance types and with different configurations, collaborate with colleagues, and access additional purpose-built features for machine learning (ML). The second option is using a SageMaker notebook instance, which is a fully managed ML compute instance running the Jupyter Notebook app.

In this post, we present a RAG solution that augments the model’s knowledge with additional data from external knowledge sources to provide more accurate responses specific to a custom domain. We use a single SageMaker Studio notebook running on an ml.g5.2xlarge instance (1 A10G GPU) and Llama 2 7b chat hf, the fine-tuned version of Llama 2 7b, which is optimized for dialog use cases from Hugging Face Hub. We use two AWS Media & Entertainment Blog posts as the sample external data, which we convert into embeddings with the BAAI/bge-small-en-v1.5 embeddings. We store the embeddings in Pinecone, a vector-based database that offers high-performance search and similarity matching. We also discuss how to transition from experimenting in the notebook to deploying your models to SageMaker endpoints for real-time inference when you complete your prototyping. The same approach can be used with different models and vector databases.

Solution overview

The following diagram illustrates the solution architecture.

Implementing the solution consists of two high-level steps: developing the solution using SageMaker Studio notebooks, and deploying the models for inference.

Develop the solution using SageMaker Studio notebooks

Complete the following steps to start developing the solution:

Load the Llama-2 7b chat model from Hugging Face Hub in the notebook.
Create a PromptTemplate with LangChain and use it to create prompts for your use case.
For 1–2 example prompts, add relevant static text from external documents as prompt context and assess if the quality of the responses improves.
Assuming that the quality improves, implement the RAG question answering workflow:
- Gather the external documents that can help the model better answer the questions in your use case.
- Load the BGE embeddings model and use it to generate embeddings of these documents.
- Store these embeddings in a Pinecone index.
- When a user asks a question, perform a similarity search in Pinecone and add the content from the most similar documents to the prompt’s context.

Deploy the models to SageMaker for inference at scale

When you hit your performance goals, you can deploy the models to SageMaker to be used by generative AI applications:

Deploy the Llama-2 7b chat model to a SageMaker real-time endpoint.
Deploy the BAAI/bge-small-en-v1.5 embeddings model to a SageMaker real-time endpoint.
Use the deployed models in your question answering generative AI applications.

In the following sections, we walk you through the steps of implementing this solution in SageMaker Studio notebooks.

Prerequisites

To follow the steps in this post, you need to have an AWS account and an AWS Identity and Access Management (IAM) role with permissions to create and access the solution resources. If you are new to AWS, see Create a standalone AWS account.

To use SageMaker Studio notebooks in your AWS account, you need a SageMaker domain with a user profile that has permissions to launch the SageMaker Studio app. If you are new to SageMaker Studio, the Quick Studio setup is the fastest way to get started. With a single click, SageMaker provisions the SageMaker domain with default presets, including setting up the user profile, IAM role, IAM authentication, and public internet access. The notebook for this post assumes an ml.g5.2xlarge instance type. To review or increase your quota, open the AWS Service Quotas console, choose AWS Services in the navigation pane, choose Amazon SageMaker, and refer to the value for Studio KernelGateway apps running on ml.g5.2xlarge instances.

After confirming your quota limit, you need to complete the dependencies to use Llama 2 7b chat.

Llama 2 7b chat is available under the Llama 2 license. To access Llama 2 on Hugging Face, you need to complete a few steps first:

Create a Hugging Face account if you don’t have one already.
Complete the form “Request access to the next version of Llama” on the Meta website.
Request access to Llama 2 7b chat on Hugging Face.

After you have been granted access, you can create a new access token to access models. To create an access token, navigate to the Settings page on the Hugging Face website.

You need to have an account with Pinecone to use it as a vector database. Pinecone is available on AWS via the AWS Marketplace. The Pinecone website also offers the option to create a free account that comes with permissions to create a single index, which is sufficient for the purposes of this post. To retrieve your Pinecone keys, open the Pinecone console and choose API Keys.

Set up the notebook and environment

To follow the code in this post, open SageMaker Studio and clone the following GitHub repository. Next, open the notebook studio-local-gen-ai/rag/RAG-with-Llama-2-on-Studio.ipynb and choose the PyTorch 2.0.0 Python 3.10 GPU Optimized image, Python 3 kernel, and ml.g5.2xlarge as the instance type. If this is your first time using SageMaker Studio notebooks, refer to Create or Open an Amazon SageMaker Studio Notebook.

To set up the development environment, you need to install the necessary Python libraries, as demonstrated in the following code:

%%writefile requirements.txt
sagemaker>=2.175.0
transformers==4.33.0
accelerate==0.21.0
datasets==2.13.0
langchain==0.0.297
pypdf>=3.16.3
pinecone-client
sentence_transformers
safetensors>=0.3.3

!pip install -U -r requirements.txt

Load the pre-trained model and tokenizer

After you have imported the required libraries, you can load the Llama-2 7b chat model along with its corresponding tokenizers from Hugging Face. These loaded model artifacts are stored in the local directory within SageMaker Studio. This enables you to swiftly reload them into memory whenever you need to resume your work at a different time.

import torch

from transformers import (
	AutoTokenizer,
	LlamaTokenizer,
	LlamaForCausalLM,
	GenerationConfig,
	AutoModelForCausalLM
)
import transformers

tg_model_id = "meta-llama/Llama-2-7b-chat-hf" #the model id in Hugging Face
tg_model_path = f"./tg_model/{tg_model_id}" #the local directory where the model will be saved

tg_model = AutoModelForCausalLM.from_pretrained(tg_model_id, token=hf_access_token,do_sample=True, use_safetensors=True, device_map="auto", torch_dtype=torch.float16
tg_tokenizer = AutoTokenizer.from_pretrained(tg_model_id, token=hf_access_token)

tg_model.save_pretrained(save_directory=tg_model_path, from_pt=True)
tg_tokenizer.save_pretrained(save_directory=tg_model_path, from_pt=True)

Ask a question that requires up-to-date information

You can now start using the model and ask questions. Llama-2 chat models expect the prompt to adhere to the following format:

<s>[INST] <<SYS>>
system_prompt
<<SYS>>
{{ user_message }} [/INST]

You can use the PromptTemplate from LangChain to create a recipe based on the prompt format, so that you can easily create prompts going forward:

from langchain import PromptTemplate

template = """<s>[INST] <<SYS>>nYou are an assistant for question-answering tasks. You are helpful and friendly. Use the following pieces of retrieved context to answer the query. If you don't know the answer, you just say I don't know. Use three sentences maximum and keep the answer concise.
<<SYS>>n
{context}n
{question} [/INST]
"""
prompt_template = PromptTemplate( template=template, input_variables=['context','question'] )

Let’s ask the model a question that needs recent information from 2023. You can use LangChain and specifically the LLMChain type of chain and pass as parameters the LLM, the prompt template you created earlier, and the question:

question = "When can I visit the AWS M&E Customer Experience Center in New York City?"

tg_tokenizer.add_special_tokens( {"pad_token": "[PAD]"} )
tg_tokenizer.padding_side = "left"

tg_pipe = transformers.pipeline(task='text-generation', model=tg_model, tokenizer=tg_tokenizer, num_return_sequences=1, eos_token_id=tg_tokenizer.eos_token_id, pad_token_id=tg_tokenizer.eos_token_id, max_new_tokens=400, temperature=0.7)

from langchain.chains import LLMChain
from langchain.llms import HuggingFacePipeline

llm=HuggingFacePipeline(pipeline=tg_pipe, model_kwargs={'temperature':0.7})
llm_chain = LLMChain(llm=llm, prompt=prompt_template)
no_context_response = llm_chain.predict(context="", question=question)
print(no_context_response)

We get the following generated answer:

Thank you for reaching out! The AWS M&E Customer Experience Center in New York City is currently closed for visits due to the COVID-19 pandemic. However, you can check their official website or social media accounts for any updates on when the center will reopen. In the meantime, you can explore their virtual tours and resources available online.

Improve the answer by adding context to the prompt

The answer we generated is not entirely true. Let’s see if we can improve it by providing some context. You can add an extract from the post AWS announces new M&E Customer Experience Center in New York, which includes updates on the topic from 2023:

context = """Media and entertainment (M&E) customers continue to face challenges in creating more content, more quickly, and distributing it to more endpoints than ever before in their quest to delight viewers globally. Amazon Web Services (AWS), along with AWS Partners, have showcased the rapid evolution of M&E solutions for years at industry events like the National Association of Broadcasters (NAB) Show and the International Broadcast Convention (IBC). Until now, AWS for M&E technology demonstrations were accessible in this way just a few weeks out of the year. Customers are more engaged than ever before; they want to have higher quality conversations regarding user experience and media tooling. These conversations are best supported by having an interconnected solution architecture for reference. Scheduling a visit of the M&E Customer Experience Center will be available starting November 13th, please send an email to AWS-MediaEnt-CXC@amazon.com."""

Use the LLMChain again and pass the preceding text as context:

context_response = llm_chain.predict(context=context, question=question)
print(context_response)

The new response answers the question with up-to-date information:

You can visit the AWS M&E Customer Experience Center in New York City starting from November 13th. Please send an email to AWS-MediaEnt-CXC@amazon.com to schedule a visit.

We have confirmed that by adding the right context, the model’s performance is improved. Now you can focus your efforts on finding and adding the right context for the question asked. In other words, implement RAG.

Implement RAG question answering with BGE embeddings and Pinecone

At this juncture, you must decide on the sources of information to enhance the model’s knowledge. These sources could be internal webpages or documents within your organization, or publicly available data sources. For the purposes of this post and for the sake of simplicity, we have chosen two AWS Blog posts published in 2023:

These posts are already available as PDF documents in the data project directory in SageMaker Studio for quick access. To divide the documents into manageable chunks, you can employ the RecursiveCharacterTextSplitter method from LangChain:

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import PyPDFDirectoryLoader

loader = PyPDFDirectoryLoader("./data/")

documents = loader.load()

text_splitter=RecursiveCharacterTextSplitter(
     chunk_size=1000,
     chunk_overlap=5
)
docs = text_splitter.split_documents(documents)

Next, use the BGE embeddings model bge-small-en created by the Beijing Academy of Artificial Intelligence (BAAI) that is available on Hugging Face to generate the embeddings of these chunks. Download and save the model in the local directory in Studio. We use fp32 so that it can run on the instance’s CPU.

em_model_name = "BAAI/bge-small-en"
em_model_path = f"./em-model"

from transformers import AutoModel
# Load model from HuggingFace Hub
em_model = AutoModel.from_pretrained(em_model_name,torch_dtype=torch.float32)
em_tokenizer = AutoTokenizer.from_pretrained(em_model_name,device="cuda")

# save model to disk
em_tokenizer.save_pretrained(save_directory=f"{em_model_path}/model",from_pt=True)
em_model.save_pretrained(save_directory=f"{em_model_path}/model",from_pt=True)
em_model.eval()

Use the following code to create an embedding_generator function, which takes the document chunks as input and generates the embeddings using the BGE model:

# Tokenize sentences
def tokenize_text(_input, device):
    return em_tokenizer(
        [_input], 
        padding=True, 
        truncation=True, 
        return_tensors='pt'
    ).to(device)

# Run embedding task as a function with model and text sentences as input
def embedding_generator(_input, normalize=True):
    # Compute token embeddings
    with torch.no_grad():
        embedded_output = em_model(
            **tokenize_text(
                _input, 
                em_model.device
            )
        )
        sentence_embeddings = embedded_output[0][:, 0]
        # normalize embeddings
        if normalize:
            sentence_embeddings = torch.nn.functional.normalize(
                sentence_embeddings, 
                p=2, 
                dim=1
            )
    
    return sentence_embeddings[0, :].tolist()
    
sample_sentence_embedding = embedding_generator(docs[0].page_content)
print(f"Embedding size of the document --->", len(sample_sentence_embedding))

In this post, we demonstrate a RAG workflow using Pinecone, a managed, cloud-native vector database that also offers an API for similarity search. You are free to rewrite the following code to use your preferred vector database.

We initialize a Pinecone python client and create a new vector search index using the embedding model’s output length. We use LangChain’s built-in Pinecone class to ingest the embeddings we created in the previous step. It needs three parameters: the documents to ingest, the embeddings generator function, and the name of the Pinecone index.

import pinecone
pinecone.init(
    api_key = os.environ["PINECONE_API_KEY"],
    environment = os.environ["PINECONE_ENV"]
)
#check if index already exists, if not we create it
index_name = "rag-index"
if index_name not in pinecone.list_indexes():
    pinecone.create_index(
        name=index_name,
        dimension=len(sample_sentence_embedding), ## 384 for bge-small-en 
        metric='cosine'
    )

#insert the embeddings
from langchain.vectorstores import Pinecone
vector_store = Pinecone.from_documents(
    docs,
    embedding_generator,
    index_name=index_name
)

With the Llama-2 7B chat model loaded into memory and the embeddings integrated into the Pinecone index, you can now combine these elements to enhance Llama 2’s responses for our question-answering use case. To achieve this, you can employ the LangChain RetrievalQA, which augments the initial prompt with the most similar documents from the vector store. By setting return_source_documents=True, you gain visibility into the exact documents used to generate the answer as part of the response, allowing you to verify the accuracy of the answer.

from langchain.chains import RetrievalQA
import textwrap

#helper method to improve the readability of the response
def print_response(llm_response):
    temp = [textwrap.fill(line, width=100) for line in llm_response['result'].split('n')]
    response = 'n'.join(temp)
    print(f"{llm_response['query']}n n{response}'n n Source Documents:")
    for source in llm_response["source_documents"]:
        print(source.metadata)

llm_qa_chain = RetrievalQA.from_chain_type(
    llm=llm, #the Llama-2 7b chat model
    chain_type='stuff',
    retriever=vector_store.as_retriever(search_kwargs={"k": 2}), # perform similarity search in Pinecone
    return_source_documents=True, #show the documents that were used to answer the question
    chain_type_kwargs={"prompt": prompt_template}
)
print_response(llm_qa_chain(question))

We get the following answer:

Q: When can I visit the AWS M&E Customer Experience Center in New York City?

A: I’m happy to help! According to the context, the AWS M&E Customer Experience Center in New York City will be available for visits starting on November 13th. You can send an email to AWS-MediaEnt-CXC@amazon.com to schedule a visit.’

Source Documents:

{‘page’: 4.0, ‘source’: ‘data/AWS announces new M&E Customer Experience Center in New York City _ AWS for M&E Blog.pdf’}

{‘page’: 2.0, ‘source’: ‘data/AWS announces new M&E Customer Experience Center in New York City _ AWS for M&E Blog.pdf’}

Let’s try a different question:

question2=" How many awards have AWS Media Services won in 2023?"
print_response(llm_qa_chain(question2))

We get the following answer:

Q: How many awards have AWS Media Services won in 2023?

A: According to the blog post, AWS Media Services have won five industry awards in 2023.’

Source Documents:

{‘page’: 0.0, ‘source’: ‘data/AWS Media Services awarded industry accolades _ AWS for M&E Blog.pdf’}

{‘page’: 1.0, ‘source’: ‘data/AWS Media Services awarded industry accolades _ AWS for M&E Blog.pdf’}

After you have established a sufficient level of confidence, you can deploy the models to SageMaker endpoints for real-time inference. These endpoints are fully managed and offer support for auto scaling.

SageMaker offers large model inference using Large Model Inference containers (LMIs), which we can utilize to deploy our models. These containers come equipped with pre-installed open source libraries like DeepSpeed, facilitating the implementation of performance-enhancing techniques such as tensor parallelism during inference. Additionally, they use DJLServing as a pre-built integrated model server. DJLServing is a high-performance, universal model-serving solution that offers support for dynamic batching and worker auto scaling, thereby increasing throughput.

In our approach, we use the SageMaker LMI with DJLServing and DeepSpeed Inference to deploy the Llama-2-chat 7b and BGE models to SageMaker endpoints running on ml.g5.2xlarge instances, enabling real-time inference. If you want to follow these steps yourself, refer to the accompanying notebook for detailed instructions.

You will require two ml.g5.2xlarge instances for deployment. To review or increase your quota, open the AWS Service Quotas console, choose AWS Services in the navigation pane, choose Amazon SageMaker, and refer to the value for ml.g5.2xlarge for endpoint usage.

The following steps outline the process of deploying custom models for the RAG workflow on a SageMaker endpoint:

Deploy the Llama-2 7b chat model to a SageMaker real-time endpoint running on an ml.g5.2xlarge instance for fast text generation.
Deploy the BAAI/bge-small-en-v1.5 embeddings model to a SageMaker real-time endpoint running on an ml.g5.2xlarge instance. Alternatively, you can deploy your own embedding model.
Ask a question and use the LangChain RetrievalQA to augment the prompt with the most similar documents from Pinecone, this time using the model deployed in the SageMaker real-time endpoint:

# convert your local LLM into SageMaker endpoint LLM
llm_sm_ep = SagemakerEndpoint(
    endpoint_name=tg_sm_model.endpoint_name, # <--- Your text-gen model endpoint name
    region_name=region,
    model_kwargs={
        "temperature": 0.05, 
        "max_new_tokens": 512
    },
    content_handler=content_handler,
)

llm_qa_smep_chain = RetrievalQA.from_chain_type(
    llm=llm_sm_ep,  # <--- This uses SageMaker Endpoint model for inference
    chain_type='stuff',
    retriever=vector_store.as_retriever(search_kwargs={"k": 2}),
    return_source_documents=True,
    chain_type_kwargs={"prompt": prompt_template}
)

Use LangChain to verify that the SageMaker endpoint with the embedding model works as expected so that it can be used for future document ingestion:

response_model = smr_client.invoke_endpoint(
    EndpointName=em_sm_model.endpoint_name, <--- Your embedding model endpoint name
    Body=json.dumps({
        "text": "This is a sample text"
    }),
    ContentType="application/json",
)

outputs = json.loads(response_model["Body"].read().decode("utf8"))['outputs']

Clean up

Complete the following steps to clean up your resources:

When you have finished working in your SageMaker Studio notebook, make sure you shut down the ml.g5.2xlarge instance to avoid any charges by choosing the stop icon. You can also set up lifecycle configuration scripts to automatically shut down resources when they are not used.

If you deployed the models to SageMaker endpoints, run the following code at the end of the notebook to delete the endpoints:

#delete your text generation endpoint
sm_client.delete_endpoint(
     EndpointName=tg_sm_model.endpoint_name
)
# delete your text embedding endpoint
sm_client.delete_endpoint(
      EndpointName=em_sm_model.endpoint_name
)

Finally, run the following line to delete the Pinecone index:

pinecone.delete_index(index_name)

Conclusion

SageMaker notebooks provide a straightforward way to kickstart your journey with Retrieval Augmented Generation. They allow you to experiment interactively with various models, configurations, and questions without spinning up additional infrastructure. In this post, we showed how to enhance the performance of Llama 2 7b chat in a question answering use case using LangChain, the BGE embeddings model, and Pinecone. To get started, launch SageMaker Studio and run the notebook available in the following GitHub repo. Please share your thoughts in the comments section!

About the authors

Anastasia Tzeveleka is a Machine Learning and AI Specialist Solutions Architect at AWS. She works with customers in EMEA and helps them architect machine learning solutions at scale using AWS services. She has worked on projects in different domains including Natural Language Processing (NLP), MLOps and Low Code No Code tools.

Pranav Murthy is an AI/ML Specialist Solutions Architect at AWS. He focuses on helping customers build, train, deploy and migrate machine learning (ML) workloads to SageMaker. He previously worked in the semiconductor industry developing large computer vision (CV) and natural language processing (NLP) models to improve semiconductor processes. In his free time, he enjoys playing chess and traveling.

KT’s journey to reduce training time for a vision transformers model using Amazon SageMaker

KT Corporation is one of the largest telecommunications providers in South Korea, offering a wide range of services including fixed-line telephone, mobile communication, and internet, and AI services. KT’s AI Food Tag is an AI-based dietary management solution that identifies the type and nutritional content of food in photos using a computer vision model. This vision model developed by KT relies on a model pre-trained with a large amount of unlabeled image data to analyze the nutritional content and calorie information of various foods. The AI Food Tag can help patients with chronic diseases such as diabetes manage their diets. KT used AWS and Amazon SageMaker to train this AI Food Tag model 29 times faster than before and optimize it for production deployment with a model distillation technique. In this post, we describe KT’s model development journey and success using SageMaker.

Introducing the KT project and defining the problem

The AI Food Tag model pre-trained by KT is based on the vision transformers (ViT) architecture and has more model parameters than their previous vision model to improve accuracy. To shrink the model size for production, KT is using a knowledge distillation (KD) technique to reduce the number of model parameters without significant impact to accuracy. With knowledge distillation, the pre-trained model is called a teacher model, and a lightweight output model is trained as a student model, as illustrated in the following figure. The lightweight student model has fewer model parameters than the teacher, which reduces memory requirements and allows for deployment on smaller, less expensive instances. The student maintains acceptable accuracy even though it’s smaller by learning from the outputs of the teacher model.

The teacher model remains unchanged during KD, but the student model is trained using the output logits of the teacher model as labels to calculate loss. With this KD paradigm, both the teacher and the student need to be on a single GPU memory for training. KT initially used two GPUs (A100 80 GB) in their internal, on-premises environment to train the student model, but the process took about 40 days to cover 300 epochs. To accelerate training and generate a student model in less time, KT partnered with AWS. Together, the teams significantly reduced model training time. This post describes how the team used Amazon SageMaker Training, the SageMaker Data Parallelism Library, Amazon SageMaker Debugger, and Amazon SageMaker Profiler to successfully develop a lightweight AI Food Tag model.

Building a distributed training environment with SageMaker

SageMaker Training is a managed machine learning (ML) training environment on AWS that provides a suite of features and tools to simplify the training experience and can be useful in distributed computing, as illustrated in the following diagram.

SageMaker customers can also access built-in Docker images with various pre-installed deep learning frameworks and the necessary Linux, NCCL, and Python packages for model training. Data scientists or ML engineers who want to run model training can do so without the burden of configuring training infrastructure or managing Docker and the compatibility of different libraries.

During a 1-day workshop, we were able to set up a distributed training configuration based on SageMaker within KT’s AWS account, accelerate KT’s training scripts using the SageMaker Distributed Data Parallel (DDP) library, and even test a training job using two ml.p4d.24xlarge instances. In this section, we describe KT’s experience working with the AWS team and using SageMaker to develop their model.

In the proof of concept, we wanted to speed up a training job by using the SageMaker DDP library, which is optimized for AWS infrastructure during distributed training. To change from PyTorch DDP to SageMaker DDP, you simply need to declare the torch_smddp package and change the backend to smddp, as shown in the following code:

import smdistributed.dataparallel.torch.torch_smddp

dist.init_process_group(backend='smddp',

rank=args.rank,

world_size=args.world_size)

To learn more about the SageMaker DDP library, refer to SageMaker’s Data Parallelism Library.

Analyzing the causes of slow training speed with the SageMaker Debugger and Profiler

The first step in optimizing and accelerating a training workload involves understanding and diagnosing where bottlenecks occur. For KT’s training job, we measured the training time per iteration of the data loader, forward pass, and backward pass:

1 iter time – dataloader : 0.00053 sec, forward : 7.77474 sec, backward: 1.58002 sec

2 iter time – dataloader : 0.00063 sec, forward : 0.67429 sec, backward: 24.74539 sec

3 iter time – dataloader : 0.00061 sec, forward : 0.90976 sec, backward: 8.31253 sec

4 iter time – dataloader : 0.00060 sec, forward : 0.60958 sec, backward: 30.93830 sec

5 iter time – dataloader : 0.00080 sec, forward : 0.83237 sec, backward: 8.41030 sec

6 iter time – dataloader : 0.00067 sec, forward : 0.75715 sec, backward: 29.88415 sec

Looking at the time in the standard output for each iteration, we saw that the backward pass’s run time fluctuated significantly from iteration to iteration. This variation is unusual and can impact total training time. To find the cause of this inconsistent training speed, we first tried to identify resource bottlenecks by utilizing the System Monitor (SageMaker Debugger UI), which allows you to debug training jobs on SageMaker Training and view the status of resources such as the managed training platform’s CPU, GPU, network, and I/O within a set number of seconds.

The SageMaker Debugger UI provides detailed and essential data that can help identifying and diagnose bottlenecks in a training job. Specifically, the CPU utilization line chart and CPU/GPU utilization heat map per instance tables caught our eye.

In the CPU utilization line chart, we noticed that some CPUs were being used 100%.

In the heat map (where darker colors indicate higher utilization), we noted that a few CPU cores had high utilization throughout the training, whereas GPU utilization wasn’t consistently high over time.

From here, we began to suspect that one of the reasons for the slow training speed was a CPU bottleneck. We reviewed the training script code to see if anything was causing the CPU bottleneck. The most suspicious part was the large value of num_workers in the data loader, so we changed this value to 0 or 1 to reduce CPU utilization. We then ran the training job again and checked the results.

The following screenshots show the CPU utilization line chart, GPU utilization, and heat map after mitigating the CPU bottleneck.

By simply changing num_workers, we saw a significant decrease in CPU utilization and an overall increase in GPU utilization. This was an important change that improved training speed significantly. Still, we wanted to see where we could optimize GPU utilization. For this, we used SageMaker Profiler.

SageMaker Profiler helps identify optimization clues by providing visibility into utilization by operations, including tracking GPU and CPU utilization metrics and kernel consumption of GPU/CPU within training scripts. It helps users understand which operations are consuming resources. First, to use SageMaker Profiler, you need to add ProfilerConfig to the function that invokes the training job using the SageMaker SDK, as shown in the following code:

from sagemaker import ProfilerConfig, Profiler

from sagemaker.debugger import (ProfilerRule, rule_configs)

rules=[ProfilerRule.sagemaker(rule_configs.ProfilerReport())]

profiler_config = ProfilerConfig(profile_params = Profiler(cpu_profiling_duration=3600))

from sagemaker.pytorch import PyTorch

region_name = 'us-west-2'

image_uri=f'763104351884.dkr.ecr.{region_name}.amazonaws.com/pytorch-training:2.0.0-gpu-py310-cu118-ubuntu20.04-sagemaker'

estimator = PyTorch(

entry_point='train.py',

source_dir='src',

role=role,

image_uri=image_uri,

instance_count=4,

instance_type='ml.p4d.24xlarge',

distribution={'smdistributed': {'dataparallel': {'enabled': True}}},

profiler_config=profiler_config,

hyperparameters=hyperparameters,

sagemaker_session=sagemaker_session,

)

In the SageMaker Python SDK, you have the flexibility to add the annotate functions for SageMaker Profiler to select code or steps in the training script that needs profiling. The following is an example of the code that you should declare for SageMaker Profiler in the training scripts:

import smppy

SMProf = smppy.SMProfiler.instance()

config = smppy.Config()

config.profiler = {

"EnableCuda": "1",

}

SMProf.configure(config)

SMProf.start_profiling()

…

with smppy.annotate("Forward"):

student_out = student_model(inp)

with smppy.annotate("Backward"):

loss.backward()

…

SMProf.stop_profiling()

After adding the preceding code, if you run a training job using the training scripts, you can get information about the operations consumed by the GPU kernel (as shown in the following figure) after the training runs for a period of time. In the case of KT’s training scripts, we ran it for one epoch and got the following results.

When we checked the top five operation consumption times of the GPU kernel among the results of SageMaker Profiler, we found that for the KT training script, the most time is consumed by the matrix product operation, which is a general matrix multiplication (GEMM) operation on GPUs. With this important insight from the SageMaker Profiler, we began investigating ways to accelerate these operations and improve GPU utilization.

Speeding up training time

We reviewed various ways to reduce computation time of matrix multiplication and applied two PyTorch functions.

Shard optimizer states with ZeroRedundancyOptimizer

If you look at the Zero Redundancy Optimizer (ZeRO), the DeepSpeed/ZeRO technique enables the training of a large model efficiently with better training speed by eliminating the redundancies in memory used by the model. ZeroRedundancyOptimizer in PyTorch uses the technique of sharding the optimizer state to reduce memory usage per a process in Distributed Data Parallel (DDP). DDP uses synchronized gradients in the backward pass so that all optimizer replicas iterate over the same parameters and gradient values, but instead of having all the model parameters, each optimizer state is maintained by sharding only for different DDP processes to reduce memory usage.

To use it, you can leave your existing Optimizer in optimizer_class and declare a ZeroRedundancyOptimizer with the rest of the model parameters and the learning rate as parameters.

student_optimizer = ZeroRedundancyOptimizer(

student_model.parameters(),

optimizer_class=torch.optim.AdamW,

lr=initial_lr

)

Automatic mixed precision

Automatic mixed precision (AMP) uses the torch.float32 data type for some operations and torch.bfloat16 or torch.float16 for others, for the convenience of fast computation and reduced memory usage. In particular, because deep learning models are typically more sensitive to exponent bits than fraction bits in their computations, torch.bfloat16 is equivalent to the exponent bits of torch.float32, allowing them to learn quickly with minimal loss. torch.bfloat16 only runs on instances with A100 NVIDIA architecture (Ampere) or higher, such as ml.p4d.24xlarge, ml.p4de.24xlarge, and ml.p5.48xlarge.

To apply AMP, you can declare torch.cuda.amp.autocast in the training scripts as shown in the code above and declare dtype as torch.bfloat16.

with torch.cuda.amp.autocast(dtype="torch.bfloat16"):

teacher = teacher_model(input_data)

student = student_model(input_data)

loss = loss(teacher, student, target)

loss.requires_grad_(True)

loss.backward()

student_optimizer.step()

student_optimizer.zero_grad(set_to_none=True)

Results in SageMaker Profiler

After applying the two functions to the training scripts and running a train job for one epoch again, we checked the top five operations consumption times for the GPU kernel in SageMaker Profiler. The following figure shows our results.

We can see that the GEMM operation, which was at the top of the list before applying the two Torch functions, has disappeared from the top five operations, replaced by the ReduceScatter operation, which typically occurs in distributed training.

Training speed results of the KT distilled model

We increased the training batch size by 128 more to account for the memory savings from applying the two Torch functions, resulting in a final batch size of 1152 instead of 1024. The training of the final student model was able to run 210 epochs per 1 day; the training time and speedup between KT’s internal training environment and SageMaker are summarized in the following table.

Training Environment	Training GPU spec.	Number of GPU	Training Time (hours)	Epoch	Hours per Epoch	Reduction Ratio
KT’s internal training environment	A100 (80GB)	2	960	300	3.20	29
Amazon SageMaker	A100 (40GB)	32	24	210	0.11	1

The scalability of AWS allowed us to complete the training job 29 times faster than before using 32 GPUs instead of 2 on premises. As a result, using more GPUs on SageMaker would have significantly reduced training time with no difference in overall training costs.

Conclusion

Park Sang-min (Vision AI Serving Technology Team Leader) from the AI2XL Lab in KT’s Convergence Technology Center commented on the collaboration with AWS to develop the AI Food Tag model:

“Recently, as there are more transformer-based models in the vision field, the model parameters and required GPU memory are increasing. We are using lightweight technology to solve this issue, and it takes a lot of time, about a month to learn once. Through this PoC with AWS, we were able to identify the resource bottlenecks with help of SageMaker Profiler and Debugger, resolve them, and then use SageMaker’s data parallelism library to complete the training in about one day with optimized model code on four ml.p4d.24xlarge instances.”

SageMaker helped save Sang-min’s team weeks of time in model training and development.

Based on this collaboration on the vision model, AWS and the SageMaker team will continue to collaborate with KT on various AI/ML research projects to improve model development and service productivity through applying SageMaker capabilities.

To learn more about related features in SageMaker, check out the following:

About the authors

Youngjoon Choi, AI/ML Expert SA, has experienced enterprise IT in various industries such as manufacturing, high-tech, and finance as a developer, architect, and data scientist. He conducted research on machine learning and deep learning, specifically on topics like hyperparameter optimization and domain adaptation, presenting algorithms and papers. At AWS, he specializes in AI/ML across industries, providing technical validation using AWS services for distributed training/large scale models and building MLOps. He proposes and reviews architectures, aiming to contribute to the expansion of the AI/ML ecosystem.

Jung Hoon Kim is an account SA of AWS Korea. Based on experiences in applications architecture design, development and systems modeling in various industries such as hi-tech, manufacturing, finance and public sector, he is working on AWS Cloud journey and workloads optimization on AWS for enterprise customers.

Rock Sakong is a researcher at KT R&D. He has conducted research and development for the vision AI in various fields and mainly conducted facial attributes (gender/glasses, hats, etc.)/face recognition technology related to the face. Currently, he is working on lightweight technology for the vision models.

Manoj Ravi is a Senior Product Manager for Amazon SageMaker. He is passionate about building next-gen AI products and works on software and tools to make large-scale machine learning easier for customers. He holds an MBA from Haas School of Business and a Masters in Information Systems Management from Carnegie Mellon University. In his spare time, Manoj enjoys playing tennis and pursuing landscape photography.

Robert Van Dusen is a Senior Product Manager with Amazon SageMaker. He leads frameworks, compilers, and optimization techniques for deep learning training.

Lifelong model editing in large language models: Balancing low-cost targeted edits and catastrophic forgetting

Illustrated figure of lifelong model editing with GRACE. On the left is a question and the model’s existing answer to it (which is incorrect). Editing method needs to update it the correct answer. In the middle the architecture is shown where the language model is frozen and embeddings are extracted to retrieve appropriate values (new embeddings) from the codebook. On the right the codebook is shown which includes a set of trainable embeddings.

Large language models (LLMs) are profoundly useful for a vast array of difficult tasks. But they sometimes make unpredictable mistakes or perpetuate biased language. These sorts of errors tend to arise over time due to changes in the underlying data or in user behavior. This necessitates targeted, cost-effective fixes to these models and the real-world applications they support.

Repeated pretraining or finetuning might be used to achieve these fixes. However, these solutions are often too computationally expensive. For example (opens in new tab), LLAMA 1 was trained for 21 days on 2,048 A100 GPUs, costing over $2.4 million. Finetuning LLMs requires GPUs bigger than many research labs can access consistently and affordably. Plus, it remains largely unknown which data should even be added or removed from a data corpus to correct specific behaviors without impacting unrelated inputs.

To keep LLMs up to date without expensive training, model editing has recently been proposed as a paradigm for making targeted updates to big models. Most model editors update a model once, injecting a batch of corrections. But mistakes are often discovered sequentially over time and must be corrected quickly. In other words, lifelong model editing where a stream of mistakes are encountered and must be addressed immediately is essential when the models are deployed. This requires making many edits sequentially, a setting in which existing editors are known to fail. Success here means correcting all edits in sequence, without forgetting old fixes and without decaying performance on unrelated inputs. But what exactly is an edit? In Aging with GRACE: Lifelong Model Editing with Discrete Key-Value Adaptors, three types of edits are considered:

Updating factual knowledge. Let’s say we have a pre-trained question-answering model: We pass questions in, and the model returns answers. But as the world changes, these answers become outdated. For example, the answer to “Who is the president of the U.S.?” should change after an election. Therefore, an edit is a tuple – or an ordered sequence of values – containing a question (e.g., “Who is the president of the U.S.?”) and the correct answer (e.g., “Biden”) for the question.
Keeping up with flipping labels. Ground truth in classification tasks can change over time. For example, when U.S. courts use new language to describe existing topics, a document’s correct label can change. In such a case, a model trained on the old labels must be corrected. Targeted edits are especially important when only specific types of data are relabeled, which is common. In this case, an edit is a paired input (e.g., court document) and a new label (e.g., topic).
Mitigating fabrication and incoherence in LLMs. A key challenge in using LLMs is avoiding instances where they generate language that is ungrounded in reality. But this might happen more in some models than others. Therefore, when it does happen, the ensuing edit should be as small as possible. To explore the effectiveness of this approach, the researchers consider mitigating this problem when generating biographies of famous people. Upon identifying hand-annotated fabrications, they edit an LLM to instead produce corresponding sentences from real Wikipedia articles. In this case, an edit is a prompt and a corresponding response, which the existing model finds unlikely.

This figure shows an overview of the proposed approach. On the left it shows a question (what was the latest pandemic?) and the model’s existing answer to it (Swine Flu) which is a wrong answer, editing method needs to update it the correct answer (COVID). In the middle the architecture is shown where the language model is frozen and embeddings are extracted to retrieve appropriate values (new embeddings) from the codebook. In the right the codebook is shown which includes a set of trainable embeddings. — **Figure 1.** Overview of lifelong model editing with GRACE. Models make important errors that must be corrected. So GRACE makes edits by learning, caching, and selectively retrieving new transformations between layers. Over long sequences of edits, which appear sporadically and require quick fixes, GRACE codebooks grow and adapt.

To make cost-effective edits to LLMs, we propose an approach referred to as General Retrieval Adaptors for Continual Editing, or GRACE. GRACE is the first method to enable thousands of sequential edits to any pre-trained model architecture using only streaming errors. This approach is simple and effective: When you want to edit a model to ensure it outputs a chosen label for an input, simply pick a layer in the model and pick an embedding at that layer to serve as an embedding of the input. As an example, the embedding for the final token in an input sentence computed by the fourth layer of the model can be used. Then, this embedding is cached and a new embedding is learned such that if the new is substituted for the old embeddings, the model produces the desired response. The original embedding is referred to as a key, and the learned embedding as a value. Learning the value is straightforward via gradient descent. The key and value are then stored in a codebook, which acts as a dictionary. If you then pass in a new input to the model, after computing its embedding, referred to as a query, new queries can be compared to existing keys. If a query matches a key, one can look up the value and apply the edit. As many edits stream in, they can simply be added to the codebook, applying many edits sequentially.

A table with four main columns labeled — **Table 1.** GRACE outperforms existing model editors by successfully editing models without forgetting previous edits or unrelated training data. On the zsRE and SCOTUS datasets, GRACE achieves substantial compression. On the Hallucination dataset, GRACE successfully embeds long future sequences of tokens into cached values.

But isn’t this just memorization? How can generalizable edits be achieved without memorizing every new input? Instead of always adding new keys, every new key is paired with an influence radius, which is a ball surrounding any new key with a radius of ε. Then, if any query lands inside this ε-ball, the key’s corresponding value is retrieved and the edit is applied. Thus, inputs that are similar to any cached edits will also be updated. Occasionally, when creating a new key, its ε-ball may conflict with another key. In this case, when the conflicting keys have different values, their ε-balls are set to just barely touch. If they have the same values, the existing key’s ε are increased to include the new input. Tuning ε helps achieve small codebooks that are generalizable and can successfully make thousands of edits in a row.

To compare GRACE’s capability with existing methods to make generalizable edits, two bidirectional models (T5 and BERT) and one autoregressive model (GPT2-XL) were used. For question-answering (QA), T5 was used along with a QA dataset (opens in new tab) that includes questions targeted for relation extraction. Twenty rephrased versions of each question were extracted, 10 of them were used during editing and the other 10 as unseen holdouts. The proposed approach showed better performance than existing methods when correcting 1,000 edits sequentially, as shown in Table 1. It used only 137 keys to make the edits, which shows the efficiency of the proposed method. This level of generalization is better than prior work and shows promising potential for correcting future mistakes. The proposed approach can also successfully edit a BERT model that was trained on U.S. Supreme Court documents (opens in new tab) from before 1992 and tested on documents after 1992 for which the label distribution shifted. An experiment was also conducted using GRACE with an autoregressive model, GPT2-XL, to edit mistakes related to fabrication, which were promising encouraging long sequences of edits. For example, when asked to generate a biography of Brian Hughes, GRACE successfully encouraged GPT2-XL to respond: “Brian Hughes (born 1955) is a Canadian guitarist whose work draws from both the smooth jazz and world music genres,” which exactly matches the requested biography using only one cached value. Another interesting observation was that GRACE edits were robust to the choice of edited layer, though later layers were harder to edit. Further, a clear balance was observed between memorization and generalization when choosing ε, as shown in Figure 2. Finally, a key feature of GRACE is that the codebook is detached from the pre-trained model, leaving its weights untouched. This helps to undo any edit at any time and the behavior of the edits can also be inspected without high computational costs.

A figure containing eight subfigures displayed as two rows and four columns. Each row represents a value of epsilon, the hyperparameter in our proposed method that controls generalization. The first row shows epsilon of 0.1, the second row shows and epsilon of 0.2. Each column shows a line graph for a different metric. Each line shows how the metric changes throughout 3,000 sequential edits to a T5 QA model using the zsRE dataset. Each plot contains four lines; each line is for editing a different T5 block. We compare edits made to blocks 0, 2, 4, and 6. Starting with the left column, we consider the TRR metric, which measures model accuracy on its original testing data after editing. For epsilon of 0.1, the TRR metric remains at 0.72 the entire time, with no difference per block. For epsilon of 3.0, the TRR metric remains at 0.72 only for Block 6 and is lowest for Block 0, dropping to below 0.7 by the end of editing. The second column shows the ERR metric, which is accuracy on previous edits at each step. Here we see that for epsilon of 0.1, Blocks 2, 4, and 6 remain high at nearly 1.0. For epsilon of 3.0, Block 6 remains high, while the other blocks drop to around 0.9. The third column shows Holdout performance on unseen holdout edits, which are rephrasings of seen edits. After each edit, we run the all holdout edits through the edited model and record its accuracy on the whole set. Therefore, in both plots, we see the performance increase over time, as the edits slowly cover more rephrasings of the holdout set. This way, we measure GRACE’s generalization. We see that for epsilon of 0.1, Block 6 generalizes slightly better than other blocks. But for epsilon of 3.0, Block 6 underperforms other methods significantly. Block 0 is slightly better and Blocks 2 and 4 are much better. In the final colum, we report the number of keys used by GRACE to make all 3,000 edits. Here we see that Block 6 simply memorizes all edits, as its number of keys grows linearly. After 3,000 edits, there are 3,000 keys. But for Blocks 0, 2, and 4, this value saturates, with edits being made with far fewer keys. When epsilon is 0.1, these blocks use about 2,000 keys. When epsilon is 3.0, Block 0 uses about 1,000 keys while Blocks 2 and 4 use around 800 keys. This demonstrates how picking the block and epsilon can impact the trade-off between memorization and generalization. Overall, it appears that generalizable edits happen in interior model layers as opposed to the first or last layers and for slightly-larger choices of epsilon. — **Figure 2.** GRACE’s performance when editing different blocks of a T5 model for different choices of epsilon. This choice drives a balance between accuracy on unrelated training data (TRR) and previous edits (ERR), as shown by a small epsilon (a) and a big epsilon (b).

Summary

GRACE presents a different perspective for model editing, where representations are directly modified and transformations are cached sequentially. Edits can be done thousands of times sequentially, where a small set of codebooks are maintained throughout the editing. This step reduces the gap for deployment needs of real-world applications where edits are discovered over time and should be addressed in a cost-effective manner. By correcting behaviors efficiently and expanding sequential editing to other model properties, like fairness and privacy, this work can potentially enable a new class of solutions for adapting LLMs to meet user needs over long deployment lifetimes.

The post Lifelong model editing in large language models: Balancing low-cost targeted edits and catastrophic forgetting appeared first on Microsoft Research.

Abstracts: November 20, 2023

Microsoft Research Podcast: Abstracts, November 23, 2023

Members of the research community at Microsoft work continuously to advance their respective fields. Abstracts brings its audience to the cutting edge with them through short, compelling conversations about new and noteworthy achievements.

In this episode, Shrey Jain (opens in new tab), a Technical Project Manager at Microsoft Research, and Dr. Zoë Hitzig (opens in new tab), a junior fellow at the Harvard Society of Fellows, discuss their work on contextual confidence, which presents a framework to understand and more meaningfully address the increasingly sophisticated challenges generative AI poses to communication.

Read the paper

Transcript

[MUSIC PLAYS]

GRETCHEN HUIZINGA: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. I’m Dr. Gretchen Huizinga. In this series, members of the research community at Microsoft give us a quick snapshot—or a podcast abstract—of their new and noteworthy papers.

[MUSIC FADES]

Today I’m talking to Shrey Jain, an applied scientist at Microsoft Research, and Dr. Zoë Hitzig, a junior fellow at the Harvard Society of Fellows. Shrey and Zoë are coauthors of a paper called Contextual Confidence and Generative AI, and you can read a preprint of this paper now on arXiv. Shrey Jain, Zoë Hitzig. Thanks for joining us on Abstracts!

SHREY JAIN: Thank you.

ZOË HITZIG: Great to be here.

HUIZINGA: Shrey, let’s start out with you. What problem does this research address, what made you care about it, and why should we care about it, too?

JAIN: Yeah, so right now, there’s a lot of discussion as towards what the impacts of generative AI is on communication, and there’s been a lot of different terms being thrown around amongst AI policy researchers or news organizations, such as disinformation, misinformation, copyright, fair use, social engineering, deception, persuasion, and it makes it really hard to understand the precise new problem that this new technology, generative AI, brings towards our understanding of how we communicate with one another. And so what we wanted to do in this research is try to present a framework to sort of guide both policymakers, AI labs, and other people working in this field to have a better understanding of the challenges that generative AI presents and accordingly be able to meet those challenges with a set of strategies that are precise to the way we understand it and also try to uncover new strategies that might remain hidden in other frameworks that are traditionally being used to address these challenges.

HUIZINGA: So expand on that a little bit in terms of, you know, what made you care about it? What was the prompt—no pun intended—for generative AI that got you concerned about this? And what kinds of things ought we to be thinking about in terms of why we should care about it, too?

JAIN: Yeah, there’s a lot of different areas under which generative AI presents new challenges to our ability to communicate, one of which was literally the ability to communicate with close family members. I think we’ve seen a lot of these deception attacks kind of happening on the elderly, who have been susceptible to these attacks pre-generative AI in the past, and only thought that that might become more concerning. I no longer live in a city where my family lives, and so the only way to communicate with them is through a digital form now, and if we don’t have confidence in that interaction, I’m scared of the repercussions that has more broadly. And, you know, being at Microsoft Research, having worked on initiatives related to election integrity, was also starting to think through the impacts that this could have at a much wider scale. And so that’s kind of what prompted us to start thinking through how we can meet that challenge and try to make a contribution to mitigate that risk.

HUIZINGA: Zoë, almost all research builds on existing foundations, so what body of work does your research draw from, and how does this paper add to the literature?

HITZIG: I’d say this research paper draws on a few different strands of literature. First, there has been a lot of social theorizing and philosophizing about what exactly constitutes privacy, for example, in the digital age. And in particular, there’s a theory of privacy that we find very compelling and we draw a lot from in the paper, which is a theory called contextual integrity, which was put forward by Helen Nissenbaum, a researcher at Cornell Tech. And what contextual integrity says is that rather than viewing privacy as a problem that’s fundamentally about control over one’s personal information or a problem about secrecy, contextual integrity says that an information flow is private when it respects the norms that have been laid down by the sender and the receiver. And so there’s a violation of privacy, according to Nissenbaum’s theory, when there’s a violation of contextual integrity. So we really take this idea from Nissenbaum and extend it to think about situations that, first of all, didn’t come up before because they’re unusual and generative AI poses new kinds of challenges. But second of all, we extend Nissenbaum’s theory into thinking not just about privacy but also authenticity. So what is authenticity? Well, in some sense, we say it’s a violation of a norm of truthfulness. What we really add to this theorizing on privacy is that we offer a perspective that shows that privacy questions and questions about authenticity or authentication can’t really be separated. And so on the theory side, we are extending the work of media scholars and internet scholars like Helen Nissenbaum but also like danah boyd and Nancy Baym, who are Microsoft Researchers, as well, to say, look, privacy and authenticity online can no longer be separated. We have to see them as two sides of the same coin. They’re both fundamentally about contextual confidence, the confidence we have in our ability to identify the context of a communication and to protect the context of that communication. So that’s sort of the theory side. And then, of course, our other big contribution is all the practical stuff that takes up the bulk of the paper.

HUIZINGA: Right. Shrey, let’s talk about methodology for a minute. And this is a unique paper in terms of methodology. How would you describe your research approach for this work, and where does it fit on the spectrum of methodology for research?

JAIN: Yeah, this paper is definitely a bit different from the conventional empirical research that might be done in the space. But it’s more of a policy or, I guess, framework paper where we try to provide both, as Zoë just commented on, the theory for contextual confidence but then also try to illustrate how we might apply contextual confidence as a framework to the existing challenges that generative AI presents. And so in order to make this framework and the theory that we present useful, we wanted to try to understand both what are the set of challenges that fall into these categories of identifying context and protecting context. So, specifically, how does generative AI threaten our ability to identify and protect? And trying to take a bird’s eye view in understanding those challenges. And then also kind of doing what might look similar to like a literature review but different in a way that we collect all of the different strategies that are typically talked about in the conversation but then in using contextual confidence as a framework realizing that new strategies that aren’t as well discussed in the conversation might be useful to meet these different challenges. And so from a methodology perspective, it’s almost like we’re applying the theory to uncover new … both new strategies that might be useful in this moment and then finding ways to give concrete examples of us applying that framework to existing technological questions that both people in the industry, as well as in policy, are thinking through when it comes to these questions about generative AI.

HUIZINGA: Zoë, for me, the most interesting part of research papers is that little part that comes after the phrase “and what we found was …” So, um, how would you describe what your takeaways were here, and how did you present them in the paper?

HITZIG: That’s a great question. That’s also my favorite question to ask myself when I’ve completed a project. I think the biggest thing that I learned through writing this paper and collaborating with Shrey was really, for the first time, I forced myself to interrogate the foundations of effective communication and to understand what it is that we rely on when, you know, we pass a stranger on the street and look at them in a certain way and somehow know what it means. Or what we rely on to understand, you know, how our partner is feeling when they speak to us over coffee in the morning. I was really forced to step back and think about the foundations of effective communication. And in doing so, what we realized was that an ability to both identify and protect context is what allows us to communicate effectively. And in some sense, this very basic fact made me see how sort of shockingly robust our communication systems have been in the past and yet at the same time how fragile they could be in the face of this alarming new technology that has the power to fundamentally upset these two foundational processes of identifying and protecting context in communication. I would also say, on the question of what we found, you know, my first answer was about these sort of fundamental insights that had never occurred to me before about what makes communication effective and how it’s threatened. But also, I was able to understand and sort of make sense of so many of the strategies and tools that are in discussion today. And, for example, I was able to see, in a totally new light, the importance of, for example, something as simple as having some form of digital identification or the simplicity of, you know, what makes a good password and what can we do to strengthen passwords in the future. So there was this strong theoretical insight, but also that theoretical insight was enormously powerful in helping us organize the very concrete discussions around particular tools and technologies.

HUIZINGA: Hmm. It’s a beautiful segue into the question I have for Shrey, which is talking about the real-world impact of this work. You know, coming down to the practical side from the theoretical, who does this work help and how?

JAIN: Yeah, I want to also add a disclaimer in that, in this podcast, we kind of present generative AI almost as this like villain to communication. [LAUGHTER] I think that there’s also a possibility that generative AI improves communication, and I want to make sure that we acknowledge the optimism that we do see here. I think part of the real-world impact is that we want to mitigate the cost that generative AI brings to communications without hurting the utility at the same time. When applying contextual confidence in contrast to, say, views of traditional privacy, which may view privacy in terms of secrecy or information integrity, we hopefully will find a way in ensuring that the utility of these models is not significantly lost. And so in terms of the real-world impact, I think when it comes to both policies that are being set right now, norms around how we interact with these models, or any startup founder or person who’s deploying these tools, when they think about the reviews that they’re doing from a privacy point of view or a compliance point of view, we hope that contextual confidence can guide, as a framework, a way that protects users of these tools along with not hindering model capabilities in that form.

HUIZINGA: Zoë, if there was one takeaway that you want our listeners to get from this work on contextual confidence, what would it be?

HITZIG: What I hope that readers will take away is, on the one hand, the key conceptual insight of the paper, which is that in today’s digital communication and in the face of generative AI, privacy questions and authenticity questions cannot be separated. And in addition, I hope that we’ve communicated the full force of that insight and shown how this framework can be useful in evaluating the deployment of new tools and new technologies.

HUIZINGA: Finally, Shrey, what outstanding questions or challenges remain here, and how do you hope to help answer them?

JAIN: In the paper, we have presented a theoretical understanding of contextual confidence and present various different strategies that might be able to help meet the challenges that generative AI presents to our ability to both identify and protect context, but we don’t know how those strategies themselves may or may not undermine the goals that we’re presenting because we haven’t done empirical research to know how a given strategy might work across different types of people. In fact, the strategies could undermine the initial goals that we intend. A verification stamp for some might enhance credibility, but for those who may not trust the institution verifying, it may actually reduce credibility. And I think there’s a lot of empirical research both on the tool development, usability, and then back to guiding the theoretical framework that we present that we want to continue to refine and work on as this framework hopefully becomes more widely used.

HUIZINGA: Well, Shrey Jain, Zoë Hitzig, thank you for joining us today, and to our listeners, thanks for tuning in.

[MUSIC PLAYS]

If you’re interested in learning more about contextual confidence and generative AI, you can find a link to the preprint of this paper at aka.ms/abstracts, or you can read it on arXiv. See you next time on Abstracts!

[MUSIC FADES]

The post Abstracts: November 20, 2023 appeared first on Microsoft Research.

What Is a SuperNIC?

Generative AI is the latest turn in the fast-changing digital landscape. One of the groundbreaking innovations making it possible is a relatively new term: SuperNIC.

What Is a SuperNIC?

SuperNIC is a new class of network accelerators designed to supercharge hyperscale AI workloads in Ethernet-based clouds. It provides lightning-fast network connectivity for GPU-to-GPU communication, achieving speeds reaching 400Gb/s using remote direct memory access (RDMA) over converged Ethernet (RoCE) technology.

SuperNICs combine the following unique attributes:

High-speed packet reordering to ensure that data packets are received and processed in the same order they were originally transmitted. This maintains the sequential integrity of the data flow.
Advanced congestion control using real-time telemetry data and network-aware algorithms to manage and prevent congestion in AI networks.
Programmable compute on the input/output (I/O) path to enable customization and extensibility of network infrastructure in AI cloud data centers.
Power-efficient, low-profile design to efficiently accommodate AI workloads within constrained power budgets.
Full-stack AI optimization, including compute, networking, storage, system software, communication libraries and application frameworks.

NVIDIA recently unveiled the world’s first SuperNIC tailored for AI computing, based on the BlueField-3 networking platform. It’s a part of the NVIDIA Spectrum-X platform, where it integrates seamlessly with the Spectrum-4 Ethernet switch system.

Together, the NVIDIA BlueField-3 SuperNIC and Spectrum-4 switch system form the foundation of an accelerated computing fabric specifically designed to optimize AI workloads. Spectrum-X consistently delivers high network efficiency levels, outperforming traditional Ethernet environments.

“In a world where AI is driving the next wave of technological innovation, the BlueField-3 SuperNIC is a vital cog in the machinery,” said Yael Shenhav, vice president of DPU and NIC products at NVIDIA. “SuperNICs ensure that your AI workloads are executed with efficiency and speed, making them foundational components for enabling the future of AI computing.”

The Evolving Landscape of AI and Networking

The AI field is undergoing a seismic shift, thanks to the advent of generative AI and large language models. These powerful technologies have unlocked new possibilities, enabling computers to handle new tasks.

AI success relies heavily on GPU-accelerated computing to process mountains of data, train large AI models, and enable real-time inference. This new compute power has opened new possibilities, but it has also challenged Ethernet cloud networks.

Traditional Ethernet, the technology that underpins internet infrastructure, was conceived to offer broad compatibility and connect loosely coupled applications. It wasn’t designed to handle the demanding computational needs of modern AI workloads, which involve tightly coupled parallel processing, rapid data transfers and unique communication patterns — all of which demand optimized network connectivity.

Foundational network interface cards (NICs) were designed for general-purpose computing, universal data transmission and interoperability. They were never designed to cope with the unique challenges posed by the computational intensity of AI workloads.

Standard NICs lack the requisite features and capabilities for efficient data transfer, low latency and the deterministic performance crucial for AI tasks. SuperNICs, on the other hand, are purpose-built for modern AI workloads.

SuperNIC Advantages in AI Computing Environments

Data processing units (DPUs) deliver a wealth of advanced features, offering high throughput, low-latency network connectivity and more. Since their introduction in 2020, DPUs have gained popularity in the realm of cloud computing, primarily due to their capacity to offload, accelerate and isolate data center infrastructure processing.

Although DPUs and SuperNICs share a range of features and capabilities, SuperNICs are uniquely optimized for accelerating networks for AI. The chart below shows how they compare:

Distributed AI training and inference communication flows depend heavily on network bandwidth availability for success. SuperNICs, distinguished by their sleek design, scale more effectively than DPUs, delivering an impressive 400Gb/s of network bandwidth per GPU.

The 1:1 ratio between GPUs and SuperNICs within a system can significantly enhance AI workload efficiency, leading to greater productivity and superior outcomes for enterprises.

The sole purpose of SuperNICs is to accelerate networking for AI cloud computing. Consequently, it achieves this goal using less computing power than a DPU, which requires substantial computational resources to offload applications from a host CPU.

The reduced computing requirements also translate to lower power consumption, which is especially crucial in systems containing up to eight SuperNICs.

Additional distinguishing features of the SuperNIC include its dedicated AI networking capabilities. When tightly integrated with an AI-optimized NVIDIA Spectrum-4 switch, it offers adaptive routing, out-of-order packet handling and optimized congestion control. These advanced features are instrumental in accelerating Ethernet AI cloud environments.

Revolutionizing AI Cloud Computing

The NVIDIA BlueField-3 SuperNIC offers several benefits that make it key for AI-ready infrastructure:

Peak AI workload efficiency: The BlueField-3 SuperNIC is purpose-built for network-intensive, massively parallel computing, making it ideal for AI workloads. It ensures that AI tasks run efficiently — without bottlenecks.
Consistent and predictable performance: In multi-tenant data centers where numerous tasks are processed simultaneously, the BlueField-3 SuperNIC ensures that each job and tenant’s performance is isolated, predictable and unaffected by other network activities.
Secure multi-tenant cloud infrastructure: Security is a top priority, especially in data centers handling sensitive information. The BlueField-3 SuperNIC maintains high security levels, enabling multiple tenants to coexist while keeping data and processing isolated.
Extensible network infrastructure: The BlueField-3 SuperNIC isn’t limited in scope — it’s highly flexible and adaptable to a myriad of other network infrastructure needs.
Broad server manufacturer support: The BlueField-3 SuperNIC fits seamlessly into most enterprise-class servers without excessive power consumption in data centers.

Learn more about NVIDIA BlueField-3 SuperNICs, including how they integrate across NVIDIA’s data center platforms, in the whitepaper: Next-Generation Networking for the Next Wave of AI.

Deploy a text embedding model via SageMaker JumpStart

Text embedding model query

Text to embedding

Get the nearest neighbors

Get the nearest neighbors on a large dataset

Run a batch transform to get embeddings on large datasets

Conclusion

About the Authors

System design

Testing and simulation

Future direction

Acknowledgements

Layout elements

Use cases of layout-aware extraction

Extracting layout elements from a page

Linearizing text from the layout response

Evaluating LLM performing metrics for abstractive and extractive tasks

ROUGE score analysis on abstractive question-answering task

ANLS score analysis on extractive tasks over academic datasets

Conclusion

About the Authors

Streamlining Drug Discovery With Computation

Putting AI in a Loop

All About That Baste

Rare-in for More

Baking a Berry Good Pie

Donut Hole in One

Thanks to AI, This Work Is Toast

Using notebooks for RAG-based question answering

Solution overview

Develop the solution using SageMaker Studio notebooks

Deploy the models to SageMaker for inference at scale

Prerequisites

Set up the notebook and environment

Load the pre-trained model and tokenizer

Ask a question that requires up-to-date information

Improve the answer by adding context to the prompt

Implement RAG question answering with BGE embeddings and Pinecone

Clean up

Conclusion

About the authors

Introducing the KT project and defining the problem

Building a distributed training environment with SageMaker

Analyzing the causes of slow training speed with the SageMaker Debugger and Profiler

Speeding up training time

Shard optimizer states with ZeroRedundancyOptimizer

Automatic mixed precision

Results in SageMaker Profiler

Training speed results of the KT distilled model

Conclusion

About the authors

Summary

Subscribe to the Microsoft Research Podcast:

Transcript

What Is a SuperNIC?

The Evolving Landscape of AI and Networking

SuperNIC Advantages in AI Computing Environments

Revolutionizing AI Cloud Computing

Navigation

GenAI Vision Endless Possibilities

"I'm interested in things that change the world or that affect the future and wondrous, new technology where you see it, and you're like, 'Wow, how did that even happen? How is that possible?'" -- Elon Musk

Copyright © 2019-2025 Vedere AI. All Rights Reserved.