Evaluate the text summarization capabilities of LLMs for enhanced decision-making on AWS

Evaluate the text summarization capabilities of LLMs for enhanced decision-making on AWS

Organizations across industries are using automatic text summarization to more efficiently handle vast amounts of information and make better decisions. In the financial sector, investment banks condense earnings reports down to key takeaways to rapidly analyze quarterly performance. Media companies use summarization to monitor news and social media so journalists can quickly write stories on developing issues. Government agencies summarize lengthy policy documents and reports to help policymakers strategize and prioritize goals.

By creating condensed versions of long, complex documents, summarization technology enables users to focus on the most salient content. This leads to better comprehension and retention of critical information. The time savings allow stakeholders to review more material in less time, gaining a broader perspective. With enhanced understanding and more synthesized insights, organizations can make better informed strategic decisions, accelerate research, improve productivity, and increase their impact. The transformative power of advanced summarization capabilities will only continue growing as more industries adopt artificial intelligence (AI) to harness overflowing information streams.

In this post, we explore leading approaches for evaluating summarization accuracy objectively, including ROUGE metrics, METEOR, and BERTScore. Understanding the strengths and weaknesses of these techniques can help guide selection and improvement efforts. The overall goal of this post is to demystify summarization evaluation to help teams better benchmark performance on this critical capability as they seek to maximize value.

Types of summarization

Summarization can generally be divided into two main types: extractive summarization and abstractive summarization. Both approaches aim to condense long pieces of text into shorter forms, capturing the most critical information or essence of the original content, but they do so in fundamentally different ways.

Extractive summarization involves identifying and extracting key phrases, sentences, or segments from the original text without altering them. The system selects parts of the text deemed most informative or representative of the whole. Extractive summarization is useful if accuracy is critical and the summary needs to reflect the exact information from the original text. These could be use cases like highlighting specific legal terms, obligations, and rights outlined in the terms of use. The most common techniques used for extractive summarization are term frequency-inverse document frequency (TF-IDF), sentence scoring, text rank algorithm, and supervised machine learning (ML).

Abstractive summarization goes a step further by generating new phrases and sentences that were not in the original text, essentially paraphrasing and condensing the original content. This approach requires a deeper understanding of the text, because the AI needs to interpret the meaning and then express it in a new, concise form. Large language models (LLMs) are best suited for abstractive summarization because the transformer models use attention mechanisms to focus on relevant parts of the input text when generating summaries. The attention mechanism allows the model to assign different weights to different words or tokens in the input sequence, enabling it to capture long-range dependencies and contextually relevant information.

In addition to these two primary types, there are hybrid approaches that combine extractive and abstractive methods. These approaches might start with extractive summarization to identify the most important content and then use abstractive techniques to rewrite or condense that content into a fluent summary.

The challenge

Finding the optimal method to evaluate summary quality remains an open challenge. As organizations increasingly rely on automatic text summarization to distill key information from documents, the need grows for standardized techniques to measure summarization accuracy. Ideally, these evaluation metrics would quantify how well machine-generated summaries extract the most salient content from source texts and present coherent summaries reflecting the original meaning and context.

However, developing robust evaluation methodologies for text summarization presents difficulties:

  • Human-authored reference summaries used for comparison often exhibit high variability based on subjective determinations of importance
  • Nuanced aspects of summary quality like fluency, readability, and coherence prove difficult to quantify programmatically
  • Wide variation exists across summarization methods from statistical algorithms to neural networks, complicating direct comparisons

Recall-Oriented Understudy for Gisting Evaluation (ROUGE)

ROUGE metrics, such as ROUGE-N and ROUGE-L, play a crucial role in evaluating the quality of machine-generated summaries compared to human-written reference summaries. These metrics focus on assessing the overlap between the content of machine-generated and human-crafted summaries by analyzing n-grams, which are groups of words or tokens. For instance, ROUGE-1 evaluates the match of individual words (unigrams), whereas ROUGE-2 considers pairs of words (bigrams). Additionally, ROUGE-N assesses the longest common subsequence of words between the two texts, allowing for flexibility in word order.

To illustrate this, consider the following examples:

  • ROGUE-1 metric – ROUGE-1 evaluates the overlap of unigrams (single words) between a generated summary and a reference summary. For example, if a reference summary contains “The quick brown fox jumps,” and the generated summary is “The brown fox jumps quickly,” the ROUGE-1 metric would consider “brown,” “fox,” and “jumps” as overlapping unigrams. ROUGE-1 focuses on the presence of individual words in the summaries, measuring how well the generated summary captures the key words from the reference summary.
  • ROGUE-2 metric – ROUGE-2 assesses the overlap of bigrams (pairs of adjacent words) between a generated summary and a reference summary. For instance, if the reference summary has “The cat is sleeping,” and the generated summary reads “A cat is sleeping,” ROUGE-2 would identify “cat is” and “is sleeping” as an overlapping bigram. ROUGE-2 provides insight into how well the generated summary maintains the sequence and context of word pairs compared to the reference summary.
  • ROUGE-N metric – ROUGE-N is a generalized form where N represents any number, allowing evaluation based on n-grams (sequences of N words). Considering N=3, if the reference summary states “The sun is shining brightly,” and the generated summary is “Sun shining brightly,” ROUGE-3 would recognize “sun shining brightly” as a matching trigram. ROUGE-N offers flexibility to evaluate summaries based on different lengths of word sequences, providing a more comprehensive assessment of content overlap.

These examples illustrate how ROUGE-1, ROUGE-2, and ROUGE-N metrics function in evaluating automatic summarization or machine translation tasks by comparing generated summaries with reference summaries based on different levels of word sequences.

Calculate a ROUGE-N score

You can use the following steps to calculate a ROUGE-N score:

  1. Tokenize the generated summary and the reference summary into individual words or tokens using basic tokenization methods like splitting by whitespace or natural language processing (NLP) libraries.
  2. Generate n-grams (contiguous sequences of N words) from both the generated summary and the reference summary.
  3. Count the number of overlapping n-grams between the generated summary and the reference summary.
  4. Calculate precision, recall, and F1 score:
    • Precision – The number of overlapping n-grams divided by the total number of n-grams in the generated summary.
    • Recall – The number of overlapping n-grams divided by the total number of n-grams in the reference summary.
    • F1 score – The harmonic mean of precision and recall, calculated as (2 * precision * recall) / (precision + recall).
  5. The aggregate F1 score obtained from calculating precision, recall, and F1 score for each row in the dataset is considered as the ROUGE-N score.

Limitations

ROGUE has the following limitations:

  • Narrow focus on lexical overlap – The core idea behind ROUGE is to compare the system-generated summary to a set of reference or human-created summaries, and measure the lexical overlap between them. This means ROUGE has a very narrow focus on word-level similarity. It doesn’t actually evaluate semantic meaning, coherence, or readability of the summary. A system could achieve high ROUGE scores by simply extracting sentences word-for-word from the original text, without generating a coherent or concise summary.
  • Insensitivity to paraphrasing – Because ROUGE relies on lexical matching, it can’t detect semantic equivalence between words and phrases. Therefore, paraphrasing and use of synonyms will often lead to lower ROUGE scores, even if the meaning is preserved. This disadvantages systems that paraphrase or summarize in an abstractive way.
  • Lack of semantic understanding – ROUGE doesn’t evaluate whether the system truly understood the meanings and concepts in the original text. A summary could achieve high lexical overlap with references, while missing the main ideas or containing factual inconsistencies. ROUGE would not identify these issues.

When to use ROUGE

ROUGE is simple and fast to calculate. Use it as a baseline or benchmark for summary quality related to content selection. ROUGE metrics are most effectively employed in scenarios involving abstractive summarization tasks, automatic summarization evaluation, assessments of LLMs, and comparative analyses of different summarization approaches. By using ROUGE metrics in these contexts, stakeholders can quantitatively evaluate the quality and effectiveness of summary generation processes.

Metric for Evaluation of Translation with Explicit Ordering (METEOR)

One of the major challenges in evaluating summarization systems is assessing how well the generated summary flows logically, rather than just selecting relevant words and phrases from the source text. Simply extracting relevant keywords and sentences doesn’t necessarily produce a coherent and cohesive summary. The summary should flow smoothly and connect ideas logically, even if they aren’t presented in the same order as the original document.

The flexibility of matching by reducing words to their root or base form (For example, after stemming, words like “running,” “runs,” and “ran” all become “run”) and synonyms means METEOR correlates better with human judgements of summary quality. It can identify if important content is preserved, even if the wording differs. This is a key advantage over n-gram based metrics like ROUGE, which only look for exact token matches. METEOR also gives higher scores to summaries that focus on the most salient content from the reference. Lower scores are given to repetitive or irrelevant information. This aligns well with the goal of summarization to keep the most important content only. METEOR is a semantically meaningful metric that can overcome some of the limitations of n-gram matching for evaluating text summarization. The incorporation of stemming and synonyms allows for better assessment of information overlap and content accuracy.

To illustrate this, consider the following examples:

Reference Summary: Leaves fall during autumn.

Generated Summary 1: Leaves drop in fall.

Generated Summary 2: Leaves green in summer.

The words that match between the reference and generated summary 1 are highlighted:

Reference Summary: Leavesfall during autumn.

Generated Summary 1: Leaves drop in fall.

Even though “fall” and “autumn” are different tokens, METEOR recognizes them as synonyms through its synonym matching. “Drop” and “fall” are identified as a stemmed match. For generated summary 2, there are no matches with the reference summary besides “Leaves,” so this summary would receive a much lower METEOR score. The more semantically meaningful matches, the higher the METEOR score. This allows METEOR to better evaluate the content and accuracy of summaries compared to simple n-gram matching.

Calculate a METEOR score

Complete the following steps to calculate a METEOR score:

  1. Tokenize the generated summary and the reference summary into individual words or tokens using basic tokenization methods like splitting by whitespace or NLP libraries.
  2. Calculate the unigram precision, recall, and F-mean score, giving more weightage to recall than precision.
  3. Apply a penalty for exact matches to avoid overemphasizing them. The penalty is chosen based on dataset characteristics, task requirements, and the balance between precision and recall. Subtract this penalty from the F-mean score calculated in Step 2.
  4. Calculate the F-mean score for stemmed forms (reducing words to their base or root form) and synonyms for unigrams where applicable. Aggregate this with the earlier calculated F-mean score to obtain the final METEOR score. The METEOR score ranges from 0–1, where 0 indicates no similarity between the generated summary and reference summary, and 1 indicates perfect alignment. Typically, summarization scores fall between 0–0.6.

Limitations

When employing the METEOR metric for evaluating summarization tasks, several challenges may arise:

  • Semantic complexity – METEOR’s emphasis on semantic similarity can struggle to capture the nuanced meanings and context in complex summarization tasks, potentially leading to inaccuracies in evaluation.
  • Reference variability – Variability in human-generated reference summaries can impact METEOR scores, because differences in reference content may affect the evaluation of machine-generated summaries.
  • Linguistic diversity – The effectiveness of METEOR may vary across languages due to linguistic variations, syntax differences, and semantic nuances, posing challenges in multilingual summarization evaluations.
  • Length discrepancy – Evaluating summaries of varying lengths can be challenging for METEOR, because discrepancies in length compared to the reference summary may result in penalties or inaccuracies in assessment.
  • Parameter tuning – Optimizing METEOR’s parameters for different datasets and summarization tasks can be time-consuming and require careful tuning to make sure the metric provides accurate evaluations.
  • Evaluation bias – There is a risk of evaluation bias with METEOR if not properly adjusted or calibrated for specific summarization domains or tasks. This can potentially lead to skewed results and affect the reliability of the evaluation process.

By being aware of these challenges and considering them when using METEOR as a metric for summarization tasks, researchers and practitioners can navigate potential limitations and make more informed decisions in their evaluation processes.

When to use METEOR

METEOR is commonly used to automatically evaluate the quality of text summaries. It is preferable to use METEOR as an evaluation metric when the order of ideas, concepts, or entities in the summary matters. METEOR considers the order and matches n-grams between the generated summary and reference summaries. It rewards summaries that preserve sequential information. Unlike metrics like ROUGE, which rely on overlap of n-grams with reference summaries, METEOR matches stems, synonyms, and paraphrases. METEOR works better when there can be multiple correct ways of summarizing the original text. METEOR incorporates WordNet synonyms and stemmed tokens when matching n-grams. In short, summaries that are semantically similar but use different words or phrasing will still score well. METEOR has a built-in penalty for summaries with repetitive n-grams. Therefore, it discourages word-for-word extraction or lack of abstraction. METEOR is a good choice when semantic similarity, order of ideas, and fluent phrasing are important for judging summary quality. It is less appropriate for tasks where only lexical overlap with reference summaries matters.

BERTScore

Surface-level lexical measures like ROUGE and METEOR evaluate summarization systems by comparing the word overlap between a candidate summary and a reference summary. However, they rely heavily on exact string matching between words and phrases. This means they may miss semantic similarities between words and phrases that have different surface forms but similar underlying meanings. By relying only on surface matching, these metrics may underestimate the quality of system summaries that use synonymous words or paraphrase concepts differently from reference summaries. Two summaries could convey nearly identical information but receive low surface-level scores due to vocabulary differences.

BERTScore is a way to automatically evaluate how good a summary is by comparing it to a reference summary written by a human. It uses BERT, a popular NLP technique, to understand the meaning and context of words in the candidate summary and reference summary. Specifically, it looks at each word or token in the candidate summary and finds the most similar word in the reference summary based on the BERT embeddings, which are vector representations of the meaning and context of each word. It measures the similarity using cosine similarity, which tells how close the vectors are to each other. For each word in the candidate summary, it finds the most related word in the reference summary using BERT’s understanding of language. It compares all these word similarities across the whole summary to get an overall score of how semantically similar the candidate summary is to the reference summary. The more similar the words and meanings captured by BERT, the higher the BERTScore. This allows it to automatically evaluate the quality of a generated summary by comparing it to a human reference without needing human evaluation each time.

To illustrate this, imagine you have a machine-generated summary: “The quick brown fox jumps over the lazy dog.” Now, let’s consider a human-crafted reference summary: “A fast brown fox leaps over a sleeping canine.”

Calculate a BERTScore

Complete the following steps to calculate a BERTScore:

  1. BERTScore uses contextual embeddings to represent each token in both the candidate (machine-generated) and reference (human-crafted) sentences. Contextual embeddings are a type of word representation in NLP that captures the meaning of a word based on its context within a sentence or text. Unlike traditional word embeddings that assign a fixed vector to each word regardless of its context, contextual embeddings consider the surrounding words to generate a unique representation for each word depending on how it is used in a specific sentence.
  2. The metric then computes the similarity between each token in the candidate sentence with each token in the reference sentence using cosine similarity. Cosine similarity helps us quantify how closely related two sets of data are by focusing on the direction they point in a multi-dimensional space, making it a valuable tool for tasks like search algorithms, NLP, and recommendation systems.
  3. By comparing the contextual embeddings and computing similarity scores for all tokens, BERTScore generates a comprehensive evaluation that captures the semantic relevance and context of the generated summary compared to the human-crafted reference.
  4. The final BERTScore output provides a similarity score that reflects how well the machine-generated summary aligns with the reference summary in terms of meaning and context.

In essence, BERTScore goes beyond traditional metrics by considering the semantic nuances and context of sentences, offering a more sophisticated evaluation that closely mirrors human judgment. This advanced approach enhances the accuracy and reliability of evaluating summarization tasks, making BERTScore a valuable tool in assessing text generation systems.

Limitations:

Although BERTScore offers significant advantages in evaluating summarization tasks, it also comes with certain limitations that need to be considered:

  • Computational intensity – BERTScore can be computationally intensive due to its reliance on pre-trained language models like BERT. This can lead to longer evaluation times, especially when processing large volumes of text data.
  • Dependency on pre-trained models – The effectiveness of BERTScore is highly dependent on the quality and relevance of the pre-trained language model used. In scenarios where the pre-trained model may not adequately capture the nuances of the text, the evaluation results may be affected.
  • Scalability – Scaling BERTScore for large datasets or real-time applications can be challenging due to its computational demands. Implementing BERTScore in production environments may require optimization strategies to provide efficient performance.
  • Domain specificity – BERTScore’s performance may vary across different domains or specialized text types. Adapting the metric to specific domains or tasks may require fine-tuning or adjustments to produce accurate evaluations.
  • Interpretability – Although BERTScore provides a comprehensive evaluation based on contextual embeddings, interpreting the specific reasons behind the similarity scores generated for each token can be complex and may require additional analysis.
  • Reference-free evaluation – Although BERTScore reduces the reliance on reference summaries for evaluation, this reference-free approach may not fully capture all aspects of summarization quality, particularly in scenarios where human-crafted references are essential for assessing content relevance and coherence.

Acknowledging these limitations can help you make informed decisions when using BERTScore as a metric for evaluating summarization tasks, providing a balanced understanding of its strengths and constraints.

When to use BERTScore

BERTScore can evaluate the quality of text summarization by comparing a generated summary to a reference summary. It uses neural networks like BERT to measure semantic similarity beyond just exact word or phrase matching. This makes BERTScore very useful when semantic fidelity preserving the full meaning and content is critical for your summarization task. BERTScore will give higher scores to summaries that convey the same information as the reference summary, even if they use different words and sentence structures. The bottom line is that BERTScore is ideal for summarization tasks where retaining the full semantic meaning not just keywords or topics is vital. Its advanced neural scoring allows it to compare meaning beyond surface-level word matching. This makes it suitable for cases where subtle differences in wording can substantially alter overall meaning and implications. BERTScore, in particular, excels in capturing semantic similarity, which is crucial for assessing the quality of abstractive summaries like those produced by Retrieval Augmented Generation (RAG) models.

Model evaluation frameworks

Model evaluation frameworks are essential for accurately gauging the performance of various summarization models. These frameworks are instrumental in comparing models, providing coherence between generated summaries and source content, and pinpointing deficiencies in evaluation methods. By conducting thorough assessments and consistent benchmarking, these frameworks propel text summarization research by advocating standardized evaluation practices and enabling multifaceted model comparisons.

In AWS, the FMEval library within Amazon SageMaker Clarify streamlines the evaluation and selection of foundation models (FMs) for tasks like text summarization, question answering, and classification. It empowers you to evaluate FMs based on metrics such as accuracy, robustness, creativity, bias, and toxicity, supporting both automated and human-in-the-loop evaluations for LLMs. With UI-based or programmatic evaluations, FMEval generates detailed reports with visualizations to quantify model risks like inaccuracies, toxicity, or bias, helping organizations align with their responsible generative AI guidelines. In this section, we demonstrate how to use the FMEval library.

Evaluate Claude v2 on summarization accuracy using Amazon Bedrock

The following code snippet is an example of how to interact with the Anthropic Claude model using Python code:

import json
# We use Claude v2 in this example.
# See https://docs.anthropic.com/claude/reference/claude-on-amazon-bedrock#list-available-models
# for instructions on how to list the model IDs for all available Claude model variants.
model_id = 'anthropic.claude-v2'
accept = "application/json"
contentType = "application/json"
# `prompt_data` is structured in the format that the Claude model expects, as documented here:
# https://docs.aws.amazon.com/bedrock/latest/userguide/model-parameters-claude.html#model-parameters-claude-request-body
prompt_data = """Human: Who is Barack Obama?
Assistant:
"""
# For more details on parameters that can be included in `body` (such as "max_tokens_to_sample"),
# see https://docs.aws.amazon.com/bedrock/latest/userguide/model-parameters-claude.html#model-parameters-claude-request-body
body = json.dumps({"prompt": prompt_data, "max_tokens_to_sample": 500})
# Invoke the model
response = bedrock_runtime.invoke_model(
body=body, modelId=model_id, accept=accept, contentType=contentType
)
# Parse the invocation response
response_body = json.loads(response.get("body").read())
print(response_body.get("completion"))

In simple terms, this code performs the following actions:

  1. Import the necessary libraries, including json, to work with JSON data.
  2. Define the model ID as anthropic.claude-v2 and set the content type for the request.
  3. Create a prompt_data variable that structures the input data for the Claude model. In this case, it asks the question “Who is Barack Obama?” and expects a response from the model.
  4. Construct a JSON object named body that includes the prompt data, and specify additional parameters like the maximum number of tokens to generate.
  5. Invoke the Claude model using bedrock_runtime.invoke_model with the defined parameters.
  6. Parse the response from the model, extract the completion (generated text), and print it out.

Make sure the AWS Identity and Access Management (IAM) role associated with the Amazon SageMaker Studio user profile has access to the Amazon Bedrock models being invoked. Refer to Identity-based policy examples for Amazon Bedrock for guidance on best practices and examples of identity-based policies for Amazon Bedrock.

Using the FMEval library to evaluate the summarized output from Claude

We use the following code to evaluate the summarized output:

from fmeval.data_loaders.data_config import DataConfig
from fmeval.model_runners.bedrock_model_runner import BedrockModelRunner
from fmeval.constants import MIME_TYPE_JSONLINES
from fmeval.eval_algorithms.summarization_accuracy import SummarizationAccuracy
config = DataConfig(
    dataset_name="gigaword_sample",
    dataset_uri="gigaword_sample.jsonl",
    dataset_mime_type=MIME_TYPE_JSONLINES,
    model_input_location="document",
    target_output_location="summary"
)
bedrock_model_runner = BedrockModelRunner(
    model_id=model_id,
    output='completion',
    content_template='{"prompt": $prompt, "max_tokens_to_sample": 500}'
)
eval_algo = SummarizationAccuracy()
eval_output = eval_algo.evaluate(model=bedrock_model_runner, dataset_config=config,
prompt_template="Human: Summarise the following text in one sentence: $featurennAssistant:n", save=True)

In the preceding code snippet, to evaluate text summarization using the FMEval library, we complete the following steps:

  1. Create a ModelRunner to perform invocation on your LLM. The FMEval library provides built-in support for Amazon SageMaker endpoints and Amazon SageMaker JumpStart LLMs. You can also extend the ModelRunner interface for any LLMs hosted anywhere.
  2. Use supported eval_algorithms like toxicity, summarization, accuracy, semantic, and robustness, based on your evaluation needs.
  3. Customize the evaluation configuration parameters for your specific use case.
  4. Use the evaluation algorithm with either built-in or custom datasets to evaluate your LLM model. The dataset used in this case is sourced from the following GitHub repo.

Refer to the developer guide and examples for detailed usage of evaluation algorithms.

The following table summarizes the results of the evaluation.

model _input model_output target_output prompt scores meteor_score rouge_score bert_score
John Edward
0 Bates, formerly of Spalding, Linco…..
I cannot make any definitive
judgments, as th…
A former
Lincolnshire Police officer carried o…
Human: John
Edward Bates, formerly of Spalding…
[{‘name’: ‘meteor’, ‘value’:
0.101010101010101…
0.10101 0 0.557155
23 October 2015
Last updated at
17:44 BST|nIt’…
Here are some key points about hurricane/trop.. Hurricane Patricia has been rated as a categor… Human: 23
October 2015 Last updated at 17:44
B…
[{‘name’: meteor’, “value’:
0.102339181286549..
0.102339 0.018265 0.441421
Ferrari appeared in a position to challenge un… Here are the key points from the article:nin… Lewis Hamilton stormed to pole position at the… Human: Ferrari appeared in a position to chall… [{‘name’: ‘meteor’, ‘value’:
0.322543352601156…
0.322543 0.078212 0.606487
The Bath-born player, 28, has made 36
appearan…
Okay, let me summarize the key points:/nin- E….. Newport Gwent Dragons number eight Ed Jackson Human: The Bath-born player, 28, has made 36 a… [{‘name’: ‘meteor’, ‘value’:
0105740181268882…
0.10574 0.012987 0.539488
Weaknesses in the way mice swapped data with c… Here are the key points I gathered from the a… Hackers could gain access to home and Human:
Weaknesses in the
swar mice swapped data
[{‘name’: ‘meteor’, ‘value’:
0.201048289433848…
0.201048 0.021858 0.526947

Check out the sample notebook for more details about the summarization evaluation that we discussed in this post.

Conclusion

ROUGE, METEOR, and BERTScore all measure the quality of machine-generated summaries, but focus on different aspects like lexical overlap, fluency, or semantic similarity. Make sure to select the metric that aligns with what defines “good” for your specific summarization use case. You can also use a combination of metrics. This provides a more well-rounded evaluation and guards against potential weaknesses of any individual metric. With the right measurements, you can iteratively improve your summarizers to meet whichever notion of accuracy matters most.

Additionally, FM and LLM evaluation is necessary to be able to productionize these models at scale. With FMEval, you get a vast set of built-in algorithms across many NLP tasks, but also a scalable and flexible tool for large-scale evaluations of your own models, datasets, and algorithms. To scale up, you can use this package in your LLMOps pipelines to evaluate multiple models. To learn more about FMEval in AWS and how to use it effectively, refer to Use SageMaker Clarify to evaluate large language models. For further understanding and insights into the capabilities of SageMaker Clarify in evaluating FMs, see Amazon SageMaker Clarify Makes It Easier to Evaluate and Select Foundation Models.


About the Authors

Dinesh Kumar Subramani is a Senior Solutions Architect based in Edinburgh, Scotland. He specializes in artificial intelligence and machine learning, and is member of technical field community with in Amazon. Dinesh works closely with UK Central Government customers to solve their problems using AWS services. Outside of work, Dinesh enjoys spending quality time with his family, playing chess, and exploring a diverse range of music.

Pranav Sharma is an AWS leader driving technology and business transformation initiatives across Europe, the Middle East, and Africa. He has experience in designing and running artificial intelligence platforms in production that support millions of customers and deliver business outcomes. He has played technology and people leadership roles for Global Financial Services organizations. Outside of work, he likes to read, play tennis with his son, and watch movies.

Read More

Blast From the Past: Stream ‘StarCraft’ and ‘Diablo’ on GeForce NOW

Blast From the Past: Stream ‘StarCraft’ and ‘Diablo’ on GeForce NOW

Support for Battle.net on GeForce NOW expands this GFN Thursday, as titles from the iconic StarCraft and Diablo series come to the cloud.

StarCraft Remastered, StarCraft II, Diablo II: Resurrected and Diablo III are part of 16 new games joining the GeForce NOW library of more than 1,900 titles.

Plus, a new update rolling out for members this week brings AV1 streaming to Mac M3 computers. This feature will improve game-streaming quality for members on M3, M3 Pro and M3 Max devices.

Plenty of Space in Hell

Dive into the original Blizzard games that set the stage for real-time strategy and action role-playing games (RPGs). StarCraft Remastered, StarCraft II, Diablo II: Resurrected and Diablo III bring galactic warfare, epic quests and legendary battles to the cloud.

StarCraft Remastered on GeForce NOW
Oh my Zerg.

In StarCraft Remastered, command one of three races — Terran, Zerg or Protoss — as they desperately struggle for survival. Build bases, gather resources and engage in intense battles using unique units and strategies.

Time to Plyon to the cloud.

Continue the saga with StarCraft II, with enhanced graphics and extended storytelling. Save the galaxy from emergent threats in full-length Terran, Zerg and Protoss campaigns. Take charge of all multiplayer units solo in Versus Mode, team up with a friend for Co-Op Missions or explore community-created game modes in the Arcade.

Diablo III on GeForce NOW
The fires of hell heat up the cloud once again.

In Diablo III, become a hero to battle the forces of darkness, uncover ancient secrets and face powerful foes in the action RPG set in the world of Sanctuary. With various character classes, intense combat and a rich loot system, members can experience a gripping single-player experience and cooperative multiplayer adventures.

Diablo II on GeForce NOW
Remastered goodness.

Pursue the mysterious Dark Wanderer and battle the denizens of hell in the remastered action RPG Diablo II: Resurrected. The title’s classic Diablo gameplay — enhanced with stunning 3D visuals for all the environments, characters and monsters — enable a nostalgic, high-quality return to hell.

Stream all the action at up to 4K resolution or up to 240 frames per second with an Ultimate membership. These top games join the Battle.net games first added to GeForce NOW, including Diablo IV, Overwatch 2, Call of Duty HQ and Hearthstone.

Remember the Cloud

Remnant II DLC on GeForce NOW
Unrelenting odds are no problem for the cloud.

The second downloadable content (DLC) for Ark Games’ Remnant II is available for members to stream. Experience a brand-new storyline, area, weapons, bosses and more in The Forgotten Kingdom.

Piece together the forgotten history of the lost tribe of Yaesha in an attempt to quell the vengeful wrath of Lydusa, an ancient stone spirit. Navigate the lingering traces of torment, treachery and death that haunt the land’s once-proud ziggurats. Traverse new dungeons, acquire powerful gear — including a new Archetype, “The Invoker” — meet unexpected allies and face new threats to return a semblance of peace to the forgotten kingdom.

GeForce NOW members will be able to stream the DLC without waiting around for downloads. Uncover the secrets of the lost tribe with an Ultimate membership for eight-hour gaming sessions and support for ultrawide resolutions.

New Adventures

Manor Lords on GeForce NOW
Grow from a humble hamlet to a hub for the kingdom in “Manor Lords.”

Guide a medieval village as it grows into a bustling city in Manor Lords, streaming this week on GeForce NOW. Manage resources and production chains in this historically accurate city builder while expanding the land through large-scale tactical battles.

Check out the full list of new games this week:

  • Dead Island 2 (New release on Steam, April 22)
  • Bellwright (New release on Steam, April 23)
  • Phantom Fury (New release on Steam, April 23)
  • Oddsparks: An Automation Adventure (New release on Steam, April 24)
  • Age of Water (New release on Steam, April 25)
  • Manor Lords (New release on Steam and Xbox, April 26, available on PC Game Pass)
  • 9-Bit Armies: A Bit Too Far (Steam)
  • Diablo II: Resurrected (Battle.net)
  • Diablo III (Battle.net)
  • Dragon’s Dogma 2 Character Creator & Storage (Steam)
  • Islands of Insight (Steam)
  • Metaball (Steam)
  • StarCraft Remastered (Battle.net)
  • StarCraft II (Battle.net)
  • Stargate: Timekeepers (Steam)
  • Tortuga – A Pirate’s Tale (Steam)

What are you planning to play this weekend? Let us know on X or in the comments below.

Read More

Into the Omniverse: Unlocking the Future of Manufacturing With OpenUSD on Siemens Teamcenter X

Into the Omniverse: Unlocking the Future of Manufacturing With OpenUSD on Siemens Teamcenter X

Editor’s note: This post is part of Into the Omniverse, a series focused on how artists, developers and enterprises can transform their workflows using the latest advances in OpenUSD and NVIDIA Omniverse.

Universal Scene Description, aka OpenUSD, is elevating the manufacturing game. Siemens, a leader in industrial technology, has embraced OpenUSD as a cornerstone of its digital transformation journey, using it to help bridge the gap between physical and virtual worlds.

Siemens is adding support for OpenUSD in its Siemens Xcelerator platform applications, starting with Teamcenter X software.

The integration empowers manufacturers to create photorealistic, robust digital twins that mirror real-world counterparts with unprecedented fidelity and efficiency. This allows for optimized resource utilization, minimized waste and enhanced product quality through comprehensive simulation and analysis — all of which align with sustainability and quality objectives.

For a company such as Siemens — one whose software touches all parts of the manufacturing cycle — digitalization can mean helping customers save time and costs, streamline workflows and reduce risk of manufacturing defects.

Ian Fisher, a member of Siemens Digital Industries Software team, is no stranger to the impact of embracing digital transformation — especially one powered by OpenUSD and generative AI.

“We are an industrial company where data is king,” he said. “OpenUSD comes in from the media side of the world, and we are looking to bring its openness and flexibility into the industrial world.”

Enterprises of all sizes depend on Siemens’ Teamcenter software, part of the Siemens Xcelerator platform, to develop and deliver products at scale. By connecting NVIDIA Omniverse — a platform of APIs and services based on OpenUSD — with Teamcenter X, Siemens’ cloud-based product lifecycle management software, engineering teams can make their physics-based digital twins more photorealistic and immersive, improving accuracy and minimizing waste and errors within workflows.

Siemens’ adoption of OpenUSD means that companies like HD Hyundai, a leader in sustainable ship manufacturing, can consolidate and visualize complex engineering projects directly within Teamcenter X. Find out more in the demo:

OpenUSD is touching other parts of Siemens as well. Siemens produces inverters, drive controllers and motors for more than 30,000 customers worldwide. Its lead electronics plant, GWE, in Erlangen, Germany, has been developing use cases from AI-enabled computer vision for defect detection to training pick-and-place robots.

One of their main challenges has been acquiring data to train the AI models that fuel these use cases. By building custom synthetic data generation pipelines using Omniverse Replicator, powered by OpenUSD, the engineers were able to generate large sets of diverse training data by varying many parameters including color, texture, background, lighting and more — allowing them to not only bootstrap but also quickly iterate on their AI models.

Committed to a future of widespread OpenUSD integration, Siemens was one of eight new general members that joined the Alliance for OpenUSD (AOUSD) last month, an organization dedicated to interoperability of 3D content through standardization.

Watch Fisher and other special guests discuss the impact of OpenUSD on industrial digitization workflows in this livestream replay:

Get Plugged Into the World of OpenUSD

Siemens and OpenUSD took center stage this week at Hannover Messe, the world’s leading industrial trade fair. Siemens CEO Roland Busch and Rev Lebaredian, vice president of Omniverse and simulation technology at NVIDIA, shared their vision on the potential of OpenUSD for customers in all industries.

For more on how Siemens is using OpenUSD to build and test complex AI-based automation systems completely virtually, watch the replay of the GTC session, “Virtual Commissioning of AI Vision Systems With OpenUSD.” All other sessions from GTC’s OpenUSD Day are available for viewing on demand.

Get started with NVIDIA Omniverse by downloading the standard license free, access OpenUSD resources and learn how Omniverse Enterprise can connect teams. Follow Omniverse on Instagram, Medium and X. For more, join the Omniverse community on the forums, Discord server, Twitch and YouTube channels. 

Featured image courtesy of Siemens, HD Hyundai.

Read More

Ideas: Exploring AI frontiers with Rafah Hosn

Ideas: Exploring AI frontiers with Rafah Hosn

Microsoft Research Podcast: Ideas - Rafah Hosn

Behind every emerging technology is a great idea propelling it forward. In the new Microsoft Research Podcast series, Ideas, members of the research community at Microsoft discuss the beliefs that animate their research, the experiences and thinkers that inform it, and the positive human impact it targets. 

In this episode, host Gretchen Huizinga talks with Rafah Hosn, partner, group product manager for AI Frontiers at Microsoft Research. Hosn’s professional experience spans the gamut—from research to product to engineering to research again, the discipline’s uniquely high levels of creativity, curiosity, and intellect drawing her back in. Energized by past technical disruptions she’s experienced, Hosn is on what she describes as her “most exciting adventure” yet, helping to drive scientific advancement in AI and to answer a big question: how far can we push machine intelligence while still delivering technologies people can derive value from? 

Transcript 

[TEASER] 

[MUSIC PLAYS UNDER DIALOGUE] 

RAFAH HOSN: What has changed is that in the old days, we had the luxury of creating something, going and piloting for three months until we know whether it works or not, and then taking one year to productize! That … that, that doesn’t work anymore! Because guess what? In three months, this innovation is, like, topped by four other innovations, be it at Microsoft or elsewhere. So that speed is really shifting the mindset and the spirit of people. 

[TEASER ENDS] 

GRETCHEN HUIZINGA: You’re listening to Ideas, a Microsoft Research Podcast that dives deep into the world of technology research and the profound questions behind the code. I’m Dr. Gretchen Huizinga. In this series, we’ll explore the technologies that are shaping our future and the big ideas that propel them forward. 


[MUSIC FADES] 

My guest today is Rafah Hosn. She’s a partner, group product manager for AI Frontiers at Microsoft Research. I’d call Rafah a sort of organizational conductor, working both with leaders to drive clarity around the mission as well as program managers to make sure they have solid operational strategies to execute on it. Rafah has mad skills in bringing research ideas from lab to life, and I’m thrilled to talk to her today. Rafah Hosn, welcome to Ideas

RAFAH HOSN: Thank you, Gretchen. Oh, my goodness, I have to live up to this introduction now! [LAUGHTER] 

HUIZINGA: Well, before we talk about research ideas, let’s talk about you and your own sort of “reason for being” in the research world. How would you describe your motivation for working in research and—assuming there was one—what was the “big idea” or animating “what if?” behind what you’re doing today? 

HOSN: Yeah, you know, I don’t know. There are so many big ideas, to be honest! Every day, I wake up and I often tell my husband how lucky, like so totally lucky and humbled, I am to be where I am right now in this moment, like right now when society as we know it is being totally disrupted by this huge leap in AI. And why research? Well, I’ve tried it all, Gretchen! I’ve been in research, I went to product, I did engineering, and I did full circle and came back to research. Because, you know, for me personally, there’s no other environment that I know of, for me, that has this amount of creativity and just infinite curiosity and intellect. So working with people that are asking “what next?” and trying to imagine the next world beyond where AI is today is just … this is the big idea. This is why I’m here. This is why I’m excited to come to work every day. 

HUIZINGA: Yeah. Well … and I want to drill in a little bit just, sort of, personally because sometimes there’s a story, an origin story, if you will, of some pivotal aha moment that you say, oh, that’s fascinating, that’s cool, that’s what I want to do. Anything that piqued your interest way back when you were a kid or, sort of, a pivotal moment in your educational years? 

HOSN: Yeah, you know, so many different things that inspire you along the journey, right. It’s not just one thing, Gretchen. My dad was a doctor. He was my biggest inspiration growing up. And the reason is because he had a lot of depth of knowledge in his domain. And I wanted that. I wanted to have depth of knowledge in a domain. So I went engineering against his advice. He really wanted me to be a doctor. [LAUGHTER] So he was not too happy. But, you know, throughout my education, you know, I was there when smartphones came about, when the internet was a thing. And now, like with generative AI, I feel like I’ve lived through so many disruptions, and every one of those was, “Oh my gosh! Like, I am exactly where I want to be!” So multiple inspirations, and every day, I wake up and there’s new news and I’m saying to myself, “OK, that’s great.” I love it! 

HUIZINGA: What a time to be alive! 

HOSN: It is amazing!

HUIZINGA: Yeah. Well, you recently took on this new role in AI Frontiers at Microsoft Research. And that very word “frontiers” evokes images of unexplored, uncharted territories like the Wild West or for Trekkies, maybe “space: the final frontier.” So what does it mean to you to be working at the frontier of artificial intelligence, and what’s the big idea behind AI Frontiers? 

HOSN: You know, it’s my biggest and most exciting adventure so far! Working under Ece Kamar’s leadership in this AI Frontiers is really trying to push ourselves to think, what’s beyond what there is right now in artificial intelligence? Where can we push more, from a scientific perspective? How do we translate these scientific discoveries into capabilities that people can actually use and derive value from? It’s a big responsibility, as well, because we just don’t want to push the boundaries of AI for the sake of pushing. We want to push it in a safe and responsible way. So it is a big responsibility. 

HUIZINGA: Yeah … 

HOSN: And fundamentally, you know, the unifying big idea in this team is to explore, you know, how far can we push intelligence further into models and encapsulations of those models so that we can, you know, have not just sort of an assistant but really a personal assistant, an agent that can, kind of, do tasks for us, with us, seamlessly across multiple domains? So this is what we’re trying to push for. 

HUIZINGA: Mmm. Rafah, do you feel like you’re at the frontier of artificial intelligence? I mean, what are the emotions that crop up when you are dealing with these things—that you and your teams basically know about but the rest of us don’t?

HOSN: For most days, it’s excitement. Sometimes it’s [LAUGHTER] … it ranges, to be honest. I would say there’s a spectrum of emotions. The dominating one is really just excitement. There’s so much that has happened with GenAI, but I feel like it has opened up so many different paths, as well, for us to explore, and that’s the excitement. And then every time the world accomplishes something, you’re like in astonishment. You’re like, wow, wow. 

HUIZINGA: Yeah … 

HOSN: And then, and then, oh my gosh, what’s next? And so, it’s a range of emotions … 

HUIZINGA: Right … 

HOSN: … but I would say the dominating one is enthusiasm.

HUIZINGA: Yeah. Well, I’ve heard other people on your teams use words like surprise, sometimes even shock … 

HOSN: Yeah, yeah, there are a lot of “wow” factors. Every day, every day, I wake up, I read like my three favorite AI tweets or things like that, and I’m like, “Oh my gosh. I wouldn’t have imagined that this model could do this thing,” so [LAUGHS] … um, but it’s exciting. 

HUIZINGA: We may have to get those accounts in the show notes so that we can follow along with your surprise and amazement in the mornings! 

HOSN: [LAUGHS] Yes! 

HUIZINGA: Well, listen, when we talk about measuring the success of an AI system, we often use the common convention of what we call benchmarks. But I want to zoom out from AI systems for a minute and ask how you might measure the success of an AI lab, which is what you’re working in. What are your benchmarks or key performance indicators—we call them KPIs—for the work going on at AI Frontiers? 

HOSN: Yeah, so I’m going to start by something that may sound surprising maybe to some, but I think it’s the culture first. It’s the culture of endless curiosity, of enthusiasm coupled with a bit of skepticism, to be honest, to ask the questions, the right questions, and this drive to push further. So I would say one KPI of success for me, personally, is, you know, can we maintain this culture of enthusiasm coupled with skepticism so we can ask hard questions and an envelope of enthusiasm and drive for everyone? So that’s one. I would say the other three are … one is around how much can we push scientifically as a community, right? This is a team of people that are getting together with a mission to push the boundaries of our understanding of artificial intelligence. So are we pushing that scientific boundaries? Are we creating insights, not just for the scientific community, but also for Microsoft and the world, so that we know how to derive value from these discoveries, right? At the end of the day, it is awesome to push scientifically. It’s even more awesome if you take this and translate it into something a human being can use … 

HUIZINGA: Yeah … 

HOSN: … or an enterprise can use. And I think … that’s kind of my KPIs of success. Culture first, pushing on the scientific boundaries, creating insights for the scientific community as well as for Microsoft so we can derive value for us as a society, right. 

HUIZINGA: Yeah. Well, continuing on this idea of success, and you’ve alluded to this already in terms of characteristics of curiosity and so on, part of your job, as you put it, was “enabling brilliant minds to find success.” So talk a little bit about the personal qualities of these brilliant minds and how you help them find success.

HOSN: Yeah, you know, everybody I work with brings different aspects of brilliance to the table—every day. So in our community of engineers, PMs, researchers, everybody is present with their ideas and their strengths. And they’re pulling together to push harder and faster on our key priorities. And I find folks working in AI these days, you know, to have a renewed fire. It’s really amazing to see. And I talk a lot about curiosity, but, you know, I cannot emphasize how much this is driving a lot of our community to explore new paths that they hadn’t thought about prior to this GenAI coming along. And so everybody is showing up, present, asking these questions and trying to solve new scenarios, new problems that are emerging. And from my perspective, you know, as you mentioned, I just try to unblock, basically. My team and I are here to [LAUGHTER] … well, two things I would say. First is bring the outside-in perspective. That’s so important because science is amazing, but unless you can derive value from it, it remains an awesome paper and an awesome equation, right. So asking, who can use this? What are the scenarios it could, you know, light up? How can we derive value? So those are the questions that my team and I can contribute to, and we are trying to participate from ideation all the way to basically delivering on key milestones. And that last mile is so important. Like, once you know what you want to do, how do you structure? How do you have an operational strategy that is amenable to these times, which is fast, fast, fast, and faster? So that’s, kind of, what we’re trying to do here. 

HUIZINGA: Yeah, yeah. Well, two things came to my mind in terms of what kinds of people would end up working in this area. And one would be agility, or agile. And that would, to me, represent in a researcher that the person would be able to spin or pivot if something didn’t work out. And the other one is sort of a risk-reward mentality. It’s like, where are you willing to push to get that reward versus what might keep you from even trying? 

HOSN: Yeah, so definitely in this AI Frontiers community, I’m finding a lot of adaptability. So people willing to try, failing fast when they fail, and pivoting. And you have to, nowadays, in this atmosphere that we are living in. And because we have the privilege of working in research—and it’s really an honor and a privilege, and I’m not saying it just lightly—but it is the place where you can take risks, Gretchen. It is the place where failing is totally fine because you’re learning and you’re pivoting in a way that allows you to progress on the next thing you tackle. So I feel like most of the people I work with in this community, AI Frontiers, we are risk takers. We want to push, and it’s OK to fail, and it’s OK to adapt. So, I think, as an aggregate, that’s kind of the spirit I’m seeing. 

HUIZINGA: In the past, Rafah, you’ve stressed the importance of both teams and timing. And so we’ve been talking about the teams and the minds and the kinds of qualities in those people. But what about the “when” of research? How does timing impact what gets done in your world?

HOSN: Well, in this new era, Gretchen, everything is yesterday! [LAUGHS] I mean, it is true. AI research is moving at such speeds that I feel like we need to get accustomed to a timing of now. And if it’s not now, it’s yesterday. So the timing is important, but the leeway has shrunk so significantly that I feel like we have to really just be present in the moment and just move as fast as we can because everybody else is moving at the highest speed. So timing is “now,” is what I would say. 

HUIZINGA: On that note, with so many innovations in AI coming out every day, every minute, what you’ve just expressed is that research horizons are shorter than ever. But as one of your team members noted in a recent panel, it still takes a lot of time to translate a research artifact, maybe a noteworthy finding or a published paper or an equation, an algorithm, into a useful product for humans. So how are you then dealing with these newly compressed timelines of “it needs to be done yesterday to keep up,” and how has the traditional research-to-product pipeline changed? 

HOSN: Yeah, it’s an awesome question. It is so true that going from research to a production-quality algorithm or capability takes time. But what I’m seeing is that the research-to-capabilities is accelerating, meaning if you look at the world today in generative AI and its surrounding, folks even in research are creating assets as they are creating their research. And so they are thinking as well, how do I showcase this? And of course, these assets are not production ready. But here’s the kicker. I think that the product teams are also adapting to this generative AI era, and they are changing to meet this disruptive moment. They are changing the way they think, and they are accelerating the way they productize and look at hardening and securing the assets so that they can put them in the hands of even a limited set of users just to get a feel of what it means to have them in the hands of end users and quickly iterating so that they can further harden and further improve the design until it’s production ready. And I feel like our product partners are meeting the moments, meaning they also are really adapting their processes such that they can get these assets and put them in the hands of users and test them out before they actually release them. 

HUIZINGA: Right. Let’s drill in a little bit more on that and talk about the traditional research-to-product pipeline, where you would have a researcher working on something and then an RSDE. What does RSDE stand for? 

HOSN: A research software development engineer. It’s a mouthful. 

HUIZINGA: Right. And then to the PM, or program manager, and then to the engineer. And you’ve said this provocative statement: now everyone is a PM! 

HOSN: Everyone is a PM! [LAUGHTER] 

HUIZINGA: What do you mean by that?

HOSN: I just, I just feel like if we are to meet the moment, we need to be thinking outside-in, inside-out simultaneously. And I believe that the spirit of program management, which is looking at the design from a user-centric perspective, is embedded as we are ideating, as we are trying to explore new methodologies, new algorithms, new assets. And so what has changed is that in the old days, we had the luxury of creating something, going and piloting for three months until we know whether it works or not, and then taking one year to productize! That … that, that doesn’t work anymore. [LAUGHTER] 

HUIZINGA: Right. 

HOSN: Because guess what? In three months, this innovation is, like, topped by four other innovations, be it at Microsoft or elsewhere. So that speed is really shifting the mindset and the, and the spirit of people. I have colleagues and friends, researchers, that are asking me, oh, scenarios, users … I mean it’s amazing to see. So, yes, everybody has gotten a little PM in them now. [LAUGHTER] 

HUIZINGA: Yeah, I did a podcast with Shamsi Iqbal and Jina Suh. And Shamsi was talking about this concept, this old concept, of the researcher being in their lab and saying, well, I’ve done this work; now go see what you want to do with it. I don’t think you have that affordance anymore as a researcher. 

HOSN: No … 

HUIZINGA: You’ve got to work much more tightly with other team members and think like a PM. 

HOSN: Totally. 

HUIZINGA: So let’s talk about how the big general idea behind AI Frontiers is giving birth to smaller, more specific ideas. What are some of the research directions and projects that you could tell us about that illustrate this vision here? 

HOSN: Yeah, and I’m sure you’ve heard some of it come from Ece Kamar as she spoke on this community that we have. In AI Frontiers, we’re exploring, I would say, three major areas of research. And I want you to imagine a stack. At the bottom of the stack, we’re asking ourselves questions around, what are some new architectures we can be thinking about for these foundational models? How do we create them? What kind of data we need to train them, to pre-train them. And then on top of that stack, which starts with a foundation model, we’re asking ourselves, OK great, you have a pretrained model. In a lot of cases, when you’re creating especially small models, you need to fine-tune them. So what is this methodology and data generation pipeline that we’re going to use to fine-tune these models and specialize them for both across domains and across skill set? And on top of that—so now we’re on the third layer—we have a final layer that encapsulates these models and orchestrates among them to allow them the ability to do, you know, complex tasks. And we don’t want to stop there because for us it’s … we don’t want to have an agent that just does things and doesn’t learn. So that learnability, that learning on the job, like we do as humans, is something we’re asking ourselves, as well. Like, how do we encapsulate these models? We orchestrate among them. And we allow these encapsulated things, we call them agents, to learn on the job so that they can accomplish more complex tasks. So those are the three things. And then cutting across these three layers, imagine there’s a thing that cuts across them, is doing everything in a way that allows us to rigorously evaluate and to ensure that we’re doing things in a safe and responsible way. So those are the main things that we’re working on. Does that make sense? 

HUIZINGA: That’s … yes, it does. And I imagine, you know, if you go to the website and you see those, kind of, three main areas, I imagine that even under there, there are specific projects on, you know, how then do we iterate? How then do we explore? 

HOSN: That’s right. That’s a good plug for people to visit the AI Frontiers website! Thank you, Gretchen! [LAUGHS] 

HUIZINGA: Well, I’ve been intrigued for a while by this idea of what you’ve called bi-directional enrichment, which represents both how research informs product but also how product informs research, but you’ve recently talked about how this idea has expanded to embrace what you call multi-directional enrichment and co-innovation. So what do you mean by that, and what does it look like for you? 

HOSN: So we talked just moments ago how the time has shrunk tremendously in artificial intelligence and the speed at which innovations are coming out. So what does that mean when you are sitting in research and you’re trying to derive value for Microsoft, for example? It means that now, rather than going on a journey to try out you know different things, what you want is for product to come on a co-innovation journey with you. And not every team has the capability or the time or the resources to do it. But sometimes product teams have applied scientists that are asking themselves very similar questions. And so now we have this huge synergistic effect by which, you know, researchers can come and explore their research but anchor them in a real-world scenario that the product team is, you know, asking themselves about. And that’s what I mean by co-innovation. And we look for co-innovation, so these are product teams or applied scientists in product teams that are not looking at something I can ship tomorrow. Because that’s not … that’s not frontiers. That’s feature-function that they can deliver right now to their customers. When we co-innovate, we have to co-innovate on a bit of a longer timespan. Now it’s no longer years, right? With generative AI, everything is months, but nonetheless, this is not next week. This is in a few months. And so … but this is really, really great because, again, I keep saying this and I have maybe a huge bias, but I do believe that research, without it being anchored in real-world scenario, just doesn’t have the same effect. So I have a bias for that. It’s my PM hat, what can I say? I love real-world scenarios! [LAUGHTER] 

HUIZINGA: What you just referred to is an interesting flow. I’ve noticed in my years doing this podcast that some people that started in research ended up over in product—and we’ll call them embedded researchers, if you will—and then some people that were in a product scenario come back over to research. And so, there’s this flow, multi-directional, bi-directional, and also where they’re placed within the company. How do you see that flow and the value of that flow between these organizations? 

HOSN: Yeah, you know, like, I think that the flow is important because that’s how cross-pollination happens. And you talked about brilliant minds. In product teams, there are brilliant minds, as well, right. And although their focus area is more around the product they live and breathe every day, this is enriching to researchers and continues to be enriching because when you deploy research capabilities in a real-world setting, there are surprising new research questions that come up, not just engineering. A lot of times people think of research, OK, yeah, you scale it, you harden it, you secure it, and it’s good to go. But that’s not always the case. In a lot of cases, because of the interactivity that happens with real-world scenarios, it opened up brand-new paths for research. And so I think that flow continues to happen even now. It’s just compressed. It’s just that researchers are no longer thinking six years. Researchers are thinking three months. Like, what am I going to do in three months? Because in three months, there will be a hundred other researchers that are coming up with innovation on the same question. So I think the flow still exists. I think that time has shrunk. And I think the mobility from researchers and research going to product and vice versa is enriching for the people that do it because you gain different perspectives. 

HUIZINGA: Well, and let’s push in even there a little bit. Researchers like everyone else can get comfortable looking at things through a particular lens. I would say that’s a human trait, not just a research trait … 

HOSN: Absolutely. 

HUIZINGA: … until a disruption challenges their status quo. So you’ve talked about LLMs, which we’ve called large language models, as being a good forcing function for researchers to think differently, even about the questions they’re asking. Can you elaborate on that a little bit? 

HOSN: Yeah, yeah, so, you know, the large language models and this disruption that we are living in at the moment is lighting fire underneath a lot of people’s intellect, I’m going to say. And so I think that people have to adapt quickly to change. And this is key. Adaptability, I believe, is just a key ingredient in doing research nowadays. Why? Because a lot of people are thinking directionally the same. And so, you know, if you’re not the first, you’re going to have to adapt to what came out. And then you have to think of, how do I differentiate? So the second point I would say is differentiation. And this mindset of, you know, how do I adapt to what just came out? How do I differentiate? And then—Rafah’s bias—how do I anchor in real-world scenario? This is the home run. And I would say you package all of this and focus, focus, focus … and you get a gold mine. 

HUIZINGA: I’m hearing “yes, and …” in this response in the sense of not everyone’s going to be first, but then, what else? This is back to the frontiers. It’s like, how do I differentiate? Yes, that’s awesome. And we’ve got this … 

HOSN: Exactly. And how do I build on what has just been discovered and give it a little bit of an edge or push it a little further or take it in a brand-new direction? I mean, so many different possibilities, but it does take adaptability, like a flexibility in the mindset, I would say. 

HUIZINGA: Yeah. Well, let’s go back to what you alluded to earlier, this idea of responsible AI. This is a big deal at Microsoft. And researchers are very thoughtful about the question of what could possibly go wrong if we got everything right. But how does that translate practically, and what concrete steps are you taking at what I’ll call the “frontier of responsibility?” 

HOSN: Yeah, and as I mentioned, you know, being at the frontiers is amazing. It also holds a big responsibility. We have so many different, I would say, checks and balances that we use, in model training and fine-tuning, to ensure that we are on top of all the regulatory, the policymaker suggestions, and we are abiding by Microsoft values first and foremost and responsibility in creating these innovations. So practically and tactically, what happens is that there are processes for how you actually even release any type of model. And this is just research. And when it goes to product, they have their own compliance, you know, a stricter even compliance, I would say, process that they go through. So we try, and I try particularly, to partner with our privacy champions, with our legal champions, with our people that are looking at this from a responsible AI perspective, so that we bring them in early on, and we say, hey, we’re thinking of doing this. And they tell us, well, you know, if you’re thinking about it this way, you might want to consider this. So we’re trying to bring them in as early as possible so that also we don’t go all the way and then we discover we did something wrong, so we have to backtrack. So I would say, you know, having these partners and colleagues come in early in the game just saves everybody a lot of time. And all this responsible AI for us, it’s ingrained with how we work, meaning we bring our champions early on and then we have them advise us as we move along the journey to create these innovations. So by the time we’re done, we know we’re good, right. And even by the time we’re done, we recheck everything, we run a lot of evaluation benchmarks, and, you know, we do the right thing per policies at Microsoft. So we take it very, very seriously. 

HUIZINGA: Well, let’s go back to this idea of research horizons for a second and anchor it in the way that we approach research. So many ideas are basically iterative steps on existing work, and they make a lot of sense … this is the next step … but then there are those out-of-the-box ideas that feel like maybe bigger swings—some might even call them outrageous—and in organizations like Microsoft Research, they might get the green light, too. Where do you find this idea of the outrageous or maybe longer-term idea finding a home or a place in an organization like Microsoft Research, and have you ever worked on something that felt outrageous to you? 

HOSN: Umm, you know, we like outrageous! That’s why we’re in research, right? So outrageous is good. I haven’t, to be honest, worked on an outrageous, but I am confident I will be. So … [LAUGHTER] I just have this belief that in AI Frontiers, we are going to have outrageous ideas, and we’re going to work on them, and we’re going to make bets that basically are hard to make in other parts of the company because we have the privilege of taking them and pursuing them. And, yes, they may fail, but if we have a breakthrough, it will be a significant breakthrough. So, so I think that outrageous is good. We need to think big. We need to take big leaps, big ideas. We also need to know how to fail gracefully and pivot fast! 

HUIZINGA: Hmmm. Mmm. You know, it strikes me, and I’m laughing to myself, it strikes me, even as we’re talking, that the idea that you work in AI Frontiers, that’s outrageous to most people and, and it’s normal to you. So maybe this idea of, “I haven’t worked on anything outrageous” is like, no, you live in outrageous, it just doesn’t seem like it! [LAUGHTER] 

HOSN: Maybe. It’s my day-to-day job, so, yes, I guess you’re right. 

HUIZINGA: Right. I mean, yeah, you say, we love outrageous, and that’s where it is right now. Every day that I follow, sort of, AI Twitter also and find myself going, seriously? That happened yesterday? What next? 

HOSN: Yeah, in two hours, there’ll be yet another thing. So, yeah, I guess I am living in outrageous, and I love it! It’s amazing! [LAUGHS] 

HUIZINGA: Yeah, maybe the idea of outrageous is just changed. 

HOSN: You know, you’re so right. I think that it’s become the norm. And it is, once we anchor in generative AI, and we push further on this idea, maybe we will go back in a cycle where outrageous is outrageous, but today it’s our life. It’s where we live. It’s what we breathe every day. So it’s become a norm. 

HUIZINGA: Yeah. Well, as we close, Rafah, I want to ask a question anchored on the big idea behind AI Frontiers. What do you believe might be true in say 10 to 15 years, and what should we be doing about it now? In other words, how does what we believe about the future influence how we conceptualize and execute on ideas today? 

HOSN: Yeah, you know, it’s … I can’t even predict what I’m going to be doing tomorrow! But … [LAUGHTER] here’s, here’s what I think. I think that we are truly approaching a moment in human history where a lot of unsurmountable problems, like very hard-to-tackle diseases that have been so hard, I think we are approaching a moment, you know, soon, I hope it’s even sooner than 10 years, where generative AI and innovations on top of it could lead to a lot of resolution for things that today … that cause unsurmountable pain and suffering. I’m very hopeful that with what we are creating that we can, you know, take inefficiencies out of so many different things that we see today that take time so that we liberate ourselves to think about the “what next” societally, right? I think what we need to be doing right now, to be honest, to influence the future is think about our curricula. What are we going to teach our kids? What are they going to work in? This is where I’m hoping that we pour some of our creativity, education system. How are we preparing the next generation? What are the paths that we are going to forge for them, knowing what we know today, knowing what this technology can bring forth? So my hope is that we put some brain power into that. 

HUIZINGA: Rafah Hosn, it’s always a pleasure to talk to you. A sincere pleasure, a delight. Thanks for joining us today on Ideas

[MUSIC PLAYS] 

HOSN: Thank you so much for having me, Gretchen. 

[MUSIC FADES]

The post Ideas: Exploring AI frontiers with Rafah Hosn appeared first on Microsoft Research.

Read More

CatLIP: CLIP-level Visual Recognition Accuracy with 2.7× Faster Pre-training on Web-scale Image-Text Data

Contrastive learning has emerged as a transformative method for learning effective visual representations through the alignment of image and text embeddings. However, pairwise similarity computation in contrastive loss between image and text pairs poses computational challenges. This paper presents a novel weakly supervised pre-training of vision models on web-scale image-text data. The proposed method reframes pre-training on image-text data as a classification task. Consequently, it eliminates the need for pairwise similarity computations in contrastive loss, achieving a remarkable 2.7…Apple Machine Learning Research

Enhance conversational AI with advanced routing techniques with Amazon Bedrock

Enhance conversational AI with advanced routing techniques with Amazon Bedrock

Conversational artificial intelligence (AI) assistants are engineered to provide precise, real-time responses through intelligent routing of queries to the most suitable AI functions. With AWS generative AI services like Amazon Bedrock, developers can create systems that expertly manage and respond to user requests. Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon using a single API, along with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI.

This post assesses two primary approaches for developing AI assistants: using managed services such as Agents for Amazon Bedrock, and employing open source technologies like LangChain. We explore the advantages and challenges of each, so you can choose the most suitable path for your needs.

What is an AI assistant?

An AI assistant is an intelligent system that understands natural language queries and interacts with various tools, data sources, and APIs to perform tasks or retrieve information on behalf of the user. Effective AI assistants possess the following key capabilities:

  • Natural language processing (NLP) and conversational flow
  • Knowledge base integration and semantic searches to understand and retrieve relevant information based on the nuances of conversation context
  • Running tasks, such as database queries and custom AWS Lambda functions
  • Handling specialized conversations and user requests

We demonstrate the benefits of AI assistants using Internet of Things (IoT) device management as an example. In this use case, AI can help technicians manage machinery efficiently with commands that fetch data or automate tasks, streamlining operations in manufacturing.

Agents for Amazon Bedrock approach

Agents for Amazon Bedrock allows you to build generative AI applications that can run multi-step tasks across a company’s systems and data sources. It offers the following key capabilities:

  • Automatic prompt creation from instructions, API details, and data source information, saving weeks of prompt engineering effort
  • Retrieval Augmented Generation (RAG) to securely connect agents to a company’s data sources and provide relevant responses
  • Orchestration and running of multi-step tasks by breaking down requests into logical sequences and calling necessary APIs
  • Visibility into the agent’s reasoning through a chain-of-thought (CoT) trace, allowing troubleshooting and steering of model behavior
  • Prompt engineering abilities to modify the automatically generated prompt template for enhanced control over agents

You can use Agents for Amazon Bedrock and Knowledge Bases for Amazon Bedrock to build and deploy AI assistants for complex routing use cases. They provide a strategic advantage for developers and organizations by simplifying infrastructure management, enhancing scalability, improving security, and reducing undifferentiated heavy lifting. They also allow for simpler application layer code because the routing logic, vectorization, and memory is fully managed.

Solution overview

This solution introduces a conversational AI assistant tailored for IoT device management and operations when using Anthropic’s Claude v2.1 on Amazon Bedrock. The AI assistant’s core functionality is governed by a comprehensive set of instructions, known as a system prompt, which delineates its capabilities and areas of expertise. This guidance makes sure the AI assistant can handle a wide range of tasks, from managing device information to running operational commands.

"""The following is the system prompt that outlines the full scope of the AI assistant's capabilities:
You are an IoT Ops agent that handles the following activities:
- Looking up IoT device information
- Checking IoT operating metrics (historical data)
- Performing actions on a device-by-device ID
- Answering general questions
You can check device information (Device ID, Features, Technical Specifications, Installation Guide, Maintenance and Troubleshooting, Safety Guidelines, Warranty, and Support) from the "IotDeviceSpecs" knowledge base.
Additionally, you can access device historical data or device metrics. The device metrics are stored in an Athena DB named "iot_ops_glue_db" in a table named "iot_device_metrics". 
The table schema includes fields for oil level, temperature, pressure, received_at timestamp, and device_id.
The available actions you can perform on the devices include start, shutdown, and reboot."""

Equipped with these capabilities, as detailed in the system prompt, the AI assistant follows a structured workflow to address user questions. The following figure provides a visual representation of this workflow, illustrating each step from initial user interaction to the final response.

a visual representation of this workflow, illustrating each step from initial user interaction to the final response.

The workflow is composed of the following steps:

  1. The process begins when a user requests the assistant to perform a task; for example, asking for the maximum data points for a specific IoT device device_xxx. This text input is captured and sent to the AI assistant.
  2. The AI assistant interprets the user’s text input. It uses the provided conversation history, action groups, and knowledge bases to understand the context and determine the necessary tasks.
  3. After the user’s intent is parsed and understood, the AI assistant defines tasks. This is based on the instructions that are interpreted by the assistant as per the system prompt and user’s input.
  4. The tasks are then run through a series of API calls. This is done using ReAct prompting, which breaks down the task into a series of steps that are processed sequentially:
    1. For device metrics checks, we use the check-device-metrics action group, which involves an API call to Lambda functions that then query Amazon Athena for the requested data.
    2. For direct device actions like start, stop, or reboot, we use the action-on-device action group, which invokes a Lambda function. This function initiates a process that sends commands to the IoT device. For this post, the Lambda function sends notifications using Amazon Simple Email Service (Amazon SES).
    3. We use Knowledge Bases for Amazon Bedrock to fetch from historical data stored as embeddings in the Amazon OpenSearch Service vector database.
  5. After the tasks are complete, the final response is generated by the Amazon Bedrock FM and conveyed back to the user.
  6. Agents for Amazon Bedrock automatically stores information using a stateful session to maintain the same conversation. The state is deleted after a configurable idle timeout elapses.

Technical overview

The following diagram illustrates the architecture to deploy an AI assistant with Agents for Amazon Bedrock.

Architecture diagram to deploy an AI assistant with Agents for Amazon Bedrock.

It consists of the following key components:

  • Conversational interface – The conversational interface uses Streamlit, an open source Python library that simplifies the creation of custom, visually appealing web apps for machine learning (ML) and data science. It is hosted on Amazon Elastic Container Service (Amazon ECS) with AWS Fargate, and it is accessed using an Application Load Balancer. You can use Fargate with Amazon ECS to run containers without having to manage servers, clusters, or virtual machines.
  • Agents for Amazon Bedrock – Agents for Amazon Bedrock completes the user queries through a series of reasoning steps and corresponding actions based on ReAct prompting:
    • Knowledge Bases for Amazon Bedrock – Knowledge Bases for Amazon Bedrock provides fully managed RAG to supply the AI assistant with access to your data. In our use case, we uploaded device specifications into an Amazon Simple Storage Service (Amazon S3) bucket. It serves as the data source to the knowledge base.
    • Action groups – These are defined API schemas that invoke specific Lambda functions to interact with IoT devices and other AWS services.
    • Anthropic Claude v2.1 on Amazon Bedrock – This model interprets user queries and orchestrates the flow of tasks.
    • Amazon Titan Embeddings – This model serves as a text embeddings model, transforming natural language text—from single words to complex documents—into numerical vectors. This enables vector search capabilities, allowing the system to semantically match user queries with the most relevant knowledge base entries for effective search.

The solution is integrated with AWS services such as Lambda for running code in response to API calls, Athena for querying datasets, OpenSearch Service for searching through knowledge bases, and Amazon S3 for storage. These services work together to provide a seamless experience for IoT device operations management through natural language commands.

Benefits

This solution offers the following benefits:

  • Implementation complexity:
    • Fewer lines of code are required, because Agents for Amazon Bedrock abstracts away much of the underlying complexity, reducing development effort
    • Managing vector databases like OpenSearch Service is simplified, because Knowledge Bases for Amazon Bedrock handles vectorization and storage
    • Integration with various AWS services is more streamlined through pre-defined action groups
  • Developer experience:
    • The Amazon Bedrock console provides a user-friendly interface for prompt development, testing, and root cause analysis (RCA), enhancing the overall developer experience
  • Agility and flexibility:
    • Agents for Amazon Bedrock allows for seamless upgrades to newer FMs (such as Claude 3.0) when they become available, so your solution stays up to date with the latest advancements
    • Service quotas and limitations are managed by AWS, reducing the overhead of monitoring and scaling infrastructure
  • Security:
    • Amazon Bedrock is a fully managed service, adhering to AWS’s stringent security and compliance standards, potentially simplifying organizational security reviews

Although Agents for Amazon Bedrock offers a streamlined and managed solution for building conversational AI applications, some organizations may prefer an open source approach. In such cases, you can use frameworks like LangChain, which we discuss in the next section.

LangChain dynamic routing approach

LangChain is an open source framework that simplifies building conversational AI by allowing the integration of large language models (LLMs) and dynamic routing capabilities. With LangChain Expression Language (LCEL), developers can define the routing, which allows you to create non-deterministic chains where the output of a previous step defines the next step. Routing helps provide structure and consistency in interactions with LLMs.

For this post, we use the same example as the AI assistant for IoT device management. However, the main difference is that we need to handle the system prompts separately and treat each chain as a separate entity. The routing chain decides the destination chain based on the user’s input. The decision is made with the support of an LLM by passing the system prompt, chat history, and user’s question.

Solution overview

The following diagram illustrates the dynamic routing solution workflow.

Dynamic routing solution workflow with LangChain

The workflow consists of the following steps:

  1. The user presents a question to the AI assistant. For example, “What are the max metrics for device 1009?”
  2. An LLM evaluates each question along with the chat history from the same session to determine its nature and which subject area it falls under (such as SQL, action, search, or SME). The LLM classifies the input and the LCEL routing chain takes that input.
  3. The router chain selects the destination chain based on the input, and the LLM is provided with the following system prompt:
"""Given the user question below, classify it as one of the candidate prompts. You may want to modify the input considering the chat history and the context of the question. 
Sometimes the user may just assume that you have the context of the conversation and may not provide a clear input. Hence, you are being provided with the chat history for more context. 
Respond with only a Markdown code snippet containing a JSON object formatted EXACTLY as specified below. 
Do not provide an explanation to your classification beside the Markdown, I just need to know your decision on which destination and next_inputs
<candidate prompt>
physics: Good for answering questions about physics
sql: sql: Good for querying sql from AWS Athena. User input may look like: get me max or min for device x?
lambdachain: Good to execute actions with Amazon Lambda like shutting down a device or turning off an engine User input can be like, shutdown device x, or terminate process y, etc.
rag: Good to search knowledgebase and retrieve information about devices and other related information. User question can be like: what do you know about device x?
default: if the input is not well suited for any of the candidate prompts above. this could be used to carry on the conversation and respond to queries like provide a summary of the conversation
</candidate prompt>"""

The LLM evaluates the user’s question along with the chat history to determine the nature of the query and which subject area it falls under. The LLM then classifies the input and outputs a JSON response in the following format:

<Markdown>
```json
{{
"destination": string  name of the prompt to use
"next_inputs": string  a potentially modified version of the original input
}}
```

The router chain uses this JSON response to invoke the corresponding destination chain. There are four subject-specific destination chains, each with its own system prompt:

  1. SQL-related queries are sent to the SQL destination chain for database interactions. You can use LCEL to build the SQL chain.
  2. Action-oriented questions invoke the custom Lambda destination chain for running operations. With LCEL, you can define your own custom function; in our case, it’s a function to run a predefined Lambda function to send an email with a device ID parsed. Example user input might be “Shut down device 1009.”
  3. Search-focused inquiries proceed to the RAG destination chain for information retrieval.
  4. SME-related questions go to the SME/expert destination chain for specialized insights.
  5. Each destination chain takes the input and runs the necessary models or functions:
    1. The SQL chain uses Athena for running queries.
    2. The RAG chain uses OpenSearch Service for semantic search.
    3. The custom Lambda chain runs Lambda functions for actions.
    4. The SME/expert chain provides insights using the Amazon Bedrock model.
  6. Responses from each destination chain are formulated into coherent insights by the LLM. These insights are then delivered to the user, completing the query cycle.
  7. User input and responses are stored in Amazon DynamoDB to provide context to the LLM for the current session and from past interactions. The duration of persisted information in DynamoDB is controlled by the application.

Technical overview

The following diagram illustrates the architecture of the LangChain dynamic routing solution.

Architecture diagram of the LangChain dynamic routing solution

The web application is built on Streamlit hosted on Amazon ECS with Fargate, and it is accessed using an Application Load Balancer. We use Anthropic’s Claude v2.1 on Amazon Bedrock as our LLM. The web application interacts with the model using LangChain libraries. It also interacts with variety of other AWS services, such as OpenSearch Service, Athena, and DynamoDB to fulfill end-users’ needs.

Benefits

This solution offers the following benefits:

  • Implementation complexity:
    • Although it requires more code and custom development, LangChain provides greater flexibility and control over the routing logic and integration with various components.
    • Managing vector databases like OpenSearch Service requires additional setup and configuration efforts. The vectorization process is implemented in code.
    • Integrating with AWS services may involve more custom code and configuration.
  • Developer experience:
    • LangChain’s Python-based approach and extensive documentation can be appealing to developers already familiar with Python and open source tools.
    • Prompt development and debugging may require more manual effort compared to using the Amazon Bedrock console.
  • Agility and flexibility:
    • LangChain supports a wide range of LLMs, allowing you to switch between different models or providers, fostering flexibility.
    • The open source nature of LangChain enables community-driven improvements and customizations.
  • Security:
    • As an open source framework, LangChain may require more rigorous security reviews and vetting within organizations, potentially adding overhead.

Conclusion

Conversational AI assistants are transformative tools for streamlining operations and enhancing user experiences. This post explored two powerful approaches using AWS services: the managed Agents for Amazon Bedrock and the flexible, open source LangChain dynamic routing. The choice between these approaches hinges on your organization’s requirements, development preferences, and desired level of customization. Regardless of the path taken, AWS empowers you to create intelligent AI assistants that revolutionize business and customer interactions

Find the solution code and deployment assets in our GitHub repository, where you can follow the detailed steps for each conversational AI approach.


About the Authors

Ameer Hakme is an AWS Solutions Architect based in Pennsylvania. He collaborates with Independent Software Vendors (ISVs) in the Northeast region, assisting them in designing and building scalable and modern platforms on the AWS Cloud. An expert in AI/ML and generative AI, Ameer helps customers unlock the potential of these cutting-edge technologies. In his leisure time, he enjoys riding his motorcycle and spending quality time with his family.

Sharon Li is an AI/ML Solutions Architect at Amazon Web Services based in Boston, with a passion for designing and building Generative AI applications on AWS. She collaborates with customers to leverage AWS AI/ML services for innovative solutions.

Kawsar Kamal is a senior solutions architect at Amazon Web Services with over 15 years of experience in the infrastructure automation and security space. He helps clients design and build scalable DevSecOps and AI/ML solutions in the Cloud.

Read More

Improve LLM performance with human and AI feedback on Amazon SageMaker for Amazon Engineering

Improve LLM performance with human and AI feedback on Amazon SageMaker for Amazon Engineering

The Amazon EU Design and Construction (Amazon D&C) team is the engineering team designing and constructing Amazon warehouses. The team navigates a large volume of documents and locates the right information to make sure the warehouse design meets the highest standards. In the post A generative AI-powered solution on Amazon SageMaker to help Amazon EU Design and Construction, we presented a question answering bot solution using a Retrieval Augmented Generation (RAG) pipeline with a fine-tuned large language model (LLM) for Amazon D&C to efficiently retrieve accurate information from a large volume of unorganized documents, and provide timely and high-quality services in their construction projects. The Amazon D&C team implemented the solution in a pilot for Amazon engineers and collected user feedback.

In this post, we share how we analyzed the feedback data and identified limitations of accuracy and hallucinations RAG provided, and used the human evaluation score to train the model through reinforcement learning. To increase training samples for better learning, we also used another LLM to generate feedback scores. This method addressed the RAG limitation and further improved the bot response quality. We present the reinforcement learning process and the benchmarking results to demonstrate the LLM performance improvement. The solution uses Amazon SageMaker JumpStart as the core service for model deployment, fine-tuning, and reinforcement learning.

Collect feedback from Amazon engineers in a pilot project

After developing the solution described in A generative AI-powered solution on Amazon SageMaker to help Amazon EU Design and Construction, the Amazon D&C team deployed the solution and ran a pilot project with Amazon engineers. The engineers accessed the pilot system through a web application developed by Streamlit, connected with the RAG pipeline. In the pipeline, we used Amazon OpenSearch Service for the vector database, and deployed a fine-tuned Mistral-7B-Instruct model on Amazon SageMaker.

One of the key objectives of the pilot is to collect feedback from Amazon engineers and use the feedback to further reduce LLM hallucinations. To achieve this, we developed a feedback collection module in the UI, as shown in the following figure, and stored the web session information and user feedback in Amazon DynamoDB. Through the feedback collection UI, Amazon engineers can select from five satisfaction levels: strongly disagree, disagree, neutral, agree, and strongly agree, corresponding to the feedback scores from 1–5. They can also provide a better answer to the question or comment on why the LLM response is not satisfactory.

human feedback UI

During the pilot, users provided 118 feedback responses. The following chart shows the distribution from negative to positive. We observed the incorrect answers marked by the negative feedback (disagree and strongly disagree) are 45% among the total feedback (53 out of 118). Some of the errors (about 10% of negative feedback and 7.5% of overall responses) can be addressed by user education and prompt engineering. Additionally, we can address the issue with the solution of LLM fine-tuning and reinforcement learning, described in the next section.

User feedback from pilot

Some users simply left a note, such as “Great!” for the strongly agree answers, or “Doesn’t answer the question,” for the strongly disagree answers. Other users provided scores and explained how they justify the LLM answers in their notes. In the following example, the user specifically provided the adequate document and content to correct the LLM hallucination.

Question How many car parking spaces do we have?
Bot Answer

Based on the given context, we don’t have enough information to determine the number of car parking spaces. The context discusses the minimum size of parking spaces and the requirements for external lighting levels, but it does not provide a specific number of parking spaces.

**You can refer to further explanations in the following resources:**

* ARS GEN 10.0/05.01.02. Design Criteria & Appendices/Performance Package AR Sortable Design Criteria v20.1.1.pdf – page: 10

* ARS GEN 10.0/05.01.02. Design Criteria & Appendices/Performance Package AR Sortable Design Criteria v20.1.1.pdf – page: 79

User score Strongly Disagree
User notes This is specified on page 21 of design criteria section 01 13 10

Improve bot response with supervised fine-tuning and reinforcement learning

The solution consists of three steps of fine-tuning:

  1. Conduct supervised fine-tuning using labeled data. This method was described in A generative AI-powered solution on Amazon SageMaker to help Amazon EU Design and Construction.
  2. Collect user feedback to label the question-answer pairs for further LLM tuning.
  3. When the training data is ready, further tune the model using reinforcement learning from human feedback (RLHF).

RLHF is widely used throughout generative artificial intelligence (AI) and LLM applications. It incorporates human feedback in the rewards function and trains the model with a reinforcement learning algorithm to maximize rewards, which makes the model perform tasks more aligned with human goals. The following diagram shows the pipeline of the steps.

Fine tuning workflow

We tested the methodology using the Amazon D&C documents with a Mistral-7B model on SageMaker JumpStart.

Supervised fine-tuning

In the previous post, we demonstrated how the fine-tuned Falcon-7B model outperforms the RAG pipeline and improves the quality and accuracy of QA bot response. For this post, we performed supervised fine-tuning on the Mistral-7B model. The supervised fine-tuning used the PEFT/LoRA technique (LoRA_r = 512, LoRA_alpha = 1024) on 436,207,616 parameters (5.68% of the total 7,677,964,288 parameters). The training was conducted on a p3.8x node with 137 samples synthetically generated by LLM and validated by humans; the process is well converged after 20 epochs, as shown in the following figure.

SFT training process

The fine-tuned model was validated by 274 samples, and the inference results were compared with the reference answers by the semantic similarity score. The score is 0.8100, which is higher than the score of 0.6419 from the traditional RAG.

Collect human and AI feedback for reinforcement learning

For RLHF, a sufficient amount of high-quality training samples labeled by subject matter experts (SMEs) are essential. However, poor-quality human labels will likely cause worse model performance than the original model after RLHF training. SMEs’ time is a scarce resource in any organization; reviewing hundreds or thousands of LLM responses and providing feedback requires a significant time investment from SMEs that may not have a clear return on investment.

To address this challenge, we adopted the reinforcement learning from AI feedback (RLAIF) method. RLAIF employs an AI assistant (another LLM) to provide evaluation scores, rather than from humans. In this hybrid learning approach, the learning agent refines the actions not only based on the interaction with a human but also from feedback provided by another AI model. It is much more scalable to provide sufficient training data for reinforcement learning, and much less subjective because the training is not dependent on particular preferences from a small group of SMEs. The SMEs may still be involved, but only to supervise the process and examine the AI feedback quality. This tremendously reduces the SMEs’ workload, because the AI-generated scores and labels are much less complicated for SMEs to analyze, filter, and sort.

In this use case, we provide the following prompt to an Anthropic Claude 2 model to generate the evaluation scores as the training samples for the reinforcement learning step:

"""

Human:

I am going to provide you with a question, a reference answer, and an machine generated answer to the question.

question: {question}

reference answer: {reference_answer}

machine_answer: {machine_answer}

Please analyze the question, the reference answer, and the machine generated answer to determine if the machine generated answer semantically matches the reference answer.

First think through your answer inside of <thinking> tags, then assign a score between 0.0 and 5.0 based on how well the machine answer matches the reference answer. Score 5.0 means a completely match, score 0.0 means a completely mismatch. Answer the score inside of <scoring> tags.

Also provide the reason to give this score inside of <reason> tags.

Assistant: <thinking>

"""

The Claude 2 model generates an evaluation score and a comment on why the score is given, as shown in the following example.

Question What is meant by ‘Fit Out Works’ in this context?
Reference Answer Fit Out Works refers to Works carried out under the construction contract by or on behalf of Amazon via the Developer up to First Receive Date from a notional Developer Shell & Core Base-build position. These Works include Amazon’s required Fit-Outs, big-box installations, and custom
QA Bot response Fit Out Works refers to Works carried out under the construction contract by or on behalf of Amazon via the Developer up to First Receive Date from a notional Developer Shell & Core Base-build position.
AI score 4.5
AI comment The machine answer provides the key information from the standard answer – that Fit Out Works refers to Works carried out under the construction contract by or on behalf of Amazon via the Developer up to First Receive Date from a notional Developer Shell & Core Base-build position. The additional details about including Amazon’s required Fit-Outs, big-box installations, and customizations are relevant and help provide further context, so I’ve scored the machine answer 4.5 out of 5. It captures the essence of the standard answer and provides some useful extra details.

Out of the 274 validation questions, the supervised fine-tuned model generated 159 responses that have AI scores greater than 4. We observed 60 answers with scores lower than 3; there is space to improve the overall response quality.

Feedback score before RLHF

The Amazon Engineering SMEs validated this AI feedback and acknowledged the benefits of using AI scores. Without AI feedback, the SMEs would need some time to review and analyze each LLM response to identify the cut-off answers and hallucinations, and to judge whether the LLM is returning correct contents and key concepts. AI feedback provides AI scores automatically and enables the SMEs to use filtering, sorting, and grouping to validate the scores and identify trends in the responses. This reduces the average SME’s review time by 80%.

Reinforcement learning from human and AI feedback

When the training samples are ready, we use the proximal policy optimization (PPO) algorithm to perform reinforcement learning. PPO uses a policy gradient method, which takes small steps to update the policy in the learning process, so that the learning agents can reliably reach the optimal policy network. This makes the training process more stable and reduces the possibility of divergence.

During the training, first we use the human- and AI-labeled data to build a reward model, which will be used guide the weights update in the learning process. For this use case, we select a distilroberta-base reward model and train it by samples in the following format:

[Instruction, Chosen_response, Rejected_response]

The following is an example of a training record.

Instruction According to the context, what is specified for inclusive and accessible design?
Chosen_response BREEAM Credit HEA06 – inclusive and accessible design – The building is designed to be fit for purpose, appropriate and accessible by all potential users. An access strategy is developed in line with the BREEAM Check list A3
Rejected_response The context states that

The reward model is trained with the learning rate 1e-5. As shown in the following chart, the training well converges after 10 epochs.

RLHF training process

Then the reward model is used for reinforcement learning. For this use case, we use PEFT/LoRA to accelerate the learning. The LoRA_r is 32 and LoRA_alpha is 64. The PPO-based learning is on a rate of 1.41e-5 in three epochs. Reinforcement training is highly volatile and very sensitive to the label data and feedback data. We use the reward_mean metric, the average reward obtained from the environment by specifically monitoring the reward model, to observe the learning process.

After RLAIF, we can see the number of bot responses with positive scores (above 3) increase from 78.1% to 85.5%, and those with negative scores (below 3) decrease from 21.9% to 14.5%. In total, the average AI feedback score increases up to 8%, from 3.9 to 4.2.

Improvement by RLAIF

You can see the improvement in the following example.

Question For floor areas being tiled, what screed is needed?
Reference Concrete Screed flooring is applicable only in floor-tiled areas. Refer to drawing series A-32 to see where floor-tiles are applicable.
Bot response before RLAIF Concrete Screed flooring is applicable only in
AI score before RLAIF 3
Bot response after RLAIF Concrete Screed flooring is applicable only in areas being tiled. Refer to drawing series A-32 to see where tile flooring is applicable.
AI score after RLAIF 4.5

Conclusion

This post is a continuation of our work with the Amazon EU Design and Construction team as discussed in A generative AI-powered solution on Amazon SageMaker to help Amazon EU Design and Construction. In this post, we showed how we generated human and AI feedback data to fine-tune the Mistral-7B model with reinforcement learning. The model after RLAIF provided better performance for Amazon Engineering’s question answering bot, improved the AI feedback score by 8%. In the Amazon D&C team’s pilot project, using RLAIF reduced the validation workload for SMEs by an estimated 80%. As the next step, we will scale up this solution by connecting with Amazon Engineering’s data infrastructure, and design a framework to automate the continuous learning process with a human in the loop. We will also further improve the AI feedback quality by tuning the prompt template.

Through this process, we learned how to further improve the quality and performance of question answering tasks through RLHF and RLAIF.

  • Human validation and augmentation are essential to provide accurate and responsible outputs from LLM. The human feedback can be used in RLHF to further improve the model response.
  • RLAIF automates the evaluation and learning cycle. The AI-generated feedback is less subjective because it doesn’t depend on a particular preference from a small pool of SMEs.
  • RLAIF is more scalable to improve the bot quality through continued reinforcement learning while minimizing the efforts required from SMEs. It is especially useful for developing domain-specific generative AI solutions within large organizations.
  • This process should be done on a regular basis, especially when new domain data is available to be covered by the solution.

In this use case, we used SageMaker JumpStart to test multiple LLMs and experiment with multiple LLM training approaches. It significantly accelerates the AI feedback and learning cycle with maximized efficiency and quality. For your own project, you can introduce the human-in-the-loop approach to collect your users’ feedback, or generate AI feedback using another LLM. Then you can follow the three-step process defined in this post to fine-tune your models using RLHF and RLAIF. We recommend experimenting with the methods using SageMaker JumpStart to speed up the process.


About the Author

YunfeiYunfei Bai is a Senior Solutions Architect at AWS. With a background in AI/ML, data science, and analytics, Yunfei helps customers adopt AWS services to deliver business results. He designs AI/ML and data analytics solutions that overcome complex technical challenges and drive strategic objectives. Yunfei has a PhD in Electronic and Electrical Engineering. Outside of work, Yunfei enjoys reading and music.

Elad_photoElad Dwek is a Construction Technology Manager at Amazon. With a background in construction and project management, Elad helps teams adopt new technologies and data-based processes to deliver construction projects. He identifies needs and solutions, and facilitates the development of the bespoke attributes. Elad has an MBA and a BSc in Structural Engineering. Outside of work, Elad enjoys yoga, woodworking, and traveling with his family.

Luca_photoLuca Cerabone is a Business Intelligence Engineer at Amazon. Drawing from his background in data science and analytics, Luca crafts tailored technical solutions to meet the unique needs of his customers, driving them towards more sustainable and scalable processes. Armed with an MSc in Data Science, Luca enjoys engaging in DIY projects, gardening and experimenting with culinary delights in his leisure moments.

Read More

Improve accuracy of Amazon Rekognition Face Search with user vectors

Improve accuracy of Amazon Rekognition Face Search with user vectors

In various industries, such as financial services, telecommunications, and healthcare, customers use a digital identity process, which usually involves several steps to verify end-users during online onboarding or step-up authentication. An example of one step that can be used is face search, which can help determine whether a new end-user’s face matches those associated with an existing account.

Building an accurate face search system involves several steps. The system must be able to detect human faces in images, extract the faces into vector representations, store face vectors in a database, and compare new faces against existing entries. Amazon Rekognition makes this effortless by giving you pre-trained models that are invoked via simple API calls.

Amazon Rekognition enables you to achieve very high face search accuracy with a single face image. In some cases, you can use multiple images of the same person’s face to create user vectors and improve accuracy even further. This is especially helpful when images have variations in lighting, poses, and appearances.

In this post, we demonstrate how to use the Amazon Rekognition Face Search APIs with user vectors to increase the similarity score for true matches and decrease the similarity score for true non-matches.

We compare the results of performing face matching with and without user vectors.

Amazon Rekognition face matching

Amazon Rekognition face matching enables measuring the similarity of a face vector extracted from one image to a face vector extracted from another image. A pair of face images is said to be a true match if both images contain the face of the same person, and a true non-match otherwise. Amazon Rekognition returns a score for the similarity of the source and target faces. The minimum similarity score is 0, implying very little similarity, and the maximum is 100.

For comparing a source face with a collection of target faces (1:N matching), Amazon Rekognition allows you to create a Collection object and populate it with faces from images using API calls.

When adding a face to a collection, Amazon Rekognition doesn’t store the actual image of the face but rather the face vector, a mathematical representation of the face. With the SearchFaces API, you can compare a source face with one or several collections of target faces.

In June 2023, AWS launched user vectors, a new capability that significantly improves face search accuracy by using multiple face images of a user. Now, you can create user vectors, which aggregate multiple face vectors of the same user. User vectors offer higher face search accuracy with more robust depictions, because they contain varying degrees of lighting, sharpness, pose, appearance, and more. This improves the accuracy compared to searching against individual face vectors.

In the following sections, we outline the process of using Amazon Rekognition user vectors. We guide you through creating a collection, storing face vectors in that collection, aggregating those face vectors into user vectors, and then comparing the results of searching against those individual face vectors and user vectors.

Solution overview

For this solution, we use an Amazon Rekognition collection of users, each with its associated indexed face vectors from a number of different images of faces for each user.

Let’s look at the workflow to build a collection with users and faces:

  1. Create an Amazon Rekognition collection.
  2. For each user, create a user in the collection.
  3. For each image of the user, add the face to the collection (IndexFaces, which returns face ID corresponding to each face vector).
  4. Associate all indexed face IDs with the user (this is necessary for user vectors).

Then, we will compare the following workflows:

Searching with a new given input image against individual face vectors in our collection:

  1. Get all faces from an image (DetectFaces).
  2. For each face, compare against individual faces in our collection (SearchFacesByImage).

Searching with a new given input image against user vectors in our collection:

  1. Get all faces from an image (DetectFaces).
  2. For each face, compare to the user vector (SearchUsersByImage).

Now let’s describe the solution in details.

Prerequisites

Add the following policy to your AWS Identity and Access Management (IAM) user or role. The policy grants you permission to the relevant Amazon Rekognition APIs and allows access to an Amazon Simple Storage Service (Amazon S3) bucket to store the images:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "RekognitionPermissions",
            "Effect": "Allow",
            "Action": [
                "rekognition:CreateCollection",
                "rekognition:DeleteCollection",
                "rekognition:CreateUser",
                "rekognition:IndexFaces",
                "rekognition:DetectFaces",
                "rekognition:AssociateFaces",
                "rekognition:SearchUsersByImage",
                "rekognition:SearchFacesByImage"
            ],
            "Resource": "*"
        },
        {
            "Sid": "S3BucketPermissions",
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:PutObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::<replace_with_your_bucket>/*",
                "arn:aws:s3:::<replace_with_your_bucket>"
            ]
        }
    ]
}

Create an Amazon Rekognition collection and add users and faces

First, we create an S3 bucket to store users’ images. We organize the bucket by creating a folder for each user that contains their personal images. Our images folder looks like the following structure:

── images
│   ├── photo.jpeg
│   ├── Swami
│   │   ├── Swami1.jpeg
│   │   └── Swami2.jpeg
│   └── Werner
│       ├── Werner1.jpeg
│       ├── Werner2.jpeg
│       └── Werner3.jpeg

Our S3 bucket has a directory for each user that stores their images. There are currently two folders, and each contains several images. You can add more folders for your users, each containing one or more images to be indexed.

Next, we create our Amazon Rekognition collection. We have supplied helpers.py, which contains different methods that we use:

  • create_collection – Create a new collection
  • delete_collection – Delete a collection
  • create_user – Create a new user in a collection
  • add_faces_to_collection – Add faces to collection
  • associate_faces – Associate face_ids to a user in a collection
  • get_subdirs – Get all subdirectories under the S3 prefix
  • get_files – Get all files under the S3 prefix

The following is an example method for creating an Amazon Rekognition collection:

import boto3
session = boto3.Session()
client = session.client('rekognition')

def create_collection(collection_id):
    try:
        # Create a collection
        print('Creating collection:' + collection_id)
        response = client.create_collection(CollectionId=collection_id)
        print('Collection ARN: ' + response['CollectionArn'])
        print('Status code: ' + str(response['StatusCode']))
        print('Done...')
    except client.exceptions.ResourceAlreadyExistsException:
        print('Resource already exits...')

Create the collection with the following code:

import helpers
collection_id = "faces-collection"
helpers.create_collection(collection_id)

Next, let’s add the face vectors into our collection and aggregate them into user vectors.

For each user in the S3 directory, we create a user vector in the collection. Then we index the face images for each user into the collection as individual face vectors, which generates face IDs. Lastly, we associate the face IDs to the appropriate user vector.

This creates two types of vectors in our collection:

  • Individual face vectors
  • User vectors, which are built based on the face vector IDs supplied using the method associate_faces

See the following code:

bucket = '<replace_with_your_bucket>'
prefix = 'images/'

# Get all the users directories from s3 containing the images
folder_list = helpers.get_subdirs(bucket, prefix)
print(f"Found users folders: {folder_list}")
print()

for user_id in folder_list:
    face_ids = []
    helpers.create_user(collection_id, user_id)
    # Get all files per user under the s3 user directory
    images = helpers.get_files(bucket, prefix + user_id + "/")
    print (f"Found images={images} for {user_id}")
    for image in images:
        face_id = helpers.add_faces_to_collection(bucket, image, collection_id)
        face_ids.append(face_id)
    helpers.associate_faces(collection_id, user_id, face_ids)
    print()

We use the following methods:

  • get_subdirs – Returns a list of all the users’ directories. In our example, the value is [Swami,Werner].
  • get_files – Returns all the images files under the S3 prefix for the user.
  • face_ids – This is a list containing all the face IDs belonging to a user. We use this list when calling the AssociateFaces API.

As explained earlier, you can add more users by adding folders for them (the folder dictates the user ID) and add your images in that folder (no ordering is required for the files).

Now that our environment is set up and we have both individual face vectors and user vectors, let’s compare our search quality against each of them. To do that, we use a new photo with multiple people and attempt to match their faces against our collection, first against the individual face vectors and then against the user vectors.

Face search of image against a collection of individual face vectors

To search against our individual face vectors, we use the Amazon Rekognition SearchFacesByImage API. This function uses a source face image to search against individual face vectors in our collection and returns faces that match our defined similarity score threshold.

An important consideration is that the SearchFacesByImage API will only operate on the largest face detected in the image. If multiple faces are present, you need to crop each individual face and pass it separately to the method for identification.

For extracting faces details from an image (such as their location on the image), we use the Amazon Rekognition DetectFaces API.

The following detect_faces_in_image method detects faces in an image. For each face, it performs the following actions:

  • Print its bounding box location
  • Crop the face from the image and check if such face exists in the collection and print the user or ‘Unknown’
  • Print the similarity score

The example Python code uses the Pillow library for doing the image manipulations (such as printing, drawing, and cropping).

We use a similarity score threshold of 99%, which is a common setting for identity verification use cases.

Run the following code:

import detect_users
from PIL import Image

# The image we would like to match faces against our collection.
file_key= "images/photo.jpeg"

img = detect_users.detect_faces_in_image(
    bucket, 
    file_key, 
    collection_id, 
    threshold=99
)
img.show() # or in Jupyter use display(img)

file_key is the S3 object key we want to match against our collection. We have supplied an example image (photo.jpeg) under the images folder.

The following image shows our results.

Using a threshold of 99%, only one person was identified. Dr. Werner Vogels was flagged as Unknown. If we run the same code using a lower threshold of 90 (set threshold=90), we get the following results.

Now we see Dr. Werner Vogel’s face has a similarity score of 96.86%. Next, let’s check if we can get the similarity score above our defined threshold by using user vectors.

Face search of image against a collection of user vectors

To search against our user vectors, we use the Amazon Rekognition SearchUsersByImage API. This function uses a source face image to search against user vectors in our collection and returns users that match our defined similarity score threshold.

The same consideration is relevant here – the SearchUsersByImage API will only operate on the largest face detected in the image. If there are multiple faces present, you need to crop each individual face and pass it separately to the method for identification.

For extracting faces details from an image (such as their location on the image), we use the Amazon Rekognition DetectFaces API.

The following detect_users_in_image method detects faces in an image. For each face, it performs the following actions:

  • Print its bounding box location
  • Crop the face from the image and check if such user face exists in our collection and print the user or ‘Unknown’
  • Print the similarity score

See the following code:

import boto3
import io
import math
from PIL import Image, ImageDraw, ImageFont

def detect_users_in_image(bucket, key, collection_id, threshold=80):

    session = boto3.Session()
    client = session.client('rekognition')

    # Load image from S3 bucket
    s3_connection = boto3.resource('s3')
    s3_object = s3_connection.Object(bucket, key)
    s3_response = s3_object.get()

    stream = io.BytesIO(s3_response['Body'].read())
    image = Image.open(stream)

    # Call DetectFaces to find faces in image
    response = client.detect_faces(
        Image={'S3Object': {'Bucket': bucket, 'Name': key}},
        Attributes=['ALL']
    )

    imgWidth, imgHeight = image.size
    draw = ImageDraw.Draw(image)

    # Calculate and display bounding boxes for each detected face
    for faceDetail in response['FaceDetails']:
        print('The detected face is between ' + str(faceDetail['AgeRange']['Low'])
              + ' and ' + str(faceDetail['AgeRange']['High']) + ' years old')

        box = faceDetail['BoundingBox']
        left = imgWidth * box['Left']
        top = imgHeight * box['Top']
        width = imgWidth * box['Width']
        height = imgHeight * box['Height']

        print('Left: ' + '{0:.0f}'.format(left))
        print('Top: ' + '{0:.0f}'.format(top))
        print('Face Width: ' + "{0:.0f}".format(width))
        print('Face Height: ' + "{0:.0f}".format(height))

        points = (
            (left, top),
            (left + width, top),
            (left + width, top + height),
            (left, top + height),
            (left, top)
        )

        # Crop the face box and convert it to byte array
        face = image.crop((left, top, left + width, top + height))
        imgByteArr = image_to_byte_array(face, image.format)

        # Search for a user in our collection using the cropped image
        user_response = client.search_users_by_image(
            CollectionId=collection_id,
            Image={'Bytes': imgByteArr},
            UserMatchThreshold=threshold
        )
        # print (user_response)

        # Extract user id and the similarity from the response
        if (user_response['UserMatches']):
            similarity = user_response['UserMatches'][0]['Similarity']
            similarity = (math.trunc(similarity * 100) / 100) if isinstance(similarity, float) else similarity
            user_id = user_response['UserMatches'][0]['User']['UserId']
            print(f"User {user_id} was found, similarity of {similarity}%")
            print("")
        else:
            user_id = "Unknown"
            similarity = 0

        draw.line(points, fill='#00d400', width=4)
        font = ImageFont.load_default(size=25)
        draw.text((left, top - 30), user_id, fill='#00d400', font=font)
        if similarity > 0:
            draw.text((left, top + 1), str(similarity), fill='#00d400', font=font)

    return image

The function returns a modified image with the results that can be saved to Amazon S3 or printed. The function also outputs statistics about the estimated ages of the faces to the terminal.

Run the following code:

import detect_users
from PIL import Image

# The image we would like to match faces against our collection.
file_key= "images/photo.jpeg"

img = detect_users.detect_users_in_image(
    bucket, 
    file_key, 
    collection_id, 
    threshold=99
)
img.show() # or in Jupyter use display(img)

The following image shows our results.

The users that exist in our collection were identified correctly with high similarity (over 99%).

We were able to increase the similarity score by using three face vectors per user vector. As we increase the number of face vectors used, we expect the similarity score for true matches to also increase. You can use up to 100 face vectors per user vector.

An end-to-end example code can be found in the GitHub repository. It includes a detailed Jupyter notebook that you can run on Amazon SageMaker Studio (or other alternatives).

Clean up

To delete the collection, use the following code:

helpers.delete_collection(collection_id)

Conclusion

In this post, we presented how to use Amazon Rekognition user vectors to implement face search against a collection of users’ faces. We demonstrated how to improve face search accuracy by using multiple face images per user and compared it against individual face vectors. Additionally, we described how you can use the different Amazon Rekognition APIs to detect faces. The provided example code serves as a solid foundation for constructing a functional face search system.

For more information about Amazon Rekognition user vectors, refer to Searching faces in a collection. If you’re new to Amazon Rekognition, you can use our Free Tier, which lasts 12 months and includes processing 5,000 images per month and storing 1,000 user vector objects per month.


About the Authors

Arik Porat is a Senior Startups Solutions Architect at Amazon Web Services. He works with startups to help them build and design their solutions in the cloud, and is passionate about machine learning and container-based solutions. In his spare time, Arik likes to play chess and video games.

Eliran Efron is a Startups Solutions Architect at Amazon Web Services. Eliran is a data and compute enthusiast, assisting startups designing their system architectures. In his spare time, Eliran likes to build and race cars in Touring races and build IoT devices.

Read More

How Virtual Factories Are Making Industrial Digitalization a Reality

How Virtual Factories Are Making Industrial Digitalization a Reality

To address the shift to electric vehicles, increased semiconductor demand, manufacturing onshoring, and ambitions for greater sustainability, manufacturers are investing in new factory developments and re-engineering their existing facilities.

These projects often run over budget and schedule, due to complex and manual planning processes, legacy technology infrastructure, and disconnected tools, data and teams.

To address these challenges, manufacturers are embracing digitalization and virtual factories, powered by technologies like digital twins, the Universal Scene Description (OpenUSD) ecosystem and generative AI, that enable new possibilities from planning to operations.

What Is a Virtual Factory?

A virtual factory is a physically accurate representation of a real factory. These digital twins of factories allow manufacturers to model, simulate, analyze and optimize their production processes, resources and operations without the need for a physical prototype or pilot plant.

Benefits of Virtual Factories

Virtual factories unlock many benefits and possibilities for manufacturers, including:

  • Streamlined Communication: Instead of teams relying on in-person meetings and static planning documents for project alignment, virtual factories streamline communication and ensure that critical design and operations decisions are informed by the most current data.
  • Contextualized Planning: During facility design, construction and commissioning, virtual factories allow project stakeholders to visualize designs in the context of the entire facility and production process. Planning and operations teams can compare and verify built structures with the virtual designs in real time and decrease costs by identifying errors and incorporating feedback early in the review process.
  • Optimized Facility Designs: Connecting virtual factories to simulations of processes and discrete events enables teams to optimize facility designs for production and material flow, ergonomic work design, safety and overall utilization.
  • Intelligent and Optimized Operations: Operations teams can integrate their virtual factories with valuable production data from Internet of Things technology at the edge, and tap AI to drive further optimizations.

Virtual Factories: A Testing Ground for AI and Robotics

Robotics developers are increasingly using virtual factories to train and test AI and autonomous systems that run in physical factories. For example, virtual factories can enable developers and manufacturing teams to simulate digital workers and autonomous mobile robots (AMRs), vision AI agents and sensors to create a centralized map of worker activity throughout a facility. By fusing data from simulated camera streams with multi-camera tracking, developers can generate occupancy maps that inform optimal AMR routes.

Developers can also use these physically accurate virtual factories to train and test AI agents capable of managing their robot fleets, to ensure AI-enabled robots can adapt to real-world unpredictability and to identify streamlined configurations for human-robot collaboration.

What Are the Foundations of a Virtual Factory

Building large-scale, physically accurate virtual factories that unlock these transformational possibilities requires bringing together many tools, data formats and technologies to harmonize the representation of real-world aspects in the digital world.

Originally invented by Pixar Animation Studios, OpenUSD encompasses a collection of tools and capabilities that enable the data interoperability developers and manufacturers require to achieve their digitalization goals.

OpenUSD’s core superpower is flexible data modeling. 3D input can be accepted from source applications and combined with a variety of data, including from computer-aided design software, live sensors, documentation and maintenance records, through a unified data pipeline. OpenUSD enables developers to share these data types across different simulation tools and AI models, providing insights for all stakeholders. Data can be synced from the factory floor to the digital twin, surfacing real-time insights for factory managers and teams.

By developing virtual factory solutions on OpenUSD, developers can enhance collaboration for factory teams, allowing them to review plans, discuss optimization opportunities and make decisions in real time.

To support and accelerate the development of the OpenUSD ecosystem, Pixar, Adobe, Apple, Autodesk and NVIDIA formed the Alliance for OpenUSD, which is building open standards for USD in core specification, materials, geometry and more.

Industrial Use Cases for Virtual Factories

To unlock the potential of virtual factories, industry leaders including Autodesk, Continental, Pegatron, Rockwell Automation, Siemens and Wistron are developing virtual-factory solutions on OpenUSD and NVIDIA Omniverse, a platform of application programming interfaces (APIs) and software development kits that enable developers to build applications for complex 3D and industrial digitalization workflows based on OpenUSD.

FlexSim, an Autodesk company, uses OpenUSD to enable factory teams to analyze, visualize and optimize real-world processes with its simulation modeling for complex systems and operations. The discrete-event simulation software provides an intuitive drag-and-drop interface to create 3D simulation models, account for real-world variability, run “what-if” scenarios and perform in-depth analyses.

Developers at Continental, a leading German automotive technology company, developed ContiVerse, a factory planning and manufacturing operations application on OpenUSD and NVIDIA Omniverse. The application helps Continental optimize factory layouts and plan production processes collaboratively, leading to an expected 13% reduction in time to market. 

Partnering with software company SoftServe, Continental also developed Industrial Co-Pilot, which combines AI-driven insights with immersive visualization to deliver real-time guidance and predictive analytics to engineers. This is expected to reduce maintenance effort and downtime by 10%.

Pegatron, one of the world’s largest manufacturers of smartphones and consumer electronics, is developing virtual-factory solutions on OpenUSD to accelerate the development of new factories — as well as to minimize change orders, optimize operations and maximize production-line throughput in existing facilities.

Rockwell Automation is integrating NVIDIA Omniverse Cloud APIs and OpenUSD with its Emulate3D digital twin software to bring manufacturing teams data interoperability, live collaboration and physically based visualization for designing, building and operating industrial-scale digital twins of production systems.

Siemens, a leading technology company for automation, digitalization and sustainability and a member of the Alliance for OpenUSD, is adopting Omniverse Cloud APIs within its Siemens Xcelerator Platform, starting with Teamcenter X, the industry-leading cloud-based product lifecycle management software. This will help teams design, build and test next-generation products, manufacturing processes and factories virtually, before they’re built in the physical world.

Wistron, a leading global technology service provider and electronics manufacturer, is digitalizing new and existing factories with OpenUSD. By developing virtual-factory solutions on NVIDIA Omniverse, Wistron enables its factory teams to collaborate remotely to refine layout configurations, optimize surface mount technology and in-circuit testing lines, and transform product-on-dock testing. 

With these solutions, Wistron has achieved a 51% boost in worker efficiency and 50% reduction in production process times. Layout optimization and real-time monitoring have decreased defect rates by 40%. And construction time on Wistron’s new NVIDIA DGX factory was cut in half, from about five months to just two and a half months.

Learn more at the Virtual Factory Use Case page, where a reference architecture provides an overview of components and capabilities developers should consider when developing virtual-factory solutions.

Get started with NVIDIA Omniverse by downloading the standard license free, access OpenUSD resources, and learn how Omniverse Enterprise can connect your team. Stay up to date on Instagram, Medium and X. For more, join the Omniverse community on the forums, Discord server, Twitch and YouTube channels. 

Read More