Build a robust text-to-SQL solution generating complex queries, self-correcting, and querying diverse data sources

Build a robust text-to-SQL solution generating complex queries, self-correcting, and querying diverse data sources

Structured Query Language (SQL) is a complex language that requires an understanding of databases and metadata. Today, generative AI can enable people without SQL knowledge. This generative AI task is called text-to-SQL, which generates SQL queries from natural language processing (NLP) and converts text into semantically correct SQL. The solution in this post aims to bring enterprise analytics operations to the next level by shortening the path to your data using natural language.

With the emergence of large language models (LLMs), NLP-based SQL generation has undergone a significant transformation. Demonstrating exceptional performance, LLMs are now capable of generating accurate SQL queries from natural language descriptions. However, challenges still remain. First, human language is inherently ambiguous and context-dependent, whereas SQL is precise, mathematical, and structured. This gap may result in inaccurate conversion of the user’s needs into the SQL that’s generated. Second, you might need to build text-to-SQL features for every database because data is often not stored in a single target. You may have to recreate the capability for every database to enable users with NLP-based SQL generation. Third, despite the larger adoption of centralized analytics solutions like data lakes and warehouses, complexity rises with different table names and other metadata that is required to create the SQL for the desired sources. Therefore, collecting comprehensive and high-quality metadata also remains a challenge. To learn more about text-to-SQL best practices and design patterns, see Generating value from enterprise data: Best practices for Text2SQL and generative AI.

Our solution aims to address those challenges using Amazon Bedrock and AWS Analytics Services. We use Anthropic Claude v2.1 on Amazon Bedrock as our LLM. To address the challenges, our solution first incorporates the metadata of the data sources within the AWS Glue Data Catalog to increase the accuracy of the generated SQL query. The workflow also includes a final evaluation and correction loop, in case any SQL issues are identified by Amazon Athena, which is used downstream as the SQL engine. Athena also allows us to use a multitude of supported endpoints and connectors to cover a large set of data sources.

After we walk through the steps to build the solution, we present the results of some test scenarios with varying SQL complexity levels. Finally, we discuss how it is straightforward to incorporate different data sources to your SQL queries.

Solution overview

There are three critical components in our architecture: Retrieval Augmented Generation (RAG) with database metadata, a multi-step self-correction loop, and Athena as our SQL engine.

We use the RAG method to retrieve the table descriptions and schema descriptions (columns) from the AWS Glue metastore to ensure that the request is related to the right table and datasets. In our solution, we built the individual steps to run a RAG framework with the AWS Glue Data Catalog for demonstration purposes. However, you can also use knowledge bases in Amazon Bedrock to build RAG solutions quickly.

The multi-step component allows the LLM to correct the generated SQL query for accuracy. Here, the generated SQL is sent for syntax errors. We use Athena error messages to enrich our prompt for the LLM for more accurate and effective corrections in the generated SQL.

You can consider the error messages occasionally coming from Athena like feedback. The cost implications of an error correction step are negligible compared to the value delivered. You can even include these corrective steps as supervised reinforced learning examples to fine-tune your LLMs. However, we did not cover this flow in our post for simplicity purposes.

Note that there is always inherent risk of having inaccuracies, which naturally comes with generative AI solutions. Even if Athena error messages are highly effective to mitigate this risk, you can add more controls and views, such as human feedback or example queries for fine-tuning, to further minimize such risks.

Athena not only allows us to correct the SQL queries, but it also simplifies the overall problem for us because it serves as the hub, where the spokes are multiple data sources. Access management, SQL syntax, and more are all handled via Athena.

The following diagram illustrates the solution architecture.

The solution architecture and the process flow is shown.

Figure 1. The solution architecture and process flow.

The process flow includes the following steps:

  1. Create the AWS Glue Data Catalog using an AWS Glue crawler (or a different method).
  2. Using the Titan-Text-Embeddings model on Amazon Bedrock, convert the metadata into embeddings and store it in an Amazon OpenSearch Serverless vector store, which serves as our knowledge base in our RAG framework.

At this stage, the process is ready to receive the query in natural language. Steps 7–9 represent a correction loop, if applicable.

  1. The user enters their query in natural language. You can use any web application to provide the chat UI. Therefore, we did not cover the UI details in our post.
  2. The solution applies a RAG framework via similarity search, which adds the extra context from the metadata from the vector database. This table is used for finding the correct table, database, and attributes.
  3. The query is merged with the context and sent to Anthropic Claude v2.1 on Amazon Bedrock.
  4. The model gets the generated SQL query and connects to Athena to validate the syntax.
  5. If Athena provides an error message that mentions the syntax is incorrect, the model uses the error text from Athena’s response.
  6. The new prompt adds Athena’s response.
  7. The model creates the corrected SQL and continues the process. This iteration can be performed multiple times.
  8. Finally, we run the SQL using Athena and generate output. Here, the output is presented to the user. For the sake of architectural simplicity, we did not show this step.

Prerequisites

For this post, you should complete the following prerequisites:

  1. Have an AWS account.
  2. Install the AWS Command Line Interface (AWS CLI).
  3. Set up the SDK for Python (Boto3).
  4. Create the AWS Glue Data Catalog using an AWS Glue crawler (or a different method).
  5. Using the Titan-Text-Embeddings model on Amazon Bedrock, convert the metadata into embeddings and store it in an OpenSearch Serverless vector store.

Implement the solution

You can use the following Jupyter notebook, which includes all the code snippets provided in this section, to build the solution. We recommend using Amazon SageMaker Studio to open this notebook with an ml.t3.medium instance with the Python 3 (Data Science) kernel. For instructions, refer to Train a Machine Learning Model. Complete the following steps to set up the solution:

  1. Create the knowledge base in OpenSearch Service for the RAG framework:
    def add_documnets(self,index_name: str,file_name:str):
    
    documents = JSONLoader(file_path=file_name, jq_schema='.', text_content=False, json_lines=False).load()
    docs = OpenSearchVectorSearch.from_documents(embedding=self.embeddings, opensearch_url=self.opensearch_domain_endpoint, http_auth=self.http_auth, documents=documents, index_name=index_name, engine="faiss")
    index_exists = self.check_if_index_exists(index_name,aws_region,opensearch_domain_endpoint,http_auth)
    if not index_exists :
    logger.info(f'index :{index_name} is not existing ')
    sys.exit(-1)
    else:
    logger.info(f'index :{index_name} Got created')

  2. Build the prompt (final_question) by combining the user input in natural language (user_query), the relevant metadata from the vector store (vector_search_match), and our instructions (details):
    def userinput(user_query):
    logger.info(f'Searching metadata from vector store')
    
    # vector_search_match=rqst.getEmbeddding(user_query)
    vector_search_match = rqst.getOpenSearchEmbedding(index_name,user_query)
    
    # print(vector_search_match)
    details = "It is important that the SQL query complies with Athena syntax. 
    During join if column name are same please use alias ex llm.customer_id 
    in select statement. It is also important to respect the type of columns: 
    if a column is string, the value should be enclosed in quotes. 
    If you are writing CTEs then include all the required columns. 
    While concatenating a non string column, make sure cast the column to string. 
    For date columns comparing to string , please cast the string input."
    final_question = "nnHuman:"+details + vector_search_match + user_query+ "nnAssistant:"
    answer = rqst.generate_sql(final_question)
    return answer

  3. Invoke Amazon Bedrock for the LLM (Claude v2) and prompt it to generate the SQL query. In the following code, it makes multiple attempts in order to illustrate the self-correction step:x
    try:
    logger.info(f'we are in Try block to generate the sql and count is :{attempt + 1}')
    generated_sql = self.llm.predict(prompt)
    query_str = generated_sql.split("```")[1]
    query_str = " ".join(query_str.split("n")).strip()
    sql_query = query_str[3:] if query_str.startswith("sql") else query_str
    
    # return sql_query
    syntaxcheckmsg=rqstath.syntax_checker(sql_query)
    if syntaxcheckmsg=='Passed':
    logger.info(f'syntax checked for query passed in attempt number :{attempt + 1}')
    return sql_query

  4. If any issues are received with the generated SQL query ({sqlgenerated}) from the Athena response ({syntaxcheckmsg}), the new prompt (prompt) is generated based on the response and the model tries again to generate the new SQL:
    else:
    prompt = f"""{prompt} 
    This is syntax error: {syntaxcheckmsg}.
    To correct this, please generate an alternative SQL query which will correct the syntax error. The updated query should take care of all the syntax issues encountered. Follow the instructions mentioned above to remediate the error.
    Update the below SQL query to resolve the issue:
    {sqlgenerated}
    Make sure the updated SQL query aligns with the requirements provided in the initial question."""
    prompts.append(prompt)

  5. After the SQL is generated, the Athena client is invoked to run and generate the output:
    query_execution = self.athena_client.start_query_execution(
    QueryString=query_string,
    ResultConfiguration=result_config,
    QueryExecutionContext=query_execution_context, )
    execution_id = query_execution["QueryExecutionId"]

Test the solution

In this section, we run our solution with different example scenarios to test different complexity levels of SQL queries.

To test our text-to-SQL, we use two datasets available from IMDB. Subsets of IMDb data are available for personal and non-commercial use. You can download the datasets and store them in Amazon Simple Storage Service (Amazon S3). You can use the following Spark SQL snippet to create tables in AWS Glue. For this example, we use title_ratings and title:

source_title_ratings3_path = 's3://llm-athena-output/input_data/title.ratings.tsv'
target_title_s3_path='s3://llm-athena-output/output_data/imdb_stg/title_ratings'
source_titleratingdf=spark.read.csv(source_title_ratings3_path,sep="t",header=True)
source_titleratingdf.write.mode('overwrite').format('parquet').option('path', target_title_s3_path).saveAsTable('imdb_stg.title_ratings')

Store data in Amazon S3 and metadata in AWS Glue

In this scenario, our dataset is stored in an S3 bucket. Athena has an S3 connector that allows you to use Amazon S3 as a data source that can be queried.

For our first query, we provide the input “I am new to this. Can you help me see all the tables and columns in imdb schema?”

The following is the generated query:

WITH tables AS (
SELECT table_name
FROM information_schema.tables
WHERE table_schema = 'imdb_stg'),
columns AS (
SELECT
c.table_name,
c.column_name,
c.data_type,
c.is_nullable,
c.column_default,
c.ordinal_position
FROM information_schema.columns c
WHERE c.table_schema = 'imdb_stg')
SELECT
t.table_name,
c.column_name,
c.data_type,
c.is_nullable,
c.column_default,
c.ordinal_position
FROM tables t
INNER JOIN columns c
ON t.table_name = c.table_name
ORDER BY
t.table_name,
c.ordinal_position
LIMIT 10;

The following screenshot and code show our output.

Table showing the attributes of the example IMBD dataset.

table_name column_name data_type
0 title titleid varchar
1 title ordering integer
2 title title varchar
3 title region varchar
4 title language varchar

For our second query, we ask “Show me all the title and details in US region whose rating is more than 9.5.”

The following is our generated query:

WITH us_titles AS (
SELECT t.title, t.region, tr.averageRating, tr.numVotes
FROM imdb_stg.title t
INNER JOIN imdb_stg.title_ratings tr
ON t.titleId = tr.tconst
WHERE t.region = 'US' AND cast(tr.averageRating as varchar) > '9.5'
)
SELECT title, region, averageRating, numVotes
FROM us_titles
LIMIT 100;

The response is as follows.

A table showing the results of the sql query

title region averageRating numVotes
0 The Way You Saw Me US 9.7 8
1 The Brother Side of the Wake US 9.6 20
2 Ignis Fatuus US 9.6 11
3 Love and Hip Hop Atlanta US 9.9 11
4 ronny/lily US 9.7 14781

For our third query, we enter “Great Response! Now show me all the original type titles having ratings more than 7.5 and not in the US region.”

The following query is generated:

WITH titles AS (
SELECT t.titleId,
t.title,
t.types,
t.isOriginalTitle,
cast(tr.averageRating as decimal(3,1)) as averageRating,
tr.numVotes,
t.region
FROM imdb_stg.title t
INNER JOIN imdb_stg.title_ratings tr
ON t.titleId = tr.tconst
WHERE t.isOriginalTitle = '1'
AND cast(tr.averageRating as decimal(3,1)) > 7.5
AND t.region != 'US')
SELECT *
FROM titles
LIMIT 100;

We get the following results.

A single row showing the result of the SQL query.

titleId title types isOriginalTitle averageRating numVotes region
0 tt0986264 Taare Zameen Par original 1 8.3 203760 XWW

Generate self-corrected SQL

This scenario simulates a SQL query that has syntax issues. Here, the generated SQL will be self-corrected based on the response from Athena. In the following response, Athena gave a COLUMN_NOT_FOUND error and mentioned that table_description can’t be resolved:

Status : {'State': 'FAILED', 'StateChangeReason': "COLUMN_NOT_FOUND: line 1:50: Column 'table_description' 
cannot be resolved or requester is not authorized to access requested resources",
'SubmissionDateTime': datetime.datetime(2024, 1, 14, 14, 38, 57, 501000, tzinfo=tzlocal()),
'CompletionDateTime': datetime.datetime(2024, 1, 14, 14, 38, 57, 778000, tzinfo=tzlocal()),
'AthenaError': {'ErrorCategory': 2, 'ErrorType': 1006, 'Retryable': False, 'ErrorMessage': "COLUMN_NOT_FOUND: 
line 1:50: Column 'table_description' cannot be resolved or requester is not authorized to 
access requested resources"}}
COLUMN_NOT_FOUND: line 1:50: Column 'table_description' cannot be resolved or requester is not authorized to access requested resources
Try Count: 2
2024-01-14 14:39:02,521,llm_execute,MainProcess,INFO,Try Count: 2
we are in Try block to generate the sql and count is :2
2024-01-14 14:39:02,521,llm_execute,MainProcess,INFO,we are in Try block to generate the sql and count is :2
Executing: Explain WITH tables AS ( SELECT table_name FROM information_schema.tables WHERE table_schema = 'imdb_stg' ), columns AS ( SELECT c.table_name, c.column_name, c.data_type, c.is_nullable, c.column_default, c.ordinal_position FROM information_schema.columns c WHERE c.table_schema = 'imdb_stg' ) SELECT t.table_name, c.column_name, c.data_type, c.is_nullable, c.column_default, c.ordinal_position FROM tables t INNER JOIN columns c ON t.table_name = c.table_name ORDER BY t.table_name, c.ordinal_position LIMIT 10;
I am checking the syntax here
execution_id: 904857c3-b7ac-47d0-8e7e-6b9d0456099b
Status : {'State': 'SUCCEEDED', 'SubmissionDateTime': datetime.datetime(2024, 1, 14, 14, 39, 29, 537000, tzinfo=tzlocal()), 'CompletionDateTime': datetime.datetime(2024, 1, 14, 14, 39, 30, 183000, tzinfo=tzlocal())}
syntax checked for query passed in tries number :2

Using the solution with other data sources

To use the solution with other data sources, Athena handles the job for you. To do this, Athena uses data source connectors that can be used with federated queries. You can consider a connector as an extension of the Athena query engine. Pre-built Athena data source connectors exist for data sources like Amazon CloudWatch Logs, Amazon DynamoDB, Amazon DocumentDB (with MongoDB compatibility), and Amazon Relational Database Service (Amazon RDS), and JDBC-compliant relational data sources such MySQL, and PostgreSQL under the Apache 2.0 license. After you set up a connection to any data source, you can use the preceding code base to extend the solution. For more information, refer to Query any data source with Amazon Athena’s new federated query.

Clean up

To clean up the resources, you can start by cleaning up your S3 bucket where the data resides. Unless your application invokes Amazon Bedrock, it will not incur any cost. For the sake of infrastructure management best practices, we recommend deleting the resources created in this demonstration.

Conclusion

In this post, we presented a solution that allows you to use NLP to generate complex SQL queries with a variety of resources enabled by Athena. We also increased the accuracy of the generated SQL queries via a multi-step evaluation loop based on error messages from downstream processes. Additionally, we used the metadata in the AWS Glue Data Catalog to consider the table names asked in the query through the RAG framework. We then tested the solution in various realistic scenarios with different query complexity levels. Finally, we discussed how to apply this solution to different data sources supported by Athena.

Amazon Bedrock is at the center of this solution. Amazon Bedrock can help you build many generative AI applications. To get started with Amazon Bedrock, we recommend following the quick start in the following GitHub repo and familiarizing yourself with building generative AI applications. You can also try knowledge bases in Amazon Bedrock to build such RAG solutions quickly.


About the Authors

Sanjeeb Panda is a Data and ML engineer at Amazon. With the background in AI/ML, Data Science and Big Data, Sanjeeb design and develop innovative data and ML solutions that solve complex technical challenges and achieve strategic goals for global 3P sellers managing their businesses on Amazon. Outside of his work as a Data and ML engineer at Amazon, Sanjeeb Panda is an avid foodie and music enthusiast.

Burak Gozluklu is a Principal AI/ML Specialist Solutions Architect located in Boston, MA. He helps strategic customers adopt AWS technologies and specifically Generative AI solutions to achieve their business objectives. Burak has a PhD in Aerospace Engineering from METU, an MS in Systems Engineering, and a post-doc in system dynamics from MIT in Cambridge, MA. Burak is still a research affiliate in MIT. Burak is passionate about yoga and meditation.

Read More

And … Action! Cuebric CEO Provides Insights Into Filmmaking Using AI

And … Action! Cuebric CEO Provides Insights Into Filmmaking Using AI

These days, just about everyone is a content creator. But can generative AI help make people create high-quality films and other content affordably? Find out from Pinar Seyhan Demirdag, cofounder and CEO of Cuebric, during his conversation with NVIDIA AI Podcast host Noah Kravitz.

Cuebric is on a mission to offer new solutions in filmmaking and content creation through immersive, two-and-a-half-dimensional cinematic environments. Its AI-powered application aims to help creators quickly bring their ideas to life, making high-quality production more accessible.

Demirdag discusses how Cuebric uses generative AI to enable the creation of engaging environments affordably. Listen in to find out about the current landscape of content creation, the role of AI in simplifying the creative process, and Cuebric’s participation in NVIDIA’s GTC technology conference.

Time Stamps:

1:15: Getting to know Pinar Seyhan Demirdag and Cuebric
2:30: The beginnings and goals of Cuebric
4:45: How Cuebric’s AI application works for filmmakers
9:00: Advantages of AI in content creation
13:20: Making high-quality production budget-friendly
17:35: The future of AI in creative endeavors
22:00: Cuebric at NVIDIA GTC

You Might Also Like…

MIT’s Anant Agarwal on AI in Education – Ep. 197

AI could help students work smarter, not harder. Anant Agarwal, founder of edX and chief platform officer at 2U, shares his vision for the future of online education and the impact of AI in revolutionizing the learning experience.

UF Provost Joe Glover on Building a Leading AI University – Ep. 186

Joe Glover, provost and senior vice president of academic affairs at the University of Florida, discusses the university’s efforts to implement AI across all aspects of higher education, including a public-private partnership with NVIDIA that has helped transform UF into one of the leading AI universities in the country.

NVIDIA’s Marc Hamilton on Building the Cambridge-1 Supercomputer During a Pandemic – Ep. 137

Cambridge-1, the U.K.’s most powerful supercomputer, ranks among the world’s top 3 most energy-efficient supercomputers and was built to help healthcare researchers make new discoveries. Marc Hamilton, vice president of solutions architecture and engineering at NVIDIA, speaks on how he remotely oversaw its construction.

Subscribe to the AI Podcast

Get the AI Podcast through iTunes, Google Podcasts, Google Play, Amazon Music, Castbox, DoggCatcher, Overcast, PlayerFM, Pocket Casts, Podbay, PodBean, PodCruncher, PodKicker, Soundcloud, Spotify, Stitcher and TuneIn.

Make the AI Podcast better: Have a few minutes to spare? Fill out this listener survey.

Read More

Time to Skill Up: Game Reviewer Ralph Panebianco Wields NVIDIA RTX for the Win

Time to Skill Up: Game Reviewer Ralph Panebianco Wields NVIDIA RTX for the Win

Editor’s note: This post is part of our weekly In the NVIDIA Studio series, which celebrates featured artists, offers creative tips and tricks, and demonstrates how NVIDIA Studio technology improves creative workflows. We’re also deep diving on new GeForce RTX 40 Series GPU features, technologies and resources, and how they dramatically accelerate content creation.

YouTube content creator Ralph Panebianco really, really loves video games.

Since getting an original Nintendo Entertainment System at the age of four, Panebianco, this week’s featured In the NVIDIA Studio creator, has spent much of his free time playing video games. He pursued a career in gaming in his native country of Australia before pivoting to content creation, opening a YouTube channel called Skill Up, where he reviews the latest video games.

“When I wasn’t playing video games, I was reading about them, and now I get to talk about them for a living,” he said.

And calling all art fans: the latest Studio Standouts video features film noir-themed artwork brought to life with dramatic, monochromatic flair.

Video Editing Skillz

Panebianco works with his partner to create in-depth, insightful reviews of the latest video games on his Skill Up YouTube channel, which has garnered nearly 1 million subscribers. Below is a recent video reviewing Pacific Drive, a title available on the NVIDIA GeForce NOW cloud gaming service, powered by GeForce RTX GPUs.

“Creatively, we don’t view game reviews as functional buying guides with a list of pros and cons,” said Panebianco. “We view reviews as a chance to crack a game open and really show the audience what makes it tick. They’re sort of mini-essays on game design, delving deep into why specific game mechanics do or don’t work.”

The content creation process begins with booting up the game on his PC, powered by the recently launched GeForce RTX 4080 SUPER graphics card. This allows the Skill Up team to tap RTX ray tracing and NVIDIA DLSS — breakthrough technologies that use AI to create additional frames and improve image quality.

He records video footage primarily using GeForce Experience, a companion to NVIDIA GeForce GPUs that enables users to capture assets, optimize game settings and keep drivers up to date, among other features.

When footage requires high-dynamic range, the team uses the OBS Studio open-source software with AV1 hardware encoding to achieve 40% more efficient encoding on average than H.264 and deliver higher quality than competing GPUs.

“The AV1 encoder is ridiculously efficient in terms of file size,” he said.

NVIDIA GPUs and OBS Studio software work in synergy.

Once the footage is ready, Panebianco writes a video script in Microsoft Word and then records himself, using Audacity. He uses the AI-powered NVIDIA Broadcast app, free for RTX GPU owners, to eliminate background noise and achieve professional studio quality.

Panebianco then hands off the files to his editor for production in Adobe Premiere Pro, where a number of GPU-accelerated, AI-powered features such as Enhance Speech, Auto Reframe and Unsharp Mask help speed the video editing process.

NVIDIA’s GPU-accelerated video decoder (NVDEC) enables smooth playback and scrubbing of high-resolution videos.

Next, Panebianco exports the final files twice as fast thanks to the dual AV1 encoders in his RTX GPU. Lastly, his editor creates a YouTube thumbnail in Adobe Photoshop, and then the video is ready for publishing.

Adobe Photoshop has over 30 GPU-accelerated features that help modify and adjust images smoothly and quickly.

“Almost my entire workflow was enhanced by NVIDIA’s hardware,” Panebianco shared. “It’s not just about the hardware making for efficient encoding or lightning-fast, hardware-enabled rendering — it’s about the end-to-end toolset.”

Panebianco has words of wisdom for aspiring content creators.

“Worry less about the numbers and more about the quality,” he said. “The metrics grind pays little in the way of dividends, but putting out truly excellent content is an almost failure-proof path to growth.”

Video game content creator Ralph Panebianco.

Catch Panebianco’s video game reviews on the Skill Up YouTube channel.

Access tutorials on the Studio YouTube channel and get updates directly in your inbox by subscribing to the Studio newsletter. 

Read More

How Axfood enables accelerated machine learning throughout the organization using Amazon SageMaker

How Axfood enables accelerated machine learning throughout the organization using Amazon SageMaker

This is a guest post written by Axfood AB. 

In this post, we share how Axfood, a large Swedish food retailer, improved operations and scalability of their existing artificial intelligence (AI) and machine learning (ML) operations by prototyping in close collaboration with AWS experts and using Amazon SageMaker.

Axfood is Sweden’s second largest food retailer, with over 13,000 employees and more than 300 stores. Axfood has a structure with multiple decentralized data science teams with different areas of responsibility. Together with a central data platform team, the data science teams bring innovation and digital transformation through AI and ML solutions to the organization. Axfood has been using Amazon SageMaker to cultivate their data using ML and has had models in production for many years. Lately, the level of sophistication and the sheer number of models in production is increasing exponentially. However, even though the pace of innovation is high, the different teams had developed their own ways of working and were in search of a new MLOps best practice.

Our challenge

To stay competitive in terms of cloud services and AI/ML, Axfood chose to partner with AWS and has been collaborating with them for many years.

During one of our recurring brainstorming sessions with AWS, we were discussing how to best collaborate across teams to increase the pace of innovation and efficiency of data science and ML practitioners. We decided to put in a joint effort to build a prototype on a best practice for MLOps. The aim of the prototype was to build a model template for all data science teams to build scalable and efficient ML models—the foundation to a new generation of AI and ML platforms for Axfood. The template should bridge and combine best practices from AWS ML experts and company-specific best practice models—the best of both worlds.

We decided to build a prototype from one of the currently most developed ML models within Axfood: forecasting sales in stores. More specifically, the forecast for fruits and vegetables of upcoming campaigns for food retail stores. Accurate daily forecasting supports the ordering process for the stores, increasing sustainability by minimizing food waste as a result of optimizing sales by accurately predicting the needed in-store stock levels. This was the perfect place to start for our prototype—not only would Axfood gain a new AI/ML platform, but we would also get a chance to benchmark our ML capabilities and learn from leading AWS experts.

Our solution: A new ML template on Amazon SageMaker Studio

Building a full ML pipeline that is designed for an actual business case can be challenging. In this case, we are developing a forecasting model, so there are two main steps to complete:

  1. Train the model to make predictions using historical data.
  2. Apply the trained model to make predictions of future events.

In Axfood’s case, a well-functioning pipeline for this purpose was already set up using SageMaker notebooks and orchestrated by the third-party workflow management platform Airflow. However, there are many clear benefits of modernizing our ML platform and moving to Amazon SageMaker Studio and Amazon SageMaker Pipelines. Moving to SageMaker Studio provides many predefined out-of-the-box features:

  • Monitoring model and data quality as well as model explainability
  • Built-in integrated development environment (IDE) tools such as debugging
  • Cost/performance monitoring
  • Model acceptance framework
  • Model registry

However, the most important incentive for Axfood is the ability to create custom project templates using Amazon SageMaker Projects to be used as a blueprint for all data science teams and ML practitioners. The Axfood team already had a robust and mature level of ML modeling, so the main focus was on building the new architecture.

Solution overview

Axfood’s proposed new ML framework is structured around two main pipelines: the model build pipeline and the batch inference pipeline:

  • These pipelines are versioned within two separate Git repositories: one build repository and one deploy (inference) repository. Together, they form a robust pipeline for forecasting fruits and vegetables.
  • The pipelines are packaged into a custom project template using SageMaker Projects in integration with a third-party Git repository (Bitbucket) and Bitbucket pipelines for continuous integration and continuous deployment (CI/CD) components.
  • The SageMaker project template includes seed code corresponding to each step of the build and deploy pipelines (we discuss these steps in more detail later in this post) as well as the pipeline definition—the recipe for how the steps should be run.
  • Automation of building new projects based on the template is streamlined through AWS Service Catalog, where a portfolio is created, serving as an abstraction for multiple products.
  • Each product translates into an AWS CloudFormation template, which is deployed when a data scientist creates a new SageMaker project with our MLOps blueprint as the foundation. This activates an AWS Lambda function that creates a Bitbucket project with two repositories—model build and model deploy—containing the seed code.

The following diagram illustrates the solution architecture. Workflow A depicts the intricate flow between the two model pipelines—build and inference. Workflow B shows the flow to create a new ML project.

Model build pipeline

The model build pipeline orchestrates the model’s lifecycle, beginning from preprocessing, moving through training, and culminating in being registered in the model registry:

  • Preprocessing – Here, the SageMaker ScriptProcessor class is employed for feature engineering, resulting in the dataset the model will be trained on.
  • Training and batch transform – Custom training and inference containers from SageMaker are harnessed to train the model on historical data and create predictions on the evaluation data using a SageMaker Estimator and Transformer for the respective tasks.
  • Evaluation – The trained model undergoes evaluation by comparing the generated predictions on the evaluation data to the ground truth using ScriptProcessor.
  • Baseline jobs – The pipeline creates baselines based on statistics in the input data. These are essential for monitoring data and model quality, as well as feature attributions.
  • Model registry – The trained model is registered for future use. The model will be approved by designated data scientists to deploy the model for use in production.

For production environments, data ingestion and trigger mechanisms are managed via a primary Airflow orchestration. Meanwhile, during development, the pipeline is activated each time a new commit is introduced to the model build Bitbucket repository. The following figure visualizes the model build pipeline.

Batch inference pipeline

The batch inference pipeline handles the inference phase, which consists of the following steps:

  • Preprocessing – Data is preprocessed using ScriptProcessor.
  • Batch transform – The model uses the custom inference container with a SageMaker Transformer and generates predictions given the input preprocessed data. The model used is the latest approved trained model in the model registry.
  • Postprocessing – The predictions undergo a series of postprocessing steps using ScriptProcessor.
  • Monitoring – Continuous surveillance completes checks for drifts related to data quality, model quality, and feature attribution.

If discrepancies arise, a business logic within the postprocessing script assesses whether retraining the model is necessary. The pipeline is scheduled to run at regular intervals.

The following diagram illustrates the batch inference pipeline. Workflow A corresponds to preprocessing, data quality and feature attribution drift checks, inference, and postprocessing. Workflow B corresponds to model quality drift checks. These pipelines are divided because the model quality drift check will only run if new ground truth data is available.

SageMaker Model Monitor

With Amazon SageMaker Model Monitor integrated, the pipelines benefit from real-time monitoring on the following:

  • Data quality – Monitors any drift or inconsistencies in data
  • Model quality – Watches for any fluctuations in model performance
  • Feature attribution – Checks for drift in feature attributions

Monitoring model quality requires access to ground truth data. Although obtaining ground truth can be challenging at times, using data or feature attribution drift monitoring serves as a competent proxy to model quality.

Specifically, in the case of data quality drift, the system watches out for the following:

  • Concept drift – This pertains to changes in the correlation between input and output, requiring ground truth
  • Covariate shift – Here, the emphasis is on alterations in the distribution of independent input variables

SageMaker Model Monitor’s data drift functionality meticulously captures and scrutinizes the input data, deploying rules and statistical checks. Alerts are raised whenever anomalies are detected.

In parallel to using data quality drift checks as a proxy for monitoring model degradation, the system also monitors feature attribution drift using the normalized discounted cumulative gain (NDCG) score. This score is sensitive to both changes in feature attribution ranking order as well as to the raw attribution scores of features. By monitoring drift in attribution for individual features and their relative importance, it’s straightforward to spot degradation in model quality.

Model explainability

Model explainability is a pivotal part of ML deployments, because it ensures transparency in predictions. For a detailed understanding, we use Amazon SageMaker Clarify.

It offers both global and local model explanations through a model-agnostic feature attribution technique based on the Shapley value concept. This is used to decode why a particular prediction was made during inference. Such explanations, which are inherently contrastive, can vary based on different baselines. SageMaker Clarify aids in determining this baseline using K-means or K-prototypes in the input dataset, which is then added to the model build pipeline. This functionality enables us to build generative AI applications in the future for increased understanding of how the model works.

Industrialization: From prototype to production

The MLOps project includes a high degree of automation and can serve as a blueprint for similar use cases:

  • The infrastructure can be reused entirely, whereas the seed code can be adapted for each task, with most changes limited to the pipeline definition and the business logic for preprocessing, training, inference, and postprocessing.
  • The training and inference scripts are hosted using SageMaker custom containers, so a variety of models can be accommodated without changes to the data and model monitoring or model explainability steps, as long as the data is in tabular format.

After finishing the work on the prototype, we turned to how we should use it in production. To do so, we felt the need to make some additional adjustments to the MLOps template:

  • The original seed code used in the prototype for the template included preprocessing and postprocessing steps run before and after the core ML steps (training and inference). However, when scaling up to use the template for multiple use cases in production, the built-in preprocessing and postprocessing steps may lead to decreased generality and reproduction of code.
  • To improve generality and minimize repetitive code, we chose to slim down the pipelines even further. Instead of running the preprocessing and postprocessing steps as part of the ML pipeline, we run these as part of the primary Airflow orchestration before and after triggering the ML pipeline.
  • This way, use case-specific processing tasks are abstracted from the template, and what is left is a core ML pipeline performing tasks that are general across multiple use cases with minimal repetition of code. Parameters that differ between use cases are supplied as input to the ML pipeline from the primary Airflow orchestration.

The result: A rapid & efficient approach to model build & deployment

The prototype in collaboration with AWS has resulted in an MLOps template following current best practices that is now available for use to all of Axfood’s data science teams. By creating a new SageMaker project within SageMaker Studio, data scientists can get started on new ML projects quickly and seamlessly transition to production, allowing for more efficient time management. This is made possible by automating tedious, repetitive MLOps tasks as part of the template.

Furthermore, several new functionalities have been added in an automated fashion to our ML setup. These gains include:

  • Model monitoring – We can perform drift checks for model and data quality as well as model explainability
  • Model and data lineage – It’s now possible to trace exactly which data has been used for which model
  • Model registry – This helps us catalog models for production and manage model versions

Conclusion

In this post, we discussed how Axfood improved operations and scalability of our existing AI and ML operations in collaboration with AWS experts and by using SageMaker and its related products.

These improvements will help Axfood’s data science teams building ML workflows in a more standardized way and will greatly simplify analysis and monitoring of models in production—ensuring the quality of ML models built and maintained by our teams.

Please leave any feedback or questions in the comments section.


About the Authors

Dr. Björn Blomqvist is the Head of AI Strategy at Axfood AB. Before joining Axfood AB he led a team of Data Scientists at Dagab, a part of Axfood, building innovative machine learning solutions with the mission to provide good and sustainable food to people all over Sweden. Born and raised in the north of Sweden, in his spare time Björn ventures to snowy mountains and open seas.

Oskar Klang is a Senior Data Scientist at the analytics department at Dagab, where he enjoys working with everything analytics and machine learning, e.g. optimizing supply chain operations, building forecasting models and, more recently, GenAI applications. He is committed to building more streamlined machine learning pipelines, enhancing efficiency and scalability.

Pavel Maslov is a Senior DevOps and ML engineer in the Analytic Platforms team. Pavel has extensive experience in the development of frameworks, infrastructure, and tools in the domains of DevOps and ML/AI on the AWS platform. Pavel has been one of the key players in building the foundational capability within ML at Axfood.

Joakim Berg is the Team Lead and Product Owner Analytic Platforms, based in Stockholm Sweden. He is leading a team of Data Platform end DevOps/MLOps engineers providing Data and ML platforms for the Data Science teams. Joakim has many years of experience leading senior development and architecture teams from different industries.

Read More

Structured knowledge from LLMs improves prompt learning for visual language models

Structured knowledge from LLMs improves prompt learning for visual language models

This research paper was presented at the 38th Annual AAAI Conference on Artificial Intelligence (opens in new tab) (AAAI-24), the premier forum for advancing understanding of intelligence and its implementation in machines.

First page of the

We’re seeing remarkable abilities from visual language models in transforming text descriptions into images. However, creating high-quality visuals requires crafting precise prompts that capture the relationships among the different image elements, a capability that standard prompts lack. In our paper, “Learning Hierarchical Prompt with Structured Linguistic Knowledge for Language Models,” presented at AAAI-24, we introduce a novel approach using large language models (LLMs) to enhance the images created by visual language models. By creating detailed graphs of image descriptions, we leverage LLMs’ linguistic knowledge to produce richer images, expanding their utility in practical applications. 

An example of three types of prompts used in VLM to recognize bird, which is  templated prompt (a photo of a bird), a natural language based prompt that descript the bird category, and a tree structured prompt highlight the key entities of birds and the corresponding attributes, such as beak, wings, etc.
Figure 1. A structured graph provides descriptions for each class name.

Figure 1 illustrates our method for constructing a structured graph containing key details for each category, or class. These graphs contain structured information, with entities (objects, people, and concepts), attributes (characteristics), and the relationships between them. For example, when defining “water lily,” we include entities like “leaves” or “blooms”, their attributes, “round” and “white”, and then apply LLMs’ reasoning capabilities to identify how these terms relate to each other. This is shown in Figure 2.

The pipeline and instructions to autonomously generate category description and the knowledge graph with LLM. We first instruct the LLM to give a category description, and  then it is asked to parse the key entities, attributes and their relationships from the un-structured  description.
Figure 2. With instructions fed into the LLM, we can receive category-related descriptions along with corresponding structured graphs.

Microsoft Research Podcast

Collaborators: Holoportation™ communication technology with Spencer Fowers and Kwame Darko

Spencer Fowers and Kwame Darko break down how the technology behind Holoportation and the telecommunication device being built around it brings patients and doctors together when being in the same room isn’t an easy option and discuss the potential impact of the work.


How to model structural knowledge

After identifying and structuring the relationships within the generated prompt descriptions, we implement Hierarchical Prompt Tuning (HTP), a new prompt-tuning framework that organizes content hierarchically. This approach allows the visual language model to discern the different levels of information in a prompt, ranging from specific details to broader categories and overarching themes across multiple knowledge domains, as shown in Figure 3. This facilitates the model’s understanding of the connections among these elements, improving its ability to process complex queries across various topics.

The overall framework of the proposed hierarchical prompt tuning.  Descriptions and relationship-guided graphs with class names are used as input for the frozen text encoder and the hierarchical prompted text encoder respectively.
Figure 3. HPT is based on a dual-path asymmetric network, which receives images and various types of text inputs.

Central to this method is a state-of-the-art relationship-guided attention module, designed to help the model identify and analyze the complex interconnections among elements within a graph. This module also understands the interactions between different entities and attributes through a cross-level self-attention mechanism. Self-attention enables the model to assess and prioritize various parts of the input data—here, the graph—according to their relevance. “Cross-level” self-attention extends this capability across various semantic layers within the graph, allowing the model to examine relationships at multiple levels of abstraction. This feature helps the model to discern the interrelations of prompts (or input commands/questions) across these various levels, helping it gain a deeper understanding of the categories or concepts.

Our findings offer valuable insights into a more effective approach to navigating and understanding complex linguistic data, improving the model’s knowledge discovery and decision-making processes. Building on these advances, we refined the traditional approach to text encoding by introducing a hierarchical, prompted text encoder, shown in Figure 4. Our aim is to improve how textual information is aligned or correlated with visual data, a necessity for vision-language models that must interpret both text and visual inputs.

Frameowork of the hierarchical prompted text encoder, where we apply three types of prompts, low-level prompts, high-level prompts, and global-level prompts for hierarchical tuning, and design a relationship-guided attention module for better modeling structure knowledge.
Figure 4. A hierarchical-prompted text encoder learns from multi-level prompts, with a relationship-guided attention module for modeling structural knowledge.

Looking ahead

By incorporating structured knowledge into our model training frameworks, our research lays the groundwork for more sophisticated applications. One example is enhanced image captioning, where visual language models gain the ability to describe the contents of photographs, illustrations, or any visual media with greater accuracy and depth. This improvement could significantly benefit various applications, such as assisting visually impaired users. Additionally, we envision advances in text-to-image generation, enabling visual language models to produce visual representations that are more precise, detailed, and contextually relevant based on textual descriptions.

Looking forward, we hope our research ignites a broader interest in exploring the role of structured knowledge in improving prompt tuning for both visual and language comprehension. This exploration is expected to extend the use of these models beyond basic classification tasks—where models categorize or label data—towards enabling more nuanced and accurate interactions between people and AI systems. By doing so, we pave the way for AI systems to more effectively interpret the complexities of human language.

Acknowledgements

Thank you to Yubin Wang for his contributions in implementing the algorithm and executing the experiments.

The post Structured knowledge from LLMs improves prompt learning for visual language models appeared first on Microsoft Research.

Read More

Rack ‘n’ Roll: NVIDIA Grace Hopper Systems Gather at GTC

Rack ‘n’ Roll: NVIDIA Grace Hopper Systems Gather at GTC

The spirit of Grace Hopper will live on at NVIDIA GTC.

Accelerated systems using powerful processors — named in honor of the pioneer of software programming — will be on display at the global AI conference running March 18-21, ready to take computing to the next level.

System makers will show more than 500 servers in multiple configurations across 18 racks, all packing NVIDIA GH200 Grace Hopper Superchips. They’ll form the largest display at NVIDIA’s booth in the San Jose Convention Center, filling the MGX Pavilion.

MGX Speeds Time to Market

NVIDIA MGX is a blueprint for building accelerated servers with any combination of GPUs, CPUs and data processing units (DPUs) for a wide range of AI, high performance computing and NVIDIA Omniverse applications. It’s a modular reference architecture for use across multiple product generations and workloads.

GTC attendees can get an up-close look at MGX models tailored for enterprise, cloud and telco-edge uses, such as generative AI inference, recommenders and data analytics.

The pavilion will showcase accelerated systems packing single and dual GH200 Superchips in 1U and 2U chassis, linked via NVIDIA BlueField-3 DPUs and NVIDIA Quantum-2 400Gb/s InfiniBand networks over LinkX cables and transceivers.

The systems support industry standards for 19- and 21-inch rack enclosures, and many provide E1.S bays for nonvolatile storage.

Grace Hopper in the Spotlight

Here’s a sampler of MGX systems now available:

  • ASRock RACK’s MECAI, measuring 450 x 445 x 87mm, accelerates AI and 5G services in constrained spaces at the edge of telco networks.
  • ASUS’s MGX server, the ESC NM2N-E1, slides into a rack that holds up to 32 GH200 processors and supports air- and water-cooled nodes.
  • Foxconn provides a suite of MGX systems, including a 4U model that accommodates up to eight NVIDIA H100 NVL PCIe Tensor Core GPUs.
  • GIGABYTE’s XH23-VG0-MGX can accommodate plenty of storage in its six 2.5-inch Gen5 NVMe hot-swappable bays and two M.2 slots.
  • Inventec’s systems can slot into 19- and 21-inch racks and use three different implementations of liquid cooling.
  • Lenovo supplies a range of 1U, 2U and 4U MGX servers, including models that support direct liquid cooling.
  • Pegatron’s air-cooled AS201-1N0 server packs a BlueField-3 DPU for software-defined, hardware-accelerated networking.
  • QCT can stack 16 of its QuantaGrid D74S-IU systems, each with two GH200 Superchips, into a single QCT QoolRack.
  • Supermicro’s ARS-111GL-NHR with nine hot-swappable fans is part of a portfolio of air- and liquid-cooled GH200 and NVIDIA Grace CPU systems.
  • Wiwynn’s SV7200H, a 1U dual GH200 system, supports a BlueField-3 DPU and a liquid-cooling subsystem that can be remotely managed.
  • Wistron’s MGX servers are 4U GPU systems for AI inference and mixed workloads, supporting up to eight accelerators in one system.

The new servers are in addition to three accelerated systems using MGX announced at COMPUTEX last May — Supermicro’s ARS-221GL-NR using the Grace CPU and QCT’s QuantaGrid S74G-2U and S74GM-2U powered by the GH200.

Grace Hopper Packs Two in One

System builders are adopting the hybrid processor because it packs a punch.

GH200 Superchips combine a high-performance, power-efficient Grace CPU with a muscular NVIDIA H100 GPU. They share hundreds of gigabytes of memory over a fast NVIDIA NVLink-C2C interconnect.

The result is a processor and memory complex well-suited to take on today’s most demanding jobs, such as running large language models. They have the memory and speed needed to link generative AI models to data sources that can improve their accuracy using retrieval-augmented generation, aka RAG.

Recommenders Run 4x Faster

In addition, the GH200 Superchip delivers greater efficiency and up to 4x more performance than using the H100 GPU with traditional CPUs for tasks like making recommendations for online shopping or media streaming.

In its debut on the MLPerf industry benchmarks last November, GH200 systems ran all data center inference tests, extending the already leading performance of H100 GPUs.

In all these ways, GH200 systems are taking to new heights a computing revolution their namesake helped start on the first mainframe computers more than seven decades ago.

Register for NVIDIA GTC, the conference for the era of AI, running March 18-21 at the San Jose Convention Center and virtually.

And get the 30,000-foot view from NVIDIA CEO and founder Jensen Huang in his GTC keynote

Read More

Meet the Omnivore: Mode Maison Harnesses OpenUSD to Drive Innovations in Retail With High-Fidelity Digital Twins

Meet the Omnivore: Mode Maison Harnesses OpenUSD to Drive Innovations in Retail With High-Fidelity Digital Twins

Editor’s note: This post is a part of our Meet the Omnivore series, which features individual creators and developers who use OpenUSD to build tools, applications and services for 3D workflows and physically accurate virtual worlds.

A failed furniture-shopping trip turned into a business idea for Steven Gay, cofounder and CEO of company Mode Maison.

Gay grew up in Houston and studied at the University of Texas before working in New York as one of the youngest concept designers at Ralph Lauren. He was inspired to start his own company after a long day of trying — and failing — to pick out a sofa.

The experience illuminated how the luxury home-goods industry has traditionally lagged in adopting digital technologies, especially those for creating immersive, interactive experiences for consumers.

Gay founded Mode Maison in 2018 with the goal of solving this challenge and paving the way for scalability, creativity and a generative future in retail. Using the Universal Scene Description framework, aka OpenUSD, and the NVIDIA Omniverse platform, Gay, along with Mode Maison Chief Technology Officer Jakub Cech and the Mode Maison team, are helping enhance and digitalize entire product lifecycle processes — from design and manufacturing to consumer experiences.


Register for NVIDIA GTC, which takes place March 17-21, to hear how leading companies are using the latest innovations in AI and graphics. And join us for OpenUSD Day to learn how to build generative AI-enabled 3D pipelines and tools using Universal Scene Description.


They developed a photometric scanning system, called Total Material Appearance Capture, which offers an unbiased, physically based approach to digitizing the material world that’s enabled by real-world embedded sensors.

TMAC captures proprietary data and the composition of any material, then turns it into input that serves as a single source of truth, which can be used for creating a fully digitized retail model. Using the system, along with OpenUSD and NVIDIA Omniverse, Mode Maison customers can create highly accurate digital twins of any material or product.

“By enabling this, we’re effectively collapsing and fostering a complete integration across the entire product lifecycle process — from design and production to manufacturing to consumer experiences and beyond,” said Gay.

Mode Maison developed a photometric scanning system called Total Material Appearance Capture.

Streamlining Workflows and Enhancing Productivity With Digital Twins

Previously, Mode Maison faced significant challenges in creating physically based, highly flexible and scalable digital materials. The limitations were particularly noticeable when rendering complex materials and textures, or integrating digital models into cohesive, multilayered environments.

Using Omniverse helped Gay and his team overcome these challenges by offering advanced rendering capabilities, physics simulations and extensibility for AI training that unlock new possibilities in digital retail.

Before using Omniverse and OpenUSD, Mode Maison used disjointed processes for digital material capture, modeling and rendering, often leading to inconsistencies, the inability to scale and minimal interoperability. After integrating Omniverse, the company experienced a streamlined, coherent workflow where high-fidelity digital twins can be created with greater efficiency and interoperability.

The team primarily uses Autodesk 3ds Max for design, and they import the 3D data using Omniverse Connectors. Gay says OpenUSD is playing an increasingly critical role in its workflows, especially when developing composable, flexible, interoperable capabilities across asset creation.

This enhanced pipeline starts with capturing high-fidelity material data using TMAC. The data is then processed and formatted into OpenUSD for the creation of physically based, scientifically accurate, high-fidelity digital twins.

“OpenUSD allows for an unprecedented level of collaboration and interoperability in creating complex, multi-layered capabilities and advanced digital materials,” Gay said. “Its ability to seamlessly integrate diverse digital assets and maintain their fidelity across various applications is instrumental in creating realistic, interactive digital twins for retail.”

OpenUSD and Omniverse have sped Mode Maison and their clients’ ability to bring products to market, reduced costs associated with building and modifying digital twins, and enhanced productivity through streamlined creation.

“Our work represents a major step toward a future where digital and physical realities will be seamlessly integrated,” said Gay. “This shift enhances consumer engagement and paves the way for more sustainable business practices by reducing the need for physical prototyping while enabling more precise manufacturing.”

As for emerging technological advancements in digital retail, Gay says AI will play a central role in creating hyper-personalized design, production, sourcing and front-end consumer experiences — all while reducing carbon footprints and paving the way for a more sustainable future in retail.

Join In on the Creation

Anyone can build their own Omniverse extension or Connector to enhance 3D workflows and tools.

Learn more about how OpenUSD and NVIDIA Omniverse are transforming industries at NVIDIA GTC, a global AI conference running March 18-21, online and at the San Jose Convention Center.

Join OpenUSD Day at GTC on Tuesday, March 19, to learn more about building generative AI-enabled 3D pipelines and tools using USD.

Get started with NVIDIA Omniverse by downloading the standard license free, access OpenUSD resources, and learn how Omniverse Enterprise can connect your team. Stay up to date on Instagram, Medium and X. For more, join the Omniverse community on the  forums, Discord server, Twitch and YouTube channels.

Read More

Techniques and approaches for monitoring large language models on AWS

Techniques and approaches for monitoring large language models on AWS

Large Language Models (LLMs) have revolutionized the field of natural language processing (NLP), improving tasks such as language translation, text summarization, and sentiment analysis. However, as these models continue to grow in size and complexity, monitoring their performance and behavior has become increasingly challenging.

Monitoring the performance and behavior of LLMs is a critical task for ensuring their safety and effectiveness. Our proposed architecture provides a scalable and customizable solution for online LLM monitoring, enabling teams to tailor your monitoring solution to your specific use cases and requirements. By using AWS services, our architecture provides real-time visibility into LLM behavior and enables teams to quickly identify and address any issues or anomalies.

In this post, we demonstrate a few metrics for online LLM monitoring and their respective architecture for scale using AWS services such as Amazon CloudWatch and AWS Lambda. This offers a customizable solution beyond what is possible with model evaluation jobs with Amazon Bedrock.

Overview of solution

The first thing to consider is that different metrics require different computation considerations. A modular architecture, where each module can intake model inference data and produce its own metrics, is necessary.

We suggest that each module take incoming inference requests to the LLM, passing prompt and completion (response) pairs to metric compute modules. Each module is responsible for computing its own metrics with respect to the input prompt and completion (response). These metrics are passed to CloudWatch, which can aggregate them and work with CloudWatch alarms to send notifications on specific conditions. The following diagram illustrates this architecture.

Fig 1: Metric compute module – solution overview

Fig 1: Metric compute module – solution overview

The workflow includes the following steps:

  1. A user makes a request to Amazon Bedrock as part of an application or user interface.
  2. Amazon Bedrock saves the request and completion (response) in Amazon Simple Storage Service (Amazon S3) as the per configuration of invocation logging.
  3. The file saved on Amazon S3 creates an event that triggers a Lambda function. The function invokes the modules.
  4. The modules post their respective metrics to CloudWatch metrics.
  5. Alarms can notify the development team of unexpected metric values.

The second thing to consider when implementing LLM monitoring is choosing the right metrics to track. Although there are many potential metrics that you can use to monitor LLM performance, we explain some of the broadest ones in this post.

In the following sections, we highlight a few of the relevant module metrics and their respective metric compute module architecture.

Semantic similarity between prompt and completion (response)

When running LLMs, you can intercept the prompt and completion (response) for each request and transform them into embeddings using an embedding model. Embeddings are high-dimensional vectors that represent the semantic meaning of the text. Amazon Titan provides such models through Titan Embeddings. By taking a distance such as cosine between these two vectors, you can quantify how semantically similar the prompt and completion (response) are. You can use SciPy or scikit-learn to compute the cosine distance between vectors. The following diagram illustrates the architecture of this metric compute module.

Fig 2: Metric compute module – semantic similarity

Fig 2: Metric compute module – semantic similarity

This workflow includes the following key steps:

  1. A Lambda function receives a streamed message via Amazon Kinesis containing a prompt and completion (response) pair.
  2. The function gets an embedding for both the prompt and completion (response), and computes the cosine distance between the two vectors.
  3. The function sends that information to CloudWatch metrics.

Sentiment and toxicity

Monitoring sentiment allows you to gauge the overall tone and emotional impact of the responses, whereas toxicity analysis provides an important measure of the presence of offensive, disrespectful, or harmful language in LLM outputs. Any shifts in sentiment or toxicity should be closely monitored to ensure the model is behaving as expected. The following diagram illustrates the metric compute module.

Fig 3: Metric compute module – sentiment and toxicity

Fig 3: Metric compute module – sentiment and toxicity

The workflow includes the following steps:

  1. A Lambda function receives a prompt and completion (response) pair through Amazon Kinesis.
  2. Through AWS Step Functions orchestration, the function calls Amazon Comprehend to detect the sentiment and toxicity.
  3. The function saves the information to CloudWatch metrics.

For more information about detecting sentiment and toxicity with Amazon Comprehend, refer to Build a robust text-based toxicity predictor and Flag harmful content using Amazon Comprehend toxicity detection.

Ratio of refusals

An increase in refusals, such as when an LLM denies completion due to lack of information, could mean that either malicious users are trying to use the LLM in ways that are intended to jailbreak it, or that users’ expectations are not being met and they are getting low-value responses. One way to gauge how often this is happening is by comparing standard refusals from the LLM model being used with the actual responses from the LLM. For example, the following are some of Anthropic’s Claude v2 LLM common refusal phrases:

“Unfortunately, I do not have enough context to provide a substantive response. However, I am an AI assistant created by Anthropic to be helpful, harmless, and honest.”

“I apologize, but I cannot recommend ways to…”

“I'm an AI assistant created by Anthropic to be helpful, harmless, and honest.”

On a fixed set of prompts, an increase in these refusals can be a signal that the model has become overly cautious or sensitive. The inverse case should also be evaluated. It could be a signal that the model is now more prone to engage in toxic or harmful conversations.

To help model integrity and model refusal ratio, we can compare the response with a set of known refusal phrases from the LLM. This could be an actual classifier that can explain why the model refused the request. You can take the cosine distance between the response and known refusal responses from the model being monitored. The following diagram illustrates this metric compute module.

Fig 4: Metric compute module – ratio of refusals

Fig 4: Metric compute module – ratio of refusals

The workflow consists of the following steps:
  1. A Lambda function receives a prompt and completion (response) and gets an embedding from the response using Amazon Titan.
  2. The function computes the cosine or Euclidian distance between the response and existing refusal prompts cached in memory.
  3. The function sends that average to CloudWatch metrics.

Another option is to use fuzzy matching for a straightforward but less powerful approach to compare the known refusals to LLM output. Refer to the Python documentation for an example.

Summary

LLM observability is a critical practice for ensuring the reliable and trustworthy use of LLMs. Monitoring, understanding, and ensuring the accuracy and reliability of LLMs can help you mitigate the risks associated with these AI models. By monitoring hallucinations, bad completions (responses), and prompts, you can make sure your LLM stays on track and delivers the value you and your users are looking for. In this post, we discussed a few metrics to showcase examples.

For more information about evaluating foundation models, refer to Use SageMaker Clarify to evaluate foundation models, and browse additional example notebooks available in our GitHub repository. You can also explore ways to operationalize LLM evaluations at scale in Operationalize LLM Evaluation at Scale using Amazon SageMaker Clarify and MLOps services. Finally, we recommend referring to Evaluate large language models for quality and responsibility to learn more about evaluating LLMs.


About the Authors

Bruno Klein is a Senior Machine Learning Engineer with AWS Professional Services Analytics Practice. He helps customers implement big data and analytics solutions. Outside of work, he enjoys spending time with family, traveling, and trying new food.

Rushabh Lokhande is a Senior Data & ML Engineer with AWS Professional Services Analytics Practice. He helps customers implement big data, machine learning, and analytics solutions. Outside of work, he enjoys spending time with family, reading, running, and playing golf.

Read More

NVIDIA RTX 500 and 1000 Professional Ada Generation Laptop GPUs Drive AI-Enhanced Workflows From Anywhere

NVIDIA RTX 500 and 1000 Professional Ada Generation Laptop GPUs Drive AI-Enhanced Workflows From Anywhere

With generative AI and hybrid work environments becoming the new standard, nearly every professional, whether a content creator, researcher or engineer, needs a powerful, AI-accelerated laptop to help users tackle their industry’s toughest challenges — even on the go.

The new NVIDIA RTX 500 and 1000 Ada Generation Laptop GPUs will be available in new, highly portable mobile workstations, expanding the NVIDIA Ada Lovelace architecture-based lineup, which includes the RTX 2000, 3000, 3500, 4000 and 5000 Ada Generation Laptop GPUs.

AI is rapidly being adopted to drive efficiencies across professional design and content creation workflows and everyday productivity applications, underscoring the importance of having powerful local AI acceleration and sufficient processing power in systems.

The next generation of mobile workstations with Ada Generation GPUs, including the RTX 500 and 1000 GPUs, will include both a neural processing unit (NPU), a component of the CPU, and an NVIDIA RTX GPU, which includes Tensor Cores for AI processing. The NPU helps offload light AI tasks, while the GPU provides up to an additional 682 TOPS of AI performance for more demanding day-to-day AI workflows.

The higher level of AI acceleration delivered by the GPU is useful for tackling a wide range of AI-based tasks, such as video conferencing with high-quality AI effects, streaming videos with AI upscaling, or working faster with generative AI and content creation applications.

The new RTX 500 GPU delivers up to 14x the generative AI performance for models like Stable Diffusion, up to 3x faster photo editing with AI and up to 10x the graphics performance for 3D rendering compared with a CPU-only configuration — bringing massive leaps in productivity for traditional and emerging workflows.

Enhancing Professional Workflows Across Industries

The RTX 500 and 1000 GPUs elevate workflows with AI for laptop users everywhere in compact designs. Video editors can streamline tasks such as removing background noise with AI. Graphic designers can bring blurry images to life with AI upscaling. Professionals can work on the go while using AI for higher-quality video conferencing and streaming experiences.

For users looking to tap AI for advanced rendering, data science and deep learning workflows, NVIDIA also offers the RTX 2000, 3000, 3500, 4000 and 5000 Ada Generation Laptop GPUs. 3D creators can use AI denoising and deep learning super sampling (DLSS) to visualize photorealistic renders in real time. Businesses can query their internal knowledge base with chatbot-like interfaces using local large language models. And researchers and scientists can experiment with data science, AI model training and tuning, and development projects.

Performance and Portability With NVIDIA RTX

The RTX 500 and 1000 GPUs, based on the NVIDIA Ada Lovelace architecture, bring the latest advancements to thin and light laptops, including:

  • Third-generation RT Cores: Up to 2x the ray tracing performance of the previous generation for high-fidelity, photorealistic rendering.
  • Fourth-generation Tensor Cores: Up to 2x the throughput of the previous generation, accelerating deep learning training, inferencing and AI-based creative workloads.
  • Ada Generation CUDA cores: Up to 30% the single-precision floating point (FP32) throughput compared to the previous generation for significant performance improvements in graphics and compute workloads.
  • Dedicated GPU memory: 4GB GPU memory with the RTX 500 GPU and 6GB with the RTX 1000 GPU allows users to run demanding 3D and AI-based applications, as well as tackle larger projects, datasets and multi-app workflows.
  • DLSS 3: Delivers a breakthrough in AI-powered graphics, significantly boosting performance by generating additional high-quality frames.
  • AV1 encoder: Eighth-generation NVIDIA encoder, aka NVENC, with AV1 support is up to 40% more efficient than H.264, enabling new possibilities for broadcasting, streaming and video calling.

Availability

The new NVIDIA RTX 500 and 1000 Ada Generation Laptop GPUs will be available this spring in mobile workstations from global manufacturing partners including Dell Technologies, HP, Lenovo and MSI.

Learn more about the latest NVIDIA RTX Laptop GPUs.

Read More