How Twilio generated SQL using Looker Modeling Language data with Amazon Bedrock

How Twilio generated SQL using Looker Modeling Language data with Amazon Bedrock

This post is co-written with Aishwarya Gupta, Apurva Gawad, and Oliver Cody from Twilio.

Today’s leading companies trust Twilio’s Customer Engagement Platform (CEP) to build direct, personalized relationships with their customers everywhere in the world. Twilio enables companies to use communications and data to add intelligence and security to every step of the customer journey, from sales and marketing to growth, customer service, and many more engagement use cases in a flexible, programmatic way. Across 180 countries, millions of developers and hundreds of thousands of businesses use Twilio to create personalized experiences for their customers. As one of the largest AWS customers, Twilio engages with data, artificial intelligence (AI), and machine learning (ML) services to run their daily workloads.

Data is the foundational layer for all generative AI and ML applications. Managing and retrieving the right information can be complex, especially for data analysts working with large data lakes and complex SQL queries. To address this, Twilio partnered with AWS to develop a virtual assistant that helps their data analysts find and retrieve relevant data from Twilio’s data lake by converting user questions asked in natural language to SQL queries. This virtual assistant tool uses Amazon Bedrock, a fully managed generative AI service that provides access to high-performing foundation models (FMs) and capabilities like Retrieval Augmented Generation (RAG). RAG optimizes language model outputs by extending the models’ capabilities to specific domains or an organization’s internal data for tailored responses.

This post highlights how Twilio enabled natural language-driven data exploration of business intelligence (BI) data with RAG and Amazon Bedrock.

Twilio’s use case

Twilio wanted to provide an AI assistant to help their data analysts find data in their data lake. They used the metadata layer (schema information) over their data lake consisting of views (tables) and models (relationships) from their data reporting tool, Looker, as the source of truth. Looker is an enterprise platform for BI and data applications that helps data analysts explore and share insights in real time.

Twilio implemented RAG using Anthropic Claude 3 on Amazon Bedrock to develop a virtual assistant tool called AskData for their data analysts. This tool converts questions from data analysts asked in natural language (such as “Which table contains customer address information?”) into a SQL query using the schema information available in Looker Modeling Language (LookML) models and views. The analysts can run this generated SQL directly, saving them the time to first identify the tables containing relevant information and then write a SQL query to retrieve the information.

The AskData tool provides ease of use and efficiency to its users:

  • Users need accurate information about the data in a quick and accessible manner to make business decisions. Providing a tool to minimize their time spent finding tables and writing SQL queries allows them to focus more on business outcomes and less on logistical tasks.
  • Users typically reach out to the engineering support channel when they have questions about data that is deeply embedded in the data lake or if they can’t access it using various queries. Having an AI assistant can reduce the engineering time spent in responding to these queries and provide answers more quickly.

Solution overview

In this post, we show you a step-by-step implementation and design of the AskData tool designed to serve as an AI assistant for Twilio’s data analysts. We discuss the following:

  • How to use a RAG approach to retrieve the relevant LookML metadata corresponding to users’ questions with the help of efficient data chunking and indexing and generate SQL queries from natural language
  • How to select the optimal large language model (LLM) for your use case from Amazon Bedrock
  • How analysts can query the data using natural language questions
  • The benefits of using RAG for data analysis, including increased productivity and reduced engineering overhead of finding the data (tables) and writing SQL queries.

This solution uses Amazon Bedrock, Amazon Relational Database Service (Amazon RDS), Amazon DynamoDB, and Amazon Simple Storage Service (Amazon S3). The following diagram illustrates the solution architecture.

The workflow consists of the following steps:

  1. An end-user (data analyst) asks a question in natural language about the data that resides within a data lake.
  2. This question uses metadata (schema information) stored in Amazon RDS and conversation history stored in DynamoDB for personalized retrieval to the user’s questions:
    • The RDS database (PostgreSQL with pgvector) stores the LookML tables and views as embeddings that are retrieved through a vector similarity search.
    • The DynamoDB table stores the previous conversation history with this user.
  3. The context and natural language question are parsed through Amazon Bedrock using an FM (in this case, Anthropic Claude 3 Haiku), which responds with a personalized SQL query that the user can use to retrieve accurate information from the data lake. The following is the prompt template that is used for generating the SQL query:
Human: The context information below represents the LookML data for Looker views and models. 
Using this context data, please generate a presto SQL query that will return the correct result for the user's question. 
Please provide a SQL query with the correct syntax, table names, and column names based on the provided LookML data.

<instructions>

1. Use the correct underlying SQL table names (table name in sql_table_name) 
and column names (use column names from the dimensions of the view as they are the correct column names). 
Use the following as an example:

{{example redacted}}

2. Join tables as necessary to get the correct result. 
- Avoid unnecessary joins if not explicitly requested by the user.

3. Avoid unnecessary filters if not explicitly requested by the user.

4. If the view has a derived table, use the derived query to answer question 
using table names and column names from derived query. Use the following as an example:

{{example redacted}}

5. The schema name is represented as <schema>.<table_name> within the LookML views. 
Use the existing schema name or "public" as the schema name if no schema is specified.

</instructions>

This is the chat history from previous messages:

<chat_history>

{chat_history}

</chat_history>

<context>

{context}

</context>

This is the user question:

<question>

{question}

</question>

Assistant: Here is a SQL query for the user question:

The solution comprises four main steps:

  1. Use semantic search on LookML metadata to retrieve the relevant tables and views corresponding to the user questions.
  2. Use FMs on Amazon Bedrock to generate accurate SQL queries based on the retrieved table and view information.
  3. Create a simple web application using LangChain and Streamlit.
  4. Refine your existing application using strategic methods such as prompt engineering, optimizing inference parameters and other LookML content.

Prerequisites

To implement the solution, you should have an AWS account, model access to your choice of FM on Amazon Bedrock, and familiarity with DynamoDB, Amazon RDS, and Amazon S3.

Access to Amazon Bedrock FMs isn’t granted by default. To gain access to an FM, an AWS Identity and Access Management (IAM) user with sufficient permissions needs to request access to it through the Amazon Bedrock console. After access is provided to a model, it is available for the users in the account.

To manage model access, choose Model access in the navigation pane on the Amazon Bedrock console. The model access page lets you view a list of available models, the output modality of the model, whether you have been granted access to it, and the End User License Agreement (EULA). You should review the EULA for terms and conditions of using a model before requesting access to it. For information about model pricing, refer to Amazon Bedrock pricing.

Model access

Model access

Structure and index the data

In this solution, we use the RAG approach to retrieve the relevant schema information from LookML metadata corresponding to users’ questions and then generate a SQL query using this information.

This solution uses two separate collections that are created in our vector store: one for Looker views and another for Looker models. We used the sentence-transformers/all-mpnet-base-v2 model for creating vector embeddings and PostgreSQL with pgvector as our vector database. As long as the LookML file doesn’t exceed the context window of the LLM used to generate the final response, we don’t split the file into chunks and instead pass the file in its entirety to the embeddings model. The vector similarity search is able to find the correct files that contain the LookML tables and views relevant to the user’s question. We can pass the entire LookML file contents to the LLM, taking advantage of its large context window, and the LLM is able to pick the schemas for the relevant tables and views to generate the SQL query.

The two subsets of LookML metadata provide distinct types of information about the data lake. Views represent individual tables, and models define the relationships between those tables. By separating these components, we can first retrieve the relevant views based on the user’s question, and then use those results to identify the associated models that capture the relationships between the retrieved views.

This two-step procedure provides a more comprehensive understanding of the relevant tables and their relationships to the user question. The following diagram shows how both subsets of metadata are chunked and stored as embeddings in different vectors for enhanced retrieval. The LookML view and model information is brought into Amazon S3 through a separate data pipeline (not shown).

Content ingestion into vector db

Content ingestion into vector db

Select the optimal LLM for your use case

Selecting the right LLM for any use case is essential. Every use case has different requirements for context length, token size, and the ability to handle various tasks like summarization, task completion, chatbot applications, and so on. Amazon Bedrock is a fully managed service that offers a choice of high-performing FMs from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral, Stability AI, and Amazon within a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI.

This solution is implemented using Anthropic Claude 3, available through Amazon Bedrock. Anthropic Claude 3 is chosen for two main reasons:

  • Increased context window – Anthropic Claude 3 can handle up to 200,000 tokens in its context, allowing for processing larger LookML queries and tables. This expanded capacity is crucial when dealing with complex or extensive data, so the LLM has access to the necessary information for accurate and informed responses to the user.
  • Enhanced reasoning abilities – Anthropic Claude 3 demonstrates enhanced performance when working with larger contexts, enabling it to better understand and respond to user queries that require a deeper comprehension of the views, models, and their relationships. You can gain granular control over the reasoning capabilities using several prompt engineering techniques.

Build a web application

This solution uses LangChain and Streamlit to build a web application and integrate Amazon Bedrock into it. LangChain is a framework specifically designed to simplify the creation of applications using LLMs, and it’s straightforward to use Amazon Bedrock through LangChain using the Amazon Bedrock component available in LangChain. We use Streamlit to develop the frontend for this web application.

For data analysts to effortlessly interact with and get queries to extract relevant data from their data lake, this solution implements a chat engine using the ConversationalRetrievalChain mechanism, which enables you to pass a custom vector store retriever, prompt, and conversation history to the LLM and generate personalized answers to user questions. To store the chat history, we use DynamoDB with the user session ID as the primary key. DynamoDB is a highly scalable and durable NoSQL database service, enabling you to efficiently store and retrieve chat histories for multiple user sessions concurrently. The following screenshot shows an example of the chat interface developed using Streamlit.

AskData user interface

AskData user interface

The following is an example scenario to get an idea of the user workflow when interacting with AskData:

User Question - how do I find a customer address?

AskData Response - Based on the provided LookML data, 
you can find the customer address by querying the raw_mailing_addresses table, 
which contains address-related information. 
Here is the SQL query to retrieve the customer address:

SELECT
    account,
    address1,
    address2,
    city,
    province,
    postal_code,
    iso_country
FROM
    raw_mailing_addresses

This query will return the account SID (which can be used to identify the customer), the first line of the address, the second line of the address (if any), the city, the province or state, the postal code, and the ISO country code for all entries in the raw_mailing_addresses table. If you have a specific customer’s account and you want to retrieve the address for that customer, you can add a WHERE clause to the query:

SELECT
    account,
    address1,
    address2,
    city,
    province,
    postal_code,
    iso_country
FROM
    raw_mailing_addresses
WHERE
    account = 'YourCustomerAccount' -- Replace with the actual account

Replace ‘YourCustomerAccount‘ with the actual account of the customer whose address you want to find.

Optimize the application

Although using an LLM to answer user questions about data is efficient, it comes with recognized limitations, such as the ability of the LLM to generate inaccurate responses, often due to hallucinated information. To enhance the accuracy of our application and reduce hallucinations, we did the following:

  • Set the temperature for the LLM to 0.1 to reduce the LLM’s propensity for overly creative responses.
  • Added instructions in the prompt to only generate the SQL query based on the context (schema, chat history) being provided in the prompt.
  • Meticulously removed duplicate and redundant entries from the LookML data before it was ingested into the vector database.
  • Added a user experience feedback (a rating from 1–5 with an optional text input for comments) as part of the UI of AskData. We used the feedback to improve the quality of our data, prompts, and inference parameter settings.

Based on user feedback, the application achieved a net promoter score (NPS) of 40, surpassing the initial target score of 35%. We set this target due to the following key factors: the lack of relevant information for specific user questions within the LookML data, specific rules related to the structure of SQL queries that might need to be added, and the expectation that sometimes the LLM would make a mistake in spite of all the measures we put in place.

Conclusion

In this post, we illustrated how to use generative AI to significantly enhance the efficiency of data analysts. By using LookML as metadata for our data lake, we constructed vector stores for views (tables) and models (relationships). With the RAG framework, we efficiently retrieved pertinent information from these stores and provided it as context to the LLM alongside user queries and any previous chat history. The LLM then seamlessly generated SQL queries in response.

Our development process was streamlined thanks to various AWS services, particularly Amazon Bedrock, which facilitated the integration of LLM for query responses, and Amazon RDS, serving as our vector stores.

Check out the following resources to learn more:

Get started with Amazon Bedrock today, and leave your feedback and questions in the comments section.


About the Authors

Apurva Gawad is a Senior Data Engineer at Twilio specializing in building scalable systems for data ingestion and empowering business teams to derive valuable insights from data. She has a keen interest in AI exploration, blending technical expertise with a passion for innovation. Outside of work, she enjoys traveling to new places, always seeking fresh experiences and perspectives.

Aishwarya Gupta is a Senior Data Engineer at Twilio focused on building data systems to empower business teams to derive insights. She enjoys to travel and explore new places, foods, and culture.

Oliver Cody is a Senior Data Engineering Manager at Twilio with over 28 years of professional experience, leading multidisciplinary teams across EMEA, NAMER, and India. His experience spans all things data across various domains and sectors. He has focused on developing innovative data solutions, significantly optimizing performance and reducing costs.

Amit Arora is an AI and ML specialist architect at Amazon Web Services, helping enterprise customers use cloud-based machine learning services to rapidly scale their innovations. He is also an adjunct lecturer in the MS data science and analytics program at Georgetown University in Washington D.C.

Johnny Chivers is a Senior Solutions Architect working within the Strategic Accounts team at AWS. With over 10 years of experience helping customers adopt new technologies, he guides them through architecting end-to-end solutions spanning infrastructure, big data, and AI.

Read More

Figure Unveils Next-Gen Conversational Humanoid Robot With 3x AI Computing for Fully Autonomous Tasks

Figure Unveils Next-Gen Conversational Humanoid Robot With 3x AI Computing for Fully Autonomous Tasks

Silicon Valley’s Figure has taken the wraps off of its next-generation Figure 02 conversational humanoid robot that taps into NVIDIA Omniverse and NVIDIA GPUs for fully autonomous tasks.

Figure said it recently tested Figure 02 for data collection and use-case training at BMW Group’s Spartanburg, South Carolina, production line.

Figure 02 comes just 10 months after Figure launched the first version of its general-purpose humanoid robot. The company has accelerated its development timeline using NVIDIA Isaac Sim — a reference application built on the NVIDIA Omniverse platform — to design, train and test AI-based robots using synthetic data, as well as NVIDIA GPUs to train generative AI models.

“Our rapid progress, marked by advances in speech, vision, dexterity and computational power, brings us closer to delivering humanoid robots to address labor shortages for many industries,” said Brett Adcock, CEO of Figure.

The company added a second NVIDIA RTX GPU-based module on board Figure 02, which supplies 3x inference gains for handling fully autonomous real-world AI tasks compared with the robot’s first iteration.

Figure aims to commercialize industrial humanoid robots to address labor shortages, and it plans to produce consumer versions.

Founded in 2022, the startup is partnered with OpenAI to develop custom AI models, trained on NVIDIA H100 GPUs, that drive the robots’ conversational AI capabilities. Figure recently raised $675 million in funding from leading technology companies including NVIDIA.

“Developing autonomous humanoid robots requires the fusion of three computers: NVIDIA DGX for AI training, NVIDIA Omniverse for simulation and NVIDIA Jetson in the robot,” said Deepu Talla, vice president of robotics and edge computing at NVIDIA. “Leading companies, including  Figure, are tapping into the NVIDIA robotics stack, from edge to cloud, to drive innovation in humanoid robotics.”

Robotic Hands Capable of Handling Real-World Tasks

New human-scale hands, six RGB cameras, and perception AI models trained with synthetic data generated in Isaac Sim enable Figure 02 to perform high-precision pick-and-place tasks required for smart manufacturing applications.

Figure is among the initial members to join the new NVIDIA Humanoid Robot Developer Program, which provides early access to the latest tools and computing technologies for humanoid robot development. This includes the latest releases of NVIDIA Isaac Sim, Isaac Lab, NIM microservices (RoboCasa and MimicGen), OSMO, Jetson Thor and Project GR00T general-purpose humanoid foundation models.

 

 

Read More

GeForce NOW Celebrates 2,000 Games in the Cloud

GeForce NOW Celebrates 2,000 Games in the Cloud

This GFN Thursday marks 2,000 games in the GeForce NOW library, with five new games joining this week, alongside a demo for Square Enix’s Visions of Mana and a new reward for members playing Elder Scrolls Online.

From epic role-playing games (RPGs) to heart-pounding shooters, the GeForce NOW library offers a variety of adventures for members to dive into anytime, anywhere.

There’s more to come — the highly anticipated action RPG Black Myth: Wukong from Game Science will soon be available for members to stream when it comes to the cloud at launch on Tuesday, Aug. 20.

Plus, gamers looking to try GeForce NOW can lock in a one- or six-month Priority or Ultimate membership at half price with the limited-time summer sale.

More Choices Than a Buffet

GeForce NOW has achieved the remarkable milestone of over 2,000 games supported in the cloud.

They’re all playable across devices, including PCs, Macs, NVIDIA SHIELD TVs, select Samsung and LG Smart TVs, mobile devices and handheld consoles like the ROG Ally and Steam Deck. Visually stunning games like Cyberpunk 2077 and Alan Wake 2 can be played at max settings — all from the comfort of the couch or while on the go.

2,000 games library on GeForce NOW.
2,000 reasons to game on.

Thanks to collaborations with renowned publishers like Blizzard, Capcom, Epic Games and Square Enix, as well as indie studios including Coffee Stain Studios, Re-Logic and Team 17, GeForce NOW is opening the door for more gamers to enjoy incredible gaming experiences in new ways, streaming across their favorite devices.

Since GeForce NOW supports games that members already own, there’s no need to repurchase titles to enjoy them in the cloud. And features like game library-syncing with Steam and Ubisoft Connect make it seamless to jump into gameplay without the hassle of installation or updates.

2,000 games on GeForce NOW.

The GeForce NOW library continues to expand each week with titles from popular digital stores like Battle.net, Epic Games Store, GOG.com, Steam, Ubisoft Connect and Xbox, and supports over 120 PC Game Pass titles — so gear up and game on.

Khajiit Has Wares … in the Cloud

ESO member reward on GeForce NOW
It’s not just a snack — it’s a statement.

It’s time for a new reward. GeForce NOW members playing Elder Scrolls Online can add a touch of sophistication to their journeys with a free Noble Snack emote, letting their character savor a regal treat in the heart of Tamriel.

Members who’ve opted in to GeForce NOW’s Rewards program can check their email for instructions on how to redeem the reward. Ultimate and Priority members can redeem the reward today, while free members can claim it starting on Friday, Aug. 9. It’s available through Sunday, Sept. 8 — first come, first served.

Here Come the Games

GeForce NOW members can today play a demo of Square Enix’s upcoming action RPG Visions of Mana before it launches in the cloud on Thursday, Aug. 29. Experience snippets of the story, the battles and the game’s focus on elemental powers — hallmarks of the Mana series. Though the title’s bonus items can’t be obtained through the demo in the cloud, members can get the items later by playing the full game when it launches on GeForce NOW.

Members can also look for the following this week:

  • Warhammer 40,000: Speed Freeks (New release on Steam, Aug. 6)
  • Prince of Persia: The Lost Crown (New release on Steam, Aug. 8)
  • Ratten Reich (New release on Steam, Aug. 9)
  • Nine Sols (Steam)
  • Visions of Mana Demo (Steam)

What are you planning to play this weekend? Let us know on X or in the comments below.

Read More

Collaborators: AI and the economy with Brendan Lucier and Mert Demirer

Collaborators: AI and the economy with Brendan Lucier and Mert Demirer

Headshots of Brendan Lucier and Mert Demirer for the Microsoft Research Podcast

Transforming research ideas into meaningful impact is no small feat. It often requires the knowledge and experience of individuals from across disciplines and institutions. Collaborators, a Microsoft Research Podcast series, explores the relationships—both expected and unexpected—behind the projects, products, and services being pursued and delivered by researchers at Microsoft and the diverse range of people they’re teaming up with. 

What can the breakdown of jobs into their specific tasks tell us about the long-term impact of AI on the economy? Microsoft Senior Principal Researcher Brendan Lucier and MIT Assistant Professor Mert Demirer are combining their expertise in micro- and macroeconomics, respectively, to build a framework for answering the question and ultimately helping the world prepare for and responsibly steer the course of disruption accompanying the technology. In this episode, they share how their work fits into the Microsoft research initiative AI, Cognition, and the Economy, or AICE; how the evolution of the internet may indicate the best is yet to come for AI; and their advice for budding AI researchers.

Transcript 

[TEASER] 

[MUSIC PLAYS UNDER DIALOGUE] 

BRENDAN LUCIER: What we’re doing here is a prediction problem. And when we were trying to look into the future this way, one way we do that is we try to get as much information as we can about where we are right now. And so we were lucky to have, like, a ton of information about the current state of the economy and the labor market and some short-term indicators on how generative AI seems to be, sort of, affecting things right now, in this moment. And then the idea is to layer some theory models on top of that to try to extrapolate forward, right, in terms of what might be happening, sort of get a glimpse of this future point. 

MERT DEMIRER: So this is a prediction problem that we cannot use machine learning, AI. Otherwise, it would have been a very easy problem to solve. So what you need instead is a model or, like, framework that will take, for example, inputs of the productivity gains or for, like, microfoundation as an input and then generate predictions for the entire economy. 


[TEASER ENDS] 

GRETCHEN HUIZINGA: You’re listening to Collaborators, a Microsoft Research Podcast showcasing the range of expertise that goes into transforming mind-blowing ideas into world-changing technologies. I’m Dr. Gretchen Huizinga.

[MUSIC FADES] 

On today’s episode, I’m talking to Dr. Brendan Lucier, a senior principal researcher in the economics and computation group at Microsoft Research, and Dr. Mert Demirer, an assistant professor of applied economics at the MIT Sloan School of Management. Brendan and Mert are exploring the economic impact of job automation and generative AI as part of Microsoft’s AI, Cognition, and the Economy, or AICE, research initiative. And since they’re part of the AICE Accelerator Pilot collaborations, let’s get to know our collaborators. Brendan, let’s start with you and your “business address,” if you will. Your research lives at the intersection of microeconomic theory and theoretical computer science. So tell us what people—shall we call them theorists?—who live there do and why they do it! 

BRENDAN LUCIER: Thank you so much for having me. Yeah, so this is a very interdisciplinary area of research that really gets at, sort of, this intersection of computation and economics. And what it does is it combines the ideas from algorithm design and computational complexity that we think of when we’re building algorithmic systems with, sort of, the microeconomic theory of how humans will use those systems and how individuals make decisions, right. How their goals inform their actions and how they interact with each other. And where this really comes into play is in the digital economy and platforms that we, sort of, see online that we work with on an everyday basis, right. So we’re increasingly interacting with algorithms as part of our day-to-day life. So we use them to search for information; we use them to find rides and find jobs and have recommendations on what products we purchase. And as we do these things online, you know, some of the algorithms that go into this, like, help them grow into these huge-scale, you know, internet-sized global platforms. But fundamentally, these are still markets, right. So even though there’s a lot of algorithms and a lot of computational ideas that go into these, really what they’re doing is connecting human users to the goods and the services and to each other over the course of what they need to do in their day-to-day life, right. And so this is where this microeconomic view really comes into play. So what we know is that when people are interacting with these platforms to get at what they want, they’re going to be strategic about this, right. So people are always going to use tools in the ways that, sort of, work best for them, right, even if that’s not what the designer has in mind. And so when we’re designing algorithms, in a big way, we’re not necessarily designing solutions; we’re designing the rules of a game that people are going to end up playing with the platform or with each other.

HUIZINGA: Wow. 

LUCIER: And so a big part of, sort of, what we do in this area is that if we’re trying to understand the impact of, like, a technology change or a new platform that we’re going to design, we need to understand what it is that the users want and how they’re going to respond to that change when they interact with it. 

HUIZINGA: Right.

LUCIER: When we think about, sort of, microeconomic theory, a lot of this is, you know, ideas from game theory, ideas about how it is that humans make decisions, either on their own or in interaction with each other, right.

HUIZINGA: Yeah.

LUCIER: So when I’m in the marketplace, maybe I’m thinking not only about what’s best for me, but, sort of, I’m anticipating maybe what other people are going to be doing, as well. And I really need to be thinking about how the algorithms that make up the fundamentals of those marketplaces are going to influence the way people are thinking about not only what they’re doing but what other people are doing. 

HUIZINGA: Yeah, this is so fascinating because even as you started to list the things that we use algorithms—and we don’t even think about it—but we look for a ride, a job, a date. All of these things that are part of our lives have become algorithmic! 

LUCIER: Absolutely. And it’s fascinating that, you know, when we think about, you know, someone might launch a new algorithm, a new advance to these platforms, that looks on paper like it’s going to be a great improvement, assuming that people keep behaving the way they were behaving before. But of course, people will naturally respond, and so there’s always this moving target of trying to anticipate what it is that people actually are really trying to do and how they will adapt. 

HUIZINGA: We’re going to get into that so deep in a few minutes. But first, Mert, you are an assistant professor of economics at MIT’s famous Sloan School of Management, and your homepage tells us your research interests include industrial organization and econometrics. So unpack those interests for our listeners and tell us what you spend most of your time doing at the Sloan School. 

MERT DEMIRER: Thank you so much for having me. My name is Mert Demirer. I am an assistant professor at MIT Sloan, and I spend most of my time doing research and teaching MBAs. And in my research, I’m an economist, so I do research in a field called industrial organization. And the overarching theme of my research is firms and firm productivity. So in my research, I ask questions like, what makes firms more productive? What are the determinants of firm growth, or how do industries evolve over time? So what I do is I typically collect data from firms, and I use some econometric model or sometimes a model of industrial or the firm model and then I answer questions like these. And more recently, my research focused on new emerging technologies and how firms use these emerging technologies and what are the productivity effect of these new technologies. And I, more specifically, I did research on cloud computing, which is a really important technology … 

HUIZINGA: Yeah … 

DEMIRER: … transforming firms and industries. And more recently, my research focuses on AI, both, like, the adoption of AI and the productivity impact of AI. 

HUIZINGA: Right, right. You know, even as you say it, I’m thinking, what’s available data? What’s good data? And how much data do you need to make informed analysis or decisions? 

DEMIRER: So finding good data is a challenge in this research. In general, there are, like, official data sources like census or, like, census of manufacturers, which have been commonly used in productivity research. That data is very comprehensive and very useful. But of course, if you want to get into the details of, like, new technologies and, like, granular firm analysis, that’s not enough. So what I have been trying to do more recently is to find industry partners which have lots of good data on other firms. 

HUIZINGA: Gotcha. 

DEMIRER: So these are typically the main data sources I use. 

HUIZINGA: You know, this episode is part of a little series within a series we’re doing on AI, Cognition, and the Economy, and we started out with Abi Sellen from the Cambridge, UK, lab, who gave us an overview of the big ideas behind the initiative. And you’re going to give us some discussion today on AI and, specifically, the economy. But before we get into your current collaboration, let’s take a minute to “geolocate” ourselves in the world of economics and how your work fits into the larger AICE research framework. So, Brendan, why don’t you situate us with the “micro” view and its importance to this initiative, and then Mert can zoom out and talk about the “macro” view and why we need him, too. 

LUCIER: Yeah, sure. Yeah, I just, I just love this AICE program and the way that it puts all this emphasis on how human users are interacting with AI systems and tools, and this is really, like, a focal point of a lot of this, sort of, micro view, also. So, like, from this econ starting point of microeconomics, one place I think of is imagining how users would want to integrate AI tools into their day-to-day, right—into both their workflow as part of their jobs; in terms of, sort of, what they’re doing in their personal lives. And when we think about how new tools like AI tech, sort of, comes into those workflows, an even earlier question is, how is it that users are organizing what they do into individual tasks and, like, why are they doing them that way in the first place, right? So when we want to think about, you know, how it is that AI might come in and help them with pain points that they’re dealing with, we, sort of, need to understand, like, what it is they’re trying to accomplish and what are the goals that they have in mind. And this is super important when we’re trying to build effective tools because we need to understand how they’ll change their behavior or adjust to incorporate this new technology and trying to zoom into that view. 

HUIZINGA: Yeah. Mert, tell us a little bit more about the macro view and why that’s important in this initiative, as well. 

DEMIRER: Macro view is very complementary to micro view, and it takes a more holistic approach and analyzes the economy with its components rather than focusing on individual components. So instead of focusing on one component and analyze the collectivity effect of AI on a particular, like, occupation or sector, you just analyze this whole economy and you model the interactions between these components. And this holistic view is really essential if you want to understand AI because this is going to allow you to make, like, long-term projections and it’s going to help you understand how AI is going to affect, like, the entire economy. And to make things, like, more concrete—and going back to what Brendan said—that suppose you analyze a particular task or you figured out how AI saw the pain point and it increased the productivity by like x amount, so that impact on that occupation or, let’s say, the industry won’t be limited to that industry, right? The wage is going to change in this industry, but it’s going to affect other industries, potentially, like, labor from one industry which is affected significantly by AI to other industries, and, like, maybe new firms are going to emerge, some firms are going to exit, and so on. So this holistic view, it essentially models all of these components in just one system and also tries to understand the interactions between those. And as I said, this is really helpful because first of all, this helps you to make long-term projections about AI, how AI is going to impact the economy. And second, this is going to let you go beyond the first-order impact. Because you can essentially look at what’s going on and analyze or measure the first-order impact, but if you want to get the second- or third-order impact, then you need a framework or you need a bigger model. And typically, those, like, second- or third-order effects are typically the unintended effects or the hidden effects. 

HUIZINGA: Right. 

DEMIRER: And that’s why this, like, more holistic approach is useful, particularly for AI. 

HUIZINGA: Yeah, I got to just say right now, I feel like I wanted to sit down with you guys for, like, a couple hours—not with a microphone—but just talking because this is so fascinating. And Abi Sellen mentioned this term “line of sight” into future projections, which was sort of an AICE overview goal. Interestingly, Mert, when you mentioned the term productivity, is that the metric? Is productivity the metric that we’re looking to in terms of this economic impact? It seems to be a buzzword, that we need to be “more productive.” Is that, kind of, a framework for your thinking? 

DEMIRER: I think it is an important component. It’s an important component how we should analyze and think about AI because again, like, when you zoom into, like, the micro view of, like, how AI is going to affect my day-to-day work, that is, like, very natural to think that in terms of, like, productivity—oh, I saved, like, half an hour yesterday by using, like, AI. And, OK, that’s the productivity, right. That’s very visible. Like, that’s something you can see, something you can easily measure. But that’s only one component. So you need to understand how that productivity effect is going to change other things. 

HUIZINGA: Right! 

LUCIER: Like how I am going to spend the additional time, whether I’m going to spend that time for leisure or I’m going to do something else. 

HUIZINGA: Right. 

DEMIRER: In that sense, I think productivity is an important component, and maybe it is, like, the initial point to analyze these technologies. But we will definitely go beyond the productivity effect and understand how these, like, potential productivity effects are going to affect, like, the other parts of the economy and how the agents—like firms, people—are going to react to that potential productivity increase. 

HUIZINGA: Yeah, yeah, in a couple questions I’ll ask Brendan specifically about that. But in the meantime, let’s talk about how you two got together on this project. I’m always interested in that story. This question is also known as “how I met your mother.” And the meetup stories are often quite fun and sometimes surprising. In fact, last week, one person told his side of the story, and the other guy said, hey, I didn’t even know that! [LAUGHS] So, Brendan, tell us your side of who called who and how it went down, and then Mert can add his perspective. 

LUCIER: Great. So, yeah, so I’ve known Mert for quite some time! Mert joined our lab as a—the Microsoft Research New England lab—as an intern some years ago and then as a postdoc in, sort of, 2020, 2021. And so over that time, we got to know each other quite well, and I knew a lot about the macroeconomic work that Mert was doing. And so then, fast-forward to more recently, you know, this particular project initially started as discussions between myself and my colleague Nicole Immorlica at Microsoft Research and John Horton, who’s an economist at MIT who was visiting us as a visiting researcher, and we were discussing how the structure of different jobs and how those jobs break down into tasks might have an impact on how they might be affected by AI. And then very early on in that conversation, we, sort of, realized that, you know, this was really a … not just, like, a microeconomic question; it’s not just a market design question. The, sort of, the macroeconomic forces were super important. And then immediately, we knew, OK, like, Mert’s top of our list; we need, [LAUGHTER] we need, you know, to get Mert in here and talking to us about it. And so we reached out to him. 

HUIZINGA: Mert, how did you come to be involved in this from your perspective? 

DEMIRER: As Brendan mentioned, I spent quite a bit of time at Microsoft Research, both as an intern and as a postdoc, and Microsoft Research is a very, like, fun place to be as an economist and a really productive place to be as an economist because it’s very, like, interdisciplinary. It is a lot different from a typical academic department and especially an economics academic department. So my time at Microsoft Research has already led to a bunch of, like, papers and collaborations. And then when Brendan, like, emailed me with the research question, I thought it’s, like, no-brainer. It’s an interesting research question, like part of Microsoft Research. So I said, yeah, let’s do it! 

HUIZINGA: Brendan, let’s get into this current project on the economic impact of automation and generative AI. Such a timely and fascinating line of inquiry. Part of your research involves looking at a lot of current occupational data. So from the vantage point of microeconomic theory and your work, tell us what you’re looking at, how you’re looking at it, and what it can tell us about the AI future. 

LUCIER: Fantastic. Yeah, so in some sense, the idea of this project and the thing that we’re hoping to do is, sort of, get our hands on the long-term economic impact of generative AI. But it’s fundamentally, like, a super-hard problem, right? For a lot of reasons. And one of those reasons is that, you know, some of the effects could be quite far in the future, right. So this is things where the effects themselves but especially, like, the data we might look at to measure them could be years or decades away. And so, fundamentally, what we’re doing here is a prediction problem. And when we were trying to, sort of, look into the future this way, one way we do that is we try to get as much information as we can about where we are right now, right. And so we were lucky to have, like, a ton of information about the current state of the economy and the labor market and some short-term indicators on how generative AI seems to be, sort of, affecting things right now in this moment. And then the idea is to, sort of, layer some theory models on top of that to try to extrapolate forward, right, in terms of what might be happening, sort of get a glimpse of this future point. So in terms of the data we’re looking at right now, there’s this absolutely fantastic dataset that comes from the Department of Labor. It’s the O*NET database. This is the, you know, Occupational Information Network—publicly available, available online—and what it does is basically it breaks down all occupations across the United States, gives a ton of information about them, including—and, sort of, importantly for us—a very detailed breakdown of the individual tasks that make up the day-to-day in terms of those occupations, right. So, for example, if you’re curious to know what, like, a wind energy engineer does day-to-day, you could just go online and look it up, and so it basically gives you the entire breakdown. Which is fantastic. I mean, it’s, you know, I love, sort of, browsing it. It’s an interesting thing to do with an afternoon. [LAUGHTER] But from our perspective, the fact that we have these tasks—and it actually gives really detailed information about what they are—lets us do a lot of analysis on things like how AI tools and generative AI might help with different tasks. There’s a lot of analysis that we and, like, a lot of other papers coming out the last year have done in looking at which tasks do we think generative AI can have a big influence on and which ones less so in the present moment, right. And there’s been work by, you know, OpenAI and LinkedIn and other groups, sort of, really leaning into that. We can actually take that one step further and actually look also at the structure between tasks, right. So we can see not only, like, what fraction of the time I spend are things that can be influenced by generative AI but also how they relate to, like, my actual, sort of, daily goals. Like, when I look at the tasks I have to do, do I have flexibility in when and where I do them, or are things in, sort of, a very rigid structure? Are there groups of interrelated tasks that all happen to be really exposed to generative AI? And, you know, what does that say about how workers might reorganize their work as they integrate AI tools in and how that might change the nature of what it is they’re actually trying to do on a day-to-day basis? 

HUIZINGA: Right. 

LUCIER: So just to give an example, so, like, one of the earliest examples we looked at as we started digging into the data and testing this out was radiology. And so radiology is—you know, this is medical doctors that specialized in using medical imaging technology—and it happens to be an interesting example for this type of work because you know there are lots of tasks that make that up and they have a lot of structure to them. And it turns out when you look at those tasks, there’s interestingly, like, a big group of tasks that all, sort of, are prerequisites for an important, sort of, core part of the job, … 

HUIZINGA: Right … 

LUCIER: … which is, sort of, recommending a plan of which tests to, sort of, perform, right. So these are things like analyzing medical history and analyzing procedure requests, summarizing information, forming reports. And these are all things that we, sort of, expect that generative AI can be quite effective at, sort of, assisting with, right. And so the fact that these are all, sort of, grouped together and feed into something that’s a core part of the job really is suggestive that there’s an opportunity here to delegate some of those, sort of, prerequisite tasks out to, sort of, AI tools so that the radiologist can then focus on the important part, which is the actual recommendations that they can make. 

HUIZINGA: Right. 

LUCIER: And so the takeaway here is that it matters, like, how these tasks are related to each other, right. Sort of, the structure of, you know, what it is that I’m doing and when I’m doing them, right. So this situation would perhaps be very different if, as I was doing these tasks where AI is very helpful, I was going back and forth doing consulting with patients or something like this, where in that, sort of, scenario, I might imagine that, yeah, like an AI tool can help me, like, on a task-by-task basis but maybe I’m less likely to try to, like, organize all those together and automate them away. 

HUIZINGA: Right. Yeah, let me focus a little bit more on this idea of you in the lab with all this data, kind of, parsing out and teasing out the tasks and seeing which ones are targets for AI, which ones are threatened by AI, which ones would be wonderful with AI. Do you have buy-in from these exemplar-type occupations that they say, yes, we would like you to do this to help us? I mean, is there any of that collaboration going on with these kinds of occupations at the task level? 

LUCIER: So the answer is not yet. [LAUGHTER] But this is definitely an important part of the workflow. So I would say that, you know, ultimately, the goal here is that, you know, as we’re looking for these patterns across, like, individual exemplar occupations, that, sort of, what we’re looking for is relationships between tasks that extrapolate out, right. Across lots of different industries, right. So, you know, it’s one thing to be able to say, you know, a lot of very deep things about how AI might influence a particular job or a particular industry. But in some sense, the goal here is to see patterns of tasks that are repeated across lots of different occupations, across lots of different sectors that say, sort of, these are the types of patterns that are really amenable to, sort of, AI being integrated well into the workforce, whereas these are scenarios where it’s much more of an augmenting story as opposed to an automating story. But I think one of the things that’s really interesting about generative AI as a technology here, as opposed to other types of automated technology, is that while there are lots of aspects of a person’s job that can be affected by generative AI, there’s this relationship between the types of work that I might use an AI for versus the types of things that are, sort of, like the core feature of what I’m doing on a day-to-day. 

HUIZINGA: Right. Gotcha … 

LUCIER: And so, maybe it’s, like, at least in the short term, it actually looks quite helpful to say that, you know, there are certain aspects of my work, like going out and summarizing a bunch of heavy data reports, that I’m very happy to have an AI, sort of, do that part of my work. So then I can go and use those things forward in, sort of, the other half of my day. 

HUIZINGA: Yeah. And that’s to Mert’s point: look how much time I just saved! Or I got a half hour back! We’ll get to that in a second. But I really now am eager, Mert, to have you explain your side of this. Brendan just gave us a wonderful task-centric view of AI’s impact on specific jobs. I want you to zoom out and talk about the holistic, as you mentioned before, or macroeconomic view in this collaboration. How are you looking at the impact of AI beyond job tasks, and what role does your work play in helping us understand how these advances in AI might affect job markets and the economy writ large? 

DEMIRER: One thing Brendan mentioned a few minutes ago is this is a prediction task. Like, we need to predict what will be the effect of AI, how AI is going to affect the economy, especially in the long run. So this is a prediction problem that we cannot use machine learning, AI. Otherwise, it would have been a very easy problem to solve. 

HUIZINGA: Right … [LAUGHS] 

DEMIRER: So what you need instead is a model or, like, framework that will take, for example, inputs of, like, the productivity gains, for example, like Brendan talked about, or for, like, microfoundation as an input and then generate predictions for the entire economy. To do that, what I do in my research is I develop and use models of industries and firms. So these models essentially incorporate a bunch of economic agents. Like, this could be labor; this could be firms; this could be [a] policymaker who is trying to regulate the industry. And then you write down the incentives of these, like, different agents in the economy, and then you write down this model, you solve this model with the available data, and then this model gives you predictions. So you can, once you have a model like this, you can ask what would be the effect of a change in the economic environment on like wages, on productivity, on industry concentration, let’s say. So this is what I do in my research. So, like, I briefly mentioned my research on cloud computing. I think this is a very good example. When you think about cloud computing, always … everyone always, like, thinks about it helps you, like, scale very rapidly, which is true, and, like, which is the actual, like, the firm-level effect of cloud computing. But then the question is, like, how that is going to affect the entire industry, whether the industry is going to be more concentrated or less concentrated, it’s going to grow, like, faster, or which industry is going to grow faster, and so on. So essentially, in my research, I develop models like this to answer questions—these, like, high-level questions. And when it comes to AI, we have these, like, very detailed micro-level studies, like these exposure measures Brendan already mentioned, and the framework, the micro framework, we developed is a task view of AI. What you do is, essentially, you take the output of that micro model and then you feed it into a bigger economy-level model, and you develop a higher-level prediction. So, for example, you can apply this, like, task-based model on many different occupations. You can get a number for every occupation, like for occupation A, productivity will be 5 percent; for occupation B, it’s going to be like 10 percent; and so on. You can aggregate them at the industry level—you can get some industry-level numbers—you feed those numbers into a more, like, general equilibrium model and then you solve the model and then you answer questions like, what will be the effect of AI on wage on average? Or, like, what will be the effect of AI on, like, total output in the economy? So my research is, like, more on this answering, like, bigger industry-level or economic-level questions. 

HUIZINGA: Well, Brendan, one of our biggest fears about AI is that it’s going to “steal our jobs.” I just made air quotes on a podcast again. But this isn’t our first disruptive technology rodeo, to use a phrase. So that said, it’s the first of its kind. What sets AI apart from disruptive technologies of the past, and how can looking at the history of technological revolutions help us manage our expectations, both good and bad? 

LUCIER: Fantastic. Such an important question. Yeah, like there’s been, you know, just so much discussion and “negativity versus optimism” debates in the world in the public sphere … 

HUIZINGA: Hope versus hype … 

LUCIER: … and in the academic sphere … yeah, exactly. Hope versus hype. But as you say, yeah, it’s not our first rodeo. And we have a lot of historical examples of these, you know, disruptive, like, so-called general-purpose technologies that have swept through the economy and made a lot of changes and enabled things like electricity and the computer and robotics. Going back further, steam engine and the industrial revolution. You know, these things are revolutions in the sense that, you know, they sort of rearrange work, right. They’re not just changing how we do things. They change what it is that we even do, like just the nature of work that’s being done. And going back to this point of automation versus augmentation, you know, what that looks like can vary quite a bit from revolution to revolution, right. So sometimes this looks like fully automating away certain types of work. But in other cases, it’s just a matter of, sort of, augmenting workers that are still doing, in some terms, what they were doing before but with a new technology that, like, substantially helps them and either takes part of their job and makes it redundant so they can focus on something that’s, you know, more core or just makes them do what they were doing before much, much faster. 

HUIZINGA: Right. 

LUCIER: And either way, you know, this can have a huge impact on the economy and especially, sort of, the labor market. But that impact can be ambiguous, right. So, you know, if I make, you know, a huge segment of workers twice as productive, then companies have a choice. They can keep all the workers and have twice the output, or they can get the same output with half as many workers or something in between, and, you know, which one of those things happens depends not even so much on the technology but on, sort of, the broader economic forces, right. The, you know, the supply and demand and how things are going to come together in equilibrium, which is why this macroeconomic viewpoint is so important to actually give the predictions on, you know, how companies might respond to these changes that are coming through the new technology. Now, you know, where GenAI is, sort of, interesting as an example is the way that, you know, what types of work it impacts, right. So generative AI is particularly notable in that it impacts, you know, high-skill, you know, knowledge-, information-based work directly, right[1]. And it cuts across so many different industries. We think of all the different types of occupations that involve, you know, summarizing data or writing a report or writing emails. There’s so many different types of occupations where this might not be the majority of what they do, but it’s a substantial fraction of what they do. And so in many cases, you know, this technology—as we were saying before—can, sort of, come in and has the potential to automate out or at least really help heavily assist with parts of the job but, in some cases, sort of, leave some other part of the job, which is a core function. And so these are the places where we really expect this human-AI collaboration view to be especially impactful and important, right. Where we’re going to have lots of different workers in lots of different occupations who are going to be making choices on which parts of their work they might delegate to, sort of, AI agents and which parts of the work, you know, they really want to keep their own hands on. 

HUIZINGA: Right, right. Brendan, talk a little more in detail about this idea of low-skill work and high-skill work, maybe physical labor and robotics kind of replacements versus knowledge worker and mental work replacements, and maybe shade it a little bit with the idea of inequalities and how that’s going to play out. I mean, I imagine this project, this collaboration, is looking at some of those issues, as well? 

LUCIER: Absolutely. So, yeah, when we think about, you know, what types of work get affected by some new technology—and especially, sort of, automation technology—a lot of the times in the past, the sorts of work that have been automated out are what we’d call low-skill or, like, at least, sort of, more physical types of labor being replaced or automated by, you know, robotics. We think about the potential of manufacturing and how that displaces, like, large groups of workers who are, sort of, working in the factory manually. And so there’s a sense when this, sort of, happens and a new technology comes through and really disrupts work, there’s this transition period where certain people, you know, even if at the end of the day, the economy will eventually reach sort of new equilibrium which is generally more productive or good overall, there’s a big question of who’s winning and who’s losing both in the long term but especially in that short term, … 

HUIZINGA: Yeah! 

LUCIER: … sort of intermediate, you know, potentially very chaotic and disruptive period. And so very often in these stories of automation historically, it’s largely marginalized low-skill workers who are really getting affected by that transition period. AI—and generative AI in particular—is, sort of, interesting in the potential to be really hitting different types of workers, right. 

HUIZINGA: Right. 

LUCIER: Really this sort of, you know, middle sort of white-collar, information-work class. And so, you know, really a big part of this project and trying to, sort of, get this glimpse into the future is getting, sort of, this—again, as you said—line of sight on which industries we expect to be, sort of, most impacted by this, and is it as we might expect, sort of, those types of work that are most directly affected, or are there second- or third-order effects that might do things that are unanticipated? 

HUIZINGA: Right, and we’ll talk about that in a second. So, Mert, along those same lines, it’s interesting to note how new technologies often start out simply by imitating old technologies. Early movies were stage plays on film. Email was a regular letter sent over a computer. [LAUGHS] Video killed the radio star … But eventually, we realized that these new technologies can do more than we thought. And so when we talked before, you said something really interesting. You said, “If a technology only saves time, it’s boring technology.” What do you mean by that? And if you mean what I think you mean, how does the evolution—not revolution but evolution—of previous technologies serve as a lens for the affordances that we may yet get from AI? 

DEMIRER: Let me say first, technology that saves time is still very useful technology! [LAUGHTER] Who wouldn’t want a technology that will save time? 

HUIZINGA: Sure … 

DEMIRER: But it is less interesting for us, like, to study and maybe it’s, like, less interesting in terms of, like, the broader implications. And so why is that? Because if a technology saves time, then, OK, so I am going to have maybe more time, and the question is, like, how I’m going to spend that time. Maybe I’m going to have more leisure or maybe I’m going to have to produce more. It’s, like, relatively straightforward to analyze and quantify. So however, like, the really impactful technologies could allow us to accomplish new tasks that were previously impossible, and they should open up new opportunities for creativity. And I think here, this knowledge-worker impact of AI is particularly important because I think as a technology, the more it affects knowledge worker, the more likely it’s going to allow us to achieve new things; it’s going to allow us to create more things. So I think in that sense, I think generative AI has a huge potential in terms of making us accomplish new things. And to give you an example from my personal experience, so I’m a knowledge worker, so I do research, I teach, and generative AI is going to help my work, as well. So it’s already affecting … so it’s already saving me time. It’s making me more productive. So suppose that generative AI just, like, makes me 50 percent more productive, let’s say, like five years from now, and that’s it. That’s the only effect. So what’s going to happen to my job? Either I’m going to maybe, like, take more time off or maybe I’m going to write more of the same kind of papers I am writing in economics. But … so imagine, like, generative AI is helping me writing a different kind of paper. How is that possible? So I have a PhD in econ, and if I try really hard, maybe I can do another PhD. But that’s it. Like, I can specialize only one or, like, two topics. But imagine generative AI as an, like, agent or collaborator having PhD in, like, hundreds of different fields, and then you can, like, collaborate and, like, communicate and get information through generative AI on really different fields. That will allow me to do different kinds of research, like more interdisciplinary kinds of research. In that sense, I think the really … the most important part of generative AI is going to be this … what it will allow us to achieve new things, like what creative new things we are going to do. And I can give you a simple example. Like, we were talking about previous technologies. Let’s think of internet. So what was the first application of internet? It’s sending an email. It saves you time. Instead of writing things on a paper and, like, mailing it, you just, like, send it immediately, and it’s a clear time-saving technology. But what are the major implications for internet, like, today? It’s not email. It is like e-commerce, or it is like social media. It allows us to access infinite number of products beyond a few stores in our neighborhood, or it allows us to communicate or connect with people all around the world … 

HUIZINGA: Yeah … 

DEMIRER: … instead of, again, like limiting ourselves to our, like, social circle. So in that sense, I think we are currently in the “email phase” of AI, … 

HUIZINGA: Right … 

DEMIRER: … and we are going to … like, I think AI is going to unlock so many other new capabilities and opportunities, and that is the most exciting part. 

HUIZINGA: Clearly, one of the drivers behind the whole AICE research initiative is the question of what could possibly go wrong if we got everything right, and I want to anchor this question on the common premise that if we get AI right, it will free us from drudgery—we’ve kind of alluded to that—and free us to spend our time on more meaningful or “human”—more air quotes there—pursuits. So, Brendan, have you and your team given any thought to this idea of unintended consequences and what such a society might actually look like? What will we do when AI purportedly gives us back our time? And will we really apply ourselves to making the world better? Or will we end up like those floating people in the movie WALL-E

LUCIER: [LAUGHS] I love that framing, and I love that movie, so this is great. Yeah. And I think this is one of these questions about, sort of, the possible futures that I think is super important to be tackling. In the past, people, sort of, haven’t stopped working; they’ve shifted to doing different types of work. And as you’re saying, there’s this ideal future in which what’s happening is that people are shifting to doing more meaningful work, right, and the AI is, sort of, taking over parts of the, sort of, the drudgery, you know. These, sort of, annoying tasks that, sort of, I need to do as just, sort of, side effects of my job. I would say that where the economic theory comes in and predicts something that’s slightly different is that I would say that the economic theory predicts that people will do more valuable work in the sense that people will tend to be shifted in equilibrium towards doing things that complement what it is that the AI can do or doing things that the AI systems can’t do as well. And, you know, this is really important in the sense that, like, we’re building these partnerships with these AI systems, right. There’s this human-AI collaboration where human people are doing the things that they’re best at and the AI systems are doing the things that they’re best at. And while we’d love to imagine that, like, that more valuable work will ultimately be more meaningful work in that it’s, sort of, fundamentally more human work, that doesn’t necessarily have to be the case. You know, we can imagine scenarios in which I personally enjoy … there are certain, you know, types of routine work that I happen to personally enjoy and find meaningful. But even in that world, if we get this right and, sort of, the, you know, the economy comes at equilibrium to a place where people are being more productive, they’re doing more valuable work, and we can effectively distribute those gains to everybody, there’s a world in which, you know, this has the potential to be the rising tide that lifts all boats. 

HUIZINGA: Right. 

LUCIER: And so that what we end up with is, you know, we get this extra time, but through this different sort of indirect path of the increased standard of living that comes with an improved economy, right. And so that’s the sort of situation where that source of free time I think really has the potential to be somewhere where we can use it for meaningful pursuits, right. But there are a lot of steps to take to, sort of, get there, and this is why it’s, I think, super important to get this line of sight on what could possibly be happening in terms of these disruptions. 

HUIZINGA: Right. Brendan, something you said reminded me that I’ve been watching a show called Dark Matter, and the premise is that there’s many possible lives we could live, all determined by the choices we make. And you two are looking at possible futures in labor markets and the economy and trying to make models for them. So how do existing hypotheses inform where AI is currently headed, and how might your research help predict them into a more optimal direction? 

LUCIER: Yeah, that’s a really big question. Again, you know, as we’ve said a few times already, there’s this goal here of getting this heads-up on which segments of the economy can be most impacted. And we can envision these better futures as the economy stabilizes, and maybe we can even envision pathways towards getting there by trying to address, sort of, the potential effects of inequality and the distribution of those gains across people. But even in a world where we get all those things right, that transition is necessarily going to be disruptive, right. 

HUIZINGA: Right. 

LUCIER: And so even if we think that things are going to work out well in the long term, in the short term, there’s certainly going to be things that we would hope to invest in to, sort of, improve for everyone. And so even in a world where we believe, sort of, the technology is out there and we really think that people are going to be using it in the ways that make most sense to them, as we get hints about where these impacts can be largest, I think that an important value there is that it lets us anticipate opportunities for responsible stewardship, right. So if we can see where there’s going to be impact, I think we can get a hint as to where we should be focusing our efforts, and that might look like getting ahead of demand for certain use cases or anticipating extra need for, you know, responsible AI guardrails, or even just, like, understanding, you know, [how] labor market impacts can help us inform policy interventions, right. And I think that this is one of the things that gets me really excited about doing this work at Microsoft specifically. Because of how much Microsoft has been investing in responsible AI, and, sort of, the fundamentals that underlie those guardrails and those possible actions means that we, sort of, in this company, we have the ability to actually act on those opportunities, right. And so I think it’s important to really, sort of, try to shine as much light as possible on where we think those will be most effective. 

HUIZINGA: Yeah. Mert, I usually ask my guests on Collaborators where their research is on the spectrum from “lab to life,” but this isn’t that kind of research. We might think of it more in terms of “lab for life” research, where your findings could actually help shape the direction of the product research in this field. So that said, where are you on the timeline of this project, and do you have any learnings yet that you could share with us? 

DEMIRER: I think the first thing I learned about this project is it is difficult to study AI! [LAUGHTER] So we are still in, like, the early stages of the project. So we developed this framework we talked about earlier in the podcast, and now what we are doing is we are applying that framework to a few particular occupations. And the challenge we had is these occupations, when you just describe them, it’s like very simple, but when you go to this, like, task view, it’s actually very complex, the number of tasks. Sometimes we see in the data, like, 20, 30 tasks they do, and the relationship between those tasks. So it turned out to be more difficult than I expected. So what we are currently doing is we are applying the framework to a few specific tasks which help us understand how the model works and whether the model needs any adjustment. And then the goal is once we understand the model on any few specific cases, we’ll scale that up. And then we are going to develop these big predictions on the economy. So we are currently not there yet, but we are hoping to get there pretty soon. 

HUIZINGA: And just to, kind of, follow up on that, what would you say your successful outcome of this research would be? What’s your artifact that you would deliver from this project as collaboration? 

DEMIRER: So ultimately, our goal is to develop predictions that will inform the trajectory the AI is taking, that’s going to inform, like, the policy. That’s our goal, and if we generate that output, and especially if it informs policy of how firms or different agents of the economy adopt AI, I think that will be the ideal output for this project. 

HUIZINGA: Yeah. And what you’ve just differentiated is that there are different end users of your research. Some of them might be governmental. Some of them might be corporate. Some of them might even be individuals or even just layers of management that try to understand how this is working and how they’re working. So wow. Well, I usually close each episode with some future casting. But that basically is what we’ve been talking about this whole episode. So I want to end instead by asking each of you to give some advice to researchers who might be just getting started in AI research, whether that’s the fields that develop the technology itself or the fields that help define its uses and the guardrails we put around it. So what is it important for us to pay attention to right now, and what words of wisdom could you offer to aspiring researchers? I’ll give you each the last word. Mert, why don’t you go first? 

DEMIRER: My first advice will be use AI yourself as much as possible. Because the great thing about AI is that everyone can access this technology even though it’s a very early stage, so there’s a huge opportunity. So I think if you want to study AI, like, you should use it as much as possible. That personally allows me to understand the technology better and also develop research questions. And the second advice would be to stay up to date with what’s happening. This is a very rapidly evolving technology. There is a new product, new use case, new model every day, and it’s hard to keep up. And it is actually important to distinguish between questions that won’t be relevant two months from now versus questions that’s going to be important five years from now. And that requires understanding how the technology is evolving. So I personally find it useful to stay up to date with what’s going on. 

HUIZINGA: Brendan, what would you add to that? 

LUCIER: So definitely fully agree with all of that. And so I guess I would just add something extra for people who are more on the design side, which is that when we build, you know, these systems, these AI tools and guardrails, we oftentimes will have some anticipated, you know, usage or ideas in our head of how this is going to land, and then there’ll always be this moment where it, sort of, meets the real users, you know, the humans who are going to use those things in, you know, possibly unanticipated ways. And, you know, this can be oftentimes a very frustrating moment, but this can be a feature, not a bug, very often, right. So the combined insight and effort of all the users of a product can be this, like, amazing strong force. And so, you know, this is something where we can try to fight against it or we can really try to, sort of, harness it and work with it, and this is why it’s really critical when we’re building especially, sort of, user-facing AI systems, that we design them from the ground up to be, sort of, collaborating, you know, with our users and guiding towards, sort of, good outcomes in the long term, you know, as people jointly, sort of, decide how best to use these products and guide towards, sort of, good usage patterns. 

[MUSIC] 

HUIZINGA: Hmmm. Well, Brendan and Mert, as I said before, this is timely and important research. It’s a wonderful contribution to the AICE research initiative, and I’m thrilled that you came on the podcast today to talk about it. Thanks for joining us. 

LUCIER: Thank you so much. 

DEMIRER: Thank you so much. 

[MUSIC FADES] 


[1] (opens in new tab) For more information, Lucier notes two resources about the economic impact of GenAI: GPTs are GPTs: An Early Look at the Labor Market Impact Potential of Large Language Models (opens in new tab) and Preparing the Workforce for Generative AI (opens in new tab)

The post Collaborators: AI and the economy with Brendan Lucier and Mert Demirer appeared first on Microsoft Research.

Read More

Improve AI assistant response accuracy using Knowledge Bases for Amazon Bedrock and a reranking model

Improve AI assistant response accuracy using Knowledge Bases for Amazon Bedrock and a reranking model

AI chatbots and virtual assistants have become increasingly popular in recent years thanks the breakthroughs of large language models (LLMs). Trained on a large volume of datasets, these models incorporate memory components in their architectural design, allowing them to understand and comprehend textual context.

Most common use cases for chatbot assistants focus on a few key areas, including enhancing customer experiences, boosting employee productivity and creativity, or optimizing business processes. For instance, customer support, troubleshooting, and internal and external knowledge-based search.

Despite these capabilities, a key challenge with chatbots is generating high-quality and accurate responses. One way of solving this challenge is to use Retrieval Augmented Generation (RAG). RAG is the process of optimizing the output of an LLM so it references an authoritative knowledge base outside of its training data sources before generating a response. Reranking seeks to improve search relevance by reordering the result set returned by a retriever with a different model. In this post, we explain how two techniques—RAG and reranking—can help improve chatbot responses using Knowledge Bases for Amazon Bedrock.

Solution overview

RAG is a technique that combines the strengths of knowledge base retrieval and generative models for text generation. It works by first retrieving relevant responses from a database, then using those responses as context to feed the generative model to produce a final output. Using a RAG approach for building a chatbot has many advantages. For example, retrieving responses from its database before generating a response could provide more relevant and coherent responses. This helps improve the conversational flow. RAG also scales better with more data compared to pure generative models, and it doesn’t require fine-tuning of the model when new data is added to the knowledge base. Additionally, the retrieval component enables the model to incorporate external knowledge by retrieving relevant background information from its database. This approach helps provide factual, in-depth, and knowledgeable responses.

To find an answer, RAG takes an approach that uses vector search across the documents. The advantage of using vector search is speed and scalability. Rather than scanning every single document to find the answer, with the RAG approach, you turn the texts (knowledge base) into embeddings and store these embeddings in the database. The embeddings are a compressed version of the documents, represented by an array of numerical values. After the embeddings are stored, the vector search queries the vector database to find the similarity based on the vectors associated with the documents. Typically, a vector search will return the top k most relevant documents based on the user question, and return the k results. However, because the similarity algorithm in a vector database works on vectors and not documents, vector search doesn’t always return the most relevant information in the top k results. This directly impacts the accuracy of the response if the most relevant contexts aren’t available to the LLM.

Reranking is a technique that can further improve the responses by selecting the best option out of several candidate responses. The following architecture illustrates how a reranking solution could work.

Bedrock KB Reranking model architecture

Architecture diagram for Reranking model integration with Knowledge Bases for Bedrock

Let’s create a question answering solution, where we ingest The Great Gatsby, a 1925 novel by American writer F. Scott Fitzgerald. This book is publicly available through Project Gutenberg. We use Knowledge Bases for Amazon Bedrock to implement the end-to-end RAG workflow and ingest the embeddings into an Amazon OpenSearch Serverless vector search collection. We then retrieve answers using standard RAG and a two-stage RAG, which involves a reranking API. We then compare results from these two methods.

The code sample is available in this GitHub repo.

In the following sections, we walk through the high-level steps:

  1. Prepare the dataset.
  2. Generate questions from the document using an Amazon Bedrock LLM.
  3. Create a knowledge base that contains this book.
  4. Retrieve answers using the knowledge base retrieve API
  5. Evaluate the response using the RAGAS
  6. Retrieve answers again by running a two-stage RAG, using the knowledge base retrieve API and then applying reranking on the context.
  7. Evaluate the two-stage RAG response using the RAGAS framework.
  8. Compare the results and the performance of each RAG approach.

For efficiency purposes, we provided sample code in a notebook used to generate a set of questions and answers. These Q&A pairs are used in the RAG evaluation process. We highly recommend having a human to validate each question and answer for accuracy.

The following sections explains major steps with the help of code blocks.

Prerequisites

To clone the GitHub repository to your local machine, open a terminal window and run the following commands:

git clone https://github.com/aws-samples/amazon-bedrock-samples
cd knowledge-bases/features-examples/03-advanced-concepts/reranking

Prepare the dataset

Download the book from the Project Gutenberg website. For this post, we create 10 large documents from this book and upload them to Amazon Simple Storage Service (Amazon S3):

target_url = "https://www.gutenberg.org/ebooks/64317.txt.utf-8" # the great gatsby
data = urllib.request.urlopen(target_url)
my_texts = []
for line in data:
my_texts.append(line.decode())

doc_size = 700 # size of the document to determine number of batches
batches = math.ceil(len(my_texts) / doc_size)

sagemaker_session = sagemaker.Session()
default_bucket = sagemaker_session.default_bucket()
s3_prefix = "bedrock/knowledgebase/datasource"

start = 0
s3 = boto3.client("s3")
for batch in range(batches):
    batch_text_arr = my_texts[start:start+doc_size]
    batch_text = "".join(batch_text_arr)
    s3.put_object(
        Body=batch_text,
        Bucket=default_bucket,
        Key=f"{s3_prefix}/{start}.txt"
    )
    start += doc_size  

Create Knowledge Base for Bedrock

If you’re new to using Knowledge Bases for Amazon Bedrock, refer to Knowledge Bases for Amazon Bedrock now supports Amazon Aurora PostgreSQL and Cohere embedding models, where we described how Knowledge Bases for Amazon Bedrock manages the end-to-end RAG workflow.

In this step, you create a knowledge base using a Boto3 client. You use Amazon Titan Text Embedding v2 to convert the documents into embeddings (‘embeddingModelArn’) and point to the S3 bucket you created earlier as the data source (dataSourceConfiguration):

bedrock_agent = boto3.client("bedrock-agent")
response = bedrock_agent.create_knowledge_base(
    name=knowledge_base_name,
    description='Knowledge Base for Bedrock',
    roleArn=role_arn,
    knowledgeBaseConfiguration={
        'type': 'VECTOR',
        'vectorKnowledgeBaseConfiguration': {
            'embeddingModelArn': embedding_model_arn
        }
    },
    storageConfiguration={
        'type': 'OPENSEARCH_SERVERLESS',
        'opensearchServerlessConfiguration': {
            'collectionArn': collection_arn,
            'vectorIndexName': index_name,
            'fieldMapping': {
                'vectorField':  "bedrock-knowledge-base-default-vector",
                'textField': 'AMAZON_BEDROCK_TEXT_CHUNK',
                'metadataField': 'AMAZON_BEDROCK_METADATA'
            }
        }
    }
)
knowledge_base_id = response['knowledgeBase']['knowledgeBaseId']
knowledge_base_name = response['knowledgeBase']['name']

response = bedrock_agent.create_data_source(
    knowledgeBaseId=knowledge_base_id,
    name=f"{knowledge_base_name}-ds",
    dataSourceConfiguration={
        'type': 'S3',
        's3Configuration': {
            'bucketArn': f"arn:aws:s3:::{bucket}",
            'inclusionPrefixes': [
                f"{s3_prefix}/",
            ]
        }
    },
    vectorIngestionConfiguration={
        'chunkingConfiguration': {
            'chunkingStrategy': 'FIXED_SIZE',
            'fixedSizeChunkingConfiguration': {
                'maxTokens': 300,
                'overlapPercentage': 10
            }
        }
    }
)
data_source_id = response['dataSource']['dataSourceId']

response = bedrock_agent.start_ingestion_job(
    knowledgeBaseId=knowledge_base_id,
    dataSourceId=data_source_id,
)

Generate questions from the document

We use Anthropic Claude on Amazon Bedrock to generate a list of 10 questions and the corresponding answers. The Q&A data serves as the foundation for the RAG evaluation based on the approaches that we are going to implement. We define the generated answers from this step as ground truth data. See the following code:

prompt_template = """The question should be diverse in nature 
across the document. The question should not contain options, not start with Q1/ Q2. 
Restrict the question to the context information provided.

<document>
{{document}}
</document>

Think step by step and pay attention to the number of question to create.

Your response should follow the format as followed:

Question: question
Answer: answer

"""
system_prompt = """You are a professor. Your task is to setup 1 question for an upcoming 
quiz/examination based on the given document wrapped in <document></document> XML tag."""

prompt = prompt_template.replace("{{document}}", documents)
temperature = 0.9
top_k = 250
messages = [{"role": "user", "content": [{"text": prompt}]}]
# Base inference parameters to use.
inference_config = {"temperature": temperature, "maxTokens": 512, "topP": 1.0}
# Additional inference parameters to use.
additional_model_fields = {"top_k": top_k}

# Send the message.
response = bedrock_runtime.converse(
    modelId=model_id,
    messages=messages,
    system=[{"text": system_prompt}],
    inferenceConfig=inference_config,
    additionalModelRequestFields=additional_model_fields
)
print(response['output']['message']['content'][0]['text'])
result = response['output']['message']['content'][0]['text']
q_pos = [(a.start(), a.end()) for a in list(re.finditer("Question:", result))]
a_pos = [(a.start(), a.end()) for a in list(re.finditer("Answer:", result))]

Retrieve answers using the knowledge base APIs

We use the generated questions and retrieve answers from the knowledge base using the retrieve and converse APIs:

contexts = []
answers = []

for question in questions:
    response = agent_runtime.retrieve(
        knowledgeBaseId=knowledge_base_id,
        retrievalQuery={
            'text': question
        },
        retrievalConfiguration={
            'vectorSearchConfiguration': {
                'numberOfResults': topk
            }
        }
    )
    
    retrieval_results = response['retrievalResults']
    local_contexts = []
    for result in retrieval_results:
        local_contexts.append(result['content']['text'])
    contexts.append(local_contexts)
    combined_docs = "n".join(local_contexts)
    prompt = llm_prompt_template.replace("{{documents}}", combined_docs)
    prompt = prompt.replace("{{query}}", question)
    temperature = 0.9
    top_k = 250
    messages = [{"role": "user", "content": [{"text": prompt}]}]
    # Base inference parameters to use.
    inference_config = {"temperature": temperature, "maxTokens": 512, "topP": 1.0}
    # Additional inference parameters to use.
    additional_model_fields = {"top_k": top_k}

    # Send the message.
    response = bedrock_runtime.converse(
        modelId=model_id,
        messages=messages,
        inferenceConfig=inference_config,
        additionalModelRequestFields=additional_model_fields
    )
    answers.append(response['output']['message']['content'][0]['text'])

Evaluate the RAG response using the RAGAS framework

We now evaluate the effectiveness of the RAG using a framework called RAGAS. The framework provides a suite of metrics to evaluate different dimensions. In our example, we evaluate responses based on the following dimensions:

  • Answer relevancy – This metric focuses on assessing how pertinent the generated answer is to the given prompt. A lower score is assigned to answers that are incomplete or contain redundant information. This metric is computed using the question and the answer, with values ranging between 0–1, where higher scores indicate better relevancy.
  • Answer similarity – This assesses the semantic resemblance between the generated answer and the ground truth. This evaluation is based on the ground truth and the answer, with values falling within the range of 0–1. A higher score signifies a better alignment between the generated answer and the ground truth.
  • Context relevancy – This metric gauges the relevancy of the retrieved context, calculated based on both the question and contexts. The values fall within the range of 0–1, with higher values indicating better relevancy.
  • Answer correctness – The assessment of answer correctness involves gauging the accuracy of the generated answer when compared to the ground truth. This evaluation relies on the ground truth and the answer, with scores ranging from 0–1. A higher score indicates a closer alignment between the generated answer and the ground truth, signifying better correctness.

A summarized report for standard RAG approach based on RAGAS evaluation:

answer_relevancy: 0.9006225160334027

answer_similarity: 0.7400904157096762

answer_correctness: 0.32703043056663855

context_relevancy: 0.024797687553157175

Two-stage RAG: Retrieve and rerank

Now that you have the results with the retrieve_and_generate API, let’s explore the two-stage retrieval approach by extending the standard RAG approach to integrate with a reranking model. In the context of RAG, reranking models are used after an initial set of contexts are retrieved by the retriever. The reranking model takes in the list of results and reranks each one based on the similarity between the context and the user query. In our example, we use a powerful reranking model called bge-reranker-large. The model is available in the Hugging Face Hub and is also free for commercial use. In the following code, we use the knowledge base’s retrieve API so we can get the handle on the context, and rerank it using the reranking model deployed as an Amazon SageMaker endpoint. We provide the sample code for deploying the reranking model in SageMaker in the GitHub repository. Here’s a code snippet that demonstrates two-stage retrieval process:

def generate_two_stage_context_answers(bedrock_runtime, 
                                       agent_runtime, 
                                       model_id, 
                                       knowledge_base_id, 
                                       retrieval_topk, 
                                       reranking_model, 
                                       questions, 
                                       rerank_top_k=3):
    contexts = []
    answers = []
    predictor = Predictor(endpoint_name=reranking_model, serializer=JSONSerializer(), deserializer=JSONDeserializer())
    for question in questions:
        retrieval_results = two_stage_retrieval(agent_runtime, knowledge_base_id, question, retrieval_topk, predictor, rerank_top_k)
        local_contexts = []
        documents = []
        for result in retrieval_results:
            local_contexts.append(result)

        contexts.append(local_contexts)
        combined_docs = "n".join(local_contexts)
        prompt = llm_prompt_template.replace("{{documents}}", combined_docs)
        prompt = prompt.replace("{{query}}", question)
        temperature = 0.9
        top_k = 250
        messages = [{"role": "user", "content": [{"text": prompt}]}]
        inference_config = {"temperature": temperature, "maxTokens": 512, "topP": 1.0}
        additional_model_fields = {"top_k": top_k}
        
        response = bedrock_runtime.converse(
            modelId=model_id,
            messages=messages,
            inferenceConfig=inference_config,
            additionalModelRequestFields=additional_model_fields
        )
        answers.append(response['output']['message']['content'][0]['text'])
    return contexts, answers

Evaluate the two-stage RAG response using the RAGAS framework

We evaluate the answers generated by the two-stage retrieval process. The following is a summarized report based on RAGAS evaluation:

answer_relevancy: 0.841581671275458

answer_similarity: 0.7961827348349313

answer_correctness: 0.43361356731293665

context_relevancy: 0.06049484724216884

Compare the results

Let’s compare the results from our tests. As shown in the following figure, the reranking API improves context relevancy, answer correctness, and answer similarity, which are important for improving the accuracy of the RAG process.

2 stage RAG evaluation metrics

RAG vs Two Stage Retrieval evaluation metrics

Similarly, we also measured the RAG latency for both approaches. The results can be shown in the following metrics and the corresponding chart:

Standard RAG latency: 76.59s

Two Stage Retrieval latency: 312.12s

reranking model speed comparison

Latency metric for RAG and Two Stage Retrieval process

In summary, using a reranking model (tge-reranker-large) hosted on an ml.m5.xlarge instance yields approximately four times the latency compared to the standard RAG approach. We recommend testing with different reranking model variants and instance types to obtain the optimal performance for your use case.

Conclusion

In this post, we demonstrated how to implement a two-stage retrieval process by integrating a reranking model. We explored how integrating a reranking model with Knowledge Bases for Amazon Bedrock can provide better performance. Finally, we used RAGAS, an open source framework, to provide context relevancy, answer relevancy, answer similarity, and answer correctness metrics for both approaches.

Try out this retrieval process today, and share your feedback in the comments.


About the Author

Wei Teh is an Machine Learning Solutions Architect at AWS. He is passionate about helping customers achieve their business objectives using cutting-edge machine learning solutions. Outside of work, he enjoys outdoor activities like camping, fishing, and hiking with his family.

Pallavi Nargund is a Principal Solutions Architect at AWS. In her role as a cloud technology enabler, she works with customers to understand their goals and challenges, and give prescriptive guidance to achieve their objective with AWS offerings. She is passionate about women in technology and is a core member of Women in AI/ML at Amazon. She speaks at internal and external conferences such as AWS re:Invent, AWS Summits, and webinars. Outside of work she enjoys volunteering, gardening, cycling and hiking.

Qingwei Li is a Machine Learning Specialist at Amazon Web Services. He received his Ph.D. in Operations Research after he broke his advisor’s research grant account and failed to deliver the Nobel Prize he promised. Currently he helps customers in the financial service and insurance industry build machine learning solutions on AWS. In his spare time, he likes reading and teaching.

Mani Khanuja is a Tech Lead – Generative AI Specialists, author of the book Applied Machine Learning and High Performance Computing on AWS, and a member of the Board of Directors for Women in Manufacturing Education Foundation Board. She leads machine learning projects in various domains such as computer vision, natural language processing, and generative AI. She speaks at internal and external conferences such AWS re:Invent, Women in Manufacturing West, YouTube webinars, and GHC 23. In her free time, she likes to go for long runs along the beach.

Read More

Automate the machine learning model approval process with Amazon SageMaker Model Registry and Amazon SageMaker Pipelines

Automate the machine learning model approval process with Amazon SageMaker Model Registry and Amazon SageMaker Pipelines

Innovations in artificial intelligence (AI) and machine learning (ML) are causing organizations to take a fresh look at the possibilities these technologies can offer. As you aim to bring your proofs of concept to production at an enterprise scale, you may experience challenges aligning with the strict security compliance requirements of their organization. In the face of these challenges, MLOps offers an important path to shorten your time to production while increasing confidence in the quality of deployed workloads by automating governance processes.

ML models in production are not static artifacts. They reflect the environment where they are deployed and, therefore, require comprehensive monitoring mechanisms for model quality, bias, and feature importance. Organizations often want to introduce additional compliance checks that validate that the model aligns with their organizational standards before it is deployed. These frequent manual checks can create long lead times to deliver value to customers. Automating these checks allows them to be repeated regularly and consistently rather than organizations having to rely on infrequent manual point- in-time checks.

This post illustrates how to use common architecture principles to transition from a manual monitoring process to one that is automated. You can use these principles and existing AWS services such as Amazon SageMaker Model Registry and Amazon SageMaker Pipelines to deliver innovative solutions to your customers while maintaining compliance for your ML workloads.

Challenge

As AI becomes ubiquitous, it’s increasingly used to process information and interact with customers in a sensitive context. Suppose a tax agency is interacting with its users through a chatbot. It’s important that this new system aligns with organizational guidelines by allowing developers to have a high degree of confidence that it responds accurately and without bias. At maturity, an organization may have tens or even hundreds of models in production. How can you make sure every model is properly vetted before it’s deployed and on each deployment?

Traditionally, organizations have created manual review processes to keep updated code from becoming available to the public through mechanisms such as an Enterprise Review Committee (ERC), Enterprise Review Board (ERB), or a Change Advisory Board (CAB).

Just as mechanisms have evolved with the rise of continuous integration and continuous delivery (CI/CD), MLOps can reduce the need for manual processes while increasing the frequency and thoroughness of quality checks. Through automation, you can scale in-demand skillsets, such as model and data analysis, introducing and enforcing in-depth analysis of your models at scale across diverse product teams.

In this post, we use SageMaker Pipelines to define the required compliance checks as code. This allows you to introduce analysis of arbitrary complexity while not being limited by the busy schedules of highly technical individuals. Because the automation takes care of repetitive analytics tasks, technical resources can focus on relentlessly improving the quality and thoroughness of the MLOps pipeline to improve compliance posture, and make sure checks are performing as expected.

Deployment of an ML model to production generally requires at least two artifacts to be approved: the model and the endpoint. In our example, the organization is willing to approve a model for deployment if it passes their checks for model quality, bias, and feature importance prior to deployment. Secondly, the endpoint can be approved for production if it performs as expected when deployed into a production-like environment. In a subsequent post, we walk you through how to deploy a model and implement sample compliance checks. In this post, we discuss how you can extend this process to large language models (LLMs), which produce a varied set of outputs and introduce complexities regarding automated quality assurance checks.

Aligning with AWS multi-account best practices

The solution outlined in this post spans across several accounts in a given AWS organization. For a deeper look at the various components required for an AWS organization multi-account enterprise ML environment, see MLOps foundation roadmap for enterprises with Amazon SageMaker. In this post, we refer to the advanced analytics governance account as the AI/ML governance account. We focus on the development of the enforcement mechanism for the centralized automated model approval within this account.

This account houses centralized components such as a model registry on SageMaker Model Registry, ML project templates on SageMaker Projects, model cards on Amazon SageMaker Model Cards, and container images on Amazon Elastic Container Registry (Amazon ECR).

We use an isolated environment (in this case, a separate AWS environment) to deploy and promote across various environments. You can modify the strategies discussed in this post along the spectrum of centralized vs. decentralized depending on the posture of your organization. For this example, we provide a centralized model. You can also extend this model to align with strict compliance requirements. For example, the AI/ML governance team trusts the development teams are sending the correct bias and explainability reports for a given model. Additional checks could be included to “trust by verify” to further bolster the posture of this organization. Additional complexities such as this are not addressed in this post. To dive further into the topic of MLOps secure implementations, refer to Amazon SageMaker MLOps: from idea to production in six steps.

Solution overview

The following diagram illustrates the solution architecture using SageMaker Pipelines to automate model approval.

A node, RegisteredModelValidationStep on top pointing to the second node below, UpdateModelStatusStep

The workflow comprises a comprehensive process for model building, training, evaluation, and approval within an organization containing different AWS accounts, integrating various AWS services. The detailed steps are as follows:

  1. Data scientists from the product team use Amazon SageMaker Studio to create Jupyter notebooks used to facilitate data preprocessing and model pre-building. The code is committed to AWS CodeCommit, a managed source control service. Optionally, you can commit to third-party version control systems such as GitHub, GitLab, or Enterprise Git.
  2. The commit to CodeCommit invokes the SageMaker pipeline, which runs several steps, including model building and training, and running processing jobs using Amazon SageMaker Clarify to generate bias and explainability reports.
    • SageMaker Clarify processes and stores its outputs, including model artifacts and reports in JSON format, in an Amazon Simple Storage Service (Amazon S3) bucket.
    • A model is registered in the SageMaker model registry with a model version.
  3. The Amazon S3 PUT action invokes an AWS Lambda
  4. This Lambda function copies all the artifacts from the S3 bucket in the development account to another S3 bucket in the AI/ML governance account, providing restricted access and data integrity. This post assumes your accounts and S3 buckets are in the same AWS Region. For cross-Region copying, see Copy data from an S3 bucket to another account and Region by using the AWS CLI.
  5. Registering the model invokes a default Amazon CloudWatch event associated with SageMaker model registry actions.
  6. The CloudWatch event is consumed by Amazon EventBridge, which invokes another Lambda
  7. This Lambda function is tasked with starting the SageMaker approval pipeline.
  8. The SageMaker approval pipeline evaluates the artifacts against predefined benchmarks to determine if they meet the approval criteria.
  9. Based on the evaluation, the pipeline updates the model status to approved or rejected accordingly.

This workflow provides a robust, automated process for model approval using AWS’s secure, scalable infrastructure and services. Each step is designed to make sure that only models meeting the set criteria are approved, maintaining high standards for model performance and fairness.

Prerequisites

To implement this solution, you need to first create and register an ML model in the SageMaker model registry with the necessary SageMaker Clarify artifacts. You can create and run the pipeline by following the example provided in the following GitHub repository.

The following sections assume that a model package version has been registered with status Pending Manual Approval. This status allows you to build an approval workflow. You can either have a manual approver or set up an automated approval workflow based on metrics checks in the aforementioned reports.

Build your pipeline

SageMaker Pipelines allows you to define a series of interconnected steps defined as code using the Pipelines SDK. You can extend the pipeline to help meet your organizational needs with both automated and manual approval steps. In this example, we build the pipeline to include two major steps. The first step evaluates artifacts uploaded to the AI/ML governance account by the model build pipeline against threshold values set by model registry administrators for model quality, bias, and feature importance. The second step receives the evaluation and updates the model’s status and metadata based on the values received. The pipeline is represented in SageMaker Pipelines by the following DAG.

A node, RegisteredModelValidationStep on top pointing to the second node below, UpdateModelStatusStep

Next, we dive into the code required for the pipeline and its steps. First, we define a pipeline session to help manage AWS service integration as we define our pipeline. This can be done as follows:

pipeline_session = PipelineSession()

Each step runs as a SageMaker Processor for which we specify a small instance type due to the minimal compute requirements of our pipeline. The processor can be defined as follows:

from sagemaker.processing import Processor
step_processor=Processor(
    image_uri=image_uri,
    role=role, 
    instance_type="ml.t3.medium", 
    base_job_name=base_job_name,
    instance_count=1,  
    sagemaker_session=pipeline_session,
)

We then define the pipeline steps using step_processor.run(…) as the input parameter to run our custom script inside the defined environment.

Validate model package artifacts

The first step takes two arguments: default_bucket and model_package_group_name. It outputs the results of the checks in JSON format stored in Amazon S3. The step is defined as follows:

process_step = ProcessingStep(
    name="RegisteredModelValidationStep",
    step_args= step_processor.run(
        code="automated-model-approval/model-approval-checks.py",
        inputs=[],
        outputs=[
            ProcessingOutput(
                output_name="checks",
                destination=f"s3://{default_bucket}/governance-pipeline/processor/",
                source="/opt/ml/processing/output"
        )],
        arguments=[
            "--default_bucket", default_bucket_s3, 
            "--model_package_group_name", model_package_group_name
        ]
    )
)

This step runs the custom script passed to the code parameter. We now explore this script in more detail.

Values passed to arguments can be parsed using standard methods like argparse and will be used throughout the script. We use these values to retrieve the model package. We then parse the model package’s metadata to find the location of the model quality, bias, and explainability reports. See the following code:

model_package_arn = client.list_model_packages(ModelPackageGroupName=model_package_group_name)[
        "ModelPackageSummaryList"
    ][0]["ModelPackageArn"]
    model_package_metrics = 
client.describe_model_package(ModelPackageName=model_package_arn)["ModelMetrics"]
model_quality_s3_key = model_package_metrics["ModelQuality"]["Statistics"]["S3Uri"].split(f"{default_bucket}/")[1]
model_quality_bias = model_package_metrics["Bias"]
model_quality_pretrain_bias_key = model_quality_bias["PreTrainingReport"]["S3Uri"].split(f"{default_bucket}/")[1]
model_quality__post_train_bias_key = model_quality_bias["PostTrainingReport"]["S3Uri"].split(f"{default_bucket}/")[1]
model_explainability_s3_key = model_package_metrics["Explainability"]["Report"]["S3Uri"].split(f"{default_bucket}/")[1]

The reports retrieved are simple JSON files we can then parse. In the following example, we retrieve the treatment equity and compare to our threshold in order to return a True or False result. Treatment equity is defined as the difference in the ratio of false negatives to false positives for the advantaged vs. disadvantaged group. We arbitrarily set the optimal threshold to be 0.8.

s3_obj = s3_client.get_object(Bucket=default_bucket, Key=model_quality__post_train_bias_key)
s3_obj_data = s3_obj['Body'].read().decode('utf-8')
model_quality__post_train_bias_json = json.loads(s3_obj_data)
treatment_equity = model_quality__post_train_bias_json["post_training_bias_metrics"][
        "facets"]["column_8"][0]["metrics"][-1]["value"]
treatment_equity_check_threshold = 0.8
treatment_equity_check = True if treatment_equity < treatment_equity_check_threshold else False

After running through the measures of interest, we return the true/false checks to a JSON file that will be copied to Amazon S3 as per the output variable of the ProcessingStep.

Update the model package status in the model registry

When the initial step is complete, we use the JSON file created in Amazon S3 as input to update the model package’s status and metadata. See the following code:

update_model_status_step = ProcessingStep(
    name="UpdateModelStatusStep",
    step_args=step_processor.run(
        code="automated-model-approval/validate-model.py",
        inputs=[
            ProcessingInput(
                source=process_step.properties.ProcessingOutputConfig.Outputs[
                    "checks"
                ].S3Output.S3Uri,
                destination="/opt/ml/processing/input",
            ),
        ],
        outputs=[],
        arguments=[
            "--model_package_group_name", model_package_group_name
        ]
    ),
)

This step runs the custom script passed to the code parameter. We now explore this script in more detail. First, parse the values in checks.json to evaluate if the model passed all checks or review the reasons for failure:

is_approved = True
reasons = []
with open('/opt/ml/processing/input/checks.json') as checks:
        checks = json.load(checks)
        print(f"checks: {checks}")
        for key, value in checks.items():            
            if not value:
                is_approved = False
                reasons.append(key)

After we know if the model should be approved or rejected, we update the model status and metadata as follows:

if is_approved:
        approval_description = "Model package meets organisational guidelines"
else:
        approval_description = "Model values for the following checks does not meet threshold: "

for reason in reasons:
approval_description+= f"{reason} "
        
model_package_update_input_dict = {
        "ModelPackageArn" : model_package_arn,
        "ApprovalDescription": approval_description,
        "ModelApprovalStatus" : "Approved" if is_approved else "Rejected"
    }
    
model_package_update_response = client.update_model_package(**model_package_update_input_dict)

This step produces a model with a status of Approved or Rejected based on the set of checks specified in the first step.

Orchestrate the steps as a SageMaker pipeline

We orchestrate the previous steps as a SageMaker pipeline with two parameter inputs passed as arguments to the various steps:

from sagemaker.workflow.pipeline import Pipeline
from sagemaker.workflow.parameters import ParameterString

model_package_group_name = ParameterString(
name="ModelPackageGroupName", default_value="ModelPackageGroupName is required variable."
)

default_bucket_s3 = ParameterString(
name="Bucket", default_value="Bucket is required variable")

pipeline = Pipeline(
    name=pipeline_name,
    parameters=[model_package_group_name, default_bucket_s3],
    steps=[process_step, update_model_status_step],
)

It’s straightforward to extend this pipeline by adding elements into the list passed to the steps parameter. In the next section, we explore how to run this pipeline as new model packages are registered to our model registry.

Run the event-driven pipeline

In this section, we outline how to invoke the pipeline using an EventBridge rule and Lambda function.

Create a Lambda function and select the Python 3.9 runtime. The following function retrieves the model package ARN, the model package group name, and the S3 bucket where the artifacts are stored based on the event. It then starts running the pipeline using these values:

import json
import boto3
sagemaker_client = boto3.client('sagemaker')

def lambda_handler(event, context):
    model_arn = event.get('detail', {}).get('ModelPackageArn', 'Unknown')
    model_package_group_name = event.get('detail', {}).get('ModelPackageGroupName', 'Unknown') 
    model_package_name = event.get('detail', {}).get('ModelPackageName', 'Unknown') 
    model_data_url = event.get('InferenceSpecification', {}).get('ModelDataUrl', 'Unknown')        
    
    # Specify the name of your SageMaker pipeline
    pipeline_name = 'model-governance-pipeline'
    
    # Define multiple parameters
    pipeline_parameters = [
    {'Name': "ModelPackageGroupName", 'Value': model_package_group_name}, {'Name': "Bucket", 'Value': model_data_url},
   ]
    # Start the pipeline execution
    response = sagemaker_client.start_pipeline_execution(
    	PipelineName=pipeline_name,
    	PipelineExecutionDisplayName=pipeline_name,
    	PipelineParameters=pipeline_parameters
    )
    
    # Return the response
    return response

After defining the Lambda function, we create the EventBridge rule to automatically invoke the function when a new model package is registered with PendingManualApproval into the model registry. You can use AWS CloudFormation and the following template to create the rule:

{
  "AWSTemplateFormatVersion": "2010-09-09",
  "Description": "CloudFormation template for EventBridge rule 'invoke-model-approval-checks'",
  "Resources": {
    "EventRule0": {
      "Type": "AWS::Events::Rule",
      "Properties": {
        "EventBusName": "default",
        "EventPattern": {
          "source": ["aws.sagemaker"],
          "detail-type": ["SageMaker Model Package State Change"],
          "detail": {
            "ModelApprovalStatus": ["PendingManualApproval"]
          }
        },
        "Name": "invoke-model-approval-checks",
        "State": "ENABLED",
        "Targets": [{
          "Id": "Id403a084c-2837-4408-940f-b808389653d1",
          "Arn": "<Your Lambda function ARN>"
        }]
      }
    }
  }
}

We now have a SageMaker pipeline consisting of two steps being invoked when a new model is registered to evaluate model quality, bias, and feature importance metrics and update the model status accordingly.

Applying this approach to generative AI models

In this section, we explore how the complexities introduced by LLMs change the automated monitoring workflow.

Traditional ML models typically produce concise outputs with obvious ground truths in their training dataset. In contrast, LLMs can generate long, nuanced sequences that may have little to no ground truth due to the autoregressive nature of training this segment of model. This strongly influences various components of the governance pipeline we’ve described.

For instance, in traditional ML models, bias is detected by looking at the distributions of labels over different population subsets (for example, male vs. female). The labels (often a single number or a few numbers) are a clear and simple signal used to measure bias. In contrast, generative models produce lengthy and complex answers, which don’t provide an obvious signal to be used for monitoring. HELM (a holistic framework for evaluating foundation models) allows you to simplify monitoring by untangling the evaluation process into metrics of concern. This includes accuracy, calibration and uncertainty, robustness, fairness, bias and stereotypes, toxicity, and efficiency. We then apply downstream processes to measure for these metrics independently. This is generally done using standardized datasets composed of examples and a variety of accepted responses.

We concretely evaluate four metrics of interest to any governance pipelines for LLMs: memorization and copyright, disinformation, bias, and toxicity, as described in HELM. This is done by collecting inference results from the model pushed to the model registry. The benchmarks include:

  • Memorization and copyright with books from bookscorpus, which uses popular books from a bestseller list and source code of the Linux kernel. This can be quickly extended to include a number of copyrighted works.
  • Disinformation with headlines from the MisinfoReactionFrames dataset, which has false headlines across a number of topics.
  • Bias with Bias Benchmark for Question Answering (BBQ). This QA dataset works to highlight biases affecting various social groups.
  • Toxicity with Bias in Open-ended Language Generation Dataset (BOLD), which benchmarks across profession, gender, race, religion, and political ideology.

Each of these datasets is publicly available. They each allow complex aspects of a generative model’s behavior to be isolated and distilled down to a single number. This flow is described in the following architecture.

A flow from the benchmark data set going into the Large Language Model, which gets saved into Requests & Responds, which gets sent to the processing job. The processing job has additional input from benchmark datasets from Ground Truth. Ultimately, metrics are sent to a Metrics & Results bucket.

For a detailed view of this topic along with important mechanisms to scale in production, refer to Operationalize LLM Evaluation at Scale using Amazon SageMaker Clarify and MLOps services.

Conclusion

In this post, we discussed a sample solution to begin automating your compliance checks for models going into production. As AI/ML becomes increasingly common, organizations require new tools to codify the expertise of their highly skilled employees in the AI/ML space. By embedding your expertise as code and running these automated checks against models using event-driven architectures, you can increase both the speed and quality of models by empowering yourself to run these checks as needed rather than relying on the availability of individuals for manual compliance or quality assurance reviews By using well-known CI/CD techniques in the application development lifecycle and applying them to the ML modeling lifecycle, organizations can scale in the era of generative AI.

If you have any thoughts or questions, please leave them in the comments section.


About the Authors

A headshot of JaysonJayson Sizer McIntosh is a Senior Solutions Architect at Amazon Web Services (AWS) in the World Wide Public Sector (WWPS) based in Ottawa (Canada) where he primarily works with public sector customers as an IT generalist with a focus on Dev(Sec)Ops/CICD. Bringing his experience implementing cloud solutions in high compliance environments, he is passionate about helping customers successfully deliver modern cloud-based services to their users.

Nicolas Bernier is an AI/ML Solutions Architect, part of the Canadian Public Sector team at AWS. He is currently conducting research in Federated Learning and holds five AWS certifications, including the ML Specialty Certification. Nicolas is passionate about helping customers deepen their knowledge of AWS by working with them to translate their business challenges into technical solutions.

A headshot of PoojaPooja Ayre is a seasoned IT professional with over 9 years of experience in product development, having worn multiple hats throughout her career. For the past two years, she has been with AWS as a Solutions Architect, specializing in AI/ML. Pooja is passionate about technology and dedicated to finding innovative solutions that help customers overcome their roadblocks and achieve their business goals through the strategic use of technology. Her deep expertise and commitment to excellence make her a trusted advisor in the IT industry.

Read More

Recursion CEO Chris Gibson on Accelerating the Biopharmaceutical Industry With AI

Recursion CEO Chris Gibson on Accelerating the Biopharmaceutical Industry With AI

Techbio is a field combining data, technology and biology to enhance scientific processes — and AI has the potential to supercharge the biopharmaceutical industry further.  In this episode of NVIDIA’s AI Podcast, host Noah Kravitz speaks with Chris Gibson, cofounder and CEO of Recursion, about how the company uses AI and machine learning to accelerate drug discovery and development at scale. Tune in to hear Gibson discuss how AI is transforming the biopharmaceutical industry by increasing efficiency and lowering discovery costs.

Time Stamps

0:58: Background on Recursion
6:23: Recursion’s approach to drug discovery
12:06: Empirical data generation and generative AI prediction
17:46: How supercomputing is accelerating drug discovery
22:32: What is techbio?
29:15: The future — using natural language prompts to work with AI systems
31:44: Recursion’s plans for future

You Might Also Like:|

Cardiac Clarity: Dr. Keith Channon Talks Revolutionizing Heart Health With AI – Ep. 212

Caristo Diagnostics has developed an AI-powered solution for detecting coronary inflammation in cardiac CT scans. Dr. Keith Channon, cofounder and chief medical officer of the company, discusses how Caristo uses AI to improve treatment plans and risk predictions by providing patient-specific readouts.

Cofounder of Annalise.ai Aengus Tran on Using AI as a Spell Check for Health Checks – Ep. 207

Clinician-led healthcare AI company Harrison.ai has built annalise.ai. This AI solution serves as a “spell checker” for radiologists — flagging critical findings to improve the speed and accuracy of radiology image analysis, reducing misdiagnoses. Harrison.ai CEO and cofounder Aengus Tran discusses the potential of autonomous AI systems to scale global healthcare capacity.

Matice Founder Jessica Whited on Harnessing Regenerative Species for Medical Breakthroughs – Ep. 198

Matice Biosciences is using AI to study the regeneration of tissues in animal species known as super-regenerators, such as salamanders and planarians. Jessica Whited, a regenerative biologist at Harvard and cofounder of Matice Biosciences, discusses the company’s goal to harness regenerative species and AI to develop new treatments that help humans heal from injuries without scarring.

Bojan Tunguz, Johnny Israeli on How AI and Crowdsourcing Can Advance Vaccine Distribution – Ep. 195

Artificial intelligence is teaming up with crowdsourcing to improve the thermo-stability of mRNA vaccines, making distribution more accessible worldwide. Bojan Tunguz, a physicist and senior system software engineer at NVIDIA, and Johnny Israeli, senior manager of AI and cloud software at NVIDIA, discuss the fusion of AI, crowdsourcing and machine learning and its potential in drug discovery.

Subscribe to the AI Podcast

Get the AI Podcast through iTunes, Google Play, Amazon Music, Castbox, DoggCatcher, Overcast, PlayerFM, Pocket Casts, Podbay, PodBean, PodCruncher, PodKicker, Soundcloud, Spotify, Stitcher and TuneIn.

Make the AI Podcast better: Have a few minutes to spare? Fill out this listener survey.

Read More

Problem Solved: STEM Studies Supercharged With RTX and AI Technologies

Problem Solved: STEM Studies Supercharged With RTX and AI Technologies

Editor’s note: This post is part of the AI Decoded series, which demystifies AI by making the technology more accessible, and showcases new hardware, software, tools and accelerations for RTX PC users.

AI powered by NVIDIA GPUs is accelerating nearly every industry, creating high demand for graduates, especially from STEM fields, who are proficient in using the technology. Millions of students worldwide are participating in university STEM programs to learn skills that will set them up for career success.

To prepare students for the future job market, NVIDIA has worked with top universities to develop a GPU-accelerated AI curriculum that’s now taught in more than 5,000 schools globally. Students can get a jumpstart outside of class with NVIDIA’s AI Learning Essentials, a set of resources that equips individuals with the necessary knowledge, skills and certifications for the rapidly evolving AI workforce.

NVIDIA GPUs — whether running in university data centers, GeForce RTX laptops or NVIDIA RTX workstations — are accelerating studies, helping enhance the learning experience and enabling students to gain hands-on experience with hardware used widely in real-world applications.

Supercharged AI Studies

NVIDIA provides several tools to help students accelerate their studies.

The RTX AI Toolkit is a powerful resource for students looking to develop and customize AI models for projects in computer science, data science, and other STEM fields. It allows students to train and fine-tune the latest generative AI models, including Gemma, Llama 3 and Phi 3, up to 30x faster — enabling them to iterate and innovate more efficiently, advancing their studies and research projects.

Students studying data science and economics can use NVIDIA RAPIDS AI and data science software libraries to run traditional machine learning models up to 25x faster than conventional methods, helping them handle large datasets more efficiently, perform complex analyses in record time and gain deeper insights from data.

AI-deal for Robotics, Architecture and Design

Students studying robotics can tap the NVIDIA Isaac platform for developing, testing and deploying AI-powered robotics applications. Powered by NVIDIA GPUs, the platform consists of NVIDIA-accelerated libraries, applications frameworks and AI models that supercharge the development of AI-powered robots like autonomous mobile robots, arms and manipulators, and humanoids.

While GPUs have long been used for 3D design, modeling and simulation, their role has significantly expanded with the advancement of AI. GPUs are today used to run AI models that dramatically accelerate rendering processes.

Some industry-standard design tools powered by NVIDIA GPUs and AI include:

  • SOLIDWORKS Visualize: This 3D computer-aided design rendering software uses NVIDIA Optix AI-powered denoising to produce high-quality ray-traced visuals, streamlining the design process by providing faster, more accurate visual feedback.
  • Blender: This popular 3D creation suite uses NVIDIA Optix AI-powered denoising to deliver stunning ray-traced visuals, significantly accelerating content creation workflows.
  • D5 Render: Commonly used by architects, interior designers and engineers, D5 Render incorporates NVIDIA DLSS technology for real-time viewport rendering, enabling smoother, more detailed visualizations without sacrificing performance. Powered by fourth-generation Tensor Cores and the NVIDIA Optical Flow Accelerator on GeForce RTX 40 Series GPUs and NVIDIA RTX Ada Generation GPUs, DLSS uses AI to create additional frames and improve image quality.
  • Enscape: Enscape makes it possible to ray trace more geometry at a higher resolution, at exactly the same frame rate. It uses DLSS to enhance real-time rendering capabilities, providing architects and designers with seamless, high-fidelity visual previews of their projects.

Beyond STEM

Students, hobbyists and aspiring artists use the NVIDIA Studio platform to supercharge their creative processes with RTX and AI. RTX GPUs power creative apps such as Adobe Creative Cloud, Autodesk, Unity and more, accelerating a variety of processes such as exporting videos and rendering art.

ChatRTX is a demo app that lets students create a personalized GPT large language model connected to their own content and study materials, including text, images or other data. Powered by advanced AI, ChatRTX functions like a personalized chatbot that can quickly provide students relevant answers to questions based on their connected content. The app runs locally on a Windows RTX PC or workstation, meaning students can get fast, secure results personalized to their needs.

NVIDIA ChatRTX user interface.

Schools are increasingly adopting remote learning as a teaching modality. NVIDIA Broadcast — a free application that delivers professional-level audio and video with AI-powered features on RTX PCs and workstations — integrates seamlessly with remote learning applications including BlueJeans, Discord, Google Meet, Microsoft Teams, Webex and Zoom. It uses AI to enhance remote learning experiences by removing background noise, improving image quality in low-light scenarios, and enabling background blur and background replacement.

NVIDIA Broadcast.

From Data Centers to School Laptops

@dannylum_

The GeForce RTX 40 Series Laptops are perfect for your back to school all-in-one computer! Get one today #STEM #Gaming #engineering #student #BeyondFast #ad @NVIDIA GeForce

♬ original sound – Danny Lum

NVIDIA RTX-powered mobile workstations and GeForce RTX and Studio RTX 40 Series laptops offer supercharged development, learning, gaming and creating experiences with AI-enabled tools and apps. They also include exclusive access to the NVIDIA Studio platform of creative tools and technologies, and Max-Q technologies that optimize battery life and acoustics — giving students an ideal platform for all aspects of campus life.

Say goodbye to late nights in the computer lab — GeForce RTX laptops and NVIDIA RTX workstations share the same architecture as the NVIDIA GPUs powering many university labs and data centers. That means students can study, create and play — all on the same PC.

STEM Application Performance for GeForce RTX 4060 Laptop GPU versus Laptop without GeForce RTX GPU.

Learn more about GeForce RTX laptops and NVIDIA RTX workstations.

Read More

FlexAttention: The Flexibility of PyTorch with the Performance of FlashAttention

FlexAttention: The Flexibility of PyTorch with the Performance of FlashAttention

a cartoon chart flexing his muscles

In theory, Attention is All You Need. In practice, however, we also need optimized attention implementations like FlashAttention.

Although these fused attention implementations have substantially improved performance and enabled long contexts, this efficiency has come with a loss of flexibility. You can no longer try out a new attention variant by writing a few PyTorch operators – you often need to write a new custom kernel! This operates as a sort of “software lottery” for ML researchers – if your attention variant doesn’t fit into one of the existing optimized kernels, you’re doomed to slow runtime and CUDA OOMs.

For some examples of attention variants, we have Causal, Relative Positional Embeddings, Alibi, Sliding Window Attention, PrefixLM, Document Masking/Sample Packing/Jagged Tensors, Tanh Soft-Capping, PagedAttention, etc. Even worse, folks often want combinations of these! Sliding Window Attention + Document Masking + Causal + Context Parallelism? Or what about PagedAttention + Sliding Window + Tanh Soft-Capping?

The left picture below represents the state of the world today – some combinations of masking + biases + setting have existing kernels implemented. But the various options lead to an exponential number of settings, and so overall we end up with fairly spotty support. Even worse, new attention variants researchers come up with will have zero support.

Attention variant support diagram

To solve this hypercube problem once and for all, we introduce FlexAttention, a new PyTorch API.

  1. We provide a flexible API that allows implementing many attention variants (including all the ones mentioned in the blog post so far) in a few lines of idiomatic PyTorch code.
  2. We lower this into a fused FlashAttention kernel through torch.compile, generating a FlashAttention kernel that doesn’t materialize any extra memory and has performance competitive with handwritten ones.
  3. We also automatically generate the backwards pass, leveraging PyTorch’s autograd machinery.
  4. Finally, we can also take advantage of sparsity in the attention mask, resulting in significant improvements over standard attention implementations.

With FlexAttention, we hope that trying new attention variants will only be limited by your imagination.

You can find many FlexAttention examples at the Attention Gym: https://github.com/pytorch-labs/attention-gym. If you have any cool applications, feel free to submit an example!

PS: We also find this API very exciting since it leverages a lot of existing PyTorch infra in a fun way – more on that in the end.

FlexAttention

Here is the classic attention equation:

math equation

In code form:

Q, K, V: Tensor[batch_size, num_heads, sequence_length, head_dim]
score: Tensor[batch_size, num_heads, sequence_length, sequence_length] = (Q @ K) / sqrt(head_dim)
probabilities = softmax(score, dim=-1)
output: Tensor[batch_size, num_heads, sequence_length, head_dim] = probabilities @ V

FlexAttention allows for an user-defined function score_mod:

math equation

In code form:

Q, K, V: Tensor[batch_size, num_heads, sequence_length, head_dim]
score: Tensor[batch_size, num_heads, sequence_length, sequence_length] = (Q @ K) / sqrt(head_dim)
modified_scores: Tensor[batch_size, num_heads, sequence_length, sequence_length] = score_mod(score)
probabilities = softmax(modified_scores, dim=-1)
output: Tensor[batch_size, num_heads, sequence_length, head_dim] = probabilities @ V

This function allows you to modify the attention scores prior to softmax. Surprisingly, this ends up being sufficient for the vast majority of attention variants (examples below)!

Concretely, the expected signature for score_mod is somewhat unique.

def score_mod(score: f32[], b: i32[], h: i32[], q_idx: i32[], kv_idx: i32[])
    return score # noop - standard attention

In other words, score is a scalar pytorch tensor that represents the dot product of a query token and a key token. The rest of the arguments tell you which dot product you’re currently computing – b (current element in batch), h (current head), q_idx (position in query), kv_idx (position in key/value tensors).

To apply this function, we could implement it as

for b in range(batch_size):
    for h in range(num_heads):
        for q_idx in range(sequence_length):
            for kv_idx in range(sequence_length):
                modified_scores[b, h, q_idx, kv_idx] = score_mod(scores[b, h, q_idx, kv_idx], b, h, q_idx, kv_idx)

Of course, this is not how FlexAttention is implemented under the hood. Leveraging torch.compile, we automatically lower your function into a single fused FlexAttention kernel – guaranteed or your money back!

This API ends up being surprisingly expressive. Let’s look at some examples.

Score Mod Examples

Full Attention

Let’s first do “full attention”, or standard bidirectional attention. In this case, score_mod is a no-op – it takes as input the scores and then returns them as is..

def noop(score, b, h, q_idx, kv_idx):
    return score

And to use it end to end (including both forwards and backwards):

from torch.nn.attention.flex_attention import flex_attention

flex_attention(query, key, value, score_mod=noop).sum().backward()

Relative Position Encodings

One common attention variant is the “relative position encoding”. Instead of encoding the absolute distance in the queries and keys, relative position encoding adjusts scores based on the “distance” between the queries and keys.

def relative_positional(score, b, h, q_idx, kv_idx):
    return score + (q_idx - kv_idx)

Note that unlike typical implementations, this does not need to materialize a SxS tensor. Instead, FlexAttention computes the bias values “on the fly” within the kernel, leading to significant memory and performance improvements.

relative position encoding

ALiBi Bias

alibi bias

Source: Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

ALiBi was introduced in Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation, and claims to have beneficial properties for length extrapolation at inference. Notably, MosaicML has pointed to “lack of kernel support” as the main reason why they eventually switched from ALiBi to rotary embeddings.

Alibi is similar to relative positional encodings with one exception – it has a per-head factor that is typically precomputed.

alibi_bias = generate_alibi_bias() # [num_heads]

def alibi(score, b, h, q_idx, kv_idx):
    bias = alibi_bias[h] * (q_idx - kv_idx)
    return score + bias

This demonstrates one interesting piece of flexibility torch.compile provides – we can load from alibi_bias even though it wasn’t explicitly passed in as an input! The generated Triton kernel will calculate the correct loads from the alibi_bias tensor and fuse it. Note that you could regenerate alibi_bias and we still wouldn’t need to recompile.

Soft-capping

Soft-capping is a technique used in Gemma2 and Grok-1 that prevents logits from growing excessively large. In FlexAttention, it looks like:

softcap = 20
def soft_cap(score, b, h, q_idx, kv_idx):
    score = score / softcap
    score = torch.tanh(score)
    score = score * softcap
    return score

Note that we also automatically generate the backwards pass from the forwards pass here. Also, although this implementation is semantically correct, we likely want to use a tanh approximation in this case for performance reasons. See attention-gym for more details.

Causal Mask

Although bidirectional attention is the simplest, the original Attention is All You Need paper and the vast majority of LLMs use attention in a decoder-only setting where each token can only attend to the tokens prior to it. Folks often think of this as a lower-triangular mask, but with the score_mod API it can be expressed as:

def causal_mask(score, b, h, q_idx, kv_idx):
    return torch.where(q_idx >= kv_idx, score, -float("inf"))

Basically, if the query token is “after” the key token, we keep the score. Otherwise, we mask it out by setting it to -inf, thus ensuring it won’t participate in the softmax calculation.

However, masking is special compared to other modifications – if something is masked out, we can completely skip its computation! In this case, a causal mask has about 50% sparsity, so not taking advantage of the sparsity would result in a 2x slowdown. Although this score_mod is sufficient to implement causal masking correctly, getting the performance benefits of sparsity requires another concept – mask_mod.

Mask Mods

To take advantage of sparsity from masking, we need to do some more work. Specifically, by passing a mask_mod to create_block_mask, we can create a BlockMask. FlexAttention can then use BlockMask to take advantage of the sparsity!

The signature of mask_mod is very similar to score_mod – just without the score. In particular

# returns True if this position should participate in the computation
mask_mod(b, h, q_idx, kv_idx) => bool

Note that score_mod is strictly more expressive than mask_mod. However, for masking, it’s recommended to use mask_mod and create_block_mask, as it’s more performant. See the FAQ on why score_mod and mask_mod are separate.

Now, let’s take a look at how we might implement causal mask with mask_mod.

Causal Mask

from torch.nn.attention.flex_attention import create_block_mask

def causal(b, h, q_idx, kv_idx):
    return q_idx >= kv_idx

# Because the sparsity pattern is independent of batch and heads, we'll set them to None (which broadcasts them) 
block_mask = create_block_mask(causal, B=None, H=None, Q_LEN=1024, KV_LEN=1024)
# In this case, we don't need a score_mod, so we won't pass any in.
# However, score_mod can still be combined with block_mask if you need the additional flexibility.
flex_attention(query, key, value, block_mask=block_mask)

Note that create_block_mask is a relatively expensive operation! Although FlexAttention will not need to recompile when it changes, if you aren’t careful about caching it, it can lead to significant slowdowns (check out the FAQ for suggestions on best practices).

flexattention performance charts

While the TFlops are roughly the same, the execution time is 2x faster for the mask_mod version! This demonstrates that we can leverage the sparsity that BlockMask provides us without losing hardware efficiency.

Sliding Window + Causal

Sliding Window Causal diagrams

Source: Mistral 7B

Popularized by Mistral, sliding window attention (also known as local attention) takes advantage of the intuition that the most recent tokens are the most useful. In particular, it allows the query token to only attend to, say, the 1024 most recent tokens. This is often used together with causal attention.

SLIDING_WINDOW = 1024

def sliding_window_causal(b, h, q_idx, kv_idx):
    causal_mask = q_idx >= kv_idx
    window_mask = q_idx - kv_idx <= SLIDING_WINDOW 
    return causal_mask & window_mask

# If you want to be cute...
from torch.nn.attention import or_masks

def sliding_window(b, h, q_idx, kv_idx)
    return q_idx - kv_idx <= SLIDING_WINDOW

sliding_window_causal = or_masks(causal_mask, sliding_window)

We benchmark it against F.scaled_dot_product_attention with a sliding window mask as well as FA2 with a causal mask (as a reference point for performance). Not only are we significantly faster than F.scaled_dot_product_attention, we’re also significantly faster than FA2 with a causal mask as this mask has significantly more sparsity.

execution time charts

PrefixLM

PrefixLM diagram

Source: PaliGemma: A versatile 3B VLM for transfer

The T5 architecture, proposed in Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer, describes an attention variant that performs full bidirectional attention on a “prefix”, and causal attention on the rest. We again compose two mask functions to accomplish this, one for causal masking and one that is based off of the prefix length.

prefix_length: [B]
def prefix_mask(b, h, q_idx, kv_idx):
    return kv_idx <= prefix_length[b]

prefix_lm_causal = or_masks(prefix_mask, causal_mask)
# In this case, our mask is different per sequence so we set B equal to our batch size
block_mask = create_block_mask(prefix_lm_causal, B=B, H=None, S, S)

Just like with score_mod, mask_mod allows us to refer to additional tensors that aren’t explicitly an input to the function! However, with prefixLM, the sparsity pattern changes per input. This means that for each new input batch, we’ll need to recompute the BlockMask. One common pattern is to call create_block_mask at the beginning of your model and reuse that block_mask for all attention calls in your model. See Recomputing Block Masks vs. Recompilation.

However, in exchange for that, we’re not only able to have an efficient attention kernel for prefixLM, we’re also able to take advantage of however much sparsity exists in the input! FlexAttention will dynamically adjust its performance based off of the BlockMask data, without needing to recompile the kernel.

Document Masking/Jagged Sequences

Another common attention variant is document masking/jagged sequences. Imagine that you have a number of sequences of varying length. You want to train on all of them together, but unfortunately, most operators only accept rectangular tensors.

Through BlockMask, we can support this efficiently in FlexAttention as well!

  1. First, we flatten all sequences into a single sequence with sum(sequence lengths) tokens.
  2. Then, we compute the document_id that each token belongs to.
  3. Finally, in our mask_mod, we simply whether the query and kv token belong to the same document!
# The document that each token belongs to.
# e.g. [0, 0, 0, 1, 1, 2, 2, 2, 2, 2, 2] corresponds to sequence lengths 3, 2, and 6.
document_id: [SEQ_LEN]

def document_masking(b, h, q_idx, kv_idx):
    return document_id[q_idx] == document_id[kv_idx]

And that’s it! In this case, we see that we end up with a blockdiagonal mask.

blockdiagonal mask

One interesting aspect about document masking is that it’s easy to see how it might combine with an arbitrary combination of other masks . For example, we already defined prefixlm_mask in the previous section. Do we now need to define a prefixlm_document_mask function as well?

In these cases, one pattern we’ve found quite useful is what we call a “higher level modification”. In this case, we can take an existing mask_mod and automatically transform it into one that works with jagged sequences!

def generate_doc_mask_mod(mask_mod, document_id):
    # Get unique document IDs and their counts
    _, counts = torch.unique_consecutive(document_id, return_counts=True)
    # Create cumulative counts (offsets)
    offsets = torch.cat([torch.tensor([0], device=document_id.device), counts.cumsum(0)[:-1]])
    def doc_mask_wrapper(b, h, q_idx, kv_idx):
        same_doc = document_id[q_idx] == document_id[kv_idx]
        q_logical = q_idx - offsets[document_id[q_idx]]
        kv_logical = kv_idx - offsets[document_id[kv_idx]]
        inner_mask = mask_mod(b, h, q_logical, kv_logical)
        return same_doc & inner_mask
    return doc_mask_wrapper

For example, given the prefix_lm_causal mask from above, we can transform it into one that works on on packed documents like so:

prefix_length = torch.tensor(2, dtype=torch.int32, device="cuda")
def prefix_mask(b, h, q_idx, kv_idx):
    return kv_idx < prefix_length
prefix_lm_causal = or_masks(prefix_mask, causal_mask)
doc_prefix_lm_causal_mask = generate_doc_mask_mod(prefix_lm_causal, document_id)

blockdiagonal mask

Now, this mask is “block-prefixLM-diagonal” shaped. 🙂

That’s all of our examples! There are far more attention variants than we have space to list, so check out Attention Gym for more examples. We hope that the community will contribute some of their favorite applications of FlexAttention as well.

FAQ

Q: When does FlexAttention need to recompile?

As FlexAttention leverages torch.compile for graph capture, it can actually avoid recompilation in a broad spectrum of cases. Notably, it does not need to recompile even if captured tensors change values!

flex_attention = torch.compile(flex_attention)
def create_bias_mod(bias)
    def bias_mod(score, b, h, q_idx, kv_idx):
        return score + bias
    return bias_mod
bias_mod1 = create_bias_mod(torch.tensor(0))
flex_attention(..., score_mod=bias_mod1) # Compiles the kernel here 

bias_mod2 = create_bias_mod(torch.tensor(2))
flex_attention(..., score_mod=bias_mod2) # Doesn't need to recompile! 

Even changing the block-sparsity doesn’t require a recompile. However, if the block-sparsity changes, we do need to recompute the BlockMask.

Q: When should we recompute the BlockMask?

We need to recompute the BlockMask whenever the block-sparsity changes. Although computing the BlockMask is much cheaper than recompilation (on the order of hundreds of microseconds as opposed to seconds), you should still take care to not excessively recompute the BlockMask.

Here are some common patterns and some recommendations on how you might approach them.

Mask never changes (e.g. causal mask)
In this case, you can simply precompute the block mask and cache it globally, reusing it for all attention calls.

block_mask = create_block_mask(causal_mask, 1, 1, S,S)
causal_attention = functools.partial(flex_attention, block_mask=block_mask)

Mask changes every batch (e.g. document masking)
In this case, we would suggest computing the BlockMask at the beginning of the model and threading it through the model – reusing the BlockMask for all layers.

def forward(self, x, doc_mask):
    # Compute block mask at beginning of forwards
    block_mask = create_block_mask(doc_mask, None, None, S, S)    
    x = self.layer1(x, block_mask)
    x = self.layer2(x, block_mask)
    ...
    # amortize block mask construction cost across all layers
    x = self.layer3(x, block_mask) 
    return x

Mask changes every layer (e.g. data-dependent sparsity)
This is the hardest setting, since we’re unable to amortize the block mask computation across multiple FlexAttention invocations. Although FlexAttention can certainly still benefit this case, the actual benefits from BlockMask depend on how sparse your attention mask is and how fast we can construct the BlockMask. That leads us to…

Q: How can we compute BlockMask quicker?

create_block_mask is unfortunately fairly expensive, both from a memory and compute perspective, as determining whether a block is completely sparse requires evaluating mask_mod at every single point in the block. There are a couple ways to address this:

  1. If your mask is the same across batch size or heads, make sure that you’re broadcasting over those (i.e. set them to None in create_block_mask).
  2. Compile create_block_mask. Unfortunately, today, torch.compile does not work directly on create_block_mask due to some unfortunate limitations. However, you can set _compile=True, which will significantly reduce the peak memory and runtime (often an order of magnitude in our testing).
  3. Write a custom constructor for BlockMask. The metadata for BlockMask is quite simple (see the documentation). It’s essentially two tensors.
    a. num_blocks: The number of KV blocks computed for each query block.
    b. indices: The positions of the KV blocks computed for each query block.

    For example, here’s a custom BlockMask constructor for causal_mask.

def create_causal_mask(S):
    BLOCK_SIZE = 128
    # The first query block computes one block, the second query block computes 2 blocks, etc.
    num_blocks = torch.arange(S // BLOCK_SIZE, device="cuda") + 1
    # Since we're always computing from the left to the right,
    # we can use the indices [0, 1, 2, ...] for every query block.
    indices = torch.arange(S // BLOCK_SIZE, device="cuda").expand(
        S // BLOCK_SIZE, S // BLOCK_SIZE
    )
    num_blocks = num_blocks[None, None, :]
    indices = indices[None, None, :]
    return BlockMask(num_blocks, indices, BLOCK_SIZE=BLOCK_SIZE, mask_mod=causal_mask)
Q: Why are score_mod and mask_mod different? Isn’t mask_mod just a special case of score_mod?

Very astute question, hypothetical audience member! In fact, any mask_mod can be easily converted to a score_mod (we do not recommend using this function in practice!)

def mask_mod_as_score_mod(b, h, q_idx, kv_idx):
    return torch.where(mask_mod(b, h, q_idx, kv_idx), score, -float("inf"))

So, if score_mod can implement everything mask_mod can, what’s the point of having mask_mod?

One immediate challenge: a score_mod requires the actual score value as an input, but when we’re precomputing the BlockMask, we don’t have the actual score value. We can perhaps fake the values by passing in all zeros, and if the score_mod returns -inf, then we consider it to be masked (in fact, we originally did this!).

However, there are two issues. The first is that this is hacky – what if the user’s score_mod returned -inf when the input is 0? Or what if the user’s score_mod masked out with a large negative value instead of -inf? It seems we’re trying to cram a round peg into a square hole. However, there’s a more important reason to separate out mask_mod from score_mod – it’s fundamentally more efficient!.

As it turns out, applying masking to every single computed element is actually quite expensive – our benchmarks see about a 15-20% degradation in performance! So, although we can get significant speedups by skipping half the computation, we lose a meaningful part of that speedup from needing to mask out every element!

Luckily, if we visualize the causal mask, we notice that the vast majority of blocks do not require a “causal mask” at all – they’re fully computed! It is only the blocks on the diagonal, partially computed and partially masked, that require masking to be applied.

blockdiagonal mask

The BlockMask previously told us which blocks we need to compute and which blocks we can skip. Now, we further augment this data structure to also tell us which blocks are “fully computed” (i.e. masking can be skipped) vs. “partially computed” (i.e. a mask needs to be applied). Note, however, that although masks can be skipped on “fully computed” blocks, other score_mods like relative positional embeddings still need to be applied.

Given just a score_mod, there’s no sound way for us to tell which parts of it are “masking”. Hence, the user must separate these out themselves into mask_mod.

Q: How much additional memory does the BlockMask need?

The BlockMask metadata is of size [BATCH_SIZE, NUM_HEADS, QUERY_LEN//BLOCK_SIZE, KV_LEN//BLOCK_SIZE]. If the mask is the same across the batch or heads dimension it can be broadcasted over that dimension to save memory.

At the default BLOCK_SIZE of 128, we expect that the memory usage will be fairly negligible for most use cases. For example, for a sequence length of 1 million, the BlockMask would only use 60MB of additional memory. If this is a problem, you can increase the block size: create_block_mask(..., BLOCK_SIZE=1024). For example, increasing BLOCK_SIZE to 1024 would result in this metadata dropping to under a megabyte.

Q: How do the numerics compare?

Although the results are not bitwise identical, we are confident that FlexAttention is as numerically accurate as FlashAttention. We generate the following distribution of differences comparing FlashAttention versus FlexAttention over a large range of inputs on both causal and non causal attention variants. The errors are nearly identical.

distribution chart

Performance

Generally speaking, FlexAttention is nearly as performant as a handwritten Triton kernel, which is unsurprising, as we heavily leverage a handwritten Triton kernel. However, due to its generality, we do incur a small performance penalty. For example, we must incur some additional latency to determine which block to compute next. In some cases, we provide some kernel options that can affect the performance of the kernel while changing its behavior. They can be found here: performance knobs

As a case study, let’s explore how the knobs affect the performance of causal attention. We will compare performance of the triton kernel versus FlashAttentionv2 on A100. The script can be found here.

FlexAttention achieves 90% of FlashAttention2’s performance in the forward pass and 85% in the backward pass. FlexAttention is currently utilizing a deterministic algorithm that recomputes more intermediates than FAv2, but we have plans to improve FlexAttention’s backward algorithm and hope to close this gap!

flexattention speed chart

flexattention speed chart

Conclusion

We hope you have as much fun using FlexAttention as we did developing it! While working on this, we ended up finding way more applications of this API than we could have expected. We’ve already seen it accelerate torchtune’s sample packing throughput by 71%, replace the need for a researcher to spend over a week writing their own custom Triton kernel, and deliver competitive performance with custom handwritten attention variants.

One final thing that made implementing FlexAttention quite fun is that we were able to leverage a lot of existing PyTorch infra in an interesting way. For example, one of the unique aspects about TorchDynamo (torch.compile’s frontend) is that it does not require tensors used in the compiled function to be explicitly passed in as inputs. This allows us to compile mods like document masking, which require accessing global variables where the global variables need to change!

bias = torch.randn(1024, 1024)
def score_mod(score, b, h, q_idx, kv_idx):
    return score + bias[q_idx][kv_idx] # The bias tensor can change!

Furthermore, the fact that torch.compile is a generic graph-capture mechanism also allows it to support more “advanced” transformations, such as the higher order transform that transforms any mask_mod into one that works with jagged tensors.

We also leverage TorchInductor (torch.compile’s backend) infrastructure for Triton templates. Not only did this make it easy to support codegening FlexAttention – it also automatically gave us support for dynamic shapes as well as epilogue fusion (i.e. fusing an operator onto the end of attention)! In the future, we plan on extending this support to allow for quantized versions of attention or things like RadixAttention as well.

In addition, we also leveraged higher order ops, PyTorch’s autograd to automatically generate the backwards pass, as well as vmap to automatically apply score_mod for creating the BlockMask.

And, of course, this project wouldn’t have been possible without Triton and TorchInductor’s ability to generate Triton code.

We look forward to leveraging the approach we used here to more applications in the future!

Limitations and Future Work

  • FlexAttention is currently available in PyTorch nightly releases, we plan to release it as a prototype feature in 2.5.0
  • We did not cover how to use FlexAttention for inference here (or how to implement PagedAttention) – we will cover those in a later post.
  • We are working to improve the performance of FlexAttention to match FlashAttention3 on H100 GPUs.
  • FlexAttention requires that all sequence lengths be a multiple of 128 – this will be addressed soon.
  • We plan on adding GQA support soon – for now, you can just replicate the kv heads.

Acknowledgements

We want to highlight some prior work (and people) that have inspired FlexAttention.

  • Tri Dao’s work on FlashAttention
  • Francisco Massa and the Xformers team for BlockSparseAttention in Triton
  • The Jax team’s work on SplashAttention
  • Philippe Tillet and Keren Zhou for helping us with Triton
  • Ali Hassani for discussions on neighborhood attention
  • Everybody who’s complained about attention kernels not supporting their favorite attention variant 🙂

Read More

Generating Gender Alternatives in Machine Translation

This paper was accepted at the 5th Workshop on Gender Bias in Natural Language Processing 2024.
Machine translation (MT) systems often translate terms with ambiguous gender (e.g., English term “the nurse”) into the gendered form that is most prevalent in the systems’ training data (e.g., “enfermera”, the Spanish term for a female nurse). This often reflects and perpetuates harmful stereotypes present in society. With MT user interfaces in mind that allow for resolving gender ambiguity in a frictionless manner, we study the problem of generating all grammatically correct gendered translation…Apple Machine Learning Research