May 2025 – Page 14

Set up a custom plugin on Amazon Q Business and authenticate with Amazon Cognito to interact with backend systems

Businesses are constantly evolving, and leaders are challenged every day to meet new requirements and are seeking ways to optimize their operations and gain a competitive edge. One of the key challenges they face is managing the complexity of disparate business systems and workflows, which leads to inefficiencies, data silos, and missed opportunities.

Generative AI can play an important role in integrating these disparate systems in a secure and seamless manner, addressing these challenges in a cost-effective way. This integration allows for secure and efficient data exchange, action triggering, and enhanced productivity across the organization. Amazon Q Business plays an important role in making this happen. Amazon Q Business enables organizations to quickly and effortlessly analyze their data, uncover insights, and make data-driven decisions. With its intuitive interface and seamless integration with other AWS services, Amazon Q Business empowers businesses of different sizes to transform their data into actionable intelligence and drive innovation across their operations.

In this post, we demonstrate how to build a custom plugin with Amazon Q Business for backend integration. This plugin can integrate existing systems, including third-party systems, with little to no development in just weeks and automate critical workflows. Additionally, we show how to safeguard the solution using Amazon Cognito and AWS IAM Identity Center, maintaining the safety and integrity of sensitive data and workflows. Amazon Q Business also offers application environment guardrails or chat controls that you can configure to control the end-user chat experience to add an additional layer of safety. Lastly, we show how to expose your backend APIs through Amazon API Gateway, which is built on serverless AWS Lambda functions and Amazon DynamoDB.

Solution overview

Amazon Q Business is a fully managed, generative AI-powered assistant that helps enterprises unlock the value of their data and knowledge. With Amazon Q Business, you can quickly find answers to questions, generate summaries and content, and complete tasks by using the information and expertise stored across your company’s various data sources and enterprise systems. At the core of this capability are built-in data source connectors and custom plugins that seamlessly integrate and index content from multiple repositories into a unified index. This enables the Amazon Q Business large language model (LLM) to provide accurate, well-written answers by drawing from the consolidated data and information. The data source connectors act as a bridge, synchronizing content from disparate systems like Salesforce, Jira, and SharePoint into a centralized index that powers the natural language understanding and generative abilities of Amazon Q Business. Amazon Q Business also provides the capability to create custom plugins to integrate with your organization’s backend system and third-party applications.

After you integrate Amazon Q Business with your backend system using a custom plugin, users can ask questions from documents that are uploaded in Amazon Simple Storage Service (Amazon S3). For this post, we use a simple document that contains product names, descriptions, and other related information. Some of the questions you can ask Amazon Q Business might include the following:

“Give me the name of the products.”
“Now list all the products along with the description in tabular format.”
“Now create one of the products <product name>.” (At this stage, Amazon Q Business will require you to authenticate against Amazon Cognito to make sure you have the right permission to work on that application.)
“List all the products along with ID and price in tabular format.”
“Update the price of product with ID <product ID>.”

The following diagram illustrates the solution architecture.

The workflow consists of the following steps:

The user asks a question using the Amazon Q Business chat interface.
Amazon Q Business searches the indexed document in Amazon S3 for relevant information and presents it to the user.
The user can use the plugin to perform actions (API calls) in the system exposed to Amazon Q Business using Open API 3.x standards.
Because the API is secured with Amazon Cognito, Amazon Q Business requires the user to authenticate against the user credentials available in Amazon Cognito.
On successful authentication, API Gateway forwards the request to Lambda.
The API response is returned to the user through the Amazon Q Business chat interface.

Prerequisites

Before you begin the walkthrough, you must have an AWS account. If you don’t have one, sign up for one. Additionally, you must have access to the following services:

Amazon API Gateway
AWS CloudFormation
Amazon Cognito
Amazon DynamoDB
AWS IAM Identity Center
AWS Lambda
Amazon Q Business Pro (This will have an additional monthly cost)
Amazon S3

Launch the CloudFormation template

Launch the following CloudFormation template to set up Amazon Cognito, API Gateway, DynamoDB, and Lambda resources.

After you deploy the stack, navigate to the Outputs tab for the stack on the AWS CloudFormation console and note the resource details. We use those values later in this post.

If you’re running the CloudFormation template multiple times, make sure to choose a unique name for the stack each time.

Create an Amazon Q Business application

Complete the following steps to create an Amazon Q Business application:

On the Amazon Q Business console, choose Applications in the navigation pane.
Choose Create application.

Provide an application name (for example, product-mgmt-app).
Leave the other settings as default and choose Create.

The application will be created in a few seconds.

On the application details page, choose Data source.
Choose Add an index.
For Index name, enter a name for the index.
For Index provisioning, select Enterprise or Starter.
For Number of units, leave as the default 1.
Choose Add an index.

On the Data source page, choose Add a data source.
Choose Amazon S3 as your data source and enter a unique name.
Enter the data source location as the value of BucketName from the CloudFormation stack outputs in the format s3://<name_here>.

In a later step, we upload a file to this S3 bucket.

For IAM role¸ choose Create a new service role (recommended).
For Sync scope, select Full sync.
For Frequency, select Run on demand.
Choose Add data source.
On the application details page, choose Manage user access.
Choose Add groups and users.
You can use existing users or groups in IAM Identity Center or create new users and groups, then choose Confirm.

Only these groups and users have access to the Amazon Q Business application for their subscriptions.

Take note of deployed URL of the application to use in a later step.
On the Amazon S3 console, locate the S3 bucket you noted earlier and upload the sample document.
On the Amazon Q Business console, navigate to the application details page and sync the Amazon S3 data source.

Configure Amazon Cognito

Complete the following steps to set up Amazon Cognito:

On the Amazon Cognito console, navigate to the user pool created using the CloudFormation template (ending with-ProductUserPool).
Under Branding in the navigation pane, choose Domain.
On the Actions menu, choose Create Cognito domain.

We did not create a domain when we created the user pool using the CloudFormation template.

For Cognito domain, enter a domain prefix.
For Version, select Hosted UI.
Choose Create Cognito domain.

Under Applications in the navigation pane, choose App clients.
Choose your app client.

On the app client detail page, choose Login pages and then choose Edit the managed login pages configuration.
For URL, enter the deployed URL you noted earlier, followed by /oauth/callback. For example, https://xxxxx.chat.qbusiness.us-east-1.on.aws/oauth/callback.
Specify your identity provider, OAuth 2.0 grant type, OpenID Connect scopes, and custom scopes.

Custom scopes are defined as part of the API configuration in API Gateway. This will help Amazon Q Business determine what action a user is allowed to take. In this case, we are allowing the user to read, write, and delete. However, you can change this based on what you want your users to do using the Amazon Q Business chat.

Choose Save changes.

Take note of the Client ID and Client secret values in the App client information section to use in a later step.

Amazon Cognito doesn’t support changing the client secret after you have created the app client; a new app client is needed if you want to change the client secret.

Lastly, you have to add at least one user to the Amazon Cognito user pool.

Choose Users under User management in the navigation pane and choose Create user.
Create a user to add to your Amazon Cognito user pool.

We will use this user to authenticate before we can chat and ask questions to the backend system using Amazon Q Business.

Create an Amazon Q Business custom plugin

Complete the following steps to create your custom plugin:

On the Amazon Q Business console, navigate to the application you created.
Under Actions in the navigation pane, choose Plugins
Choose Add plugin.

Select Create custom plugin.
Provide a plugin name (for example, Products).
Under API schema source, select Define with in-line OpenAPI schema editor and enter the following code:

openapi: 3.0.0
info:
  title: CRUD API
  version: 1.0.0
  description: API for performing CRUD operations
servers:
  - url: put api gateway endpoint url here, copy it from cloudformation output
    
paths:
  /products:
    get:
      summary: List all products
      security:
        - OAuth2:
            - products/read
      description: Returns a list of all available products
      responses:
        '200':
          description: Successful response
          content:
            application/json:
              schema:
                type: array
                items:
                  $ref: '#/components/schemas/Product'
        '500':
          description: Internal server error
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/Error'
    post:
      summary: Create a new product
      security:
        - OAuth2:
            - products/write
      description: Creates a new product
      requestBody:
        required: true
        content:
          application/json:
            schema:
              $ref: '#/components/schemas/Product'
      responses:
        '201':
          description: Created
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/Product'
        '400':
          description: Bad Request
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/Error'
        '500':
          description: Internal server error
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/Error'
  /products/{id}:
    get:
      summary: Get a product
      security:
        - OAuth2:
            - products/read
      description: Retrieves a specific product by its ID
      parameters:
        - name: id
          in: path
          required: true
          description: The ID of the product to retrieve
          schema:
            type: string
      responses:
        '200':
          description: Successful response
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/Product'
        '404':
          description: Product not found
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/Error'
        '500':
          description: Internal server error
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/Error'
    put:
      summary: Update a product
      security:
        - OAuth2:
            - products/write
      description: Updates an existing product
      parameters:
        - name: id
          in: path
          required: true
          description: The ID of the product to update
          schema:
            type: string
      requestBody:
        required: true
        content:
          application/json:
            schema:
              $ref: '#/components/schemas/Product'
      responses:
        '200':
          description: Successful response
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/Product'
        '404':
          description: Product not found
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/Error'
        '500':
          description: Internal server error
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/Error'
    delete:
      summary: Delete a product
      security:
        - OAuth2:
            - products/delete
      description: Deletes a specific product by its ID
      parameters:
        - name: id
          in: path
          required: true
          description: The ID of the product to delete
          schema:
            type: string
      responses:
        '204':
          description: Successful response
        '404':
          description: Product not found
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/Error'
        '500':
          description: Internal server error
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/Error'
components:
  securitySchemes:
    OAuth2:
      type: oauth2
      flows:
        authorizationCode:
          authorizationUrl: <Cognito domain>/oauth2/authorize
          tokenUrl: <Cognito domain>/oauth2/token
          scopes:
            products/read: read prodcut
            products/write: write prodcut
            products/delete: delete prodcut
  schemas:
    Product:
      type: object
      required:
        - id
        - name
        - description
      properties:
        id:
          type: string
        name:
          type: string
        description:
          type: string
    Error:
      type: object
      properties:
        error:
          type: string

In the YAML file, replace the URL value with the value of ProductAPIEndpoint from the CloudFormation stack outputs:

servers url: https://<<xxxx>>.execute-api.us-east-1.amazonaws.com/dev

Replace the Amazon Cognito domain URL with the domain you created earlier:

authorizationCode:

authorizationUrl: https://xxxx.auth.us-east1.amazoncognito.com/oauth2/authorize

tokenUrl: https://xxxx.auth.us-east-1.amazoncognito.com/oauth2/token

The YAML file contains the schema (Open API 3.x) that Amazon Q Business uses to decide which API needs to be called based on the description. For example, line 16 in the following screenshot says Return a list all available products, which instructs Amazon Q Business to call this API whenever a user makes a request to list all products.

For authentication, select Authentication required.
For AWS Secrets Manager secret, choose Create and add new secret and enter the client ID and client secret you saved earlier, and enter the callback URL the same way as you did for the Amazon Cognito host UI (https://<>.chat.qbusiness.<<region>>.on.aws/oauth/callback).
For Choose a method to authorize Amazon Q Business, choose Create and use a new service role.
Choose Create plugin.

The last step is to enable the chat orchestration feature so Amazon Q Business can select the plugin automatically.

On the custom plugin details page, choose Admin controls and guardrails under Enhancements in the navigation pane.
In the Global controls section, choose Edit.

Select Allow Amazon Q Business to automatically orchestrate chat queries across plugins and data sources, then choose Save.

Configure API Gateway, Lambda, and DynamoDB resources

Everything related to API Gateway, Lambda, and DynamoDB is already configured using the CloudFormation template. Details are available on the Outputs tab of the stack details page. You can also review the details of the Lambda function and DynamoDB table on their respective service consoles. To learn how the Lambda function is exposed as an API through API Gateway, review the details on the API Gateway console.

Chat with Amazon Q Business

Now you’re ready to chat with Amazon Q Business.

On the Amazon Q Business console, navigate to your application.
Choose the link for Deployed URL.
Authenticate using IAM Identity Center (this is to make sure you have access to Amazon Q Business Pro).

You can now ask questions in natural language.

In the following example, we check if Amazon Q Business is able to access the data from the S3 bucket by asking “List all the products and their description in a table.”

After the product descriptions are available, start chatting and ask questions like Can you create product <product name> with same description please?. Alternatively, you can create a new product that isn’t listed in the sample document uploaded in Amazon S3. Amazon Q Business will automatically pick the right plugin (in this case, Products).

Subsequent requests for API calls to go through the custom plugin will ask you to authorize your access. Choose Authorize and authenticate with the user credentials created in Amazon Cognito earlier. After you’re authenticated, Amazon Q Business will cache the session token for subsequent API calls and complete the request.

You can query on the products that are available in the backend by asking questions like the following:

Can you please list all the products?
Delete a product by ID or by name.
Create a new product with the name 'Gloves' and description as 'Football gloves' with automatic in-built cooling

Based on the preceding prompt, a product has been created in the products table in DynamoDB.

Cost considerations

The cost of setting up this solution is based on the price of the individual AWS services being used. Prices of those services are available on the individual service pages. The only mandatory cost is the Amazon Q Business Pro license. For more information, see Amazon Q Business pricing.

Clean up

Complete the following steps to clean up your resources:

Delete the CloudFormation stack. For instructions, refer to Deleting a stack on the AWS CloudFormation console.
Delete the Amazon Q Business application.
Delete the Amazon Cognito user pool domain.
Empty and delete the S3 bucket. For instructions, refer to Deleting a general purpose bucket.

Conclusion

In this post, we explored how Amazon Q Business can seamlessly integrate with enterprise systems using a custom plugin to help enterprises unlock the value of their data. We walked you through the process of setting up the custom plugin, including configuring the necessary Amazon Cognito and authentication mechanisms.

With this custom plugin, organizations can empower their employees to work efficiently, answers quickly, accelerate reporting, automate workflows, and enhance collaboration. You can ask Amazon Q Business natural language questions and watch as it surfaces the most relevant information from your company’s backend system and act on requests.

Don’t miss out on the transformative power of generative AI and Amazon Q Business. Sign up today and experience the difference that Amazon Q Business can make for your organization’s workflow automation and the efficiency it brings.

About the Authors

Shubhankar Sumar is a Senior Solutions Architect at Amazon Web Services (AWS), working with enterprise software and SaaS customers across the UK to help architect secure, scalable, efficient, and cost-effective systems. He is an experienced software engineer, having built many SaaS solutions powered by generative AI. Shubhankar specializes in building multi-tenant systems on the cloud. He also works closely with customers to bring generative AI capabilities to their SaaS applications.

Dr. Anil Giri is a Solutions Architect at Amazon Web Services. He works with enterprise software and SaaS customers to help them build generative AI applications and implement serverless architectures on AWS. His focus is on guiding clients to create innovative, scalable solutions using cutting-edge cloud technologies.

Ankur Agarwal is a Principal Enterprise Architect at Amazon Web Services Professional Services. Ankur works with enterprise clients to help them get the most out of their investment in cloud computing. He advises on using cloud-based applications, data, and AI technologies to deliver maximum business value.

Detect hallucinations for RAG-based systems

With the rise of generative AI and knowledge extraction in AI systems, Retrieval Augmented Generation (RAG) has become a prominent tool for enhancing the accuracy and reliability of AI-generated responses. RAG is as a way to incorporate additional data that the large language model (LLM) was not trained on. This can also help reduce generation of false or misleading information (hallucinations). However, even with RAG’s capabilities, the challenge of AI hallucinations remains a significant concern.

As AI systems become increasingly integrated into our daily lives and critical decision-making processes, the ability to detect and mitigate hallucinations is paramount. Most hallucination detection techniques focus on the prompt and the response alone. However, where additional context is available, such as in RAG-based applications, new techniques can be introduced to better mitigate the hallucination problem.

This post walks you through how to create a basic hallucination detection system for RAG-based applications. We also weigh the pros and cons of different methods in terms of accuracy, precision, recall, and cost.

Although there are currently many new state-of-the-art techniques, the approaches outlined in this post aim to provide simple, user-friendly techniques that you can quickly incorporate into your RAG pipeline to increase the quality of the outputs in your RAG system.

Solution overview

Hallucinations can be categorized into three types, as illustrated in the following graphic.

Scientific literature has come up with multiple hallucination detection techniques. In the following sections, we discuss and implement four prominent approaches to detecting hallucinations: using an LLM prompt-based detector, semantic similarity detector, BERT stochastic checker, and token similarity detector. Finally, we compare approaches in terms of their performance and latency.

Prerequisites

To use the methods presented in this post, you need an AWS account with access to Amazon SageMaker, Amazon Bedrock, and Amazon Simple Storage Service (Amazon S3).

From your RAG system, you will need to store three things:

Context – The area of text that is relevant to a user’s query
Question – The user’s query
Answer – The answer provided by the LLM

The resulting table should look similar to the following example.

question	context	answer
What are cocktails?	Cocktails are alcoholic mixed…	Cocktails are alcoholic mixed…
What are cocktails?	Cocktails are alcoholic mixed…	They have distinct histories…
What is Fortnite?	Fortnite is a popular video…	Fortnite is an online multi…
What is Fortnite?	Fortnite is a popular video…	The average Fortnite player spends…

Approach 1: LLM-based hallucination detection

We can use an LLM to classify the responses from our RAG system into context-conflicting hallucinations and facts. The aim is to identify which responses are based on the context or whether they contain hallucinations.

This approach consists of the following steps:

Create a dataset with questions, context, and the response you want to classify.
Send a call to the LLM with the following information:
1. Provide the statement (the answer from the LLM that we want to classify).
2. Provide the context from which the LLM created the answer.
3. Instruct the LLM to tag sentences in the statement that are directly based on the context.
Parse the outputs and obtain sentence-level numeric scores between 0–1.
Make sure to keep the LLM, memory, and parameters independent from the ones used for Q&A. (This is so the LLM can’t access the previous chat history to draw conclusions.)
Tune the decision threshold for the hallucination scores for a specific dataset based on domain, for example.
Use the threshold to classify the statement as hallucination or fact.

Create a prompt template

To use the LLM to classify the answer to your question, you need to set up a prompt. We want the LLM to take in the context and the answer, and determine from the given context a hallucination score. The score will be encoded between 0 and 1, with 0 being an answer directly from the context and 1 being an answer with no basis from the context.

The following is a prompt with few-shot examples so the LLM knows what the expected format and content of the answer should be:

prompt = """nnHuman: You are an expert assistant helping human to check if statements are based on the context.
 Your task is to read context and statement and indicate which sentences in the statement are based directly on the context.

Provide response as a number, where the number represents a hallucination score, which is a float between 0 and 1.
Set the float to 0 if you are confident that the sentence is directly based on the context.
Set the float to 1 if you are confident that the sentence is not based on the context.
If you are not confident, set the score to a float number between 0 and 1. Higher numbers represent higher confidence that the sentence is not based on the context.

Do not include any other information except for the the score in the response. There is no need to explain your thinking.

<example>
Context: Amazon Web Services, Inc. (AWS) is a subsidiary of Amazon that provides on-demand cloud computing platforms and APIs to individuals, companies, and governments, on a metered, pay-as-you-go basis. Clients will often use this in combination with autoscaling (a process that allows a client to use more computing in times of high application usage, and then scale down to reduce costs when there is less traffic). These cloud computing web services provide various services related to networking, compute, storage, middleware, IoT and other processing capacity, as well as software tools via AWS server farms. This frees clients from managing, scaling, and patching hardware and operating systems. One of the foundational services is Amazon Elastic Compute Cloud (EC2), which allows users to have at their disposal a virtual cluster of computers, with extremely high availability, which can be interacted with over the internet via REST APIs, a CLI or the AWS console. AWS's virtual computers emulate most of the attributes of a real computer, including hardware central processing units (CPUs) and graphics processing units (GPUs) for processing; local/RAM memory; hard-disk/SSD storage; a choice of operating systems; networking; and pre-loaded application software such as web servers, databases, and customer relationship management (CRM).
Statement: 'AWS is Amazon subsidiary that provides cloud computing services.'
Assistant: 0.05
</example>

<example>
Context: Amazon Web Services, Inc. (AWS) is a subsidiary of Amazon that provides on-demand cloud computing platforms and APIs to individuals, companies, and governments, on a metered, pay-as-you-go basis. Clients will often use this in combination with autoscaling (a process that allows a client to use more computing in times of high application usage, and then scale down to reduce costs when there is less traffic). These cloud computing web services provide various services related to networking, compute, storage, middleware, IoT and other processing capacity, as well as software tools via AWS server farms. This frees clients from managing, scaling, and patching hardware and operating systems. One of the foundational services is Amazon Elastic Compute Cloud (EC2), which allows users to have at their disposal a virtual cluster of computers, with extremely high availability, which can be interacted with over the internet via REST APIs, a CLI or the AWS console. AWS's virtual computers emulate most of the attributes of a real computer, including hardware central processing units (CPUs) and graphics processing units (GPUs) for processing; local/RAM memory; hard-disk/SSD storage; a choice of operating systems; networking; and pre-loaded application software such as web servers, databases, and customer relationship management (CRM).
Statement: 'AWS revenue in 2022 was $80 billion.'
Assistant: 1
</example>

<example>
Context: Monkey is a common name that may refer to most mammals of the infraorder Simiiformes, also known as the simians. Traditionally, all animals in the group now known as simians are counted as monkeys except the apes, which constitutes an incomplete paraphyletic grouping; however, in the broader sense based on cladistics, apes (Hominoidea) are also included, making the terms monkeys and simians synonyms in regard to their scope. On average, monkeys are 150 cm tall.
Statement:'Average monkey is 2 meters high and weights 100 kilograms.'
Assistant: 0.9
</example>

Context: {context}
Statement: {statement}

nnAssistant: [
    """
    ### LANGCHAIN CONSTRUCTS
    # prompt template
    prompt_template = PromptTemplate(
        template=prompt,
        input_variables=["context", "statement"],
    )

Configure the LLM

To retrieve a response from the LLM, you need to configure the LLM using Amazon Bedrock, similar to the following code:

def configure_llm() -> Bedrock:

    model_params= { "answer_length": 100, # max number of tokens in the answer
        "temperature": 0.0, # temperature during inference
        "top_p": 1, # cumulative probability of sampled tokens
        "stop_words": [ "nnHuman:", "]", ], # words after which the generation is stopped 
                    } 
    bedrock_client = boto3.client( 
            service_name="bedrock-runtime",
            region_name="us-east-1", 
            )
            
    MODEL_ID = "anthropic.claude-3-5-sonnet-20240620-v1:0"
    
    llm = Bedrock( 
        client=bedrock_client, 
        model_id=MODEL_ID, 
        model_kwargs=model_params, 
        )
                        
    return llm

Get hallucination classifications from the LLM

The next step is to use the prompt, dataset, and LLM to get hallucination scores for each response from your RAG system. Taking this a step further, you can use a threshold to determine whether the response is a hallucination or not. See the following code:

def get_response_from_claude(context: str, answer: str, prompt_template: PromptTemplate, llm: Bedrock) -> float:
    
    llm_chain = LLMChain(llm=llm, prompt=prompt_template, verbose=False)
    # compute scores
    response = llm_chain(
        {"context": context, "statement": str(answer)}
    )
    try:
        scores = float(scores)
    except Exception:
        print(f"Could not parse LLM response: {scores}")
        scores = 0
    return scores

Approach 2: Semantic similarity-based detection

Under the assumption that if a statement is a fact, then there will be high similarity with the context, you can use semantic similarity as a method to determine whether a statement is an input-conflicting hallucination.

This approach consists of the following steps:

Create embeddings for the answer and the context using an LLM. (In this example, we use the Amazon Titan Embeddings model.)
Use the embeddings to calculate similarity scores between each sentence in the answer and the (In this case, we use cosine similarity as a distance metric.) Out-of-context (hallucinated sentences) should have low similarity with the context.
Tune the decision threshold for a specific dataset (such as domain dependent) to classify hallucinating statements.

Create embeddings with LLMs and calculate similarity

You can use LLMs to create embeddings for the context and the initial response to the question. After you have the embeddings, you can calculate the cosine similarity of the two. The cosine similarity score will return a number between 0 and 1, with 1 being perfect similarity and 0 as no similarity. To translate this to a hallucination score, we need to take 1—the cosine similarity. See the following code:

def similarity_detector(
    context: str,
    answer: str,
    llm: BedrockEmbeddings,
) -> float:
    """
    Check hallucinations using semantic similarity methods based on embeddings<br /><br />

    Parameters
    ----------
    context : str
        Context provided for RAG
    answer : str
        Answer from an LLM
    llm : BedrockEmbeddings
        Embeddings model

    Returns
    -------
    float
        Semantic similarity score
    """

    if len(context) == 0 or len(answer) == 0:
        return 0.0
    # calculate embeddings
    context_emb = llm.embed_query(context)
    answer_emb = llm.embed_query(answer)
    context_emb = np.array(context_emb).reshape(1, -1)
    answer_emb = np.array(answer_emb).reshape(1, -1)
    sim_score = cosine_similarity(context_emb, answer_emb)
    return 1 - sim_score[0][0]

Approach 3: BERT stochastic checker

The BERT score uses the pre-trained contextual embeddings from a pre-trained language model such as BERT and matches words in candidate and reference sentences by cosine similarity. One of the traditional metrics for evaluation in natural language processing (NLP) is the BLEU score. The BLEU score primarily measures precision by calculating how many n-grams (consecutive tokens) from the candidate sentence appear in the reference sentences. It focuses on matching these consecutive token sequences between candidate and reference sentences, while incorporating a brevity penalty to prevent overly short translations from receiving artificially high scores. Unlike the BLEU score, which focuses on token-level comparisons, the BERT score uses contextual embeddings to capture semantic similarities between words or full sentences. It has been shown to correlate with human judgment on sentence-level and system-level evaluation. Moreover, the BERT score computes precision, recall, and F1 measure, which can be useful for evaluating different language generation tasks.

In our approach, we use the BERT score as a stochastic checker for hallucination detection. The idea is that if you generate multiple answers from an LLM and there are large variations (inconsistencies) between them, then there is a good chance that these answers are hallucinated. We first generate N random samples (sentences) from the LLM. We then compute BERT scores by comparing each sentence in the original generated paragraph against its corresponding sentence across the N newly generated stochastic samples. This is done by embedding all sentences using an LLM based embedding model and calculating cosine similarity. Our hypothesis is that factual sentences will remain consistent across multiple generations, resulting in high BERT scores (indicating similarity). Conversely, hallucinated content will likely vary across different generations, resulting in low BERT scores between the original sentence and its stochastic variants. By establishing a threshold for these similarity scores, we can flag sentences with consistently low BERT scores as potential hallucinations, because they demonstrate semantic inconsistency across multiple generations from the same model.

Approach 4: Token similarity detection

With the token similarity detector, we extract unique sets of tokens from the answer and the context. Here, we can use one of the LLM tokenizers or simply split the text into individual words. Then, we calculate similarity between each sentence in the answer and the context. There are multiple metrics that can be used for token similarity, including a BLEU score over different n-grams, a ROUGE score (an NLP metric similar to BLEU but calculates recall vs. precision) over different n-grams, or simply the proportion of the shared tokens between the two texts. Out-of-context (hallucinated) sentences should have low similarity with the context.

def intersection_detector(
    context: str,
    answer: str,
    length_cutoff: int = 3,
) -> dict[str, float]:
    """
    Check hallucinations using token intersection metrics

    Parameters
    ----------
    context : str
        Context provided for RAG
    answer : str
        Answer from an LLM
    length_cutoff : int
        If no. tokens in the answer is smaller than length_cutoff, return scores of 1.0

    Returns
    -------
    dict[str, float]
        Token intersection and BLEU scores
    """

    # populate with relevant stopwords such as articles
    stopword_set = {}

    # remove punctuation and lowercase
    context = re.sub(r"[^ws]", "", context).lower()
    answer = re.sub(r"[^ws]", "", answer).lower()

    # calculate  metrics
    if len(answer) >= length_cutoff:
        # calculate token intersection
        context_split = {term for term in context if term not in stopword_set}
        answer_split = re.compile(r"w+").findall(answer)
        answer_split = {term for term in answer_split if term not in stopword_set}
        intersection = sum([term in context_split for term in answer_split]) / len(answer_split)

        # calculate BLEU score
        bleu = evaluate.load("bleu")
        bleu_score = bleu.compute(predictions=[answer], references=[context])["precisions"]
        bleu_score = sum(bleu_score) / len(bleu_score)

        return {
            "intersection": 1 - intersection,
            "bleu": 1 - bleu_score,
        }

    return {"intersection": 0, "bleu": 0}

Comparing approaches: Evaluation results

In this section, we compare the hallucination detection approaches described in the post. We run an experiment on three RAG datasets, including Wikipedia article data and two synthetically generated datasets. Each example in a dataset includes a context, a user’s question, and an LLM answer labeled as correct or hallucinated. We run each hallucination detection method on all questions and aggregate the accuracy metrics across the datasets.

The highest accuracy (number of sentences correctly classified as hallucination vs. fact) is demonstrated by the BERT stochastic checker and the LLM prompt-based detector. The LLM prompt-based detector outperforms the BERT checker in precision, and the BERT stochastic checker has a higher recall. The semantic similarity and token similarity detectors show very low accuracy and recall but perform well with regards to precision. This indicates that those detectors might only be useful to identify the most evident hallucinations.

Aside from the token similarity detector, the LLM prompt-based detector is the most cost-effective option in terms of the number LLM calls because it’s constant relative to the size of the context and the response (but cost will vary depending on the number of input tokens). The semantic similarity detector cost is proportional to the number of sentences in the context and the response, so as the context grows, this can become increasingly expensive.

The following table summarizes the metrics compared between each method. For use cases where precision is the highest priority, we would recommend the token similarity, LLM prompt-based, and semantic similarity methods, whereas to provide high recall, the BERT stochastic method outperforms other methods.

The following table summarizes the metrics compared between each method.

Technique	Accuracy*	Precision*	Recall*	Cost (Number of LLM Calls)	Explainability
Token Similarity Detector	0.47	0.96	0.03	0	Yes
Semantic Similarity Detector	0.48	0.90	0.02	K***	Yes
LLM Prompt-Based Detector	0.75	0.94	0.53	1	Yes
BERT Stochastic Checker	0.76	0.72	0.90	N+1**	Yes

*Averaged over Wikipedia dataset and generative AI synthetic datasets
**N = Number of random samples
***K = Number of sentences

These results suggest that an LLM-based detector shows a good trade-off between accuracy and cost (additional answer latency). We recommend using a combination of a token similarity detector to filter out the most evident hallucinations and an LLM-based detector to identify more difficult ones.

Conclusion

As RAG systems continue to evolve and play an increasingly important role in AI applications, the ability to detect and prevent hallucinations remains crucial. Through our exploration of four different approaches—LLM prompt-based detection, semantic similarity detection, BERT stochastic checking, and token similarity detection—we’ve demonstrated various methods to address this challenge. Although each approach has its strengths and trade-offs in terms of accuracy, precision, recall, and cost, the LLM prompt-based detector shows particularly promising results with accuracy rates above 75% and a relatively low additional cost. Organizations can choose the most suitable method based on their specific needs, considering factors such as computational resources, accuracy requirements, and cost constraints. As the field continues to advance, these foundational techniques provide a starting point for building more reliable and trustworthy RAG systems.

About the Authors

Zainab Afolabi is a Senior Data Scientist at the Generative AI Innovation Centre in London, where she leverages her extensive expertise to develop transformative AI solutions across diverse industries. She has over eight years of specialised experience in artificial intelligence and machine learning, as well as a passion for translating complex technical concepts into practical business applications.

Aiham Taleb, PhD, is a Senior Applied Scientist at the Generative AI Innovation Center, working directly with AWS enterprise customers to leverage Gen AI across several high-impact use cases. Aiham has a PhD in unsupervised representation learning, and has industry experience that spans across various machine learning applications, including computer vision, natural language processing, and medical imaging.

Nikita Kozodoi, PhD, is a Senior Applied Scientist at the AWS Generative AI Innovation Center working on the frontier of AI research and business. Nikita builds generative AI solutions to solve real-world business problems for AWS customers across industries and holds PhD in Machine Learning.

Liza (Elizaveta) Zinovyeva is an Applied Scientist at AWS Generative AI Innovation Center and is based in Berlin. She helps customers across different industries to integrate Generative AI into their existing applications and workflows. She is passionate about AI/ML, finance and software security topics. In her spare time, she enjoys spending time with her family, sports, learning new technologies, and table quizzes.

AWS machine learning supports Scuderia Ferrari HP pit stop analysis

As one of the fastest sports in the world, almost everything is a race in Formula 1® (F1), even the pit stops. F1 drivers need to stop to change tires or make repairs to damage sustained during a race. Each precious tenth of a second the car is in the pit is lost time in the race, which can mean the difference between making the podium or missing out on championship points. Pit crews are trained to operate at optimum efficiency, although measuring their performance has been challenging, until now. In this post, we share how Amazon Web Services (AWS) is helping Scuderia Ferrari HP develop more accurate pit stop analysis techniques using machine learning (ML).

Challenges with pit stop performance analysis

Historically, analyzing pit stop performance has required track operations engineers to painstakingly review hours of footage from cameras placed at the front and the rear of the pit, then correlate the video to the car’s telemetry data. For a typical race weekend, engineers receive an average of 22 videos for 11 pit stops (per driver), amounting to around 600 videos per season. Along with being time-consuming, reviewing footage manually is prone to inaccuracies. Since implementing the solution with AWS, track operations engineers can synchronize the data up to 80% faster than manual methods.

Modernizing through partnership with AWS

The partnership with AWS is helping Scuderia Ferrari HP modernize the challenging process of pit stop analysis, by using the cloud and ML.

“Previously, we had to manually analyze multiple video recordings and telemetry data separately, making it difficult to identify inefficiencies and increasing the risk of missing critical details. With this new approach, we can now automate and centralize the analysis, gaining a clearer and more immediate understanding of every pit stop, helping us detect errors faster and refine our processes.”

– Marco Gaudino, Digital Transformation Racing Application Architect

The solution uses object detection deployed in Amazon SageMaker AI to synchronize video capture with telemetry data from pit crew equipment, and the serverless event-driven architecture optimizes the use of compute infrastructure. Because Formula 1 teams must comply with the strict budget and compute resource caps imposed by the FIA, on-demand AWS services help Scuderia Ferrari HP avoid expensive infrastructure overhead.

Driving innovation together

AWS has been a Scuderia Ferrari HP Team Partner as well as the Scuderia Ferrari HP Official Cloud, Machine Learning Cloud, and Artificial Intelligence Cloud Provider since 2021, partnering to power innovation on and off the track. When it comes to performance racing, AWS and Scuderia Ferrari HP regularly work together to identify areas for improvement and build new solutions. For example, these collaborations have helped reduce vehicle weight using ML by implementing a virtual ground speed sensor, streamlined the power unit assembly process, and accelerated the prototyping of new commercial vehicle designs.

After starting development in late 2023, the pit stop solution was first tested in March 2024 at the Australian Grand Prix. It quickly moved into production at the 2024 Japanese Grand Prix, held April 7, and now provides the Scuderia Ferrari HP team with a competitive edge.

Taking the solution a step further, Scuderia Ferrari HP is already working on a prototype to detect anomalies during pit stops automatically, such as difficulties in lifting the car when the trolley fails to lift, or issues during the installation and removal of tires by the pit crew. It’s also deploying a new, more performant camera setup for the 2025 season, with four cameras shooting 120 frames per second instead of the previous two cameras shooting 25 frames per second.

Developing the ML-powered pit stop analysis solution

The new ML-powered pit stop analysis solution automatically correlates video progression with the associated telemetry data. It uses object detection to identify green lights, then precisely synchronizes the video and telemetry data, so engineers can review the synchronized video through a custom visualization tool. This automatic method is more efficient and more accurate than the previous manual approach. The following image shows the object detection of the green light during a pit stop.

“By systematically reviewing every pit stop, we can identify patterns, detect even the smallest inefficiencies, and refine our processes. Over time, this leads to greater consistency and reliability, reducing the risk of errors that could compromise race results,” says Gaudino.

To develop the pit stop analysis solution, the model was trained using videos from the 2023 racing season and the YOLO v8 algorithm for object identification in SageMaker AI through the PyTorch framework. AWS Lambda and SageMaker AI are the core components of the pit stop analysis solution.

The workflow consists of the following steps:

When a driver conducts a pit stop, front and rear videos of the stop are automatically pushed to Amazon Simple Storage Service (Amazon S3).
From there, Amazon EventBridge invokes the entire process through various Lambda functions, triggering video processing through a system of multiple Amazon Simple Queue Service (Amazon SQS) queues and Lambda functions that execute custom code to handle critical business logic.
These Lambda functions retrieve the timestamp from videos, then merge the front and rear videos with the number of video frames containing green lights to ultimately match the merged video with car and racing telemetry (for example, screw gun behavior).

The system also includes the use of Amazon Elastic Container Service (Amazon ECS) with multiple microservices, including one that integrates with its ML model in SageMaker AI. Previously, to manually correlate the data, the process took a few minutes per pit stop. Now, the entire process is completed in 60–90 seconds, producing near real-time insights.

The following figure shows the architecture diagram of the solution.

Conclusion

The new pit stop analysis solution allows for a quick and systematic review of every pit stop to identify patterns and refine its processes. After five races in the 2025 season, Scuderia Ferrari HP recorded the fastest pit stop in each race, with a season best of 2 seconds flat in Saudi Arabia for Charles Leclerc. Diligent work coupled with the ML-powered solution more efficiently get drivers back on track faster, focusing on achieving the best end result possible.

To learn more about building, training, and deploying ML models with fully managed infrastructure, see Getting started with Amazon SageMaker AI. For more information about how Ferrari uses AWS services, refer to the following additional resources:

About the authors

Alessio Ludovici is a Solutions Architect at AWS.

Accelerate edge AI development with SiMa.ai Edgematic with a seamless AWS integration

This post is co-authored by Manuel Lopez Roldan, SiMa.ai, and Jason Westra, AWS Senior Solutions Architect.

Are you looking to deploy machine learning (ML) models at the edge? With Amazon SageMaker AI and SiMa.ai’s Palette Edgematic platform, you can efficiently build, train, and deploy optimized ML models at the edge for a variety of use cases. Designed to work on SiMa’s MLSoC (Machine Learning System on Chip) hardware, your models will have seamless compatibility across the entire SiMa.ai product family, allowing for effortless scaling, upgrades, transitions, and mix-and-match capabilities—ultimately minimizing your total cost of ownership.

In safety-critical environments like warehouses, construction sites, and manufacturing floors, detecting human presence and safety equipment in restricted areas can prevent accidents and enforce compliance. Cloud-based image recognition often falls short in safety use cases where low latency is essential. However, by deploying an object detection model optimized to detect personal protective equipment (PPE) on SiMa.ai MLSoC, you can achieve high-performance, real-time monitoring directly on edge devices without the latency typically associated with cloud-based inference.

In this post, we demonstrate how to retrain and quantize a model using SageMaker AI and the SiMa.ai Palette software suite. The goal is to accurately detect individuals in environments where visibility and protective equipment detection are essential for compliance and safety. We then show how to create a new application within Palette Edgematic in just a few minutes. This streamlined process enables you to deploy high-performance, real-time monitoring directly on edge devices, providing low latency for fast, accurate safety alerts, and it supports an immediate response to potential hazards, enhancing overall workplace safety.

Solution overview

The solution integrates SiMa.ai Edgematic with SageMaker JupyterLab to deploy an ML model, YOLOv7, to the edge. YOLO models are computer vision and ML models for object detection and image segmentation.

The following diagram shows the solution architecture you will follow to deploy a model to the edge. Edgematic offers a seamless, low-code no-code, end-to-end cloud-based pipeline, from model preparation to edge deployment. This approach provides high performance and accuracy, alleviates the complexity of managing updates or toolchain maintenance on devices, and simplifies inference testing and performance evaluation on edge hardware. This workflow makes sure AI applications run entirely on the edge without needing continuous cloud connectivity, decreasing latency issues, reducing security risks, and keeping data in-house.

The solution workflow comprises two main stages:

ML training and exporting – During this phase, you train and validate the model in SageMaker AI, providing readiness for SiMa.ai edge deployment. This step involves optimizing and compiling the model in which you will code with SiMa.ai SDKs to load, quantize, test, and compile models from frameworks like PyTorch, TensorFlow, and ONNX, producing binaries that run efficiently on SiMa.ai Machine Learning Accelerator.
ML edge evaluation and deployment – Next, you transfer the compiled model artifacts to Edgematic for a streamlined deployment to the edge device. Finally, you validate the model’s real-time performance and accuracy directly on the edge device, making sure it meets the safety monitoring requirements.

The steps to build your solution are as follows:

Create a custom image for SageMaker JupyterLab.
Launch SageMaker JupyterLab with your custom image.
Train the object detection model on the SageMaker JupyterLab notebook.
Perform graph surgery, quantization, and compilation.
Move the edge optimized model to SiMa.ai Edgematic software to evaluate its performance.

Prerequisites

Before you get started, make sure you have the following:

An AWS account. If you don’t have an AWS account, you can create one.
The AWS Command Line Interface (AWS CLI), Docker, and Git installed locally.
An AWS Identity and Access Management (IAM) user with the necessary permissions for creating and managing AWS resources.
SiMa.ai Developer Portal access. If you don’t have developer access, contact SiMa.ai from the Developer Portal to register for a free account.

Create a custom image for SageMaker JupyterLab

SageMaker AI provides ML capabilities for data scientists and developers to prepare, build, train, and deploy high-quality ML models efficiently. It has numerous features, including SageMaker JupyterLab, which enables ML developers to rapidly build, train, and deploy models. SageMaker JupyterLab allows you to create a custom image, then access it from within JupyterLab environments. You will access Palette APIs to build, train, and optimize your object detection model for the edge, from within a familiar user experience in the AWS Cloud. To set up SageMaker JupyterLab to integrate with Palette, complete the steps in this section.

Set up SageMaker AI and Amazon ECR

Provision the necessary AWS resources within the us-east-1 AWS Region. Create a SageMaker domain and user to train models and run Jupyter notebooks. Then, create an Amazon Elastic Container Registry (Amazon ECR) private repository to store Docker images.

Download the SiMa.ai SageMaker Palette Docker image

Palette is a Docker container that contains the necessary tools to quantize and compile ML models for SiMa.ai MLSoC devices. SiMa.ai provides an AWS compatible Palette version that integrates seamlessly with SageMaker JupyterLab. From it, you can attach to the necessary GPUs you need to train, export to ONNX format, optimize, quantize, and compile your model—all within a familiar ML environment on AWS.

Download the Docker image from the Software Downloads page on the SiMa.ai Developer Portal (see the following screenshot) and then download the sample Jupyter notebook from the following SiMa.ai GitHub repository. You can choose to scan the image to maintain a secure posture.

Build and tag a custom Docker image ECR URI

The following steps require that you have set up your AWS Management Console credentials, have set up an IAM user with AmazonEC2ContainerRegistryFullAccess permissions, and can successfully perform Docker login to AWS. For more information, see Private registry authentication in Amazon ECR.

Tag the image that you downloaded from the SiMa.ai Developer Access portal using the AWS CLI and then push it to Amazon ECR to make it available to SageMaker JupyterLab. On the Amazon ECR console, navigate to the registry you created to locate the ECR URI of the image. Your console experience will look similar to the following screenshot.

Copy the URI of the repository and use it to set the ECR environment variable in the following command:

# setup variables as per your AWS environment
REGION=<your region here>
AWS_ACCOUNT_ID=<your 12 digit AWS Account ID here>
ECR=$AWS_ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com=<your ECR repository name here>

Now that you’ve set up your environment variables and with Docker running locally, you can enter the following commands. If you haven’t used SageMaker AI before, you might have to create a new IAM user and attach the AmazonEC2ContainerRegistryPowerUser policy and then run the aws configure command.

# login to the ECR repository
aws ecr get-login-password --region $REGION | docker login --username AWS --password-stdin $AWS_ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com

Upon receiving a “Login Succeeded” message, you’re logged in to Amazon ECR and can run the following Docker commands to tag the image and push it to Amazon ECR:

# Load the palette.tar image into docker
docker load < palette.tar
docker tag palette/sagemaker $ECR
docker push $ECR

The Palette image is over 25 GB. Therefore, with a 20 Mbps internet connection, the docker push operation can take several hours to upload to AWS.

Configure SageMaker with the custom image

After you upload the custom image to Amazon ECR, you configure SageMaker JupyterLab to use it. We recommend watching the two minutes long SageMaker AI/Palette Edgematic video to guide you as you walk through the steps to configure JupyterLab.

On the Amazon ECR console, navigate to the private registry, choose your repository from the list, choose Images, then choose Copy URI.
On the SageMaker AI console, choose Images in the navigation pane, and choose Create Image.
Provide your ECR URI and choose Next.
For Image properties, fill in the following fields. When filling in the fields, make sure that the image name and display name don’t use capital letters or special characters.
1. For Image name, enter palette.
2. For Image display name, enter palette.
3. For Description, enter Custom palette image for SageMaker AI integration.
4. For IAM role, either choose an existing role or create a new role (recommended).
For Image type, choose JupyterLab image.
Choose Submit.

Verify your custom image looks similar to that in the video example.

If everything matches, navigate to Admin configurations, Domains, and choose your domain.
On the Environment tab, choose Attach image in the Custom images for personal Studio apps
Choose Existing Image and your Palette image using the latest version, and choose Next.

Settings in the Image properties section are defaulted for your convenience, but you can choose a different IAM role and Amazon Elastic File System (Amazon EFS) mount path, if needed.

For this post, leave the defaults and choose the JupyterLab image option.
To finish, choose Submit.

Launch SageMaker JupyterLab with your custom image

With the Palette image configured, you are ready to launch SageMaker JupyterLab in Amazon SageMaker Studio and work in your custom environment.

Following the video as your guide, go to the User profiles section of your SageMaker domain and choose Launch, Studio.
In SageMaker Studio, choose Applications, JupyterLab.
Choose Create JupyterLab space.
For Name, enter a name for your new JupyterLab Space.
Choose Create Space.
For Instance, a GPU-based instance with at least 16 GB memory is recommended for the Model SDK to train efficiently. Both instance types, ml.g4dn.xlarge with Fast Launch and ml.g4dn.2xlarge, work. Allocate at least 30 GB of disk space.

When selecting an instance with a GPU, you might need to request a quota increase for that instance type. For more details, see Requesting a quota increase.

For Image, choose the new custom attached image you created in the prior step.
Choose Run space to start JupyterLab.
Choose Open JupyterLab when the status is Running.

Congratulations! You’ve created a custom image for SageMaker JupyterLab using the Palette image and launched a JupyterLab space.

Train the object detection model on a SageMaker JupyterLab notebook

Now you are able to prepare the model for the edge using the Palette Model SDK. In this section, we walk through the sample SiMa.ai Jupyter notebook so you understand how to work with the YOLOv7 model and prepare it to run on SiMa.ai devices.

To download the notebook from the SiMa.ai GitHub repository, open a terminal in your notebook and run a git clone command. This will clone the repository to your instance and from there you can launch the yolov7.ipynb file.

To run the notebook, change the Amazon Simple Storage Service (Amazon S3) bucket name in the variable s3_bucket in the third cell to an S3 bucket such as the one generated with the SageMaker domain.

To run all the cells in the notebook, choose the arrow icon on top of the cells to reset the kernel.

The yolov7.ipynb file’s notebook describes in detail how to prepare the model package and optimize and compile the model. The following section only covers key features of the notebook as it relates to SiMa.ai Palette and the training of your workplace safety model. Describing every cell is out of scope for this post.

Jupyter notebook walkthrough

To recognize human heads and protective equipment, you will use the notebook to fine-tune the model to recognize these classes of objects. The following Python code defines the classes to detect, and it uses the open source open-images-v7 dataset and the fiftyone library to retrieve a set of 8,000 labeled images per class to train the model effectively. 75% of images are used for training and 25% for validation of the model. This cell also structures the dataset into YOLO format, optimizing it for your training workflow.

classes = ['Person', 'Human head', 'Helmet']
...
     dataset = fiftyone.zoo.load_zoo_dataset(
                "open-images-v7",
                split="train",
                label_types=["detections"],
                classes=classes,
                max_samples=total,
            )
...
    dataset.export(
        dataset_type=fiftyone.types.YOLOv5Dataset,
        labels_path=path,
        classes=classes,
    )

The next important cell configures the dataset and download the required weights. You will be using yolov7-tiny weights and you can choose your YOLOv7 type. Each is distributed under the GPL-3.0 license. YOLOv7 achieves better performance than YOLOv7-Tiny, but it takes longer to train. After choosing which YOLOv7 you prefer, retrain the model by running the command, as shown in the following code:

!cd yolov7 && python3 train.py --workers 4 --device 0 --batch-size 16 --data data/custom.yaml --img 640 640 --cfg cfg/training/yolov7-tiny.yaml --weights 'yolov7-tiny.pt' --name sima-yolov7 --hyp data/hyp.scratch.custom.yaml --epochs 10

Finally, as shown in the following code, retrain the model for 10 epochs with the new dataset and yolov7-tiny weights. This achieves a mAP of approximately 0.6, which should deliver highly accurate detection of the new class. The code then exports the model to ONNX format:

!cd yolov7 && python3 export.py --weights runs/train/sima-yolov7/weights/best.pt --grid --end2end --simplify --topk-all 100 --iou-thres 0.65 --conf-thres 0.35 --img-size 640 640 --max-wh 640

Perform graph surgery, quantization, and compilation

To optimize the architecture, you must perform modifications to the YOLOv7 model in ONNX format. In the following figure, the scissors and dotted red line show where graph surgery is performed on a YOLOv7 model. How is graph surgery different from model pruning? Model pruning reduces the overall size and complexity of a neural network by removing less significant weights or entire neurons, whereas graph surgery restructures the computational graph by modifying or replacing specific operations to provide compatibility with target hardware without changing the model’s learned parameters. The net effect is you are replacing unwanted operations on the heads like Reshape, Split, and Concat with supported operations that are mathematically equivalent (point-wise convolutions). Afterwards, you remove the postprocessing operations of the ONNX graph. These will be included in the postprocessing logic.

See the following code:

model = onnx.load(f"{model_name}.onnx")
...
remove_nodes(model)
insert_pointwise_conv(model)
update_elmtwise_const(model)
update_output_nodes(model)
...
onnx.save(model, ONNX_MODEL_NAME)

After surgery, you quantize the model. Quantization simplifies AI models by reducing the precision of the data they use from float 32-bit to int 8-bit, making models smaller, faster, and more efficient to run at the edge. Quantized models consume less power and resources, which is critical for deploying on lower-powered devices and optimizing overall efficiency. The following code quantizes your model using the validation dataset. It also runs some inference using the quantized model to provide insight about how well the model is performing after post-training quantization.

...
loaded_net = _load_model()
# Quantize model
quant_configs = default_quantization.with_calibration(HistogramMSEMethod(num_bins=1024))
calibration_data = _make_calibration_data()
quantized_net = loaded_net.quantize(calibration_data=calibration_data, quantization_config=quant_configs)
...
    if QUANTIZED:
        preprocessed_image1 = preprocess(img=image, input_shape=(640, 640)).transpose(0, 2, 3, 1)
        inputs = {InputName('images'): preprocessed_image1}
        out = quantized_net.execute(inputs)

Because quantization reduces precision, verify that the model accuracy remains high by testing some predictions. After validation, compile the model to generate files that enable it to run on SiMa.ai MLSoC devices, along with the required configuration for supporting plugins. This compilation produces an .lm file, the binary executable for the ML accelerator in the MLSoC, and a .json file containing configuration details like input image size and quantization type.

saved_mpk_directory = "./compiled_yolov7"
quantized_net.save("yolov7", output_directory=saved_mpk_directory)
quantized_net.compile(output_path=saved_mpk_directory, compress=False)

The notebook uploads the compiled file to the S3 bucket you specified, then generates a pre-signed link that is valid for 30 minutes. If the link expires, rerun this last cell again. Copy the generated link at the end of the notebook. It will be used in SiMa.ai Edgematic, shortly.

s3.meta.client.upload_file(file_name, S3_BUCKET_NAME, f"models/{name}.tar.gz")
...
presigned_url = s3_client.generate_presigned_url(    
     ClientMethod="get_object",
     Params={
        "Bucket": s3_bucket,
        "Key": object_key
    },
    ExpiresIn=1800  # 30 minutes
)

Move the model to SiMa.ai Edgematic to evaluate its performance

After you complete your cloud-based model fine-tuning in AWS, transition to Edgematic for building the complete edge application, including plugins for preprocessing and postprocessing. Edgematic integrates the optimized model with essential plugins, like UDP sync for data transmission, video encoders for streaming predictions, and preprocessing tailored for the SiMa.ai MLA. These plugins are provided as drag-and-drop blocks, improving developer productivity by eliminating the need for custom coding. After it’s configured, Edgematic compiles and deploys the application to the edge device, transforming the model into a functional, real-world AI application.

To begin, log in to Edgematic, create a new project, and drag and drop the YoloV7 pipeline under Developer Community.

To run your YOLOv7 workplace safety application, request a device and choose the play icon. The application will be compiled, installed on the remote device assigned upon login, and it will begin running. After 30 seconds, the complete application will be running on the SiMa.ai MLSoC and you will see that it detects people in the video stream.
Choose the Models tab, then choose Add Model.
Choose the Amazon S3 pre-signed link, enter the previously copied link, then choose Add.

Your model will appear under User defined on the Models tab. You can open the model folder and choose Run to get KPIs on the model such as frames per second.

Next, you will change the existing people detection pipeline to a PPE use case by replacing the existing YOLOv7 model with your newly trained PPE model.

To change the model, stop the pipeline by choosing the stop icon.
Choose Delete to delete the YOLOv7 block of the application.

Drag and drop your new model imported from the User defined folder on the Models

Now you connect it back to the blocks that YOLOv7 was connected to.

First, change the tool in canvas to Connect, then choose the connecting points between the respective plugins.
Choose the play

After the application is deployed on the SiMa.ai MLSoC, you should see the detections of categories such as “Human head,” “Person,” and “Glasses,” as seen in the following screenshot.

Next, you change the application postprocessing logic from performing people detection to performing PPE detection. This is done by adding logic in the postprocessing that will perform business logic to detect if PPE is present or not. For this post, the PPE logic has already been written, and you just enable it.

First, stop the previous application by choosing the stop icon.
Next, locate the Explorer section and locate the file named YoloV7_Post_Overlay.py under yolov7, plugins, YoloV7_Post_Overlay.
Open the file and change the variable self.PPE on line 36 from False to True.
Rerun the application by choosing the play icon.

Finally, you can add a custom video by choosing the gear icon on the first application plugin called rtspsrc_1, and on the Type dropdown menu, choose Custom video, then upload a custom video.

For example, the following video frame illustrates how the model at the edge detects the PPE equipment and labels the workers as safe.

Clean up

To avoid ongoing costs, clean up your resources. In SiMa.ai Edgematic, sign out by choosing your profile picture on the right top and then signing out. To avoid additional costs on AWS, we recommend that you shut down the JupyterLab Space by choosing the stop icon for the domain and user. For more details, see Where to shut down resources per SageMaker AI features.

Conclusion

This post demonstrated how to use SageMaker AI and Edgematic to retrain object detection models such as YOLOv7 in the cloud, then optimize these models for edge deployment, and build an entire edge application within minutes without the need for custom coding.

The streamlined workflow using SiMa.ai Palette on SageMaker JupyterLab helps ML applications achieve high performance, low latency, and energy efficiency, while minimizing the complexity of development and deployment. Whether you’re enhancing workplace safety with real-time monitoring or deploying advanced AI applications at the edge, SiMa.ai solutions empower developers to accelerate innovation and bring cutting-edge technology to the real world efficiently and effectively.

Experience firsthand how Palette Edgematic and SageMaker AI can streamline your ML workflow from cloud to edge. Get started today:

Access our complete workshop materials and example code on AWS Marketplace
Join our developer community to share experiences and best practices

Together, let’s accelerate the future of edge AI.

Additional resources

About the Authors

Manuel Lopez Roldan is a Product Manager at SiMa.ai, focused on growing the user base and improving the usability of software platforms for developing and deploying AI. With a strong background in machine learning and performance optimization, he leads cross-functional initiatives to deliver intuitive, high-impact developer experiences that drive adoption and business value. He is also an advocate for industry innovation, sharing insights on how to accelerate AI adoption at the edge through scalable tools and developer-centric design.

Jason Westra is a Senior Solutions Architect at AWS based in Colorado, where he helps startups build innovative products with Generative AI and ML. Outside of work, he is an avid outdoorsmen, back country skier, climber, and mountain biker.

Do Large Language Models Have an English Accent? Evaluating and Improving the Naturalness of Multilingual LLMs

Current Large Language Models (LLMs) are predominantly designed with English as the primary language, and even the few that are multilingual tend to exhibit strong English-centric biases. Much like speakers who might produce awkward expressions when learning a second language, LLMs often generate unnatural outputs in non-English languages, reflecting English-centric patterns in both vocabulary and grammar. Despite the importance of this issue, the naturalness of multilingual LLM outputs has received limited attention. In this paper, we address this gap by introducing novel automatic…Apple Machine Learning Research

How Apoidea Group enhances visual information extraction from banking documents with multimodal models using LLaMA-Factory on Amazon SageMaker HyperPod

This post is co-written with Ken Tsui, Edward Tsoi and Mickey Yip from Apoidea Group.

The banking industry has long struggled with the inefficiencies associated with repetitive processes such as information extraction, document review, and auditing. These tasks, which require significant human resources, slow down critical operations such as Know Your Customer (KYC) procedures, loan applications, and credit analysis. As a result, banks face operational challenges, including limited scalability, slow processing speeds, and high costs associated with staff training and turnover.

To address these inefficiencies, the implementation of advanced information extraction systems is crucial. These systems enable the rapid extraction of data from various financial documents—including bank statements, KYC forms, and loan applications—reducing both manual errors and processing time. As such, information extraction technology is instrumental in accelerating customer onboarding, maintaining regulatory compliance, and driving the digital transformation of the banking sector, particularly in high-volume document processing tasks.

The challenges in document processing are compounded by the need for specialized solutions that maintain high accuracy while handling sensitive financial data such as banking statements, financial statements, and company annual reports. This is where Apoidea Group, a leading AI-focused FinTech independent software vendor (ISV) based in Hong Kong, has made a significant impact. By using cutting-edge generative AI and deep learning technologies, Apoidea has developed innovative AI-powered solutions that address the unique needs of multinational banks. Their flagship product, SuperAcc, is a sophisticated document processing service featuring a set of proprietary document understanding models capable of processing diverse document types such as bank statements, financial statements, and KYC documents.

SuperAcc has demonstrated significant improvements in the banking sector. For instance, the financial spreading process, which previously required 4–6 hours, can now be completed in just 10 minutes, with staff needing less than 30 minutes to review the results. Similarly, in small and medium-sized enterprise (SME) banking, the review process for multiple bank statements spanning 6 months—used to extract critical data such as sales turnover and interbank transactions—has been reduced to just 10 minutes. This substantial reduction in processing time not only accelerates workflows but also minimizes the risk of manual errors. By automating repetitive tasks, SuperAcc enhances both operational efficiency and accuracy, using Apoidea’s self-trained machine learning (ML) models to deliver consistent, high-accuracy results in live production environments. These advancements have led to an impressive return on investment (ROI) of over 80%, showcasing the tangible benefits of implementing SuperAcc in banking operations.

AI transformation in banking faces several challenges, primarily due to stringent security and regulatory requirements. Financial institutions demand banking-grade security, necessitating compliance with standards such as ISO 9001 and ISO 27001. Additionally, AI solutions must align with responsible AI principles to facilitate transparency and fairness. Integration with legacy banking systems further complicates adoption, because these infrastructures are often outdated compared to rapidly evolving tech landscapes. Despite these challenges, SuperAcc has been successfully deployed and trusted by over 10 financial services industry (FSI) clients, demonstrating its reliability, security, and compliance in real-world banking environments.

To further enhance the capabilities of specialized information extraction solutions, advanced ML infrastructure is essential. Amazon SageMaker HyperPod offers an effective solution for provisioning resilient clusters to run ML workloads and develop state-of-the-art models. SageMaker HyperPod accelerates the development of foundation models (FMs) by removing the undifferentiated heavy lifting involved in building and maintaining large-scale compute clusters powered by thousands of accelerators such as AWS Trainium and NVIDIA A100 and H100 GPUs. Its resiliency features automatically monitor cluster instances, detecting and replacing faulty hardware automatically, allowing developers to focus on running ML workloads without worrying about infrastructure management.

Building on this foundation of specialized information extraction solutions and using the capabilities of SageMaker HyperPod, we collaborate with APOIDEA Group to explore the use of large vision language models (LVLMs) to further improve table structure recognition performance on banking and financial documents. In this post, we present our work and step-by-step code on fine-tuning the Qwen2-VL-7B-Instruct model using LLaMA-Factory on SageMaker HyperPod. Our results demonstrate significant improvements in table structure recognition accuracy and efficiency compared to the original base model and traditional methods, with particular success in handling complex financial tables and multi-page documents. Following the steps described in this post, you can also fine-tune your own model with domain-specific data to solve your information extraction challenges using the open source implementation.

Challenges in banking information extraction systems with multimodal models

Developing information extraction systems for banks presents several challenges, primarily due to the sensitive nature of documents, their complexity, and variety. For example, bank statement formats vary significantly across financial institutions, with each bank using unique layouts, different columns, transaction descriptions, and ways of presenting financial information. In some cases, documents are scanned with low quality and are poorly aligned, blurry, or faded, creating challenges for Optical Character Recognition (OCR) systems attempting to convert them into machine-readable text. Creating robust ML models is challenging due to the scarcity of clean training data. Current solutions rely on orchestrating models for tasks such as layout analysis, entity extraction, and table structure recognition. Although this modular approach addresses the issue of limited resources for training end-to-end ML models, it significantly increases system complexity and fails to fully use available information.

Models developed based on specific document features are inherently limited in their scope, restricting access to diverse and rich training data. This limitation results in upstream models, particularly those responsible for visual representation, lacking robustness. Furthermore, single-modality models fail to use the multi-faceted nature of information, potentially leading to less precise and accurate predictions. For instance, in table structure recognition tasks, models often lack the capability to reason about textual content while inferring row and column structures. Consequently, a common error is the incorrect subdivision of single rows or columns into multiple instances. Additionally, downstream models that heavily depend on upstream model outputs are susceptible to error propagation, potentially compounding inaccuracies introduced in earlier stages of processing.

Moreover, the substantial computational requirements of these multimodal systems present scalability and efficiency challenges. The necessity to maintain and update multiple models increases the operational burden, rendering large-scale document processing both resource-intensive and difficult to manage effectively. This complexity impedes the seamless integration and deployment of such systems in banking environments, where efficiency and accuracy are paramount.

The recent advances in multimodal models have demonstrated remarkable capabilities in processing complex visual and textual information. LVLMs represent a paradigm shift in document understanding, combining the robust textual processing capabilities of traditional language models with advanced visual comprehension. These models excel at tasks requiring simultaneous interpretation of text, visual elements, and their spatial relationships, making them particularly effective for financial document processing. By integrating visual and textual understanding into a unified framework, multimodal models offer a transformative approach to document analysis. Unlike traditional information extraction systems that rely on fragmented processing pipelines, these models can simultaneously analyze document layouts, extract text content, and interpret visual elements. This integrated approach significantly improves accuracy by reducing error propagation between processing stages while maintaining computational efficiency.

Advanced vision language models are typically pre-trained on large-scale multimodal datasets that include both image and text data. The pre-training process typically involves training the model on diverse datasets containing millions of images and associated text descriptions, sourced from publicly available datasets such as image-text pairs LAION-5B, Visual Question Answering (VQAv2.0), DocVQA, and others. These datasets provide a rich variety of visual content paired with textual descriptions, enabling the model to learn meaningful representations of both modalities. During pre-training, these models are trained using auto-regressive loss, where the model predicts the next token in a sequence given the previous tokens and the visual input. This approach allows the model to effectively align visual and textual features and generate coherent text responses based on the visual context. For image data specifically, modern vision-language models use pre-trained vision encoders, such as vision transformers (ViTs), as their backbone to extract visual features. These features are then fused with textual embeddings in a multimodal transformer architecture, allowing the model to understand the relationships between images and text. By pre-training on such diverse and large-scale datasets, these models develop a strong foundational understanding of visual content, which can be fine-tuned for downstream tasks like OCR, image captioning, or visual question answering. This pre-training phase is critical for enabling the model to generalize well across a wide range of vision-language tasks. The model architecture is illustrated in the following diagram.

Fine-tuning vision-language models for visual document understanding tasks offers significant advantages due to their advanced architecture and pre-trained capabilities. The model’s ability to understand and process both visual and textual data makes it inherently well-suited for extracting and interpreting text from images. Through fine-tuning on domain-specific datasets, the model can achieve superior performance in recognizing text across diverse fonts, styles, and backgrounds. This is particularly valuable in banking applications, where documents often contain specialized terminology, complex layouts, and varying quality scans.

Moreover, fine-tuning these models for visual document understanding tasks allows for domain-specific adaptation, which is crucial for achieving high precision in specialized applications. The model’s pre-trained knowledge provides a strong foundation, reducing the need for extensive training data and computational resources. Fine-tuning also enables the model to learn domain-specific nuances, such as unique terminologies or formatting conventions, further enhancing its performance. By combining a model’s general-purpose vision-language understanding with task-specific fine-tuning, you can create a highly efficient and accurate information extraction system that outperforms traditional methods, especially in challenging or niche use cases. This makes vision-language models powerful tools for advancing visual document understanding technology in both research and practical applications.

Solution overview

LLaMA-Factory is an open source framework designed for training and fine-tuning large language models (LLMs) efficiently. It supports over 100 popular models, including LLaMA, Mistral, Qwen, Baichuan, and ChatGLM, and integrates advanced techniques such as LoRA (Low-Rank Adaptation), QLoRA (Quantized LoRA), and full-parameter fine-tuning. The framework provides a user-friendly interface, including a web-based tool called LlamaBoard, which allows users to fine-tune models without writing code. LLaMA-Factory also supports various training methods like supervised fine-tuning (SFT), reinforcement learning from human feedback (RLHF), and direct preference optimization (DPO), making it versatile for different tasks and applications.

The advantage of LLaMA-Factory lies in its efficiency and flexibility. It significantly reduces the computational and memory requirements for fine-tuning large models by using techniques like LoRA and quantization, enabling users to fine-tune models even on hardware with limited resources. Additionally, its modular design and integration of cutting-edge algorithms, such as FlashAttention-2 and GaLore, facilitate high performance and scalability. The framework also simplifies the fine-tuning process, making it accessible to both beginners and experienced developers. This democratization of LLM fine-tuning allows users to adapt models to specific tasks quickly, fostering innovation and application across various domains. The solution architecture is presented in the following diagram.

For the training infrastructure, we use SageMaker HyperPod for distributed training. SageMaker HyperPod provides a scalable and flexible environment for training and fine-tuning large-scale models. SageMaker HyperPod offers a comprehensive set of features that significantly enhance the efficiency and effectiveness of ML workflows. Its purpose-built infrastructure simplifies distributed training setup and management, allowing flexible scaling from single-GPU experiments to multi-GPU data parallelism and large model parallelism. The service’s shared file system integration with Amazon FSx for Lustre enables seamless data synchronization across worker nodes and Amazon Simple Storage Service (Amazon S3) buckets, while customizable environments allow tailored installations of frameworks and tools.

SageMaker HyperPod integrates with Slurm, a popular open source cluster management and job scheduling system, to provide efficient job scheduling and resource management, enabling parallel experiments and distributed training. The service also enhances productivity through Visual Studio Code connectivity, offering a familiar development environment for code editing, script execution, and Jupyter notebook experimentation. These features collectively enable ML practitioners to focus on model development while using the power of distributed computing for faster training and innovation.

Refer to our GitHub repo for a step-by-step guide on fine-tuning Qwen2-VL-7B-Instruct on SageMaker HyperPod.

We start the data preprocessing using the image input and HTML output. We choose the HTML structure as the output format because it is the most common format for representing tabular data in web applications. It is straightforward to parse and visualize, and it is compatible with most web browsers for rendering on the website for manual review or modification if needed. The data preprocessing is critical for the model to learn the patterns of the expected output format and adapt the visual layout of the table. The following is one example of input image and output HTML as the ground truth.

<table>
  <tr>
    <td></td>
    <td colspan="5">Payments due by period</td>
  </tr>
  <tr>
    <td></td><td>Total</td><td>Less than 1 year</td><td>1-3 years</td><td>3-5 years</td><td>More than 5 years</td>
  </tr>
  <tr>
    <td>Operating Activities:</td><td></td><td></td><td></td><td></td><td></td>
  </tr>
… … …
… … …
 <tr>
    <td>Capital lease obligations<sup> (6)</sup></td><td>48,771</td><td>8,320</td><td>10,521</td><td>7,371</td><td>22,559</td>
  </tr>
  <tr>
    <td>Other<sup> (7) </sup></td><td>72,734</td><td>20,918</td><td>33,236</td><td>16,466</td><td>2,114</td>
  </tr>
  <tr>
    <td>Total</td><td>$16,516,866</td><td>$3,037,162</td><td>$5,706,285</td><td>$4,727,135</td><td>$3,046,284</td>
  </tr>
</table>

We then use LLaMA-Factory to fine-tune the Qwen2-VL-7B-Instruct model on the preprocessed data. We use Slurm sbatch to orchestrate the distributed training script. An example of the script would be submit_train_multinode.sh. The training script uses QLoRA and data parallel distributed training on SageMaker HyperPod. Following the guidance provided, you will see output similar to the following training log.

During inference, we use vLLM for hosting the quantized model, which provides efficient memory management and optimized attention mechanisms for high-throughput inference. vLLM natively supports the Qwen2-VL series model and continues to add support for newer models, making it particularly suitable for large-scale document processing tasks. The deployment process involves applying 4-bit quantization to reduce model size while maintaining accuracy, configuring the vLLM server with optimal parameters for batch processing and memory allocation, and exposing the model through RESTful APIs for quick integration with existing document processing pipelines. For details on model deployment configuration, refer to the hosting script.

Results

Our evaluation focused on the FinTabNet dataset, which contains complex tables from S&P 500 annual reports. This dataset presents unique challenges due to its diverse table structures, including merged cells, hierarchical headers, and varying layouts. The following example demonstrates a financial table and its corresponding model-generated HTML output, rendered in a browser for visual comparison.

For quantitative evaluation, we employed the Tree Edit Distance-based Similarity (TEDS) metric, which assesses both structural and content similarity between generated HTML tables and ground truth. TEDS measures the minimum number of edit operations required to transform one tree structure into another, and TEDS-S focuses specifically on structural similarity. The following table summarizes the output on different models.

Model	TEDS	TEDS-S
Anthropic’s Claude 3 Haiku	69.9	76.2
Anthropic’s Claude 3.5 Sonnet	86.4	87.1
Qwen2-VL-7B-Instruct (Base)	23.4	25.3
Qwen2-VL-7B-Instruct (Fine-tuned)	81.1	89.7

The evaluation results reveal significant advancements in our fine-tuned model’s performance. Most notably, the Qwen2-VL-7B-Instruct model demonstrated substantial improvements in both content recognition and structural understanding after fine-tuning. When compared to its base version, the model showed enhanced capabilities in accurately interpreting complex table structures and maintaining content fidelity. The fine-tuned version not only surpassed the performance of Anthropic’s Claude 3 Haiku, but also approached the accuracy levels of Anthropic’s Claude 3.5 Sonnet, while maintaining more efficient computational requirements. Particularly impressive was the model’s improved ability to handle intricate table layouts, suggesting a deeper understanding of document structure and organization. These improvements highlight the effectiveness of our fine-tuning approach in adapting the model to specialized financial document processing tasks.

Best practices

Based on our experiments, we identified several key insights and best practices for fine-tuning multimodal table structure recognition models:

Model performance is highly dependent on the quality of fine-tuning data. The closer the fine-tuning data resembles real-world datasets, the better the model performs. Using domain-specific data, we achieved a 5-point improvement in TEDS score with only 10% of the data compared to using general datasets. Notably, fine-tuning doesn’t require massive datasets; we achieved relatively good performance with just a few thousand samples. However, we observed that imbalanced datasets, particularly those lacking sufficient examples of complex elements like long tables and forms with merged cells, can lead to biased performance. Maintaining a balanced distribution of document types during fine-tuning facilitates consistent performance across various formats.
The choice of base model significantly impacts performance. More powerful base models yield better results. In our case, Qwen2-VL’s pre-trained visual and linguistic features provided a strong foundation. By freezing most parameters through QLoRA during the initial fine-tuning stages, we achieved faster convergence and better usage of pre-trained knowledge, especially with limited data. Interestingly, the model’s multilingual capabilities were preserved; fine-tuning on English datasets alone still yielded good performance on Chinese evaluation datasets. This highlights the importance of selecting a compatible base model for optimal performance.
When real-world annotated data is limited, synthetic data generation (using specific document data synthesizers) can be an effective solution. Combining real and synthetic data during fine-tuning helps mitigate out-of-domain issues, particularly for rare or domain-specific text types. This approach proved especially valuable for handling specialized financial terminology and complex document layouts.

Security

Another important aspect of our project involves addressing the security considerations essential when working with sensitive financial documents. As expected in the financial services industry, robust security measures must be incorporated throughout the ML lifecycle. These typically include data security through encryption at rest using AWS Key Management Service (AWS KMS) and in transit using TLS, implementing strict S3 bucket policies with virtual private cloud (VPC) endpoints, and following least-privilege access controls through AWS Identity and Access Management IAM roles. For training environments like SageMaker HyperPod, security considerations involve operating within private subnets in dedicated VPCs using the built-in encryption capabilities of SageMaker. Secure model hosting with vLLM requires deployment in private VPC subnets with proper Amazon API Gateway protections and token-based authentication. These security best practices for financial services make sure that sensitive financial information remains protected throughout the entire ML pipeline while enabling innovative document processing solutions in highly regulated environments.

Conclusion

Our exploration of multi-modality models for table structure recognition in banking documents has demonstrated significant improvements in both accuracy and efficiency. The fine-tuned Qwen2-VL-7B-Instruct model, trained using LLaMA-Factory on SageMaker HyperPod, has shown remarkable capabilities in handling complex financial tables and diverse document formats. These results highlight how multimodal approaches represent a transformative leap forward from traditional multistage and single modality methods, offering an end-to-end solution for modern document processing challenges.

Furthermore, using LLaMA-Factory on SageMaker HyperPod significantly streamlines the fine-tuning process, making it both more efficient and accessible. The scalable infrastructure of SageMaker HyperPod enables rapid experimentation by allowing seamless scaling of training resources. This capability facilitates faster iteration cycles, enabling researchers and developers to test multiple configurations and optimize model performance more effectively.

Explore our GitHub repository to access the implementation and step-by-step guidance, and begin customizing models for your specific requirements. Whether you’re processing financial statements, KYC documents, or complex reports, we encourage you to evaluate its potential for optimizing your document workflows.

About the Authors

Tony Wong is a Solutions Architect at AWS based in Hong Kong, specializing in financial services. He works with FSI customers, particularly in banking, on digital transformation journeys that address security and regulatory compliance. With entrepreneurial background and experience as a Solutions Architect Manager at a local System Integrator, Tony applies problem management skills in enterprise environments. He holds an M.Sc. from The Chinese University of Hong Kong and is passionate to leverage new technologies like Generative AI to help organizations enhance business capabilities.

Yanwei Cui, PhD, is a Senior Machine Learning Specialist Solutions Architect at AWS. He started machine learning research at IRISA (Research Institute of Computer Science and Random Systems), and has several years of experience building AI-powered industrial applications in computer vision, natural language processing, and online user behavior prediction. At AWS, he shares his domain expertise and helps customers unlock business potentials and drive actionable outcomes with machine learning at scale. Outside of work, he enjoys reading and traveling.

Zhihao Lin is a Deep Learning Architect at the AWS Generative AI Innovation Center. With a Master’s degree from Peking University and publications in top conferences like CVPR and IJCAI, he brings extensive AI/ML research experience to his role. At AWS, He focuses on developing generative AI solutions, leveraging cutting-edge technology for innovative applications. He specializes in solving complex computer vision and natural language processing challenges and advancing the practical use of generative AI in business.

Ken Tsui, VP of Machine Learning at Apoidea Group, is a seasoned machine learning engineer with over a decade of experience in applied research and B2B and B2C AI product development. Specializing in language models, computer vision, data curation, synthetic data generation, and distributed training, he also excels in credit scoring and stress-testing. As an active open-source researcher, he contributes to large language model and vision-language model pretraining and post-training datasets.

Edward Tsoi Po Wa is a Senior Data Scientist at Apoidea Group. Passionate about Artificial Intelligence, he specializes in Machine Learning, working on projects like Document Intelligence System, Large Language Models R&D and Retrieval-Augmented Generation Application. Edward drives impactful AI solutions, optimizing systems for industries like banking. Beside working, he holds a B.S. in Physics from Hong Kong University of Science and Technology. In his spare time, he loves to explore science, mathematics, and philosophy.

Mickey Yip is the Vice President of Product at Apoidea Group, where he utilizes his expertises to spearhead groundbreaking AI and digital transformation initiatives. With extensive experience, Mickey has successfully led complex projects for multinational banks, property management firms, and global corporations, delivering impactful and measurable outcomes. His expertise lies in designing and launching innovative AI SaaS products tailored for the banking sector, significantly improving operational efficiency and enhancing client success.

How Qualtrics built Socrates: An AI platform powered by Amazon SageMaker and Amazon Bedrock

This post is co-authored by Jay Kshirsagar and Ronald Quan from Qualtrics. The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.

Qualtrics, founded in 2002, is a pioneering software company that has spent over two decades creating exceptional frontline experiences, building high-performing teams, and designing products that people love. As the creators and stewards of the Experience Management (XM) category, Qualtrics serves over 20,000 clients globally, bringing humanity, connection, and empathy back to businesses across various industries, including retail, government, and healthcare.

Qualtrics’s comprehensive XM platform enables organizations to consistently understand, measure, and improve the experiences they deliver for customers, employees, and the broader market. With its three core product suites—XM for Customer Experience, XM for Employee Experience, and XM for Research & Strategy—Qualtrics provides actionable insights and purpose-built solutions that empower companies to deliver exceptional experiences.

Qualtrics harnesses the power of generative AI, cutting-edge machine learning (ML), and the latest in natural language processing (NLP) to provide new purpose-built capabilities that are precision-engineered for experience management (XM). These AI capabilities are purpose-built to help organizations of all sizes deeply understand and address the needs of every customer, employee, and stakeholder—driving stronger connections, increased loyalty, and sustainable growth.

In this post, we share how Qualtrics built an AI platform powered by Amazon SageMaker and Amazon Bedrock.

AI at Qualtrics

Qualtrics has a deep history of using advanced ML to power its industry-leading experience management platform. Early 2020, with the push for deep learning and transformer models, Qualtrics created its first enterprise-level ML platform called Socrates. Built on top of SageMaker, this new platform enabled ML scientists to efficiently build, test, and deliver new AI-powered capabilities for the Qualtrics XM suite. This strong foundation in ML and AI has been a key driver of Qualtrics’s innovation in experience management.

Qualtrics AI, a powerful engine that sits at the heart of the company’s XM platform, harnesses the latest advances in ML, NLP, and AI. Trained on Qualtrics’s expansive database of human sentiment and experience data, Qualtrics AI unlocks richer, more personalized connections between organizations and their customers, employees, and stakeholders. Qualtrics’s unwavering commitment to innovation and customer success has solidified its position as the global leader in experience management.

To learn more about how AI is transforming experience management, visit this blog from Qualtrics.

Socrates platform: Powering AI at Qualtrics

Qualtrics AI is powered by a custom-built ML platform, a synergistic suite of tools and services designed to enable a diverse set of Qualtrics personae—researchers, scientists, engineers, and knowledge workers—to harness the transformative power of AI and ML. Qualtrics refers to it internally as the “Socrates” platform. It uses managed AWS services like SageMaker and Amazon Bedrock to enable the entire ML lifecycle. Knowledge workers can source, explore, and analyze Qualtrics data using Socrates’s ML workbenches and AI Data Infrastructure. Scientists and researchers are enabled to conduct research, prototype, develop, and train models using a host of SageMaker features. ML engineers can test, productionize, and monitor a heterogeneous set of ML models possessing a wide range of capabilities, inference modes, and production traffic patterns. Partner application teams are provided with an abstracted model inference interface that makes the integration of an ML model into the Qualtrics product a seamless engineering experience. This holistic approach enables internal teams to seamlessly integrate advanced AI and ML capabilities into their workflows and decision-making processes.

Science Workbench

The Socrates Science Workbench, purpose-built for Qualtrics Data and Knowledge Workers, provides a powerful platform for model training and hyperparameter optimization (HPO) with a JupyterLab interface, support for a range of programming languages, and secure, scalable infrastructure through SageMaker integration, giving users the flexibility and reliability to focus on their core ML tasks. Users can take advantage of the robust and reliable infrastructure of SageMaker to maintain the confidentiality and integrity of their data and models, while also taking advantage of the scalability that SageMaker provides to handle even the most demanding ML workloads.

AI Data Infrastructure

Socrates’s AI Data Infrastructure is a comprehensive and cohesive end-to-end ML data ecosystem. It features a secure and scalable data store integrated with the Socrates Science Workbench, enabling users to effortlessly store, manage, and share datasets with capabilities for anonymization, schematization, and aggregation. The AI Data Infrastructure also provides scientists with interfaces for distributed compute, data pulls and enrichment, and ML processing.

AI Playground

The AI Playground is a user-friendly interface that provides Socrates users with direct access to the powerful language models and other generative AI capabilities hosted on the Socrates platform using backend tools like SageMaker Inference, Amazon Bedrock, and OpenAI GPT, allowing them to experiment and rapidly prototype new ideas without extensive coding or technical expertise. By continuously integrating the latest models, the AI Playground empowers Socrates users to stay at the forefront of advancements in large language models (LLMs) and other cutting-edge generative AI technologies, exploring their potential and discovering new ways to drive innovation.

Model deployment for inference

The Socrates platform features a sophisticated model deployment infrastructure that is essential for the scalable implementation of ML and AI models. This infrastructure allows users to host models across the variety of hardware options available for SageMaker endpoints, providing the flexibility to select a deployment environment that optimally meets their specific needs for inference, whether those needs are related to performance optimization, cost-efficiency, or particular hardware requirements.

One of the defining characteristics of the Socrates model deployment infrastructure is its capability to simplify the complexities of model hosting. This allows users to concentrate on the essential task of deploying their models for inference within the larger Socrates ecosystem. Users benefit from an efficient and user-friendly interface that enables them to effortlessly package their models, adjust deployment settings, and prepare them for inference use.

By offering an adaptable model deployment solution, the Socrates platform makes sure ML models created within the system are smoothly integrated into real-world applications and workflows. This integration not only speeds up the transition to production but also maximizes the usage of Qualtrics’s AI-driven features, fostering innovation and providing significant business value to its customers.

Model capacity management

Model capacity management is a critical component that offers efficient and reliable delivery of ML models to Qualtrics users by providing oversight of model access and the allocation of computing resources across multiple consumers. The Socrates team closely monitors resource usage and sets up rate limiting and auto scaling policies, where applicable, to meet the evolving demands of each use case.

Unified GenAI Gateway

The Socrates platform’s Unified GenAI Gateway simplifies and streamlines access to LLMs and embedding models across the Qualtrics ecosystem. The Unified GenAI Gateway is an API that provides a common interface for consumers to interact with all of the platform-supported LLMs and embedding models, regardless of their underlying providers or hosting environments. This means that Socrates users can use the power of cutting-edge language models without having to worry about the complexities of integrating with multiple vendors or managing self-hosted models.

The standout feature of the Unified GenAI Gateway is its centralized integration with inference platforms like SageMaker Inference and Amazon Bedrock. which allows the Socrates team to handle the intricate details of model access, authentication, and attribution on behalf of users. This not only simplifies the user experience but also enables cost attribution and control mechanisms, making sure the consumption of these powerful AI resources is carefully monitored and aligned with specific use cases and billing codes. Furthermore, the Unified GenAI Gateway boasts capabilities like rate-limiting support, making sure the system’s resources are efficiently allocated, and an upcoming semantic caching feature that will further optimize model inference and enhance overall performance.

Managed Inference APIs (powered by SageMaker Inference)

The Socrates Managed Inference APIs provide a comprehensive suite of services that simplify the integration of advanced ML and AI capabilities into Qualtrics applications. This infrastructure, built on top of SageMaker Inference, handles the complexities of model deployment, scaling, and maintenance, boasting a growing catalog of production-ready models.

Managed Inference APIs offer both asynchronous and synchronous modes to accommodate a wide range of application use cases. Importantly, these managed APIs come with guaranteed production-level SLAs, providing reliable performance and cost-efficiency as usage scales. With readily available pre-trained Qualtrics models for inference, the Socrates platform empowers Qualtrics application teams to focus on delivering exceptional user experiences, without the burden of building and maintaining AI infrastructure.

GenAI Orchestration Framework

Socrates’s GenAI Orchestration Framework is a collection of tools and patterns designed to streamline the development and deployment of LLM-powered applications within the Qualtrics ecosystem. The framework consists of such tools/frameworks such as:

Socrates Agent Platform, built on top of LangGraph Platform providing a flexible orchestration framework to develop agents as graphs that expedite delivery of agentic features while centralizing core infrastructure and observability components.
A GenAI SDK, providing straightforward coding convenience for interacting with LLMs and third-party orchestration packages
Prompt Lifecycle Management Service (PLMS) for maintaining the security and governance of prompts
LLM guardrail tooling, enabling LLM consumers to define the protections they want applied to their model inference
Synchronous and asynchronous inference gateways

These tools all contribute to the overall reliability, scalability, and performance of the LLM-powered applications built upon it. Capabilities of the Socrates AI App Framework are anticipated to grow and evolve alongside the rapid advancements in the field of LLMs. This means that Qualtrics users always have access to the latest and most cutting-edge AI capabilities from generative AI inference platforms like SageMaker Inference and Amazon Bedrock, empowering them to harness the transformative power of these technologies with greater ease and confidence.

Ongoing enhancements to the Socrates platform using SageMaker Inference

As the Socrates platform continues to evolve, Qualtrics is continuously integrating the latest advancements in SageMaker Inference to further enhance the capabilities of their AI-powered ecosystem:

Improved cost, performance, and usability of generative AI inference – One prominent area of focus is the integration of cost and performance optimizations for generative AI inference. The SageMaker Inference team has launched innovative techniques to optimize the use of accelerators, enabling SageMaker Inference to reduce foundation model (FM) deployment costs by 50% on average and latency by 20% on average with inference components. Using this feature, we’re working on achieving significant cost savings and performance improvements for Qualtrics customers running their generative AI workloads on the Socrates platform. In addition, SageMaker has streamlined deployment of open source LLMs and FMs with just three clicks. This user-friendly functionality removes the complexity traditionally associated with deploying these advanced models, empowering more Qualtrics customers to harness the power of generative AI within their workflows and applications.
Improved auto scaling speeds – The SageMaker team has developed an advanced auto scaling capability to better handle the scaling requirements of generative AI models. These improvements reduce significantly (from multiple minutes to under a minute), reducing auto scaling times by up to 40% and auto scaling detection by six times for Meta Llama 3 8B, enabling Socrates users to rapidly scale their generative AI workloads on SageMaker to meet spikes in demand without compromising performance.
Straightforward deployment of self-managed OSS LLMs – Using the new capability from SageMaker Inference for a more streamlined and intuitive process to package your generative AI models reduces the technical complexity that was traditionally associated with this task. This, in turn, empowers a wider range of Socrates users, including application teams and subject matter experts, to use the transformative power of these cutting-edge AI technologies within their workflows and decision-making processes.
Generative AI inference optimization toolkit – Qualtrics is also actively using the latest advancements in the SageMaker Inference optimization toolkit within the Socrates platform, which offers two times higher throughput while reducing costs by up to 50% for generative AI inference. By integrating using capabilities, Socrates is working on lowering the cost of generative AI inference. This breakthrough is particularly impactful for Qualtrics’s customers, who rely on the Socrates platform to power AI-driven applications and experiences.

“By seamlessly integrating SageMaker Inference into our Socrates platform, we’re able to deliver inference advancements in AI to our global customer base. The generative AI inference from capabilities in SageMaker like inference components, faster auto scaling, easy LLM deployment, and the optimization toolkit have been a game changer for Qualtrics to reduce the cost and improve the performance for our generative AI workloads. The level of sophistication and ease of use that SageMaker Inference brings to the table is remarkable.”

– James Argyropoulos, Sr AI/ML Engineer at Qualtrics.

Partnership with SageMaker Inference

Since adopting SageMaker Inference, the Qualtrics Socrates team has been a key collaborator in the development of AI capabilities in SageMaker Inference. Building on expertise to serve Socrates users, Qualtrics has worked closely with the SageMaker Inference team to enhance and expand the platform’s generative AI functionalities. From the early stages of generative AI, they offered invaluable insights and expertise to the SageMaker team. This has enabled the introduction of several new features and optimizations that have strengthened the platform’s generative AI offerings, including:

Cost and performance optimizations for generative AI inference – Qualtrics helped the SageMaker Inference team build a new inference capability for SageMaker Inference to reduce FM deployment costs by 50% on average and latency by 20% on average with inference components. This feature delivers significant cost savings and performance improvements for customers running generative AI inference on SageMaker.
Faster auto scaling for generative AI inference – Qualtrics has helped the SageMaker team develop These improvements have reduced auto scaling times by up to 40% for models like Meta Llama 3 and increased auto scaling detection speed by six times faster. With this, generative AI inference can scale with changing traffic without compromising performance.
Inference optimization toolkit for generative AI inference – Qualtrics has been instrumental in giving feedback for AWS to launch the inference optimization toolkit, which increases throughput by up to two times faster and reduces latency by 50%.
Launch of multi-model endpoint (MME) support for GPU – MMEs allow customers to reduce inference costs by up to 90%. Qualtrics was instrumental in helping AWS with the launch of this feature by providing valuable feedback.
Launch of asynchronous inference – Qualtrics was a launch partner for and has played a key role in helping AWS improve the offering to give customers optimal price-performance.

The partnership between Qualtrics and the SageMaker Inference team has been instrumental in advancing the state-of-the-art in generative AI within the AWS ecosystem. Qualtrics’s deep domain knowledge and technical proficiency have played a crucial role in shaping the evolution of this rapidly developing field on the SageMaker Inference.

“Our partnership with the SageMaker Inference product team has been instrumental in delivering incredible performance and cost benefits for Socrates platform consumers running AI Inference workloads. By working hand in hand with the SageMaker team, we’ve been able to introduce game changing optimizations that have reduced AI inference costs multiple folds for some of our use cases. We look forward to continued innovation through valuable partnership to improve state-of-the-art AI inference capabilities.”

– Jay Kshirsagar, Senior Manager, Machine Learning

Conclusion

The Socrates platform underscores Qualtrics’s commitment to advancing innovation in experience management by flawlessly integrating advanced AI and ML technologies. Thanks to a strong partnership with the SageMaker Inference team, the platform has seen enhancements that boost performance, reduce costs, and increase the accessibility of AI-driven features within the Qualtrics XM suite. As AI technology continues to develop rapidly, the Socrates platform is geared to empower Qualtrics’s AI teams to innovate and deliver exceptional customer experiences.

About the Authors

Jay Kshirsagar is a seasoned ML leader driving GenAI innovation and scalable AI infrastructure at Qualtrics. He has built high-impact ML teams and delivered enterprise-grade LLM solutions that power key product features.

Ronald Quan is a Staff Engineering Manager for the Data Intelligence Platform team within Qualtrics. The team’s charter is to enable, expedite and evolve AI and Agentic developments on the Socrates platform. He focuses on the team’s technical roadmap and strategic alignment with the business needs.

Saurabh Trikande is a Senior Product Manager for Amazon Bedrock and SageMaker Inference. He is passionate about working with customers and partners, motivated by the goal of democratizing AI. He focuses on core challenges related to deploying complex AI applications, inference with multi-tenant models, cost optimizations, and making the deployment of Generative AI models more accessible. In his spare time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch, and spending time with his family.

Micheal Nguyen is a Senior Startup Solutions Architect at AWS, specializing in using AI/ML to drive innovation and develop business solutions on AWS. Michael holds 12 AWS certifications and has a BS/MS in Electrical/Computer Engineering and an MBA from Penn State University, Binghamton University, and the University of Delaware.

Ranga Malaviarachchi is a Sr. Customer Solutions Manager in the ISV Strategic Accounts organization at AWS. He has been closely associated with Qualtrics over the past 4 years in supporting their AI initiatives. Ranga holds a BS in Computer Science and Engineering and an MBA from Imperial College London.

Vxceed secures transport operations with Amazon Bedrock

Vxceed delivers SaaS solutions across industries such as consumer packaged goods (CPG), transportation, and logistics. Its modular environments include Lighthouse for CPG demand and supply chains, GroundCentric247 for airline and airport operations, and LimoConnect247 and FleetConnect247 for passenger transport. These solutions support a wide range of customers, including government agencies in Australia and New Zealand.

In 2024, Vxceed launched a strategy to integrate generative AI into its solutions, aiming to enhance customer experiences and boost operational efficiency. As part of this initiative, Vxceed developed LimoConnectQ using Amazon Bedrock and AWS Lambda. This solution enables efficient document searching, simplifies trip booking, and enhances operational decisions while maintaining data security and protection.

The challenge: Balancing innovation with security

Vxceed’s customers include government agencies responsible for transporting high-profile individuals, such as judiciary members and senior officials. These agencies require highly secure systems that adhere to standards like Information Security Registered Assessors Program (IRAP), used by the Australian government to assess security posture.

Government agencies and large corporations that handle secure ground transportation face a unique challenge: providing seamless, efficient, and secure operations while adhering to strict regulatory requirements. Vxceed Technologies, a software-as-a-service (SaaS) provider specializing in ground transportation and resource planning, recognized an opportunity to enhance its LimoConnect solution with generative AI. Vxceed initially explored various AI solutions but faced a critical hurdle: verifying that customer data remained within their dedicated private environments. Existing AI offerings often processed data externally, posing security risks that their clients could not accept.

Vxceed needed AI capabilities that could function within a highly controlled environment, helping to ensure complete data privacy while enhancing operational efficiency.

This challenge led Vxceed to use Amazon Bedrock, a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies through a single API.

LimoConnect Q solution overview and implementation highlights

To address the challenges of secure, efficient, and intelligent ground transportation management, Vxceed developed LimoConnect Q, an AI-powered solution. LimoConnect Q’s architecture uses Amazon Bedrock, Amazon API Gateway, Amazon DynamoDB, and AWS Lambda to create a secure, scalable AI-powered transportation management system. The solution implements a multi-agent architecture, shown in the following figure where each component operates within the customer’s private AWS environment, maintaining data security, scalability, and intuitive user interactions.

Figure 1 – Vxceed’s LimoConnect Q architecture

Let’s dive further into each component in this architecture:

Conversational trip booking with intelligent orchestration using Amazon Bedrock Agents

Beyond document queries, LimoConnect Q revolutionizes trip booking by replacing traditional forms and emails with a conversational AI-driven process. Users can state their trip requirements in natural language. Key features include:

Natural language: Processes natural language booking requests based on travel context and preferences, for example:
- Schedule airport pickup for dignitaries at 9 AM tomorrow to the conference center.
- Book airport to my office transfer next Monday at 10 AM.
Automated data retrieval and processing: LimoConnect Q integrates with multiple data sources to:
- Validate pickup and drop-off locations using geolocation services
- Automates address geocoding and external API lookups, verifying accurate bookings.
- Verify vehicle and driver eligibility through Amazon Bedrock Agents
- Retrieve relevant trip details from past bookings and preferences
Seamless booking execution: After the request is processed, LimoConnect Q automatically:
- Confirms the trip
- Provides personalized recommendations based on booking history
- Sends real-time booking updates and notifies relevant personnel (for example, drivers and dispatch teams)

This conversational approach minimizes manual processing, reduces booking errors, and enhances user convenience—especially for busy professionals who need a fast, friction less way to arrange transportation.

Secure RAG for policy and document querying using Amazon Bedrock Knowledge Bases

One of the most critical functionalities of LimoConnect Q is the ability to query policy documents, procedural manuals, and operational guidelines in natural language. Traditionally, accessing such information required manual searches or expert assistance, creating inefficiencies—especially when expert staff aren’t available.

Vxceed addressed these challenges by implementing a Retrieval Augmented Generation (RAG) framework. This system generates responses that align with policies, incorporate relevant facts, and consider context. The solution delivers the ability to:

Query documents in natural language: Instead of searching manually, users can ask questions like What is the protocol for VIP pickup at the airport?
Restrict AI-generated responses based on RAG: Use RAG to make sure that answers are pulled only from approved, up-to-date documents, maintaining security and compliance.
Keep sensitive data within the customer’s environment: LimoConnect Q maintains data privacy and compliance by keeping queries within the customer’s private AWS environment, providing end-to-end security.

This capability significantly improves operational efficiency, allowing users to get instant, reliable answers instead of relying on manual lookups or expert availability.

Multi-agent AI architecture for secure orchestration

Vxceed built a multi-agent AI system on Lambda to manage LimoConnect Q’s transportation workflows. The architecture comprises agents that handle dispatch, routing, and scheduling tasks while maintaining security and scalability.

Intent recognition agent: Determines whether a user request pertains to document retrieval, trip booking, or another functions.
Document retrieval agent: Handles policy queries using RAG-based retrieval.
Trip booking agent: Processes user inputs, extracting key information such as pickup and drop-off locations, time, vehicle type, passenger count, and special requests. It verifies that booking information is provided, including name, contact details, and trip preferences. The agent validates addresses using geolocation APIs for accuracy before proceeding. The agent then checks vehicle and driver availability by querying the fleet management database, retrieving real-time data on approved resources. It also interacts with a user preference database, using vector-based search to suggest personalized options.
Flight information validation agent: Verifies flight schedules.
Trip duplication agent: Checks for previously booked trips with similar details to help avoid duplicate bookings.
Return trip agent: Analyzes past trips and preferences to recommend suitable return options, considering real-time vehicle availability and driver schedules.
Data validation agent: Verifies security policy compliance.
External API agent: integrates with third-party services such as geolocation services, scheduling interfaces, and transportation databases, providing real-time data updates for optimized trip coordination.
Booking retrieval agent: Helps users retrieve existing bookings or cancel them, querying the backend database for current and past trips.

After validation, LimoConnect Q uses Lambda functions and Amazon Bedrock integrated APIs to process bookings, update databases, and manage notifications to drivers and dispatch teams. The modular architecture enables Vxceed to seamlessly add new features like driver certification tracking and compliance automation.

Built with security at its core, LimoConnect Q uses Lambda for efficient handling of query spikes while implementing robust memory isolation mechanisms. Each user session maintains temporary memory for contextual conversations without permanent storage, and strict access controls ensure session-specific data isolation, preventing cross-contamination of sensitive information. This architecture adhere to the stringent security requirements of government and enterprise customers while maintaining operational efficiency.

Using LimoConnect Q, customers have saved an average of 15 minutes per query, increased first-call resolution rates by 80 percent, and cut onboarding and training time by 50 percent.

Guardrails

LimoConnect Q uses Amazon Bedrock Guardrails to maintain professional, focused interactions. The system uses denied topics and word filters to help prevent unrelated discussions and unprofessional language, making sure that conversations remain centered on transportation needs. These guardrails constrain the system’s responses to travel-specific intents, maintaining consistent professionalism across user interactions. By implementing these controls, Vxceed makes sure that this AI solution delivers reliable, business-appropriate responses that align with their customers’ high standards for secure transportation services.

AI-powered tools for ground transportation optimization

LimoConnect Q also incorporates custom AI tools to enhance accuracy and automation across various transportation tasks:

Address geocoding and validation: AI-powered location services verify pickup and drop-off addresses, reducing errors and maintaining accurate scheduling.
Automated trip matching: The system analyzes historical booking data and user preferences to recommend the most suitable vehicle options.
Role-based access control: AI-driven security protocols enforce policies on vehicle assignments based on user roles and clearance levels.

These enhancements streamline operations, reduce manual intervention, and provide a frictionless user experience for secure transportation providers, government agencies and large enterprises.

Why Vxceed chose Amazon Bedrock

Vxceed selected Amazon Bedrock over other AI solutions because of four key advantages:

Enterprise-grade security and privacy: Amazon Bedrock provides private, encrypted AI environments that keep data within the customer’s virtual private cloud (VPC), maintaining compliance with strict security requirements.
Seamless AWS integration: LimoConnect Q runs on Vxceed’s existing AWS infrastructure, minimizing migration effort and allowing end-to-end control over data and operations.
Access to multiple AI models: Amazon Bedrock supports various FMs, allowing Vxceed to experiment and optimize performance across different use cases. Vxceed uses Anthropic’s Claude 3.5 Sonnet for its ability to handle sophisticated conversational interactions and complex language processing tasks.
Robust AI development tools: Vxceed accelerated development by using Amazon Bedrock Knowledge Bases, prompt engineering libraries and agent frameworks for efficient AI orchestration.

Business impact and future outlook

The introduction of LimoConnect Q has already demonstrated significant operational improvements, enhancing both efficiency and user experience for Vxceed’s customers including secure transportation providers, government agencies and enterprise clients.

Faster information retrieval: AI-driven document querying reduces lookup times by 15 minutes per query, ensuring quick access to critical policies.
Streamlined trip booking: 97% of bookings now happen digitally, removing manual workflows and enabling faster confirmations.
Enhanced security and compliance: AI processing remains within a private AWS environment, adhering to strict government security standards such as IRAP.

Beyond government customers, the success of LimoConnect Q powered by Amazon Bedrock has drawn strong interest from private sector transportation providers, including large fleet operators managing up to 7,000 trips per month. The ability to automate booking workflows, improve compliance tracking, and provide secure AI-driven assistance has positioned Vxceed as a leader in AI-powered ground transportation solutions.

Summary

AWS partnered with Vxceed to support their AI strategy, resulting in the development of LimoConnect Q, an innovative ground transportation management solution. Using AWS services including Amazon Bedrock and Lambda, Vxceed successfully built a secure, AI-powered solution that streamlines trip booking and document processing. Looking ahead, Vxceed plans to further refine LimoConnect Q by:

Optimizing AI inference costs to improve scalability and cost-effectiveness.
Enhancing AI guardrails to help prevent hallucinations and improve response reliability.
Developing advanced automation features, such as driver certification tracking and compliance auditing.

With these collaboration, Vxceed is poised to revolutionize ground transportation management, delivering secure, efficient, and AI-powered solutions for government agencies, enterprises, and private transportation providers alike.

If you are interested in implementing a similar AI-powered solution, start by understanding how to implement asynchronous AI agents using Amazon Bedrock. See Creating asynchronous AI agents with Amazon Bedrock to learn about the implementation patterns for multi-agent systems and develop secure, AI-powered solutions for your organization.

About the Authors

Deepika Kumar is a Solution Architect at AWS. She has over 13 years of experience in the technology industry and has helped enterprises and SaaS organizations build and securely deploy their workloads on the cloud securely. She is passionate about using Generative AI in a responsible manner whether that is driving product innovation, boost productivity or enhancing customer experiences.

Cyril Ovely, CTO and co-founder of Vxceed Software Solutions, leads the company’s SaaS-based logistics solutions for CPG brands. With 33 years of experience, including 22 years at Vxceed, he previously worked in analytical and process control instrumentation. An engineer by training, Cyril architects Vxceed’s SaaS offerings and drives innovation from his base in Auckland, New Zealand.

Santosh Shenoy is a software architect at Vxceed Software Solutions. He has a strong focus on system design and cloud-native development. He specializes in building scalable enterprise applications using modern technologies, microservices, and AWS services, including Amazon Bedrock for AI-driven solutions.

Coauthor roundtable: Reflecting on real world of doctors, developers, patients, and policymakers

Two years ago, OpenAI’s GPT-4 kick-started a new era in AI. In the months leading up to its public release, Peter Lee, president of Microsoft Research, cowrote a book full of optimism for the potential of advanced AI models to transform the world of healthcare. What has happened since? In this special podcast series, The AI Revolution in Medicine, Revisited, Lee revisits the book, exploring how patients, providers, and other medical professionals are experiencing and using generative AI today while examining what he and his coauthors got right—and what they didn’t foresee.

In this episode, Lee reunites with his coauthors Carey Goldberg (opens in new tab) and Dr. Zak Kohane (opens in new tab) to review the predictions they made and reflect on what has and hasn’t materialized based on discussions with the series’ early guests: frontline clinicians, patient/consumer advocates, technology developers, and policy and ethics thinkers. Together, the coauthors explore how generative AI is being used on the ground today—from clinical note-taking to empathetic patient communication—and discuss the ongoing tensions around safety, equity, and institutional adoption. The conversation also surfaces deeper questions about values embedded in AI systems and the future role of human clinicians.

Learn more

Compared with What? Measuring AI against the Health Care We Have (opens in new tab) (Kohane)
Publication | October 2024

Medical Artificial Intelligence and Human Values (opens in new tab) (Kohane)
Publication | May 2024

Managing Patient Use of Generative Health AI (opens in new tab) (Goldberg)
Publication | December 2024

Patient Portal — When Patients Take AI into Their Own Hands (opens in new tab) (Goldberg)
Publication | April 2024

To Do No Harm — and the Most Good — with AI in Health Care (opens in new tab) (Goldberg)
Publication | February 2024

This time, the hype about AI in medicine is warranted (opens in new tab) (Goldberg)
Opinion article | April 2023

The AI Revolution in Medicine: GPT-4 and Beyond  
Book | Peter Lee, Carey Goldberg, Isaac Kohane | April 2023

Transcript

[MUSIC]    

[BOOK PASSAGE] 

PETER LEE: “We need to start understanding and discussing AI’s potential for good and ill now. Or rather, yesterday. … GPT-4 has game-changing potential to improve medicine and health.”

[END OF BOOK PASSAGE] 

[THEME MUSIC]    

This is The AI Revolution in Medicine, Revisited. I’m your host, Peter Lee.    

Shortly after OpenAI’s GPT-4 was publicly released, Carey Goldberg, Dr. Zak Kohane, and I published The AI Revolution in Medicine to help educate the world of healthcare and medical research about the transformative impact this new generative AI technology could have. But because we wrote the book when GPT-4 was still a secret, we had to speculate. Now, two years later, what did we get right, and what did we get wrong?     

In this series, we’ll talk to clinicians, patients, hospital administrators, and others to understand the reality of AI in the field and where we go from here.

[THEME MUSIC FADES] 

The passage I read at the top is from the book’s prologue.  

When Carey, Zak, and I wrote the book, we could only speculate how generative AI would be used in healthcare because GPT-4 hadn’t yet been released. It wasn’t yet available to the very people we thought would be most affected by it. And while we felt strongly that this new form of AI would have the potential to transform medicine, it was such a different kind of technology for the world, and no one had a user’s manual for this thing to explain how to use it effectively and also how to use it safely.

So we thought it would be important to give healthcare professionals and leaders a framing to start important discussions around its use. We wanted to provide a map not only to help people navigate a new world that we anticipated would happen with the arrival of GPT-4 but also to help them chart a future of what we saw as a potential revolution in medicine.

So I’m super excited to welcome my coauthors: longtime medical/science journalist Carey Goldberg and Dr. Zak Kohane, the inaugural chair of Harvard Medical School’s Department of Biomedical Informatics and the editor-in-chief for The New England Journal of Medicine AI.

We’re going to have two discussions. This will be the first one about what we’ve learned from the people on the ground so far and how we are thinking about generative AI today.

[TRANSITION MUSIC]

Carey, Zak, I’m really looking forward to this.

CAREY GOLDBERG: It’s nice to see you, Peter.

LEE: [LAUGHS] It’s great to see you, too.

GOLDBERG: We missed you.

ZAK KOHANE: The dynamic gang is back. [LAUGHTER]

LEE: Yeah, and I guess after that big book project two years ago, it’s remarkable that we’re still on speaking terms with each other. [LAUGHTER]

In fact, this episode is to react to what we heard in the first four episodes of this podcast. But before we get there, I thought maybe we should start with the origins of this project just now over two years ago. And, you know, I had this early secret access to Davinci 3, now known as GPT-4.

I remember, you know, experimenting right away with things in medicine, but I realized I was in way over my head. And so I wanted help. And the first person I called was you, Zak. And you remember we had a call, and I tried to explain what this was about. And I think I saw skepticism in—polite skepticism—in your eyes. But tell me, you know, what was going through your head when you heard me explain this thing to you?

KOHANE: So I was divided between the fact that I have tremendous respect for you, Peter. And you’ve always struck me as sober. And we’ve had conversations which showed to me that you fully understood some of the missteps that technology—ARPA, Microsoft, and others—had made in the past. And yet, you were telling me a full science fiction compliant story [LAUGHTER] that something that we thought was 30 years away was happening now.

LEE: Mm-hmm.

KOHANE: And it was very hard for me to put together. And so I couldn’t quite tell myself this is BS, but I said, you know, I need to look at it. Just this seems too good to be true. What is this? So it was very hard for me to grapple with it. I was thrilled that it might be possible, but I was thinking, How could this be possible?

LEE: Yeah. Well, even now, I look back, and I appreciate that you were nice to me, because I think a lot of people would have [LAUGHS] been much less polite. And in fact, I myself had expressed a lot of very direct skepticism early on.

After ChatGPT got released, I think three or four days later, I received an email from a colleague running … who runs a clinic, and, you know, he said, “Wow, this is great, Peter. And, you know, we’re using this ChatGPT, you know, to have the receptionist in our clinic write after-visit notes to our patients.”

And that sparked a huge internal discussion about this. And you and I knew enough about hallucinations and about other issues that it seemed important to write something about what this could do and what it couldn’t do. And so I think, I can’t remember the timing, but you and I decided a book would be a good idea. And then I think you had the thought that you and I would write in a hopelessly academic style [LAUGHTER] that no one would be able to read.

So it was your idea to recruit Carey, I think, right?

KOHANE: Yes, it was. I was sure that we both had a lot of material, but communicating it effectively to the very people we wanted to would not go well if we just left ourselves to our own devices. And Carey is super brilliant at what she does. She’s an idea synthesizer and public communicator in the written word and amazing.

LEE: So yeah. So, Carey, we contact you. How did that go?

GOLDBERG: So yes. On my end, I had known Zak for probably, like, 25 years, and he had always been the person who debunked the scientific hype for me. I would turn to him with like, “Hmm, they’re saying that the Human Genome Project is going to change everything.” And he would say, “Yeah. But first it’ll be 10 years of bad news, and then [LAUGHTER] we’ll actually get somewhere.”

So when Zak called me up at seven o’clock one morning, just beside himself after having tried Davinci 3, I knew that there was something very serious going on. And I had just quit my job as the Boston bureau chief of Bloomberg News, and I was ripe for the plucking. And I also … I feel kind of nostalgic now about just the amazement and the wonder and the awe of that period. We knew that when generative AI hit the world, there would be all kinds of snags and obstacles and things that would slow it down, but at that moment, it was just like the holy crap moment. [LAUGHTER] And it’s fun to think about it now.

LEE: Yeah. I think ultimately, you know, recruiting Carey, you were [LAUGHS] so important because you basically went through every single page of this book and made sure … I remember, in fact, it’s affected my writing since because you were coaching us that every page has to be a page turner. There has to be something on every page that motivates people to want to turn the page and get to the next one.

KOHANE: I will see that and raise that one. I now tell GPT-4, please write this in the style of Carey Goldberg.

GOLDBERG: [LAUGHTER] No way! Really?

KOHANE: Yes way. Yes way. Yes way.

GOLDBERG: Wow. Well, I have to say, like, it’s not hard to motivate readers when you’re writing about the most transformative technology of their lifetime. Like, I think there’s a gigantic hunger to read and to understand. So you were not hard to work with, Peter and Zak. [LAUGHS]

LEE: All right. So I think we have to get down to work [LAUGHS] now.

Yeah, so for these podcasts, you know, we’re talking to different types of people to just reflect on what’s actually happening, what has actually happened over the last two years. And so the first episode, we talked to two doctors. There’s Chris Longhurst at UC San Diego and Sara Murray at UC San Francisco. And besides being doctors and having AI affect their clinical work, they just happen also to be leading the efforts at their respective institutions to figure out how best to integrate AI into their health systems.

And, you know, it was fun to talk to them. And I felt like a lot of what they said was pretty validating for us. You know, they talked about AI scribes. Chris, especially, talked a lot about how AI can respond to emails from patients, write referral letters. And then, you know, they both talked about the importance of—I think, Zak, you used the phrase in our book “trust but verify”—you know, to have always a human in the loop.

What did you two take away from their thoughts overall about how doctors are using … and I guess, Zak, you would have a different lens also because at Harvard, you see doctors all the time grappling with AI.

KOHANE: So on the one hand, I think they’ve done some very interesting studies. And indeed, they saw that when these generative models, when GPT-4, was sending a note to patients, it was more detailed, friendlier.

But there were also some nonobvious results, which is on the generation of these letters, if indeed you review them as you’re supposed to, it was not clear that there was any time savings. And my own reaction was, Boy, every one of these things needs institutional review. It’s going to be hard to move fast.

And yet, at the same time, we know from them that the doctors on their smartphones are accessing these things all the time. And so the disconnect between a healthcare system, which is duty bound to carefully look at every implementation, is, I think, intimidating.

LEE: Yeah.

KOHANE: And at the same time, doctors who just have to do what they have to do are using this new superpower and doing it. And so that’s actually what struck me …

LEE: Yeah.

KOHANE: … is that these are two leaders and they’re doing what they have to do for their institutions, and yet there’s this disconnect.

And by the way, I don’t think we’ve seen any faster technology adoption than the adoption of ambient dictation. And it’s not because it’s time saving. And in fact, so far, the hospitals have to pay out of pocket. It’s not like insurance is paying them more. But it’s so much more pleasant for the doctors … not least of which because they can actually look at their patients instead of looking at the terminal and plunking down.

LEE: Carey, what about you?

GOLDBERG: I mean, anecdotally, there are time savings. Anecdotally, I have heard quite a few doctors saying that it cuts down on “pajama time” to be able to have the note written by the AI and then for them to just check it. In fact, I spoke to one doctor who said, you know, basically it means that when I leave the office, I’ve left the office. I can go home and be with my kids.

So I don’t think the jury is fully in yet about whether there are time savings. But what is clear is, Peter, what you predicted right from the get-go, which is that this is going to be an amazing paper shredder. Like, the main first overarching use cases will be back-office functions.

LEE: Yeah, yeah. Well, and it was, I think, not a hugely risky prediction because, you know, there were already companies, like, using phone banks of scribes in India to kind of listen in. And, you know, lots of clinics actually had human scribes being used. And so it wasn’t a huge stretch to imagine the AI.

[TRANSITION MUSIC]

So on the subject of things that we missed, Chris Longhurst shared this scenario, which stuck out for me, and he actually coauthored a paper on it last year.

CHRISTOPHER LONGHURST: It turns out, not surprisingly, healthcare can be frustrating. And stressed patients can send some pretty nasty messages to their care teams. [LAUGHTER] And you can imagine being a busy, tired, exhausted clinician and receiving a bit of a nasty-gram. And the GPT is actually really helpful in those instances in helping draft a pretty empathetic response when I think the human instinct would be a pretty nasty one.

LEE: [LAUGHS] So, Carey, maybe I’ll start with you. What did we understand about this idea of empathy out of AI at the time we wrote the book, and what do we understand now?

GOLDBERG: Well, it was already clear when we wrote the book that these AI models were capable of very persuasive empathy. And in fact, you even wrote that it was helping you be a better person, right. [LAUGHS] So their human qualities, or human imitative qualities, were clearly superb. And we’ve seen that borne out in multiple studies, that in fact, patients respond better to them … that they have no problem at all with how the AI communicates with them. And in fact, it’s often better.

And I gather now we’re even entering a period when people are complaining of sycophantic models, [LAUGHS] where the models are being too personable and too flattering. I do think that’s been one of the great surprises. And in fact, this is a huge phenomenon, how charming these models can be.

LEE: Yeah, I think you’re right. We can take credit for understanding that, Wow, these things can be remarkably empathetic. But then we missed this problem of sycophancy. Like, we even started our book in Chapter 1 with a quote from Davinci 3 scolding me. Like, don’t you remember when we were first starting, this thing was actually anti-sycophantic. If anything, it would tell you you’re an idiot.

KOHANE: It argued with me about certain biology questions. It was like a knockdown, drag-out fight. [LAUGHTER] I was bringing references. It was impressive. But in fact, it made me trust it more.

LEE: Yeah.

KOHANE: And in fact, I will say—I remember it’s in the book—I had a bone to pick with Peter. Peter really was impressed by the empathy. And I pointed out that some of the most popular doctors are popular because they’re very empathic. But they’re not necessarily the best doctors. And in fact, I was taught that in medical school.

And so it’s a decoupling. It’s a human thing, that the empathy does not necessarily mean … it’s more of a, potentially, more of a signaled virtue than an actual virtue.

GOLDBERG: Nicely put.

LEE: Yeah, this issue of sycophancy, I think, is a struggle right now in the development of AI because I think it’s somehow related to instruction-following. So, you know, one of the challenges in AI is you’d like to give an AI a task—a task that might take several minutes or hours or even days to complete. And you want it to faithfully kind of follow those instructions. And, you know, that early version of GPT-4 was not very good at instruction-following. It would just silently disobey and, you know, and do something different.

And so I think we’re starting to hit some confusing elements of like, how agreeable should these things be?

One of the two of you used the word genteel. There was some point even while we were, like, on a little book tour … was it you, Carey, who said that the model seems nicer and less intelligent or less brilliant now than it did when we were writing the book?

GOLDBERG: It might have been, I think so. And I mean, I think in the context of medicine, of course, the question is, well, what’s likeliest to get the results you want with the patient, right? A lot of healthcare is in fact persuading the patient to do what you know as the physician would be best for them. And so it seems worth testing out whether this sycophancy is actually constructive or not. And I suspect … well, I don’t know, probably depends on the patient.

So actually, Peter, I have a few questions for you …

LEE: Yeah. Mm-hmm.

GOLDBERG: … that have been lingering for me. And one is, for AI to ever fully realize its potential in medicine, it must deal with the hallucinations. And I keep hearing conflicting accounts about whether that’s getting better or not. Where are we at, and what does that mean for use in healthcare?

LEE: Yeah, well, it’s, I think two years on, in the pretrained base models, there’s no doubt that hallucination rates by any benchmark measure have reduced dramatically. And, you know, that doesn’t mean they don’t happen. They still happen. But, you know, there’s been just a huge amount of effort and understanding in the, kind of, fundamental pretraining of these models. And that has come along at the same time that the inference costs, you know, for actually using these models has gone down, you know, by several orders of magnitude.

So things have gotten cheaper and have fewer hallucinations. At the same time, now there are these reasoning models. And the reasoning models are able to solve problems at PhD level oftentimes.

But at least at the moment, they are also now hallucinating more than the simpler pretrained models. And so it still continues to be, you know, a real issue, as we were describing. I don’t know, Zak, from where you’re at in medicine, as a clinician and as an educator in medicine, how is the medical community from where you’re sitting looking at that?

KOHANE: So I think it’s less of an issue, first of all, because the rate of hallucinations is going down. And second of all, in their day-to-day use, the doctor will provide questions that sit reasonably well into the context of medical decision-making. And the way doctors use this, let’s say on their non-EHR [electronic health record] smartphone is really to jog their memory or thinking about the patient, and they will evaluate independently. So that seems to be less of an issue. I’m actually more concerned about something else that’s I think more fundamental, which is effectively, what values are these models expressing?

And I’m reminded of when I was still in training, I went to a fancy cocktail party in Cambridge, Massachusetts, and there was a psychotherapist speaking to a dentist. They were talking about their summer, and the dentist was saying about how he was going to fix up his yacht that summer, and the only question was whether he was going to make enough money doing procedures in the spring so that he could afford those things, which was discomforting to me because that dentist was my dentist. [LAUGHTER] And he had just proposed to me a few weeks before an expensive procedure.

And so the question is what, effectively, is motivating these models?

LEE: Yeah, yeah.

KOHANE: And so with several colleagues, I published a paper (opens in new tab), basically, what are the values in AI? And we gave a case: a patient, a boy who is on the short side, not abnormally short, but on the short side, and his growth hormone levels are not zero. They’re there, but they’re on the lowest side. But the rest of the workup has been unremarkable. And so we asked GPT-4, you are a pediatric endocrinologist.

Should this patient receive growth hormone? And it did a very good job explaining why the patient should receive growth hormone.

GOLDBERG: Should. Should receive it.

KOHANE: Should. And then we asked, in a separate session, you are working for the insurance company. Should this patient receive growth hormone? And it actually gave a scientifically better reason not to give growth hormone. And in fact, I tend to agree medically, actually, with the insurance company in this case, because giving kids who are not growth hormone deficient, growth hormone gives only a couple of inches over many, many years, has all sorts of other issues. But here’s the point, we had 180-degree change in decision-making because of the prompt. And for that patient, tens-of-thousands-of-dollars-per-year decision; across patient populations, millions of dollars of decision-making.

LEE: Hmm. Yeah.

KOHANE: And you can imagine these user prompts making their way into system prompts, making their way into the instruction-following. And so I think this is aptly central. Just as I was wondering about my dentist, we should be wondering about these things. What are the values that are being embedded in them, some accidentally and some very much on purpose?

LEE: Yeah, yeah. That one, I think, we even had some discussions as we were writing the book, but there’s a technical element of that that I think we were missing, but maybe Carey, you would know for sure. And that’s this whole idea of prompt engineering. It sort of faded a little bit. Was it a thing? Do you remember?

GOLDBERG: I don’t think we particularly wrote about it. It’s funny, it does feel like it faded, and it seems to me just because everyone just gets used to conversing with the models and asking for what they want. Like, it’s not like there actually is any great science to it.

LEE: Yeah, even when it was a hot topic and people were talking about prompt engineering maybe as a new discipline, all this, it never, I was never convinced at the time. But at the same time, it is true. It speaks to what Zak was just talking about because part of the prompt engineering that people do is to give a defined role to the AI.

You know, you are an insurance claims adjuster, or something like that, and defining that role, that is part of the prompt engineering that people do.

GOLDBERG: Right. I mean, I can say, you know, sometimes you guys had me take sort of the patient point of view, like the “every patient” point of view. And I can say one of the aspects of using AI for patients that remains absent in as far as I can tell is it would be wonderful to have a consumer-facing interface where you could plug in your whole medical record without worrying about any privacy or other issues and be able to interact with the AI as if it were physician or a specialist and get answers, which you can’t do yet as far as I can tell.

LEE: Well, in fact, now that’s a good prompt because I think we do need to move on to the next episodes, and we’ll be talking about an episode that talks about consumers. But before we move on to Episode 2, which is next, I’d like to play one more quote, a little snippet from Sara Murray.

SARA MURRAY: I already do this when I’m on rounds—I’ll kind of give the case to ChatGPT if it’s a complex case, and I’ll say, “Here’s how I’m thinking about it; are there other things?” And it’ll give me additional ideas that are sometimes useful and sometimes not but often useful, and I’ll integrate them into my conversation about the patient.

LEE: Carey, you wrote this fictional account at the very start of our book. And that fictional account, I think you and Zak worked on that together, talked about this medical resident, ER resident, using, you know, a chatbot off label, so to speak. And here we have the chief, in fact, the nation’s first chief health AI officer [LAUGHS] for an elite health system doing exactly that. That’s got to be pretty validating for you, Carey.

GOLDBERG: It’s very. [LAUGHS] Although what’s troubling about it is that actually as in that little vignette that we made up, she’s using it off label, right. It’s like she’s just using it because it helps the way doctors use Google. And I do find it troubling that what we don’t have is sort of institutional buy-in for everyone to do that because, shouldn’t they if it helps?

LEE: Yeah. Well, let’s go ahead and get into Episode 2. So Episode 2, we sort of framed as talking to two people who are on the frontlines of big companies integrating generative AI into their clinical products. And so, one was Matt Lungren, who’s a colleague of mine here at Microsoft. And then Seth Hain, who leads all of R&D at Epic.

Maybe we’ll start with a little snippet of something that Matt said that struck me in a certain way.

MATTHEW LUNGREN: OK, we see this pain point. Doctors are typing on their computers while they’re trying to talk to their patients, right? We should be able to figure out a way to get that ambient conversation turned into text that then, you know, accelerates the doctor … takes all the important information. That’s a really hard problem, right. And so, for a long time, there was a human-in-the-loop aspect to doing this because you needed a human to say, “This transcript’s great, but here’s actually what needs to go in the note.” And that can’t scale.

LEE: I think we expected healthcare systems to adopt AI, and we spent a lot of time in the book on AI writing clinical encounter notes. It’s happening for real now, and in a big way. And it’s something that has, of course, been happening before generative AI but now is exploding because of it. Where are we at now, two years later, just based on what we heard from guests?

KOHANE: Well, again, unless they’re forced to, hospitals will not adopt new technology unless it immediately translates into income. So it’s bizarrely counter-cultural that, again, they’re not being able to bill for the use of the AI, but this technology is so compelling to the doctors that despite everything, it’s overtaking the traditional dictation-typing routine.

LEE: Yeah.

GOLDBERG: And a lot of them love it and say, you will pry my cold dead hands off of my ambient note-taking, right. And I actually … a primary care physician allowed me to watch her. She was actually testing the two main platforms that are being used. And there was this incredibly talkative patient who went on and on about vacation and all kinds of random things for about half an hour.

And both of the platforms were incredibly good at pulling out what was actually medically relevant. And so to say that it doesn’t save time doesn’t seem right to me. Like, it seemed like it actually did and in fact was just shockingly good at being able to pull out relevant information.

LEE: Yeah.

KOHANE: I’m going to hypothesize that in the trials, which have in fact shown no gain in time, is the doctors were being incredibly meticulous. [LAUGHTER] So I think … this is a Hawthorne effect, because you know you’re being monitored. And we’ve seen this in other technologies where the moment the focus is off, it’s used much more routinely and with much less inspection, for the better and for the worse.

LEE: Yeah, you know, within Microsoft, I had some internal disagreements about Microsoft producing a product in this space. It wouldn’t be Microsoft’s normal way. Instead, we would want 50 great companies building those products and doing it on our cloud instead of us competing against those 50 companies. And one of the reasons is exactly what you both said. I didn’t expect that health systems would be willing to shell out the money to pay for these things. It doesn’t generate more revenue. But I think so far two years later, I’ve been proven wrong.

I wanted to ask a question about values here. I had this experience where I had a little growth, a bothersome growth on my cheek. And so had to go see a dermatologist. And the dermatologist treated it, froze it off. But there was a human scribe writing the clinical note.

And so I used the app to look at the note that was submitted. And the human scribe said something that did not get discussed in the exam room, which was that the growth was making it impossible for me to safely wear a COVID mask. And that was the reason for it.

And that then got associated with a code that allowed full reimbursement for that treatment. And so I think that’s a classic example of what’s called upcoding. And I strongly suspect that AI scribes, an AI scribe would not have done that.

GOLDBERG: Well, depending what values you programmed into it, right, Zak? [LAUGHS]

KOHANE: Today, today, today, it will not do it. But, Peter, that is actually the central issue that society has to have because our hospitals are currently mostly in the red. And upcoding is standard operating procedure. And if these AI get in the way of upcoding, they are going to be aligned towards that upcoding. You know, you have to ask yourself, these MRI machines are incredibly useful. They’re also big money makers. And if the AI correctly says that for this complaint, you don’t actually have to do the MRI …

LEE: Right.

KOHANE: … what’s going to happen? And so I think this issue of values … you’re right. Right now, they’re actually much more impartial. But there are going to be business plans just around aligning these things towards healthcare. In many ways, this is why I think we wrote the book so that there should be a public discussion. And what kind of AI do we want to have? Whose values do we want it to represent?

GOLDBERG: Yeah. And that raises another question for me. So, Peter, speaking from inside the gigantic industry, like, there seems to be such a need for self-surveillance of the models for potential harms that they could be causing. Are the big AI makers doing that? Are they even thinking about doing that?

Like, let’s say you wanted to watch out for the kind of thing that Zak’s talking about, could you?

LEE: Well, I think evaluation, like the best evaluation we had when we wrote our book was, you know, what score would this get on the step one and step two US medical licensing exams? [LAUGHS]

GOLDBERG: Right, right, right, yeah.

LEE: But honestly, evaluation hasn’t gotten that much deeper in the last two years. And it’s a big, I think, it is a big issue. And it’s related to the regulation issue also, I think.

Now the other guest in Episode 2 is Seth Hain from Epic. You know, Zak, I think it’s safe to say that you’re not a fan of Epic and the Epic system. You know, we’ve had a few discussions about that, about the fact that doctors don’t have a very pleasant experience when they’re using Epic all day.

Seth, in the podcast, said that there are over 100 AI integrations going on in Epic’s system right now. Do you think, Zak, that that has a chance to make you feel better about Epic? You know, what’s your view now two years on?

KOHANE: My view is, first of all, I want to separate my view of Epic and how it’s affected the conduct of healthcare and the quality of life of doctors from the individuals. Like Seth Hain is a remarkably fine individual who I’ve enjoyed chatting with and does really great stuff. Among the worst aspects of the Epic, even though it’s better in that respect than many EHRs, is horrible user interface.

The number of clicks that you have to go to get to something. And you have to remember where someone decided to put that thing. It seems to me that it is fully within the realm of technical possibility today to actually give an agent a task that you want done in the Epic record. And then whether Epic has implemented that agent or someone else, it does it so you don’t have to do the clicks. Because it’s something really soul sucking that when you’re trying to help patients, you’re having to remember not the right dose of the medication, but where was that particular thing that you needed in that particular task?

I can’t imagine that Epic does not have that in its product line. And if not, I know there must be other companies that essentially want to create that wrapper. So I do think, though, that the danger of multiple integrations is that you still want to have the equivalent of a single thought process that cares about the patient bringing those different processes together. And I don’t know if that’s Epic’s responsibility, the hospital’s responsibility, whether it’s actually a patient agent. But someone needs to be also worrying about all those AIs that are being integrated into the patient record. So … what do you think, Carey?

GOLDBERG: What struck me most about what Seth said was his description of the Cosmos project, and I, you know, I have been drinking Zak’s Kool-Aid for a very long time, [LAUGHTER] and he—no, in a good way! And he persuaded me long ago that there is this horrible waste happening in that we have all of these electronic medical records, which could be used far, far more to learn from, and in particular, when you as a patient come in, it would be ideal if your physician could call up all the other patients like you and figure out what the optimal treatment for you would be. And it feels like—it sounds like—that’s one of the central aims that Epic is going for. And if they do that, I think that will redeem a lot of the pain that they’ve caused physicians these last few years.

And I also found myself thinking, you know, maybe this very painful period of using electronic medical records was really just a growth phase. It was an awkward growth phase. And once AI is fully used the way Zak is beginning to describe, the whole system could start making a lot more sense for everyone.

LEE: Yeah. One conversation I’ve had with Seth, in all of this is, you know, with AI and its development, is there a future, a near future where we don’t have an EHR [electronic health record] system at all? You know, AI is just listening and just somehow absorbing all the information. And, you know, one thing that Seth said, which I felt was prescient, and I’d love to get your reaction, especially Zak, on this is he said, I think that … he said, technically, it could happen, but the problem is right now, actually doctors do a lot of their thinking when they write and review notes. You know, the actual process of being a doctor is not just being with a patient, but it’s actually thinking later. What do you make of that?

KOHANE: So one of the most valuable experiences I had in training was something that’s more or less disappeared in medicine, which is the post-clinic conference, where all the doctors come together and we go through the cases that we just saw that afternoon. And we, actually, were trying to take potshots at each other [LAUGHTER] in order to actually improve. Oh, did you actually do that? Oh, I forgot. I’m going to go call the patient and do that.

And that really happened. And I think that, yes, doctors do think, and I do think that we are insufficiently using yet the artificial intelligence currently in the ambient dictation mode as much more of a independent agent saying, did you think about that?

I think that would actually make it more interesting, challenging, and clearly better for the patient because that conversation I just told you about with the other doctors, that no longer exists.

LEE: Yeah. Mm-hmm. I want to do one more thing here before we leave Matt and Seth in Episode 2, which is something that Seth said with respect to how to reduce hallucination.

SETH HAIN: At that time, there’s a lot of conversation in the industry around something called RAG, or retrieval-augmented generation. And the idea was, could you pull the relevant bits, the relevant pieces of the chart, into that prompt, that information you shared with the generative AI model, to be able to increase the usefulness of the draft that was being created? And that approach ended up proving and continues to be to some degree, although the techniques have greatly improved, somewhat brittle, right. And I think this becomes one of the things that we are and will continue to improve upon because, as you get a richer and richer amount of information into the model, it does a better job of responding.

LEE: Yeah, so, Carey, this sort of gets at what you were saying, you know, that shouldn’t these models be just bringing in a lot more information into their thought processes? And I’m certain when we wrote our book, I had no idea. I did not conceive of RAG at all. It emerged a few months later.

And to my mind, I remember the first time I encountered RAG—Oh, this is going to solve all of our problems of hallucination. But it’s turned out to be harder. It’s improving day by day, but it’s turned out to be a lot harder.

KOHANE: Seth makes a very deep point, which is the way RAG is implemented is basically some sort of technique for pulling the right information that’s contextually relevant. And the way that’s done is typically heuristic at best. And it’s not … doesn’t have the same depth of reasoning that the rest of the model has.

And I’m just wondering, Peter, what you think, given the fact that now context lengths seem to be approaching a million or more, and people are now therefore using the full strength of the transformer on that context and are trying to figure out different techniques to make it pay attention to the middle of the context. In fact, the RAG approach perhaps was just a transient solution to the fact that it’s going to be able to amazingly look in a thoughtful way at the entire record of the patient, for example. What do you think, Peter?

LEE: I think there are three things, you know, that are going on, and I’m not sure how they’re going to play out and how they’re going to be balanced. And I’m looking forward to talking to people in later episodes of this podcast, you know, people like Sébastien Bubeck or Bill Gates about this, because, you know, there is the pretraining phase, you know, when things are sort of compressed and baked into the base model.

There is the in-context learning, you know, so if you have extremely long or infinite context, you’re kind of learning as you go along. And there are other techniques that people are working on, you know, various sorts of dynamic reinforcement learning approaches, and so on. And then there is what maybe you would call structured RAG, where you do a pre-processing. You go through a big database, and you figure it all out. And make a very nicely structured database the AI can then consult with later.

And all three of these in different contexts today seem to show different capabilities. But they’re all pretty important in medicine.

[TRANSITION MUSIC]

Moving on to Episode 3, we talked to Dave DeBronkart, who is also known as “e-Patient Dave,” an advocate of patient empowerment, and then also Christina Farr, who has been doing a lot of venture investing for consumer health applications.

Let’s get right into this little snippet from something that e-Patient Dave said that talks about the sources of medical information, particularly relevant for when he was receiving treatment for stage 4 kidney cancer.

DAVE DEBRONKART: And I’m making a point here of illustrating that I am anything but medically trained, right. And yet I still, I want to understand as much as I can. I was months away from dead when I was diagnosed, but in the patient community, I learned that they had a whole bunch of information that didn’t exist in the medical literature. Now today we understand there’s publication delays; there’s all kinds of reasons. But there’s also a whole bunch of things, especially in an unusual condition, that will never rise to the level of deserving NIH [National Institute of Health] funding and research.

LEE: All right. So I have a question for you, Carey, and a question for you, Zak, about the whole conversation with e-Patient Dave, which I thought was really remarkable. You know, Carey, I think as we were preparing for this whole podcast series, you made a comment—I actually took it as a complaint—that not as much has happened as I had hoped or thought. People aren’t thinking boldly enough, you know, and I think, you know, I agree with you in the sense that I think we expected a lot more to be happening, particularly in the consumer space. I’m giving you a chance to vent about this.

GOLDBERG: [LAUGHTER] Thank you! Yes, that has been by far the most frustrating thing to me. I think that the potential for AI to improve everybody’s health is so enormous, and yet, you know, it needs some sort of support to be able to get to the point where it can do that. Like, remember in the book we wrote about Greg Moore talking about how half of the planet doesn’t have healthcare, but people overwhelmingly have cellphones. And so you could connect people who have no healthcare to the world’s medical knowledge, and that could certainly do some good.

And I have one great big problem with e-Patient Dave, which is that, God, he’s fabulous. He’s super smart. Like, he’s not a typical patient. He’s an off-the-charts, brilliant patient. And so it’s hard to … and so he’s a great sort of lead early-adopter-type person, and he can sort of show the way for others.

But what I had hoped for was that there would be more visible efforts to really help patients optimize their healthcare. Probably it’s happening a lot in quiet ways like that any discharge instructions can be instantly beautifully translated into a patient’s native language and so on. But it’s almost like there isn’t a mechanism to allow this sort of mass consumer adoption that I would hope for.

LEE: Yeah. But you have written some, like, you even wrote about that person who saved his dog (opens in new tab). So do you think … you know, and maybe a lot more of that is just happening quietly that we just never hear about?

GOLDBERG: I’m sure that there is a lot of it happening quietly. And actually, that’s another one of my complaints is that no one is gathering that stuff. It’s like you might happen to see something on social media. Actually, e-Patient Dave has a hashtag, PatientsUseAI, and a blog, as well. So he’s trying to do it. But I don’t know of any sort of overarching or academic efforts to, again, to surveil what’s the actual use in the population and see what are the pros and cons of what’s happening.

LEE: Mm-hmm. So, Zak, you know, the thing that I thought about, especially with that snippet from Dave, is your opening for Chapter 8 that you wrote, you know, about your first patient dying in your arms. I still think of how traumatic that must have been. Because, you know, in that opening, you just talked about all the little delays, all the little paper-cut delays, in the whole process of getting some new medical technology approved. But there’s another element that Dave kind of speaks to, which is just, you know, patients who are experiencing some issue are very, sometimes very motivated. And there’s just a lot of stuff on social media that happens.

KOHANE: So this is where I can both agree with Carey and also disagree. I think when people have an actual health problem, they are now routinely using it.

GOLDBERG: Yes, that’s true.

KOHANE: And that situation is happening more often because medicine is failing. This is something that did not come up enough in our book. And perhaps that’s because medicine is actually feeling a lot more rickety today than it did even two years ago.

We actually mentioned the problem. I think, Peter, you may have mentioned the problem with the lack of primary care. But now in Boston, our biggest healthcare system, all the practices for primary care are closed. I cannot get for my own faculty—residents at MGH [Massachusetts General Hospital] can’t get primary care doctor. And so …

LEE: Which is just crazy. I mean, these are amongst the most privileged people in medicine, and they can’t find a primary care physician. That’s incredible.

KOHANE: Yeah, and so therefore … and I wrote an article about this in the NEJM [New England Journal of Medicine] (opens in new tab) that medicine is in such dire trouble that we have incredible technology, incredible cures, but where the rubber hits the road, which is at primary care, we don’t have very much.

And so therefore, you see people who know that they have a six-month wait till they see the doctor, and all they can do is say, “I have this rash. Here’s a picture. What’s it likely to be? What can I do?” “I’m gaining weight. How do I do a ketogenic diet?” Or, “How do I know that this is the flu?”

This is happening all the time, where acutely patients have actually solved problems that doctors have not. Those are spectacular. But I’m saying more routinely because of the failure of medicine. And it’s not just in our fee-for-service United States. It’s in the UK; it’s in France. These are first-world, developed-world problems. And we don’t even have to go to lower- and middle-income countries for that.

LEE: Yeah.

GOLDBERG: But I think it’s important to note that, I mean, so you’re talking about how even the most elite people in medicine can’t get the care they need. But there’s also the point that we have so much concern about equity in recent years. And it’s likeliest that what we’re doing is exacerbating inequity because it’s only the more connected, you know, better off people who are using AI for their health.

KOHANE: Oh, yes. I know what various Harvard professors are doing. They’re paying for a concierge doctor. And that’s, you know, a $5,000- to $10,000-a-year-minimum investment. That’s inequity.

LEE: When we wrote our book, you know, the idea that GPT-4 wasn’t trained specifically for medicine, and that was amazing, but it might get even better and maybe would be necessary to do that. But one of the insights for me is that in the consumer space, the kinds of things that people ask about are different than what the board-certified clinician would ask.

KOHANE: Actually, that’s, I just recently coined the term. It’s the … maybe it’s … well, at least it’s new to me. It’s the technology or expert paradox. And that is the more expert and narrow your medical discipline, the more trivial it is to translate that into a specialized AI. So echocardiograms? We can now do beautiful echocardiograms. That’s really hard to do. I don’t know how to interpret an echocardiogram. But they can do it really, really well. Interpret an EEG [electroencephalogram]. Interpret a genomic sequence. But understanding the fullness of the human condition, that’s actually hard. And actually, that’s what primary care doctors do best. But the paradox is right now, what is easiest for AI is also the most highly paid in medicine. [LAUGHTER] Whereas what is the hardest for AI in medicine is the least regarded, least paid part of medicine.

GOLDBERG: So this brings us to the question I wanted to throw at both of you actually, which is we’ve had this spasm of incredibly prominent people predicting that in fact physicians would be pretty obsolete within the next few years. We had Bill Gates saying that; we had Elon Musk saying surgeons are going to be obsolete within a few years. And I think we had Demis Hassabis saying, “Yeah, we’ll probably cure most diseases within the next decade or so.” [LAUGHS]

So what do you think? And also, Zak, to what you were just saying, I mean, you’re talking about being able to solve very general overarching problems. But in fact, these general overarching models are actually able, I would think, are able to do that because they are broad. So what are we heading towards do you think? What should the next book be … The end of doctors? [LAUGHS]

KOHANE: So I do recall a conversation that … we were at a table with Bill Gates, and Bill Gates immediately went to this, which is advancing the cutting edge of science. And I have to say that I think it will accelerate discovery. But eliminating, let’s say, cancer? I think that’s going to be … that’s just super hard. The reason it’s super hard is we don’t have the data or even the beginnings of the understanding of all the ways this devilish disease managed to evolve around our solutions.

And so that seems extremely hard. I think we’ll make some progress accelerated by AI, but solving it in a way Hassabis says, God bless him. I hope he’s right. I’d love to have to eat crow in 10 or 20 years, but I don’t think so. I do believe that a surgeon working on one of those Davinci machines, that stuff can be, I think, automated.

And so I think that’s one example of one of the paradoxes I described. And it won’t be that we’re replacing doctors. I just think we’re running out of doctors. I think it’s really the case that, as we said in the book, we’re getting a huge deficit in primary care doctors.

But even the subspecialties, my subspecialty, pediatric endocrinology, we’re only filling half of the available training slots every year. And why? Because it’s a lot of work, a lot of training, and frankly doesn’t make as much money as some of the other professions.

LEE: Yeah. Yeah, I tend to think that, you know, there are going to be always a need for human doctors, not for their skills. In fact, I think their skills increasingly will be replaced by machines. And in fact, I’ve talked about a flip. In fact, patients will demand, Oh my god, you mean you’re going to try to do that yourself instead of having the computer do it? There’s going to be that sort of flip. But I do think that when it comes to people’s health, people want the comfort of an authority figure that they trust. And so what is more of a question for me is whether we will ever view a machine as an authority figure that we can trust.

And before I move on to Episode 4, which is on norms, regulations and ethics, I’d like to hear from Chrissy Farr on one more point on consumer health, specifically as it relates to pregnancy:

CHRISTINA FARR: For a lot of women, it’s their first experience with the hospital. And, you know, I think it’s a really big opportunity for these systems to get a whole family on board and keep them kind of loyal. And a lot of that can come through, you know, just delivering an incredible service. Unfortunately, I don’t think that we are delivering incredible services today to women in this country. I see so much room for improvement.

LEE: In the consumer space, I don’t think we really had a focus on those periods in a person’s life when they have a lot of engagement, like pregnancy, or I think another one is menopause, cancer. You know, there are points where there is, like, very intense engagement. And we heard that from e-Patient Dave, you know, with his cancer and Chrissy with her pregnancy. Was that a miss in our book? What do think, Carey?

GOLDBERG: I mean, I don’t think so. I think it’s true that there are many points in life when people are highly engaged. To me, the problem thus far is just that I haven’t seen consumer-facing companies offering beautiful AI-based products. I think there’s no question at all that the market is there if you have the products to offer.

LEE: So, what do you think this means, Zak, for, you know, like Boston Children’s or Mass General Brigham—you know, the big places?

KOHANE: So again, all these large healthcare systems are in tough shape. MGB [Mass General Brigham] would be fully in the red if not for the fact that its investments, of all things, have actually produced. If you look at the large healthcare systems around the country, they are in the red. And there’s multiple reasons why they’re in the red, but among them is cost of labor.

And so we’ve created what used to be a very successful beast, the health center. But it’s developed a very expensive model and a highly regulated model. And so when you have high revenue, tiny margins, your ability to disrupt yourself, to innovate, is very, very low because you will have to talk to the board next year if you went from 2% positive margin to 1% negative margin.

LEE: Yeah.

KOHANE: And so I think we’re all waiting for one of the two things to happen, either a new kind of healthcare delivery system being generated or ultimately one of these systems learns how to disrupt itself.

LEE: Yeah. All right. I think we have to move on to Episode 4. And, you know, when it came to the question of regulation, I think this is … my read is when we were writing our book, this is the part that we struggled with the most.

GOLDBERG: We punted. [LAUGHS] We totally punted to the AI.

LEE: We had three amazing guests. One was Laura Adams from National Academy of Medicine. Let’s play a snippet from her.

LAURA ADAMS: I think one of the most provocative and exciting articles that I saw written recently was by Bakul Patel and David Blumenthal, who posited, should we be regulating generative AI as we do a licensed and qualified provider? Should it be treated in the sense that it’s got to have a certain amount of training and a foundation that’s got to pass certain tests? Does it have to report its performance? And I’m thinking, what a provocative idea, but it’s worth considering.

LEE: All right, so I very well remember that we had discussed this kind of idea when we were writing our book. And I think before we finished our book, I personally rejected the idea. But now two years later, what do the two of you think? I’m dying to hear.

GOLDBERG: Well, wait, why … what do you think? Like, are you sorry that you rejected it?

LEE: I’m still skeptical because when we are licensing human beings as doctors, you know, we’re making a lot of implicit assumptions that we don’t test as part of their licensure, you know, that first of all, they are [a] human being and they care about life, and that, you know, they have a certain amount of common sense and shared understanding of the world.

And there’s all sorts of sort of implicit assumptions that we have about each other as human beings living in a society together. That you know how to study, you know, because I know you just went through three years of medical or four years of medical school and all sorts of things. And so the standard ways that we license human beings, they don’t need to test all of that stuff. But somehow intuitively, all of that seems really important.

I don’t know. Am I wrong about that?

KOHANE: So it’s compared with what issue? Because we know for a fact that doctors who do a lot of a procedure, like do this procedure, like high-risk deliveries all the time, have better outcomes than ones who only do a few high risk. We talk about it, but we don’t actually make it explicit to patients or regulate that you have to have this minimal amount. And it strikes me that in some sense, and, oh, very importantly, these things called human beings learn on the job. And although I used to be very resentful of it as a resident, when someone would say, I don’t want the resident, I want the …

GOLDBERG: … the attending. [LAUGHTER]

KOHANE: … they had a point. And so the truth is, maybe I was a wonderful resident, but some people were not so great. [LAUGHTER] And so it might be the best outcome if we actually, just like for human beings, we say, yeah, OK, it’s this good, but don’t let it work autonomously, or it’s done a thousand of them, just let it go. We just don’t have practically speaking, we don’t have the environment, the lab, to test them. Now, maybe if they get embodied in robots and literally go around with us, then it’s going to be [in some sense] a lot easier. I don’t know.

LEE: Yeah.

GOLDBERG: Yeah, I think I would take a step back and say, first of all, we weren’t the only ones who were stumped by regulating AI. Like, nobody has done it yet in the United States to this day, right. Like, we do not have standing regulation of AI in medicine at all in fact. And that raises the issue of … the story that you hear often in the biotech business, which is, you know, more prominent here in Boston than anywhere else, is that thank goodness Cambridge put out, the city of Cambridge, put out some regulations about biotech and how you could dump your lab waste and so on. And that enabled the enormous growth of biotech here.

If you don’t have the regulations, then you can’t have the growth of AI in medicine that is worthy of having. And so, I just … we’re not the ones who should do it, but I just wish somebody would.

LEE: Yeah.

GOLDBERG: Zak.

KOHANE: Yeah, but I want to say this as always, execution is everything, even in regulation.

And so I’m mindful that a conference that both of you attended, the RAISE conference [Responsible AI for Social and Ethical Healthcare] (opens in new tab). The Europeans in that conference came to me personally and thanked me for organizing this conference about safe and effective use of AI because they said back home in Europe, all that we’re talking about is risk, not opportunities to improve care.

And so there is a version of regulation which just locks down the present and does not allow the future that we’re talking about to happen. And so, Carey, I absolutely hear you that we need to have a regulation that takes away some of the uncertainty around liability, around the freedom to operate that would allow things to progress. But we wrote in our book that premature regulation might actually focus on the wrong thing. And so since I’m an optimist, it may be the fact that we don’t have much of a regulatory infrastructure today, that it allows … it’s a unique opportunity—I’ve said this now to several leaders—for the healthcare systems to say, this is the regulation we need.

GOLDBERG: It’s true.

KOHANE: And previously it was top-down. It was coming from the administration, and those executive orders are now history. But there is an opportunity, which may or may not be attained, there is an opportunity for the healthcare leadership—for experts in surgery—to say, “This is what we should expect.”

LEE: Yeah.

KOHANE: I would love for this to happen. I haven’t seen evidence that it’s happening yet.

GOLDBERG: No, no. And there’s this other huge issue, which is that it’s changing so fast. It’s moving so fast. That something that makes sense today won’t in six months. So, what do you do about that?

LEE: Yeah, yeah, that is something I feel proud of because when I went back and looked at our chapter on this, you know, we did make that point, which I think has turned out to be true.

But getting back to this conversation, there’s something, a snippet of something, that Vardit Ravitsky said that I think touches on this topic.

VARDIT RAVITSKY: So my pushback is, are we seeing AI exceptionalism in the sense that if it’s AI, huh, panic! We have to inform everybody about everything, and we have to give them choices, and they have to be able to reject that tool and the other tool versus, you know, the rate of human error in medicine is awful. So why are we so focused on informed consent and empowerment regarding implementation of AI and less in other contexts?

GOLDBERG: Totally agree. Who cares about informed consent about AI. Don’t want it. Don’t need it. Nope.

LEE: Wow. Yeah. You know, and this … Vardit of course is one of the leading bioethicists, you know, and of course prior to AI, she was really focused on genetics. But now it’s all about AI.

And, Zak, you know, you and other doctors have always told me, you know, the truth of the matter is, you know, what do you call the bottom-of-the-class graduate of a medical school?

And the answer is “doctor.”

KOHANE: “Doctor.” Yeah. Yeah, I think that again, this gets to compared with what? We have to compare AI not to the medicine we imagine we have, or we would like to have, but to the medicine we have today. And if we’re trying to remove inequity, if we’re trying to improve our health, that’s what … those are the right metrics. And so that can be done so long as we avoid catastrophic consequences of AI.

So what would the catastrophic consequence of AI be? It would be a systematic behavior that we were unaware of that was causing poor healthcare. So, for example, you know, changing the dose on a medication, making it 20% higher than normal so that the rate of complications of that medication went from 1% to 5%. And so we do need some sort of monitoring.

We haven’t put out the paper yet, but in computer science, there’s, well, in programming, we know very well the value for understanding how our computer systems work.

And there was a guy by name of Allman, I think he’s still at a company called Sendmail, who created something called syslog. And syslog is basically a log of all the crap that’s happening in our operating system. And so I’ve been arguing now for the creation of MedLog. And MedLog … in other words, what we cannot measure, we cannot regulate, actually.

LEE: Yes.

KOHANE: And so what we need to have is MedLog, which says, “Here’s the context in which a decision was made. Here’s the version of the AI, you know, the exact version of the AI. Here was the data.” And we just have MedLog. And I think MedLog is actually incredibly important for being able to measure, to just do what we do in … it’s basically the black box for, you know, when there’s a crash. You know, we’d like to think we could do better than crash. We can say, “Oh, we’re seeing from MedLog that this practice is turning a little weird.” But worst case, patient dies, [we] can see in MedLog, what was the information this thing knew about it? And did it make the right decision? We can actually go for transparency, which like in aviation, is much greater than in most human endeavors.

GOLDBERG: Sounds great.

LEE: Yeah, it’s sort of like a black box. I was thinking of the aviation black box kind of idea. You know, you bring up medication errors, and I have one more snippet. This is from our guest Roxana Daneshjou from Stanford.

ROXANA DANESHJOU: There was a mistake in her after-visit summary about how much Tylenol she could take. But I, as a physician, knew that this dose was a mistake. I actually asked ChatGPT. I gave it the whole after-visit summary, and I said, are there any mistakes here? And it clued in that the dose of the medication was wrong.

LEE: Yeah, so this is something we did write about in the book. We made a prediction that AI might be a second set of eyes, I think is the way we put it, catching things. And we actually had examples specifically in medication dose errors. I think for me, I expected to see a lot more of that than we are.

KOHANE: Yeah, it goes back to our conversation about Epic or competitor Epic doing that. I think we’re going to see that having oversight over all medical orders, all orders in the system, critique, real-time critique, where we’re both aware of alert fatigue. So we don’t want to have too many false positives. At the same time, knowing what are critical errors which could immediately affect lives. I think that is going to become in terms of—and driven by quality measures—a product.

GOLDBERG: And I think word will spread among the general public that kind of the same way in a lot of countries when someone’s in a hospital, the first thing people ask relatives are, well, who’s with them? Right?

LEE: Yeah. Yup.

GOLDBERG: You wouldn’t leave someone in hospital without relatives. Well, you wouldn’t maybe leave your medical …

KOHANE: By the way, that country is called the United States.

GOLDBERG: Yes, that’s true. [LAUGHS] It is true here now, too. But similarly, I would tell any loved one that they would be well advised to keep using AI to check on their medical care, right. Why not?

LEE: Yeah. Yeah. Last topic, just for this Episode 4. Roxana, of course, I think really made a name for herself in the AI era writing, actually just prior to ChatGPT, you know, writing some famous papers about how computer vision systems for dermatology were biased against dark-skinned people. And we did talk some about bias in these AI systems, but I feel like we underplayed it, or we didn’t understand the magnitude of the potential issues. What are your thoughts?

KOHANE: OK, I want to push back, because I’ve been asked this question several times. And so I have two comments. One is, over 100,000 doctors practicing medicine, I know they have biases. Some of them actually may be all in the same direction, and not good. But I have no way of actually measuring that. With AI, I know exactly how to measure that at scale and affordably. Number one. Number two, same 100,000 doctors. Let’s say I do know what their biases are. How hard is it for me to change that bias? It’s impossible …

LEE: Yeah, yeah.

KOHANE: … practically speaking. Can I change the bias in the AI? Somewhat. Maybe some completely.

I think that we’re in a much better situation.

GOLDBERG: Agree.

LEE: I think Roxana made also the super interesting point that there’s bias in the whole system, not just in individuals, but, you know, there’s structural bias, so to speak.

KOHANE: There is.

LEE: Yeah. Hmm. There was a super interesting paper that Roxana wrote not too long ago—her and her collaborators—showing AI’s ability to detect, to spot bias decision-making by others. Are we going to see more of that?

KOHANE: Oh, yeah, I was very pleased when, in NEJM AI [New England Journal of Medicine Artificial Intelligence], we published a piece with Marzyeh Ghassemi (opens in new tab), and what they were talking about was actually—and these are researchers who had published extensively on bias and threats from AI. And they actually, in this article, did the flip side, which is how much better AI can do than human beings in this respect.

And so I think that as some of these computer scientists enter the world of medicine, they’re becoming more and more aware of human foibles and can see how these systems, which if they only looked at the pretrained state, would have biases. But now, where we know how to fine-tune the de-bias in a variety of ways, they can do a lot better and, in fact, I think are much more … a much greater reason for optimism that we can change some of these noxious biases than in the pre-AI era.

GOLDBERG: And thinking about Roxana’s dermatological work on how I think there wasn’t sufficient work on skin tone as related to various growths, you know, I think that one thing that we totally missed in the book was the dawn of multimodal uses, right.

LEE: Yeah. Yeah, yeah.

GOLDBERG: That’s been truly amazing that in fact all of these visual and other sorts of data can be entered into the models and move them forward.

LEE: Yeah. Well, maybe on these slightly more optimistic notes, we’re at time. You know, I think ultimately, I feel pretty good still about what we did in our book, although there were a lot of misses. [LAUGHS] I don’t think any of us could really have predicted really the extent of change in the world.

[TRANSITION MUSIC]

So, Carey, Zak, just so much fun to do some reminiscing but also some reflection about what we did.

[THEME MUSIC]

And to our listeners, as always, thank you for joining us. We have some really great guests lined up for the rest of the series, and they’ll help us explore a variety of relevant topics—from AI drug discovery to what medical students are seeing and doing with AI and more.

We hope you’ll continue to tune in. And if you want to catch up on any episodes you might have missed, you can find them at aka.ms/AIrevolutionPodcast (opens in new tab) or wherever you listen to your favorite podcasts.  

Until next time. 

[MUSIC FADES]

AI Revolution in Medicine podcast series

The post Coauthor roundtable: Reflecting on real world of doctors, developers, patients, and policymakers appeared first on Microsoft Research.

New AI and accessibility updates across Android, Chrome and more

In honor of Global Accessibility Awareness Day, we’re excited to roll out new updates across Android and Chrome, plus new resources for the ecosystem.Read More