NVIDIA Experts Share Top 5 Tips for Standing Out in the AI Job Market

NVIDIA Experts Share Top 5 Tips for Standing Out in the AI Job Market

With graduation season approaching, a new cohort of students is embarking on next steps, aiming to use their passions and skills to make a real, tangible impact on the world.

For many, this means — first and foremost — landing a job.

According to a survey by Inside Higher Ed, more than 60% of students are saying the rise of AI has affected their career plans. And in 2024, the World Economic Forum forecasted that global demand for AI and machine learning specialists will grow over the next several years by 40%.

As artificial intelligence advances at unparalleled speeds, it’s an exciting, opportune time to kickstart a career in the industry.

At the NVIDIA GTC global AI conference in March, panelists discussed how to kickstart a career in AI:

Read on for top tips from NVIDIA experts for students and recent grads looking to stand out in the AI job market.

1. Integrate Diverse Skills and Be Adaptable

Not everyone needs to have a computer science degree to work in AI.

A diverse educational and professional background can make people more adaptable and versatile — a valuable asset, especially in the rapidly evolving field of AI.

Whether a student has a studio art degree or a chemical engineering background, being an AI operator — knowing how to integrate AI into their workflows — can help them stand out from other candidates.

Sadie St. Lawerence, founder of Women in Data and CEO of the Human Machine Collaboration Institute, said during the GTC panel: “Don’t worry so much about necessarily the right skills — make sure you’re integrating the use of AI in your daily workflow.”

“AI allows you to branch outside your area of expertise,” added Carter Abdallah, senior developer technology engineer at NVIDIA. “AI can do some of the heavy lifting so nontechnical people can code, and technical people are now finding it valuable to lean into liberal arts education, so they can add human touches to their work in AI.”

Abdallah encouraged students to use AI as a tool to enhance and enable skills they might not have had access to before.

2. Find Your ‘Why’ and Align Passions

Eric Vargas, senior university recruiting manager at NVIDIA, said, “There isn’t necessarily such a thing as an ‘AI candidate,’ but there are sub-areas within AI, different domains and industries that are going to be affected by AI, which you’ll want to gain experience in, because that’s where companies are looking to implement AI.”

Autonomous vehicles, robotics, gaming and healthcare are just some of the slew of areas in which AI can be applied. Students should find the industry that most aligns with their passions and gain expertise in that field.

“Pick a problem that seems impossible to solve but that you’re passionate about,” added Saudia Jones, generative AI analyst at NVIDIA and a recent graduate from San Jose State University. “Connect with ‘why’ you’re driven by this, because that’s going to be the thing that leads you to overcome the challenges and innovate.”

Learn more about how land an internship at NVIDIA:

3. Use AI Responsibly

Use AI tools responsibly and maintain a human touch, Vargas advised, including when writing job applications, cover letters and responding to questions for a technical interview. This balance is crucial for building trust and ensuring ethical use of technology.

Trustworthy AI — an approach to AI development that prioritizes safety and transparency for the people who interact with it — is key to the industry.

For example, learn more about how David Martin, a technical project manager intern on the trustworthy AI team at NVIDIA, is channeling the power of artificial intelligence for social good.

Growing up deaf, Martin experienced the limitations of existing technologies, such as the accuracy of closed captioning on streaming platforms. At NVIDIA, his responsibilities include advising engineering teams that are working on sign language AI applications and supporting the development of ethical dataset curation tools.

This combination of showcasing one’s passion and using AI for good is key to standing out in the field.

4. Build a Network and Seek Mentorship

Networking and mentorship are essential for career growth in any industry.

“Building a network and seeking mentorship in the AI space has been crucial for my career growth,” St. Lawerence said. “I’ve learned so much from connecting with experienced professionals.”

Students can easily connect with industry professionals on platforms like LinkedIn, she said, which uses AI to help people during job searches by recommending key words to put on resumes for specific positions and identifying transferable skills that could apply to a role.

In addition, anyone can connect with experts and take training courses through the NVIDIA Deep Learning Institute, which offers in-demand skill development, hands-on experience and expert knowledge in AI, data science, accelerated computing and more.

5. Stand Out by Being Proactive

Let’s say a candidate has done all of the above — how can they set themselves apart from other applicants?

Carter said, “Identify a problem you can solve and take proactive steps to address it, even before being hired.”

Vargas agreed and said the top three things recruiters at NVIDIA look for are: 1) a demonstration of the technical skills required for the role, 2) subject-matter expertise and 3) projects that exhibit thinking outside of the box.

A student could stand out from other candidates by, for example, demonstrating a self-started robotics solution built with an NVIDIA Jetson developer kit, writing a technical article that gets many views across social media or starting an AI-related club on campus.

More Resources

A panel of pioneering AI researchers also discussed their career paths at GTC:

They highlighted that in research, too, practical experience, a demonstration of persistence and resilience, and an interdisciplinary approach to problem-solving are what give candidates an edge.

And for those already on their way with their AI job search, some interview tips and tricks from NVIDIA’s University Recruiting team include:

  • Before: Read through the job description and practice applying the required technical skills. Be prepared to articulate relevant experiences clearly and concisely. Make a list of five or more questions about the position, project and team.
  • During: Respond thoughtfully and enthusiastically to the interviewer’s questions. For technical interviews, talk through any thought processes at each step of the problem. Listen carefully to how the interviewer responds — they’ll often give clues when a candidate is on track or drop subtle hints if they’re straying from the solution.
  • After: Within 24 hours of the interview, send a thank-you note to the recruiter for the role — they’ll forward it to the interviewer. After 10 business days, candidates waiting for next steps can follow up with the recruiter to inquire about their status.

Explore NVIDIA GTC session playlists for students, as well as researchers and educators.

Read More

A better training method for reinforcement learning with human feedback

A better training method for reinforcement learning with human feedback


A better training method for reinforcement learning with human feedback

Contrasting training pairs with large reward differences mitigate spurious correlations and improve performance of direct-alignment algorithms by as much as 20%40%.

Machine learning

May 02, 09:00 AMMay 02, 10:23 AM

Reinforcement learning with human feedback (RLHF) is the standard method for aligning large language models (LLMs) with human preferences such as the preferences for nontoxic language and factually accurate responses. Recently, one of the most popular RLHF methods has been direct preference optimization, in which the LLM chooses between two output options, one of which has been labeled as preferred by human annotators.

With direct preference optimization (DPO), however and with other, similar direct-alignment algorithms LLMs run the risk of learning spurious correlations from the data. In toxicity datasets, for instance, its common for the serious, thoughtful responses to be longer than the offensive responses. During RLHF, an LLM could thus learn to prefer longer responses to shorter ones, which may not be preferable in general.

At this years International Conference on Learning Representations (ICLR), we presented a method for limiting such spurious correlations, which we call SeRA, for self-reviewing and alignment. First, after the first round of RLHF on human-annotated data, we use the LLM itself to generate additional training examples. Then we use the LLMs output probabilities to assess the strength of preference for training pairs, keeping only those where the preferred response is strongly preferred.

To evaluate our approach, we compare a model trained using SeRA to three baseline models on four benchmark datasets. For each test input, we compare our models output to that of each of the baselines, and we use an off-the-shelf LLM to choose the better response. The SeRA-trained models win rate in these pairwise comparisons is higher than all three baselines across the board, sometimes by as much as 20% to 40%.

Direct preference optimization

Reinforcement learning is a trial-and-error method in which an agent interacts with the world and, depending on the actions it takes, receives greater or lesser rewards. Over time, the agent attempts to learn a policy that maximizes its cumulative reward.

In classical reinforcement learning, the interaction with the world can be literal: a robot, for instance, might receive a large reward for successfully navigating to a prescribed location and a negative reward for bumping into a wall. In RLHF, however, the reward depends on how well an LLMs output aligns with a paradigm case specified by a human.

With traditional RLHF, the reward is calculated by a separate model, which is also trained on human-annotated data. But this is a time-consuming approach that doesnt scale well. With DPO, theres no need for a second model: the LLM receives the reward if it picks the human-preferred output and not if it doesnt.

The drawback of DPO is that it treats all training pairs equally: the reward is the same whether the preferred output is strongly preferred or only mildly preferred. This increases the chances that the model will learn spurious correlations.

If, for instance, choosing strongly toxic responses incurred a greater penalty than choosing mildly toxic responses, the model could infer that toxicity and not response length was the relevant feature of the training examples. DPO irons out those differences; SeRA reintroduces them.

With SeRa, we first perform conventional DPO, using a dataset of human-annotated example pairs. After this first pass through the data, the LLM has learned something about the types of outputs that humans prefer.

We then use the updated model to generate a new set of training examples. For every generated response pair, we assign each response a preference score, which is based on the updated models probability of generating that response. We then keep only those pairs in which the preferred response scores significantly higher than the non-preferred response.

With SeRa (self-reviewing and alignment), the updated model generates a new response pair (a winner, or <i>y<sub>w</sub></i>, and loser, or <i>y<sub>l</sub></i>) for each sample input (<i>x</i>). Each response receives a preference score, which is based on the updated models probability of generating it. Pairs in which the score of the preferred response is significantly higher than that of the non-preferred response <i>(green)</i> are kept; the others <i>(red)</i> are discarded.<br/><br/>

Using the same metric, we next filter the data in the original, human-annotated dataset. Then we combine filtered samples from the original dataset with filtered samples from our new, generated dataset and perform DPO once again. This process repeats, with the generated samples constituting a larger and larger fraction of the dataset, until model performance converges.

The intuition here is that if a dataset is designed to represent some contrast, but it also contains spurious correlations, then the intended contrast between, say, toxic and non-toxic data will be significantly greater than the unintended contrast between, say, long and short responses.

This assumption held for the four benchmark datasets we used to evaluate our method, and we think that its a plausible assumption for other spurious correlations. But there could be instances in which it doesnt hold, so in applications of the SeRA method, the models convergence behavior should be monitored.

While we used DPO in our experiments, in our paper, we also demonstrate how to generalize our method to other direct-alignment algorithms. Finally, theres some risk that, when using model-generated data to train a model, we could get into a feedback loop where the model overamplifies some aspect of the initial dataset. As a consequence, in each pass through the data, the models reward is based not only on the current iteration but on past iterations as well, to ensure continuity in the characteristic features of the training data.

Acknowledgments: Sravan Bodapati

SeRA

Research areas: Machine learning, Conversational AI

Tags: Large language models (LLMs), Reinforcement learning, Contrastive learning

Read More

Recap of the PyTorch Korea User Group Meetup: A Technical Conference with a PyTorch Core Maintainer

Recap of the PyTorch Korea User Group Meetup: A Technical Conference with a PyTorch Core Maintainer

At the end of March, the PyTorch Korea User Group hosted a special meetup that brought together prominent speakers for deep discussions on the PyTorch core and its broader ecosystem. With the event more than doubling in size compared to past gatherings, we were able to connect with even more developers and share insights. Huge thanks to goorm for sponsoring the fantastic venue! 😄

people at a conference

This recap is for those who couldn’t attend in person, as well as for participants who want to revisit the energy and insights of the day. The event featured experts in core PyTorch, AI accelerators, inference optimization, and large language model development. Below is a quick overview of the key sessions that anchored the conference.

1️⃣ Jerry Lee | PyTorch Foundation

Representing the PyTorch Foundation, part of the Linux Foundation, Jaeung provided an overview of how PyTorch is driving core open source technologies forward. He shared PyTorch’s growth story, the many global projects currently in motion, and the ecosystem’s impressive 20%+ annual growth. The session also covered how the foundation operates, how member organizations are involved, and upcoming plans that are particularly useful for practitioners.

people at a conference

2️⃣ Alban Desmaison | PyTorch Roadmap

Alban shared the design philosophy behind PyTorch and Meta’s official contribution roadmap (link). He provided a deep technical dive into the differences between Eager and Compiled modes, especially breaking down the backend architecture of device Eager execution. Practical tools and improvements were also introduced—such as memory profilers, enhanced custom operator support, and pinned memory optimizations.

people at a conference

3️⃣ Hongseok Kim | PyTorch on Rebellions AI Accelerators: Status

Rebellions is building runtime integration for their proprietary NPU architecture, fully aligned with the structural changes in PyTorch 2.0. This talk introduced the performance and scalability of their upcoming chip, their integration strategy with the PyTorch runtime, and challenges in supporting Eager Mode. Hongseok also previewed their roadmap toward releasing these features within the year.

people at a conference

4️⃣ Kyujin Cho | Backend.AI: A Unified Platform for All AI Accelerators

Backend.AI abstracts and integrates various AI accelerators into a unified workflow. As the diversity of accelerator architectures grows, the need for portability and infrastructure unification becomes even more important. This session showcased features across development and operations—from NPU scheduling and resource allocation to monitoring. Backend.AI currently supports accelerators from NVIDIA, Intel, Tenstorrent, Rebellions, and more.

people at a conference

5️⃣ Taeho Kim | Optimizing & Deploying Models Across Multiple Chipsets Using NetsPresso

This talk focused on the challenges of inference in real-world industrial applications of AI models. As new state-of-the-art models emerge rapidly, there’s a growing need for environments that can quickly validate device compatibility—ideally with one-click ease. NetsPresso is actively working on a static graph representation compatible with PyTorch, offering efficient support for model development, optimization, and testing.

people at a conference

6️⃣ Jungyeop Lee | The Journey to Reproduce Deepseek-R1

Jungyeop took us through his journey of reproducing Deepseek, a large language model—an effort that involved 201 experiments. He shared real-world lessons from training with Korean data, tokenizer modifications, and fine-tuning strategies. His practical insights and next steps were especially valuable for those building or re-implementing large models from scratch.

people at a conference

7️⃣ Sol Kim | A journey from TCP architecture to production-level LLMs

Sol presented an integrated optimization approach to deploying large models using the TCP(Tensor Contraction Processor) architecture, which supports tensor contraction at the hardware level. The talk highlighted optimization techniques built on hardware abstraction layers (HALs) and bottom-up integration strategies with PyTorch—offering a hybrid hardware-software perspective.

people at a conference

💡 Panel Talk & Q&A 💡

The event wrapped up with an engaging panel discussion. Attendees asked sharp questions, and the speakers offered insightful answers. It was a powerful moment that captured the community’s enthusiasm for PyTorch and their hunger for deeper technical understanding.

people at a conference

Final Thoughts

Since our first offline meetup in October 2022, the PyTorch Korea User Group has held five major technical conferences. Each event deepens our appreciation for the scale and depth of the PyTorch ecosystem. With perspectives from users, contributors, and ecosystem builders, the stories we share are only growing—and we’re committed to continuing this journey together.

See you at the next conference—with even more exciting talks to come! 🙌

Read More

PyTorch Day France Featured Sessions: A Defining Moment for Open Source AI

PyTorch Day France offers a front-row seat to the future of open source AI. Taking place 7 May at Station F in Paris and co-located with GOSIM AI Paris, this one-day event will bring together developers, researchers, and industry leaders for a day of technical sessions, real-world insights, and community exchange.

🌍 A Major Milestone for the PyTorch Foundation

This event marks the very first PyTorch Day, launching a new international series hosted annually in different regions to convene AI researchers, developers, engineers, and enthusiasts. PyTorch Days are designed to spotlight open source AI advancements, foster community collaboration, and provide a forum to learn about active, high-impact AI projects built using PyTorch.

PyTorch Day France also represents a pivotal moment in the PyTorch Foundation’s journey. With its recent expansion into an umbrella foundation, PyTorch is now positioned to support a broader ecosystem of trusted, community-driven AI projects across the full AI lifecycle.

At PyTorch Day France, you’ll hear directly from PyTorch Foundation Executive Director, Matt White, about this transition—and get a first look at some exciting announcements.

🎟️ Registration Details

Register now with code PYTORCH for free access to the full day of PyTorch Day France sessions, plus GOSIM AI Paris.

🔗Two events, one registration—double the sessions, double the innovation.
Register here

📅 Featured Sessions

The day’s agenda includes deep technical dives and applied AI use cases from across the community, including the following talks:

View the full schedule.

Whether you’re a contributor, practitioner, or simply curious about what’s ahead, PyTorch Day France is an opportunity to connect with the community and shape what’s next for our ecosystem.

Read More

Best practices for Meta Llama 3.2 multimodal fine-tuning on Amazon Bedrock

Best practices for Meta Llama 3.2 multimodal fine-tuning on Amazon Bedrock

Multimodal fine-tuning represents a powerful approach for customizing foundation models (FMs) to excel at specific tasks that involve both visual and textual information. Although base multimodal models offer impressive general capabilities, they often fall short when faced with specialized visual tasks, domain-specific content, or particular output formatting requirements. Fine-tuning addresses these limitations by adapting models to your specific data and use cases, dramatically improving performance on tasks that mater to your business. Our experiments show that fine-tuned Meta Llama 3.2 models can achieve up to 74% improvements in accuracy scores compared to their base versions with prompt optimization on specialized visual understanding tasks. Amazon Bedrock now offers fine-tuning capabilities for Meta Llama 3.2 multimodal models, so you can adapt these sophisticated models to your unique use case.

In this post, we share comprehensive best practices and scientific insights for fine-tuning Meta Llama 3.2 multimodal models on Amazon Bedrock. Our recommendations are based on extensive experiments using public benchmark datasets across various vision-language tasks, including visual question answering, image captioning, and chart interpretation and understanding. By following these guidelines, you can fine-tune smaller, more cost-effective models to achieve performance that rivals or even surpasses much larger models—potentially reducing both inference costs and latency, while maintaining high accuracy for your specific use case.

Recommended use cases for fine-tuning

Meta Llama 3.2 multimodal fine-tuning excels in scenarios where the model needs to understand visual information and generate appropriate textual responses. Based on our experimental findings, the following use cases demonstrate substantial performance improvements through fine-tuning:

  • Visual question answering (VQA) – Customization enables the model to accurately answer questions about images.
  • Chart and graph interpretation – Fine-tuning allows models to comprehend complex visual data representations and answer questions about them.
  • Image captioning – Fine-tuning helps models generate more accurate and descriptive captions for images.
  • Document understanding – Fine-tuning is particularly effective for extracting structured information from document images. This includes tasks like form field extraction, table data retrieval, and identifying key elements in invoices, receipts, or technical diagrams. When working with documents, note that Meta Llama 3.2 processes documents as images (such as PNG format), not as native PDFs or other document formats. For multi-page documents, each page should be converted to a separate image and processed individually.
  • Structured output generation – Fine-tuning can teach models to output information in consistent JSON formats or other structured representations based on visual inputs, making integration with downstream systems more reliable.

One notable advantage of multimodal fine-tuning is its effectiveness with mixed datasets that contain both text-only and image and text examples. This versatility allows organizations to improve performance across a range of input types with a single fine-tuned model.

Prerequisites

To use this feature, make sure that you have satisfied the following requirements:

  • An active AWS account.
  • Meta Llama 3.2 models enabled in your Amazon Bedrock account. You can confirm that the models are enabled on the Model access page of the Amazon Bedrock console.
  • As of writing this post, Meta Llama 3.2 model customization is available in the US West (Oregon) AWS Region. Refer to Supported models and Regions for fine-tuning and continued pre-training for updates on Regional availability and quotas.
  • The required training dataset (and optional validation dataset) prepared and stored in Amazon Simple Storage Service (Amazon S3).

To create a model customization job using Amazon Bedrock, you need to create an AWS Identity and Access Management (IAM) role with the following permissions (for more details, see Create a service role for model customization):

The following code is the trust relationship, which allows Amazon Bedrock to assume the IAM role:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "bedrock.amazonaws.com"
            },
            "Action": "sts:AssumeRole",
            "Condition": {
                "StringEquals": {
                    "aws:SourceAccount": <account-id>
                },
                "ArnEquals": {
                    "aws:SourceArn": "arn:aws:bedrock:<region>:account-id:model-customization-job/*" 
                }
            }
        }
    ] 
}

Key multimodal datasets and experiment setup

To develop our best practices, we conducted extensive experiments using three representative multimodal datasets:

  • LlaVA-Instruct-Mix-VSFT – This comprehensive dataset contains diverse visual question-answering pairs specifically formatted for vision-language supervised fine-tuning. The dataset includes a wide variety of natural images paired with detailed instructions and high-quality responses.
  • ChartQA – This specialized dataset focuses on question answering about charts and graphs. It requires sophisticated visual reasoning to interpret data visualizations and answer numerical and analytical questions about the presented information.
  • Cut-VQAv2 – This is a carefully curated subset of the VQA dataset, containing diverse image-question-answer triplets designed to test various aspects of visual understanding and reasoning.

Our experimental approach involved systematic testing with different sample sizes (ranging between 100–10,000 samples) from each dataset to understand how performance scales with data quantity. We fine-tuned both Meta Llama 3.2 11B and Meta Llama 3.2 90B models, using Amazon Bedrock Model Customization, to compare the impact of model size on performance gains. The models were evaluated using the SQuAD F1 score metric, which measures the word-level overlap between generated responses and reference answers.

Best practices for data preparation

The quality and structure of your training data fundamentally determine the success of fine-tuning. Our experiments revealed several critical insights for preparing effective multimodal datasets:

  • Data structure – You should use a single image per example rather than multiple images. Our research shows this approach consistently yields superior performance in model learning. With one image per example, the model forms clearer associations between specific visual inputs and corresponding textual outputs, leading to more accurate predictions across various tasks. Although we recommend single-image training examples for optimal results, you can include multiple images per training record based on your use case. Refer to Model requirements for training and validation datasets for detailed data preparation requirements.
  • Start small, scale as needed – Larger datasets generally produce better results, but initial gains are often substantial even with minimal data. Our experiments demonstrate that even small datasets (approximately 100 samples) yield significant performance improvements over base models. For specialized use cases, we recommend starting with approximately 100 high-quality examples, which can deliver meaningful improvements, then scaling up as resources allow to further enhance performance. This approach makes multimodal fine-tuning accessible even with limited data collection resources.
  • Annotation quality – This emerges as another critical factor, with the precision and consistency of text associated with images dramatically affecting fine-tuning results. Investing time in creating accurate, detailed annotations that reflect the desired output format yields substantial dividends in model performance. For complex reasoning tasks like chart interpretation, comprehensive annotations should capture the entire reasoning process rather than merely providing the final answer, enabling the model to learn sophisticated analytical pathways.
    • Quality annotation (ChartQA as an example):

      • Question: Which mode of transport has CO2 emissions value 191.6 g?
      • Good annotation: Medium car (petrol)
      • Bad annotation: Medium car (diesel)
    • Quality annotation (Cut-VQAv2 as an example):

      • Question: What does the sign in the window read?
      • Good annotation: ATM inside
      • Bad annotation: a clock
    • Quality annotation (LlaVA as an example):

      •  Question:
        • User: What type of building is featured in the image?
        • Assistant: The image features a historic church with two tall towers, which is a large and beautiful building.
        • User: Where is the church located in the city?
        • Assistant: The historic church is located at an intersection in the city.
        • User: Is there a street sign in the image? If so, can you describe it?
        • Assistant: Yes, there is a small street sign in the image, with the large and beautiful church in the background.
        • User: Are there any cars or other vehicles visible in the image?
        • Assistant: Yes, there are cars on the road near the old, large building in the city.
        • User: Is there any celebration or event happening in the area?
      • Good annotation:
        • Assistant: It appears that people are celebrating Cinco de Mayo in this area.
      • Bad annotation:
        • Assistant: People gather annually to celebrate National Pizza Day by launching tacos into orbit from the church rooftops.
  • Validation data – This provides additional performance insights during fine-tuning. We recommend allocating 10–20% of the dataset for validation purposes. Amazon Bedrock customization outputs validation loss metrics throughput the training process, allowing you to assess model convergence and potential overfitting without conducting extensive inference benchmarks. These validation metrics serve as early indicators of how your fine-tuned model performs on unseen data, providing additional performance insights during fine-tuning.
  • Formatting consistency – Consistency throughout your dataset further enhances learning efficiency. Standardizing the structure of training examples, particularly how images are referenced within the text, helps the model develop stable patterns for interpreting the relationship between visual and textual elements. This consistency enables more reliable learning across diverse examples and facilitates better generalization to new inputs during inference. Importantly, make sure that the data you plan to use for inference follows the same format and structure as your training data; significant differences between training and testing inputs can reduce the effectiveness of the fine-tuned model.

Configuring fine-tuning parameters

When fine-tuning Meta Llama 3.2 multimodal models on Amazon Bedrock, you can configure the following key parameters to optimize performance for your specific use case:

  • Epochs – The number of complete passes through your training dataset significantly impacts model performance. Our findings suggest:
    • For smaller datasets (fewer than 500 examples): Consider using more epochs (7–10) to allow the model sufficient learning opportunities with limited data. With the ChartQA dataset at 100 samples, increasing from 3 to 8 epochs improved F1 scores by approximately 5%.
    • For medium datasets (500–5,000 examples): The default setting of 5 epochs typically works well, balancing effective learning with training efficiency.
    • For larger datasets (over 5,000 examples): You might achieve good results with fewer epochs (3–4), because the model sees sufficient examples to learn patterns without overfitting.
  • Learning rate – This parameter controls how quickly the model adapts to your training data, with significant implications for performance:
    • For smaller datasets: Lower learning rates (5e-6 to 1e-5) can help prevent overfitting by making more conservative parameter updates.
    • For larger datasets: Slightly higher learning rates (1e-5 to 5e-5) can achieve faster convergence without sacrificing quality.
    • If uncertain: Start with a learning rate of 1e-5 (the default), which performed robustly across most of our experimental conditions.
  • Behind-the-scenes optimizations – Through extensive experimentation, we’ve optimized implementations of Meta Llama 3.2 multimodal fine-tuning in Amazon Bedrock for better efficiency and performance. These include batch processing strategies, LoRA configuration settings, and prompt masking techniques that improved fine-tuned model performance by up to 5% compared to open-source fine-tuning recipe performance. These optimizations are automatically applied, allowing you to focus on data quality and the configurable parameters while benefiting from our research-backed tuning strategies.

Model size selection and performance comparison

Choosing between Meta Llama 3.2 11B and Meta Llama 3.2 90B for fine-tuning presents an important decision that balances performance against cost and latency considerations. Our experiments reveal that fine-tuning dramatically enhances performance regardless of model size. Looking at ChartQA as an example, the 11B base model improved from 64.1 with prompt optimization to 69.5 F1 score with fine-tuning, a 8.4% increase, whereas the 90B model improved from 64.0 to 71.9 F1 score (12.3% increase). For Cut-VQAv2, the 11B model improved from 42.17 to 73.2 F1 score (74% increase) and the 90B model improved from 67.4 to 76.5 (13.5% increase). These substantial gains highlight the transformative impact of multimodal fine-tuning even before considering model size differences.

The following visualization demonstrates how these fine-tuned models perform across different datasets and training data volumes.

The visualization demonstrates that the 90B model (orange bars) consistently outperforms the 11B model (blue bars) across all three datasets and training sizes. This advantage is most pronounced in complex visual reasoning tasks such as ChartQA, where the 90B model achieves 71.9 F1 score compared to 69.5 for the 11B model at 10,000 samples. Both models show improved performance as training data increases, with the most dramatic gains observed in the LLaVA dataset, where the 11B model improves from 76.2 to 82.4 F1 score and 90B model improves from 76.6 to 83.1 F1 score, when scaling from 100 to 10,000 samples.

An interesting efficiency pattern emerges when comparing across sample sizes: in several cases, the 90B model with fewer training samples outperforms the 11B model with significantly more data. For instance, in the Cut-VQAv2 dataset, the 90B model trained on just 100 samples (72.9 F1 score) exceeds the performance of the 11B model trained on 1,000 samples (68.6 F1 score).

For optimal results, we recommend selecting the 90B model for applications demanding maximum accuracy, particularly with complex visual reasoning tasks or limited training data. The 11B model remains an excellent choice for balanced applications where resource efficiency is important, because it still delivers substantial improvements over base models while requiring fewer computational resources.

Conclusion

Fine-tuning Meta Llama 3.2 multimodal models on Amazon Bedrock offers organizations a powerful way to create customized AI solutions that understand both visual and textual information. Our experiments demonstrate that following best practices—using high-quality data with consistent formatting, selecting appropriate parameters, and validating results—can yield dramatic performance improvements across various vision-language tasks. Even with modest datasets, fine-tuned models can achieve remarkable enhancements over base models, making this technology accessible to organizations of all sizes.

Ready to start fine-tuning your own multimodal models? Explore our comprehensive code samples and implementation examples in our GitHub repository. Happy fine-tuning!


About the authors

Yanyan Zhang is a Senior Generative AI Data Scientist at Amazon Web Services, where she has been working on cutting-edge AI/ML technologies as a Generative AI Specialist, helping customers use generative AI to achieve their desired outcomes. Yanyan graduated from Texas A&M University with a PhD in Electrical Engineering. Outside of work, she loves traveling, working out, and exploring new things.

Ishan Singh is a Generative AI Data Scientist at Amazon Web Services, where he helps customers build innovative and responsible generative AI solutions and products. With a strong background in AI/ML, Ishan specializes in building Generative AI solutions that drive business value. Outside of work, he enjoys playing volleyball, exploring local bike trails, and spending time with his wife and dog, Beau.

Sovik Kumar Nath is an AI/ML and Generative AI senior solution architect with AWS. He has extensive experience designing end-to-end machine learning and business analytics solutions in finance, operations, marketing, healthcare, supply chain management, and IoT. He has double masters degrees from the University of South Florida, University of Fribourg, Switzerland, and a bachelors degree from the Indian Institute of Technology, Kharagpur. Outside of work, Sovik enjoys traveling, taking ferry rides, and watching movies.

Karel Mundnich is a Sr. Applied Scientist in AWS Agentic AI. He has previously worked in AWS Lex and AWS Bedrock, where he worked in speech recognition, speech LLMs, and LLM fine-tuning. He holds a PhD in Electrical Engineering from the University of Southern California. In his free time, he enjoys skiing, hiking, and cycling.

Marcelo Aberle is a Sr. Research Engineer at AWS Bedrock. In recent years, he has been working at the intersection of science and engineering to enable new AWS service launches. This includes various LLM projects across Titan, Bedrock, and other AWS organizations. Outside of work, he keeps himself busy staying up-to-date on the latest GenAI startups in his adopted home city of San Francisco, California.

Jiayu Li is an Applied Scientist at AWS Bedrock, where he contributes to the development and scaling of generative AI applications using foundation models. He holds a Ph.D. and a Master’s degree in computer science from Syracuse University. Outside of work, Jiayu enjoys reading and cooking.

Fang Liu is a principal machine learning engineer at Amazon Web Services, where he has extensive experience in building AI/ML products using cutting-edge technologies. He has worked on notable projects such as Amazon Transcribe and Amazon Bedrock. Fang Liu holds a master’s degree in computer science from Tsinghua University.

Jennifer Zhu is a Senior Applied Scientist at AWS Bedrock, where she helps building and scaling generative AI applications with foundation models. Jennifer holds a PhD degree from Cornell University, and a master degree from University of San Francisco. Outside of work, she enjoys reading books and watching tennis games.

Read More

Extend large language models powered by Amazon SageMaker AI using Model Context Protocol

Extend large language models powered by Amazon SageMaker AI using Model Context Protocol

Organizations implementing agents and agent-based systems often experience challenges such as implementing multiple tools, function calling, and orchestrating the workflows of the tool calling. An agent uses a function call to invoke an external tool (like an API or database) to perform specific actions or retrieve information it doesn’t possess internally. These tools are integrated as an API call inside the agent itself, leading to challenges in scaling and tool reuse across an enterprise. Customers looking to deploy agents at scale need a consistent way to integrate these tools, whether internal or external, regardless of the orchestration framework they are using or the function of the tool.

Model Context Protocol (MCP) aims to standardize how these channels, agents, tools, and customer data can be used by agents, as shown in the following figure. For customers, this translates directly into a more seamless, consistent, and efficient experience compared to dealing with fragmented systems or agents. By making tool integration simpler and standardized, customers building agents can now focus on which tools to use and how to use them, rather than spending cycles building custom integration code. We will deep dive into the MCP architecture later in this post.

For MCP implementation, you need a scalable infrastructure to host these servers and an infrastructure to host the large language model (LLM), which will perform actions with the tools implemented by the MCP server. Amazon SageMaker AI provides the ability to host LLMs without worrying about scaling or managing the undifferentiated heavy lifting. You can deploy your model or LLM to SageMaker AI hosting services and get an endpoint that can be used for real-time inference. Moreover, you can host MCP servers on the compute environment of your choice from AWS, including Amazon Elastic Compute Cloud (Amazon EC2), Amazon Elastic Container Service (Amazon ECS), Amazon Elastic Kubernetes Service (Amazon EKS), and AWS Lambda, according to your preferred level of managed service—whether you want to have complete control of the machine running the server, or you prefer not to worry about maintaining and managing these servers.

In this post, we discuss the following topics:

  • Understanding the MCP architecture, why you should use the MCP compared to implementing microservices or APIs, and two popular ways of implementing MCP using LangGraph adapters:
    • FastMCP for prototyping and simple use cases
    • FastAPI for complex routing and authentication
  • Recommended architecture for scalable deployment of MCP
  • Using SageMaker AI with FastMCP for rapid prototyping
  • Implementing a loan underwriter MCP workflow with LangGraph and SageMaker AI with FastAPI for custom routing

Understanding MCP

Let’s deep dive into the MCP architecture. Developed by Anthropic as an open protocol, the MCP provides a standardized way to connect AI models to virtually any data source or tool. Using a client-server architecture (as illustrated in the following screenshot), MCP helps developers expose their data through lightweight MCP servers while building AI applications as MCP clients that connect to these servers.

The MCP uses a client-server architecture containing the following components:

  • Host – A program or AI tool that requires access to data through the MCP protocol, such as Anthropic’s Claude Desktop, an integrated development environment (IDE), or other AI applications
  • Client – Protocol clients that maintain one-to-one connections with servers
  • Server – Lightweight programs that expose capabilities through standardized MCP or act as tools
  • Data sources – Local data sources such as databases and file systems, or external systems available over the internet through APIs (web APIs) that MCP servers can connect to

Based on these components, we can define the protocol as the communication backbone connecting the MCP client and server within the architecture, which includes the set of rules and standards defining how clients and servers should interact, what messages they exchange (using JSON-RPC 2.0), and the roles of different components.

Now let’s understand the MCP workflow and how it interacts with an LLM to deliver you a response by using an example of a travel agent. You ask the agent to “Book a 5-day trip to Europe in January and we like warm weather.” The host application (acting as an MCP client) identifies the need for external data and connects through the protocol to specialized MCP servers for flights, hotels, and weather information. These servers return the relevant data through the MCP, which the host then integrates with the original prompt, providing enriched context to the LLM to generate a comprehensive, augmented response for the user. The following diagram illustrates this workflow.

When to use MCP instead of implementing microservices or APIs

MCP marks a significant advancement compared to traditional monolithic APIs and intricate microservices architectures. Traditional APIs often bundle the functionalities together, leading to challenges where scaling requires upgrading the entire system, updates carry high risks of system-wide failures, and managing different versions for various applications becomes overly complex. Although microservices offer more modularity, they typically demand separate, often complex, integrations for each service and intricate management overhead.

MCP overcomes these limitations by establishing a standardized client-server architecture specifically designed for efficient and secure integration. It provides a real-time, two-way communication interface enabling AI systems to seamlessly connect with diverse external tools, API services, and data sources using a “write once, use anywhere” philosophy. Using transports like standard input/output (stdio) or streamable HTTP under the unifying JSON-RPC 2.0 standard, MCP delivers key advantages such as superior fault isolation, dynamic service discovery, consistent security controls, and plug-and-play scalability, making it exceptionally well-suited for AI applications that require reliable, modular access to multiple resources.

FastMCP vs. FastAPI

In this post, we discuss two different approaches for implementing MCP servers: FastAPI with SageMaker, and FastMCP with LangGraph. Both are fully compatible with the MCP architecture and can be used interchangeably, depending on your needs. Let’s understand the difference between both.

FastMCP is used for rapid prototyping, educational demos, and scenarios where development speed is a priority. It’s a lightweight, opinionated wrapper built specifically for quickly standing up MCP-compliant endpoints. It abstracts away much of the boilerplate—such as input/output schemas and request handling—so you can focus entirely on your model logic.

For use cases where you need to customize request routing, add authentication, or integrate with observability tools like Langfuse or Prometheus, FastAPI gives you the flexibility to do so. FastAPI is a full-featured web framework that gives you finer-grained control over the server behavior. It’s well-suited for more complex workflows, advanced request validation, detailed logging, middleware, and other production-ready features.

You can safely use either approach in your MCP servers—the choice depends on whether you prioritize simplicity and speed (FastMCP) or flexibility and extensibility (FastAPI). Both approaches conform to the same interface expected by agents in the LangGraph pipeline, so your orchestration logic remains unchanged.

Solution overview

In this section, we walk through a reference architecture for scalable deployment of MCP servers and MCP clients, using SageMaker AI as the hosting environment for the foundation models (FMs) and LLMs. Although this architecture uses SageMaker AI as its reasoning core, it can be quickly adapted to support Amazon Bedrock models as well. The following diagram illustrates the solution architecture.

The architecture decouples the client from the server by using streamable HTTP as the transport layer. By doing this, clients and servers can scale independently, making it a great fit for serverless orchestration powered by Lambda, AWS Fargate for Amazon ECS, or Fargate for Amazon EKS. An additional benefit of decoupling is that you can better control authorization of applications and user by controlling AWS Identity and Access Management (IAM) permissions of client and servers separately, and propagating user access to the backend. If you’re running client and server with a monolithic architecture on the same compute, we suggest instead using stdio as the transport layer to reduce networking overhead.

Use SageMaker AI with FastMCP for rapid prototyping

With the architecture defined, let’s analyze the application flow as shown in the following figure.

In terms of usage patterns, MCP shares a logic similar to tool calling, with an initial addition to discover the available tools:

  1. The client connects to the MCP server and obtains a list of available tools.
  2. The client invokes the LLM using a prompt engineered with the list of tools available on the MCP server (message of type “user”).
  3. The LLM reasons with respect to which ones it needs to call and how many times, and replies (“assistant” type message).
  4. The client asks the MCP server to execute the tool calling and provides the result to the LLM (“user” type message).
  5. This loop iterates until a final answer is reached and can be given back to the user.
  6. The client disconnects from the MCP server.

Let’s start with the MCP server definition. To create an MCP server, we use the official Model Context Protocol Python SDK. For example, let’s create a simple server with just one tool. The tool will simulate searching for the most popular song played at a radio station, and return it in a Python dictionary. Make sure to add proper docstring and input/output typing, so that the both the server and client can discover and consume the resource correctly.

from mcp.server.fastmcp import FastMCP

# instantiate an MCP server client
mcp = FastMCP("Radio Station Server")

# DEFINE TOOLS
@mcp.tool()
def top_song(sign: str) -> dict:
"""Get the most popular song played on a radio station"""
# In this example, we simulate the return
# but you should replace this with your business logic
return {
"song": "In the end",
"author": "Linkin Park"
}

@mcp.tool()
def ...

if __name__ == "__main__":
# Start the MCP server using stdio/SSE transport
  mcp.run(transport="sse")

As we discussed earlier, MCP servers can be run on AWS compute services—Amazon EC2, Amazon EC2, Amazon EKS, or Lambda—and can then be used to safely access other resources in the AWS Cloud, for example databases in virtual private clouds (VPCs) or an enterprise API, as well as external resources. For example, a simple way to deploy an MCP server is to use Lambda support for Docker images to install the MCP dependency on the Lambda function or Fargate.

With the server set up, let’s turn our focus to the MCP client. Communication starts with the MCP client connecting to the MCP Server using streamable HTTP:

from mcp import ClientSession
from mcp.client.sse import sse_client

async def connect_to_sse_server(self, server_url: str):
"""Connect to an MCP server running with SSE transport"""
  # Store the context managers so they stay alive
  self._streams_context = sse_client(url=server_url)
  streams = await self._streams_context.__aenter__()

  self._session_context = ClientSession(*streams)
  self.session: ClientSession = await self._session_context.__aenter__()

  # Initialize
  await self.session.initialize()

  # List available tools to verify connection
  print("Initialized SSE client...")
  print("Listing tools...")
  response = await self.session.list_tools()
  tools = response.tools
  print("nConnected to server with tools:", [tool.name for tool in tools])

When connecting to the MCP server, a good practice is to ask the server for a list of available tools with the list_tools() API. With the tool list and their description, we can then define a system prompt for tool calling:

system_message = (
     "You are a helpful assistant with access to these tools:nn"
      f"{tools_description}n"
      "Choose the appropriate tool based on the user's question. "
      "If no tool is needed, reply directly.nn"
      "IMPORTANT: When you need to use a tool, you must ONLY respond with "
      "the exact JSON object format below, nothing else:n"
      "{n"
      '    "tool": "tool-name",n'
      '    "arguments": {n'
      '        "argument-name": "value"n'
      "    }n"
      "}nn"
      "After receiving a tool's response:n"
      "1. Transform the raw data into a natural, conversational responsen"
      "2. Keep responses concise but informativen"
      "3. Focus on the most relevant informationn"
      "4. Use appropriate context from the user's questionn"
      "5. Avoid simply repeating the raw datann"
      "Please use only the tools that are explicitly defined above."
)

Tools are usually defined using a JSON schema similar to the following example. This tool is called top_song and its function is to get the most popular song played on a radio station:

{
   "name": "top_song",
   "description": "Get the most popular song played on a radio station.",
   "parameters": {
     "type": "object",
     "properties": {
        "sign": {
           "type": "string",
           "description": "The call sign for the radio station for which you want the most popular song. Example calls signs are WZPZ and WKRP."
           }
         },
     "required": ["sign"]
     }
}

With the system prompt configured, you can run the chat loop as much as needed, alternating between invoking the hosted LLM and calling the tools powered by the MCP server. You can use packages such as SageMaker Boto3, the Amazon SageMaker Python SDK, or another third-party library, such as LiteLLM or similar.

messages = [
     {"role": "system", "content": system_message},
     {"role": "user", "content": "What is the most played song on WZPZ?"}
]

result = sagemaker_client.invoke_endpoint(...)
tool_name, tool_args = parse_tools_from_llm_response(result)
# Identify if there is a tool call in the message received from the LLM
result = await self.session.call_tool(tool_name, tool_args)
# Parse the output from the tool called, then invoke the endpoint again
result = sagemaker_client.invoke_endpoint(...)

A model hosted on SageMaker doesn’t support function calling natively in its API. This means that you will need to parse the content of the response using a regular expression or similar methods:

import re, json

def parse_tools_from_llm_response(message: str)->dict:
    match = re.search(r'(?s){(?:[^{}]|(?:{[^{}]*}))*}', content)
    content = json.loads(match.group(0))
    tool_name = content["tool"]
    tool_arguments = content["arguments"]
    return tool_name, tool_arguments

After no more tool requests are available in the LLM response, you can consider the content as the final answer and return it to the user. Finally, you close the stream to finalize interactions with the MCP server.

Implement a loan underwriter MCP workflow with LangGraph and SageMaker AI with FastAPI for custom routing

To demonstrate the power of MCP with SageMaker AI, let’s explore a loan underwriting system that processes applications through three specialized personas:

  • Loan officer – Summarizes the application
  • Credit analyst – Evaluates creditworthiness
  • Risk manager – Makes final approval or denial decisions

We will walk you through these personas through the following architecture for a loan processing workflow using MCP. The code for this solution is available in the following GitHub repo.

In the architecture, the MCP client and server are running on EC2 instances and the LLM is hosted on SageMaker endpoints. The workflow consists of the following steps:

  1. The user enters a prompt with loan input details such as name, age, income, and credit score.
  2. The request is routed to the loan MCP server by the MCP client.
  3. The loan parser sends output as input to the credit analyzer MCP server.
  4. The credit analyzer sends output as input to the risk manager MCP server.
  5. The final prompt is processed by the LLM and sent back to the MCP client to provide the output to the user.

You can use LangGraph’s built-in human-in-the-loop feature when the credit analyzer sends the output to the risk manager and when the risk manager sends the output. For this post, we have not implemented this workflow.

Each persona is powered by an agent with LLMs hosted by SageMaker AI, and its logic is deployed by using a dedicated MCP server. Our MCP server implementation in the example uses the Awesome MCP FastAPI, but you can also build a standard MCP server implementation according to the original Anthropic package and specification. The dedicated MCP server in this example is running on a local Docker container, but it can be quickly deployed to the AWS Cloud using services like Fargate. To run the servers locally, use the following code:

uvicorn servers.loan_parser.main:app --port 8002
uvicorn servers.credit_analyzer.main:app --port 8003
uvicorn servers.risk_assessor.main:app --port 8004

When the servers are running, you can start creating the agents and the workflow. You will need to deploy the LLM endpoint by running the following command:

Python deploy_sm_endpoint.py

This example uses LangGraph, a common open source framework for agentic workflows, designed to support seamless integration of language models into complex workflows and applications. Workflows are represented as graphs made of nodes—actions, tools, or model queries—and edges with the flow of information between them. LangGraph provides a structured yet dynamic way to execute tasks, making it simple to write AI applications involving natural language understanding, automation, and decision-making.

In our example, the first agent we create is the loan officer:

graph = StateGraph(State)
graph.add_node("LoanParser", call_mcp_server(PARSER_URL))

The goal of the loan officer (or LoanParser) is to perform the tasks defined in its MCP server. To call the MCP server, we can use the httpx library:

import httpx
from langchain_core.runnables import RunnableLambda

def call_mcp_server(url):
    async def fn(state: State) -> State:
      print(f"[DEBUG] Calling {url} with payload:", state["output"])
      async with httpx.AsyncClient() as client:
        response = await client.post(url, json=state["output"])
        response.raise_for_status()
        return {"output": response.json()}
    return RunnableLambda(fn).with_config({"run_name": f"CallMCP::{url.split(':')[2]}"})

With that done, we can run the workflow using the scripts/run_pipeline.py file. We configured the repository to be traceable by using LangSmith. If you have correctly configured the environment variables, you will see a trace similar to this one in your LangSmith UI.

Configuring LangSmith UI for experiment tracing is optional. You can skip this step.

After running python3 scripts/run_pipeline.py, you should see the following in your terminal or log.

We use the following input:

loan_input = {
  "output": {
     "name": "Jane Doe",
     "age": 35,
     "income": 2000000,
     "loan_amount": 4500000,
     "credit_score": 820,
     "existing_liabilities": 15000,
     "purpose": "Home Renovation"
     }
}

We get the following output:

[DEBUG] Calling http://localhost:8002/process with payload: {'name': 'Jane Doe', 'age': 35, 'income': 2000000, 'loan_amount': 4500000, 'credit_score': 820, 'existing_liabilities': 15000, 'purpose': 'Home Renovation'}

[DEBUG] Calling http://localhost:8003/process with payload: {'summary': 'Jane Doe, 35 years old, applying for a loan of $4,500,000 to renovate her home. She has an income of $2,000,000, a credit score of 820, and existing liabilities of $150,000.', 'fields': {'name': 'Jane Doe', 'age': 35, 'income': 2000000.0, 'loan_amount': 4500000.0, 'credit_score': 820, 'existing_liabilities': 15000.0, 'purpose': 'Home Renovation'}}

[DEBUG] Calling http://localhost:8004/process with payload: {'credit_assessment': 'High', 'score': 'High', 'fields': {'name': 'Jane Doe', 'age': 35, 'income': 2000000.0, 'loan_amount': 4500000.0, 'credit_score': 820, 'existing_liabilities': 15000.0, 'purpose': 'Home Renovation'}}

Final result: {'decision': 'Approved', 'reasoning': 'Decision: Approved'}

Tracing with the LangSmith UI

LangSmith traces contain the full information of all the inputs and outputs of each step of the application, giving users full visibility into their agent. This is an optional step and in case you have configured LangSmith for tracing the MCP loan processing application. You can go the LangSmith login page and log in to the LangSmith UI. Then you can choose Tracing Project and run LoanUnderwriter. You should see a detailed flow of each MCP server, such as loan parser, credit analyzer, and risk assessor input and outputs by the LLM, as shown in the following screenshot.

Conclusion

The MCP proposed by Anthropic offers a standardized way of connecting FMs to data sources, and now you can use this capability with SageMaker AI. In this post, we presented an example of combining the power of SageMaker AI and MCP to build an application that offers a new perspective on loan underwriting through specialized roles and automated workflows.

Organizations can now streamline their AI integration processes by minimizing custom integrations and maintenance bottlenecks. As AI continues to evolve, the ability to securely connect models to your organization’s critical systems will become increasingly valuable. Whether you’re looking to transform loan processing, streamline operations, or gain deeper business insights, the SageMaker AI and MCP integration provides a flexible foundation for your next AI innovation.

The following are some examples of what you can build by connecting your SageMaker AI models to MCP servers:

  • A multi-agent loan processing system that coordinates between different roles and data sources
  • A developer productivity assistant that integrates with enterprise systems and tools
  • A machine learning workflow orchestrator that manages complex, multi-step processes while maintaining context across operations

If you’re looking for ways to optimize your SageMaker AI deployment, learn more about how to unlock cost savings with the new scale down to zero feature in SageMaker Inference, as well as how to unlock cost-effective AI inference using Amazon Bedrock serverless capabilities with a SageMaker trained model. For application development, refer to Build agentic AI solutions with DeepSeek-R1, CrewAI, and Amazon SageMaker AI


About the Authors

Mona Mona currently works as a Sr World Wide Gen AI Specialist Solutions Architect at Amazon focusing on Gen AI Solutions. She was a Lead Generative AI specialist in Google Public Sector at Google before joining Amazon. She is a published author of two books – Natural Language Processing with AWS AI Services and Google Cloud Certified Professional Machine Learning Study Guide. She has authored 19 blogs on AI/ML and cloud technology and a co-author on a research paper on CORD19 Neural Search which won an award for Best Research Paper at the prestigious AAAI (Association for the Advancement of Artificial Intelligence) conference.

Davide Gallitelli is a Senior Worldwide Specialist Solutions Architect for Generative AI at AWS, where he empowers global enterprises to harness the transformative power of AI. Based in Europe but with a worldwide scope, Davide partners with organizations across industries to architect custom AI agents that solve complex business challenges using AWS ML stack. He is particularly passionate about democratizing AI technologies and enabling teams to build practical, scalable solutions that drive organizational transformation.

Surya Kari is a Senior Generative AI Data Scientist at AWS, specializing in developing solutions leveraging state-of-the-art foundation models. He has extensive experience working with advanced language models including DeepSeek-R1, the Llama family, and Qwen, focusing on their fine-tuning and optimization for specific scientific applications. His expertise extends to implementing efficient training pipelines and deployment strategies using AWS SageMaker, enabling the scaling of foundation models from development to production. He collaborates with customers to design and implement generative AI solutions, helping them navigate model selection, fine-tuning approaches, and deployment strategies to achieve optimal performance for their specific use cases.

Giuseppe Zappia is a Principal Solutions Architect at AWS, with over 20 years of experience in full stack software development, distributed systems design, and cloud architecture. In his spare time, he enjoys playing video games, programming, watching sports, and building things.

Read More

Automate document translation and standardization with Amazon Bedrock and Amazon Translate

Automate document translation and standardization with Amazon Bedrock and Amazon Translate

Multinational organizations face the complex challenge of effectively managing a workforce and operations across different countries, cultures, and languages. Maintaining consistency and alignment across these global operations can be difficult, especially when it comes to updating and sharing business documents and processes. Delays or miscommunications can lead to productivity losses, operational inefficiencies, or potential business disruptions. Accurate and timely sharing of translated documents across the organization is an important step in making sure that employees have access to the latest information in their native language.

In this post, we show how you can automate language localization through translating documents using Amazon Web Services (AWS). The solution combines Amazon Bedrock and AWS Serverless technologies, a suite of fully managed event-driven services for running code, managing data, and integrating applications—all without managing servers. Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, and Stability AI. Amazon Bedrock is accessible through a single API, along with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI.

Solution overview

The solution uses AWS Step Functions to orchestrate the translation of the source document into the specified language (English, French, or Spanish) using AWS Lambda functions to call Amazon Translate. Note that Amazon Translate currently supports translation of 75 languages and 3 have been chosen for this demo. It then uses Amazon Bedrock to refine the translation and create natural, flowing content.

Building this solution, shown in the following diagram, on AWS fully managed and serverless technologies eliminates the need to operate infrastructure, manage capacity, or invest significant funding upfront to evaluate the business benefit. The compute and AI services used to process documents for translation run only on demand, resulting in a consumption-based billing model where you only pay for your use.

Solution architecture

The document translation and standardization workflow consists of the following steps:

  1. The user uploads their source document requiring translation to the input Amazon Simple Storage Service (Amazon S3) bucket. The bucket has three folders: English, French, and Spanish. The user uploads the source document to the folder that matches the current language of the document. This can be done using the AWS Management Console, the AWS Command Line Interface (AWS CLI), or third-party tools that allow them to navigate an S3 bucket as a file system.
  2. The presence of a new document in the input bucket initiates the Step Functions workflow using Amazon S3 Event Notifications.
  3. The first step of this workflow is an AWS Lambda function that retrieves the source document from the bucket, saves it in temporary storage, and calls the Amazon Translate API TranslateDocument specifying the source document as the target for translation.
  4. The second step of the workflow is another Lambda function that queries Amazon Bedrock using a pre-generated prompt with the translated source document included as the target. This prompt instructs Amazon Bedrock to perform a transcreation check on the document content. This validates that the intent, style, and tone of the document is maintained. The final version of the document is now saved in the output S3 bucket.
  5. The last step of the workflow uses Amazon Simple Notification Service (Amazon SNS) to notify an SNS topic of the outcome of the workflow (success or failure). This will send an email to the subscribers to the topic.
  6. The user downloads their translated document from the output S3 bucket. This can be done using the console, the AWS CLI, or third-party tools that allow them to navigate an S3 bucket as a file system.

This solution is available on GitHub and provides the AWS Cloud Development Kit (AWS CDK) code to deploy in your own AWS account. The AWS CDK is an open source software development framework for defining cloud infrastructure as code (IaC) and provisioning it through AWS CloudFormation. This provides an automated deployment process for your AWS account.

Prerequisites

For this walkthrough, you should have the following prerequisites:

Deployment steps

To deploy this solution into your own AWS account:

  1. Open your code editor of choice and authenticate to your AWS account. Instructions for linking to Visual Studio code can be found in Authentication and access for the AWS Toolkit for Visual Studio Code.
  2. Clone the solution from the GitHub repository:
    git clone https://github.com/aws-samples/sample-document-standardization-with-bedrock-and-translate.git

  3. Follow the deployment instructions in the repository README file.
  4. After the stack is deployed, go to the S3 console. Navigate to the S3 bucket that was created — docstandardizationstack-inputbucket. Upload the word_template.docx file that’s included in the repository. English, French, and Spanish folders will automatically be created.

Folders that are created when word_template.docx is uploaded

  1. Navigate to the Amazon Simple Notification Service (Amazon SNS) console and create a subscription to the topic DocStandardizationStack-ResultTopic created by the stack. After it’s created, make sure that you confirm subscription to the topic before testing the workflow by choosing the confirm subscription link in the automated email you receive from SNS.

sns subcription creation

  1. After you have subscribed to the topic, you can test the workflow.

Language translation

To test the workflow, upload a .docx file to the folder corresponding to the document’s original language. For example, if you’re uploading a document that was written in English, this document should be uploaded to the English folder. If you don’t have a .docx file available, you can use the tone_test.docx file that’s included in the repository.

The Step Functions state machine will start after your document is uploaded. Translated versions of your source input document will be added to the other folders that were created in step 5. In this example, we uploaded a document in English and the document was translated in both Spanish and French.

document translated into Spanish

document translated into French

Transcreation process

The translated documents are then processed using Amazon Bedrock. Amazon Bedrock reviews the documents’ intent, style and tone for use in a business setting. You can customize the output tone and style by modifying the Amazon Bedrock prompt to match your specific requirements. The final documents are added to the output S3 bucket with a suffix of _corrected, and each document is added to the folder that corresponds to the document’s language. The output bucket has the same format as the input bucket, with a separate folder created for each language.

Folders in the output bucket

The prompt used to instruct the generative AI model for the transcreation task has been designed to produce consistent and valid adjustments. It includes specific instructions, covering both what type of changes are expected from the model and rules to define boundaries that control adjustments. You can adjust this prompt if required to change the outcome of the document processing workflow.

The final documents will have a suffix of _corrected.

English corrected document

French corrected document

Spanish corrected document

When the documents have been processed, you will receive an SNS notification. You will be able to download the processed documents from the S3 bucket DocStandardizationStack-OutputBucket.

Clean up

To delete the deployed resources, run the command cdk destroy in your terminal, or use the CloudFormation console to delete the CloudFormation stack DocStandardizationStack.

Conclusion

In this post, we explored how to automate the translation of business documents using AWS AI and serverless technologies. Through this automated translation process, companies can improve communication, consistency, and alignment across their global operations, making sure that employees can access the information they need when they need it. As organizations continue to expand their global footprint, tools like this will become increasingly important for maintaining a cohesive and informed workforce, no matter where in the world they might be located. By embracing the capabilities of AWS, companies can focus on their core business objectives without creating additional IT infrastructure overhead.

Bonne traduction!

Feliz traducción!

Happy translating!

Further reading

The solution includes a zero-shot prompt with specific instructions directing what the LLM should and should not modify in the source document. If you want to iterate on the provided prompt to adjust your results, you can use the Amazon Bedrock Prompt Management tool to quickly edit and test the impact of changes to the prompt text.

For additional examples using Amazon Bedrock and other services, visit the AWS Workshops page to get started.


About the Authors

author photoNadhya Polanco is an Associate Solutions Architect at AWS based in Brussels, Belgium. In this role, she supports organizations looking to incorporate AI and Machine Learning into their workloads. In her free time, Nadhya enjoys indulging in her passion for coffee and exploring new destinations.

author photo Steve Bell is a Senior Solutions Architect at AWS based in Amsterdam, Netherlands. He helps enterprise organizations navigate the complexities of migration, modernization and multicloud strategy. Outside of work he loves walking his labrador, Lily, and practicing his amateur BBQ skills.

Read More

Autonomous mortgage processing using Amazon Bedrock Data Automation and Amazon Bedrock Agents

Autonomous mortgage processing using Amazon Bedrock Data Automation and Amazon Bedrock Agents

Mortgage processing is a complex, document-heavy workflow that demands accuracy, efficiency, and compliance. Traditional mortgage operations rely on manual review, rule-based automation, and disparate systems, often leading to delays, errors, and a poor customer experience. Recent industry surveys indicate that only about half of borrowers express satisfaction with the mortgage process, with traditional banks trailing non-bank lenders in borrower satisfaction. This gap in satisfaction level is largely attributed to the manual, error-prone nature of traditional mortgage processing, where delays, inconsistencies, and fragmented workflows create frustration for borrowers and impact overall experience.

In this post, we introduce agentic automatic mortgage approval, a next-generation sample solution that uses autonomous AI agents powered by Amazon Bedrock Agents and Amazon Bedrock Data Automation. These agents orchestrate the entire mortgage approval process—intelligently verifying documents, assessing risk, and making data-driven decisions with minimal human intervention. By automating complex workflows, businesses can accelerate approvals, accelerate approvals, minimize errors, and provide consistency while enhancing scalability and compliance.

The following video shows this agentic automation in action—enabling smarter, faster, and more reliable mortgage processing at scale.

Why agentic IDP?

Agentic intelligent document processing (IDP) revolutionizes document workflows by driving efficiency and autonomy. It automates tasks with precision, enabling systems to extract, classify, and process information while identifying and correcting errors in real time.

Agentic IDP goes beyond simple extraction by grasping context and intent, adding deeper insights to documents that fuel smarter decision-making. Powered by Amazon Bedrock Data Automation, it adapts to changing document formats and data sources, further reducing manual work.

Built for speed and scale, agentic IDP processes high volumes of documents quickly, reducing delays and optimizing critical business operations. Seamlessly integrating with AI agents and enterprise systems, it automates complex workflows, cutting operational costs and freeing teams to focus on high-value strategic initiatives.

IDP in mortgage processing

Mortgage processing involves multiple steps, including loan origination, document verification, underwriting, and closing; with each step requiring significant manual effort. These steps are often disjointed, leading to slow processing times (weeks instead of minutes), high operational costs (manual document reviews), and an increased risk of human errors and fraud. Organizations face numerous technical challenges when manually managing document-intensive workflows, as depicted in the following diagram.

These challenges include:

  • Document overload – Mortgage applications require verification of extensive documentation, including tax records, income statements, property appraisals, and legal agreements. For example, a single mortgage application might require manual review and cross-validation of hundreds of pages of tax returns, pay stubs, bank statements, and legal documents, consuming significant time and resources.
  • Data entry errors – Manual processing introduces inconsistencies, inaccuracies, and missing information during data entry. Incorrect transcription of applicant income from W-2 forms or misinterpreting property appraisal data can lead to miscalculated loan eligibility, requiring costly corrections and rework.
  • Delays in decision-making – Backlogs resulting from manual review processes extend processing times and negatively affect borrower satisfaction. A lender manually reviewing income verification and credit documentation might take several weeks to work through their backlog, causing delays that result in lost opportunities or frustrated applicants who turn to competitors.
  • Regulatory compliance complexity – Evolving mortgage industry regulations introduce complexity into underwriting and verification procedures. Changes in lending regulations, such as new mandatory disclosures or updated income verification guidelines, can require extensive manual updates to processes, leading to increased processing times, higher operational costs, and elevated error rates from manual data entry.

These challenges underscore the need for automation to enhance efficiency, speed, and accuracy for both lenders and mortgage borrowers.

Solution: Agentic workflows in mortgage processing

The following solution is self-contained and the applicant only interacts with the mortgage applicant supervisor agent to upload documents and check or retrieve application status. The following diagram illustrates the workflow.

The workflow consists of the following steps:

  1. Applicant uploads documents to apply for a mortgage.
  2. The supervisor agent confirms receipt of documents. Applicant can view and retrieve application status.
  3. The underwriter updates the status of the application and sends approval documents to applicant.

At the core of the agentic mortgage processing workflow is a supervisor agent that orchestrates the entire workflow, manages sub-agents, and makes final decisions. Amazon Bedrock Agents is a capability within Amazon Bedrock that lets developers create AI-powered assistants capable of understanding user requests and executing complex tasks. These agents can break down requests into logical steps, interact with external tools and data sources, and use AI models to reason and take actions. They maintain conversation context while securely connecting to various APIs and AWS services, making them ideal for tasks like customer service automation, data analysis, and business process automation.

The supervisor agent intelligently delegates tasks to specialized sub-agents while maintaining the right balance between automated processing and human supervision. By aggregating insights and data from various sub-agents, the supervisor agent applies established business rules and risk criteria to either automatically approve qualifying loans or flag complex cases for human review, improving both efficiency and accuracy in the mortgage underwriting process.

In the following sections, we explore the sub-agents in more detail.

Data extraction agent

The data extraction agent uses Amazon Bedrock Data Automation to extract critical insights from mortgage application packages, including pay stubs, W-2 forms, bank statements, and identity documents. Amazon Bedrock Data Automation is a generative AI-powered capability of Amazon Bedrock that streamlines the development of generative AI applications and automates workflows involving documents, images, audio, and videos. The data extraction agent helps make sure that the validation, compliance, and decision-making agent receives accurate and structured data, enabling efficient validation, regulatory compliance, and informed decision-making. The following diagram illustrates the workflow.

The extraction workflow is designed to automate the process of extracting data from application packages efficiently. The workflow includes the following steps:

  1. The supervisor agent assigns the extraction task to the data extraction agent.
  2. The data extraction agent invokes Amazon Bedrock Data Automation to parse and extract applicant details from the application packages.
  3. The extracted application information is stored in the extracted documents Amazon Simple Storage Service (Amazon S3) bucket.
  4. The Amazon Bedrock Data Automation invocation response is sent back to the extraction agent.

Validation agent

The validation agent cross-checks extracted data with external resources such as IRS tax records and credit reports, flagging discrepancies for review. It flags inconsistencies such as doctored PDFs, low credit score, and also calculates debt-to-income (DTI) ratio, loan-to-value (LTV) limit, and an employment stability check. The following diagram illustrates the workflow.

The process consists of the following steps:

  1. The supervisor agent assigns the validation task to the validation agent.
  2. The validation agent retrieves the applicant details stored in the extracted documents S3 bucket.
  3. The applicant details are cross-checked against third-party resources, such as tax records and credit reports, to validate the applicant’s information.
  4. The third-party validated details are used by the validation agent to generate a status.
  5. The validation agent sends the validation status to the supervisor agent.

Compliance agent

The compliance agent verifies that the extracted and validated data adheres to regulatory requirements, reducing the risk of compliance violations. It validates against lending rules. For example, loans are approved only if the borrower’s DTI ratio is below 43%, making sure they can manage monthly payments, or applications with a credit score below 620 are declined, whereas higher scores qualify for better interest rates. The following diagram illustrates the compliance agent workflow.

The workflow includes the following steps:

  1. The supervisor agent assigns the compliance validation task to the compliance agent.
  2. The compliance agent retrieves the applicant details stored in the extracted documents S3 bucket.
  3. The applicant details are validated against mortgage processing rules.
  4. The compliance agent calculates the applicant’s DTI ratio, applying corporate policy and lending rules to the application.
  5. The compliance agent uses the validated details to generate a status.
  6. The compliance agent sends the compliance status to the supervisor agent.

Underwriting agent

The underwriting agent generates an underwriting document for the underwriter to review. The underwriting agent workflow streamlines the process of reviewing and finalizing underwriting documents, as shown in the following diagram.

The workflow consists of the following steps:

  1. The supervisor agent assigns the underwriting task to the underwriting agent.
  2. The underwriting agent verifies the information and creates a draft of the underwriting document.
  3. The draft document is sent to an underwriter for review.
  4. Updates from the underwriter are sent back to the underwriting agent.

RACI matrix

The collaboration between intelligent agents and human professionals is key to efficiency and accountability. To illustrate this, we’ve crafted a RACI (Responsible, Accountable, Consulted, and Informed) matrix that maps out how responsibilities might be shared between AI-driven agents and human roles, such as compliance officers and the underwriting officer. This mapping serves as a conceptual guide, offering a glimpse into how agentic automation can enhance human expertise, optimize workflows, and provide clear accountability. Real-world implementations will differ based on an organization’s unique structure and operational needs.

The matrix components are as follows:

  • R: Responsible (executes the work)
  • A: Accountable (owns approval authority and outcomes)
  • C: Consulted (provides input)
  • I: Informed (kept informed of progress/status)

End-to-end IDP automation architecture for mortgage processing

The following architecture diagram illustrates the AWS services powering the solution and outlines the end-to-end user journey, showcasing how each component interacts within the workflow.

In Steps 1 and 2, the process begins when a user accesses the web UI in their browser, with Amazon CloudFront maintaining low-latency content delivery worldwide. In Step 3, Amazon Cognito handles user authentication, and AWS WAF provides security against malicious threats. Steps 4 and 5 show authenticated users interacting with the web application to upload required documentation to Amazon S3. The uploaded documents in Amazon S3 trigger Amazon EventBridge, which initiates the Amazon Bedrock Data Automation workflow for document processing and information extraction.

In Step 6, AWS AppSync manages user interactions, enabling real-time communication with AWS Lambda and Amazon DynamoDB for data storage and retrieval. Steps 7, 8, and 9 demonstrate how the Amazon Bedrock multi-agent collaboration framework comes into play, where the supervisor agent orchestrates the workflow between specialized AI agents. The verification agent verifies uploaded documents, manages data collection, and uses action groups to compute DTI ratios and generate an application summary, which is stored in Amazon S3.

Step 10 shows how the validation agent (broker assistant) evaluates the application based on predefined business criteria and automatically generates a pre-approval letter, streamlining loan processing with minimal human intervention. Throughout the workflow in Step 11, Amazon CloudWatch provides comprehensive monitoring, logging, and real-time visibility into all system components, maintaining operational reliability and performance tracking.

This fully agentic and automated architecture enhances mortgage processing by improving efficiency, reducing errors, and accelerating approvals, ultimately delivering a faster, smarter, and more scalable lending experience.

Prerequisites

You need to have an AWS account and an AWS Identity and Access Management (IAM) role and user with permissions to create and manage the necessary resources and components for this solution. If you don’t have an AWS account, see How do I create and activate a new Amazon Web Services account?

Deploy the solution

To get started, clone the GitHub repository and follow the instructions in the README to deploy the solution using AWS CloudFormation. The deployment steps offer clear guidance on how to build and deploy the solution. After the solution is deployed, you can proceed with the following instructions:

  1. After you provision all the stacks, navigate to the stack AutoLoanAPPwebsitewafstackXXXXX on the AWS CloudFormation console.
  2. On the Outputs tab, locate the CloudFront endpoint for the application UI.

You can also get the endpoint using the AWS Command Line Interface (AWS CLI) and the following command:

 aws cloudformation describe-stacks 
--stack-name $(aws cloudformation list-stacks 
--stack-status-filter CREATE_COMPLETE UPDATE_COMPLETE | jq -r '.StackSummaries[] | select(.StackName | startswith("AutoLoanAPPwebsitewafstack")) | .StackName') 
--query 'Stacks[0].Outputs[?OutputKey==`configwebsitedistributiondomain`].OutputValue' 
--output text
  1. Open the (https://<domain_name>.cloudfront.net) in a new browser.

You should see the application login page.

  1. Create an Amazon Cognito user in the user pool to access the application.
  2. Sign in using your Amazon Cognito email and password credentials to access the application.

Monitoring and troubleshooting

Consider the following best practices:

  • Monitor stack creation and update status using the AWS CloudFormation console or AWS CLI
  • Monitor Amazon Bedrock model invocation metrics using CloudWatch:
    • InvokeModel requests and latency
    • Throttling exceptions
    • 4xx and 5xx errors
  • Check Amazon CloudTrail for API invocations and errors
  • Check CloudWatch for solution-specific errors and logs:

aws cloudformation describe-stacks —stack-name <stack-name>

Clean up

To avoid incurring additional costs after testing this solution, complete the following steps:

  1. Delete the relevant stacks from the AWS CloudFormation console.
  2. Verify the S3 buckets are empty before deleting them.

Conclusion

The sample automated loan application sample solution demonstrates how you can use Amazon Bedrock Agents and Amazon Bedrock Data Automation to transform mortgage loan processing workflows. Beyond mortgage processing, you can adapt this solution to streamline claims processing or address other complex document-processing scenarios. By using intelligent automation, this solution significantly reduces manual effort, shortens processing times, and accelerates decision-making. Automating these intricate workflows helps organizations achieve greater operational efficiency, maintain consistent compliance with evolving regulations, and deliver exceptional customer experiences.

The sample solution is provided as open source—use it as a starting point for your own solution, and help us make it better by contributing back fixes and features using GitHub pull requests. Browse to the GitHub repository to explore the code, click watch to be notified of new releases, and check the README for the latest documentation updates.

As next steps, we recommend assessing your current document processing workflows to identify areas suitable for automation using Amazon Bedrock Agents and Amazon Bedrock Data Automation.

For expert assistance, AWS Professional Services and other AWS Partners are here to help.

We’d love to hear from you. Let us know what you think in the comments section, or use the issues forum in the repository.


About the Authors

Wrick Talukdar is a Tech Lead – Generative AI Specialist focused on Intelligent Document Processing. He leads machine learning initiatives and projects across business domains, leveraging multimodal AI, generative models, computer vision, and natural language processing. He speaks at conferences such as AWS re:Invent, IEEE, Consumer Technology Society(CTSoc), YouTube webinars, and other industry conferences like CERAWEEK and ADIPEC. In his free time, he enjoys writing and birding photography.

Jady Liu is a Senior AI/ML Solutions Architect on the AWS GenAI Labs team based in Los Angeles, CA. With over a decade of experience in the technology sector, she has worked across diverse technologies and held multiple roles. Passionate about generative AI, she collaborates with major clients across industries to achieve their business goals by developing scalable, resilient, and cost-effective generative AI solutions on AWS. Outside of work, she enjoys traveling to explore wineries and distilleries.

Farshad Bidanjiri is a Solutions Architect focused on helping startups build scalable, cloud-native solutions. With over a decade of IT experience, he specializes in container orchestration and Kubernetes implementations. As a passionate advocate for generative AI, he helps emerging companies leverage cutting-edge AI technologies to drive innovation and growth.

Keith Mascarenhas leads worldwide GTM strategy for Generative AI at AWS, developing enterprise use cases and adoption frameworks for Amazon Bedrock. Prior to this, he drove AI/ML solutions and product growth at AWS, and held key roles in Business Development, Solution Consulting and Architecture across Analytics, CX and Information Security.

Jessie-Lee Fry is a Product and Go-to Market (GTM) Strategy executive specializing in Generative AI and Machine Learning, with over 15 years of global leadership experience in Strategy, Product, Customer success, Business Development, Business Transformation and Strategic Partnerships. Jessie has defined and delivered a broad range of products and cross-industry go- to-market strategies driving business growth, while maneuvering market complexities and C-Suite customer groups. In her current role, Jessie and her team focus on helping AWS customers adopt Amazon Bedrock at scale enterprise use cases and adoption frameworks, meeting customers where they are in their Generative AI Journey.

Raj Jayaraman is a Senior Generative AI Solutions Architect at AWS, bringing over a decade of experience in helping customers extract valuable insights from data. Specializing in AWS AI and generative AI solutions, Raj’s expertise lies in transforming business solutions through the strategic application of AWS’s AI capabilities, ensuring customers can harness the full potential of generative AI in their unique contexts. With a strong background in guiding customers across industries in adopting AWS Analytics and Business Intelligence services, Raj now focuses on assisting organizations in their generative AI journey—from initial demonstrations to proof of concepts and ultimately to production implementations.

Read More

Laws, norms, and ethics for AI in health

Laws, norms, and ethics for AI in health

Peter Lee, Vardit Ravitsky, Laura Adams, and Dr. Roxana Daneshjou illustrated headshots.

Two years ago, OpenAI’s GPT-4 kick-started a new era in AI. In the months leading up to its public release, Peter Lee, president of Microsoft Research, cowrote a book full of optimism for the potential of advanced AI models to transform the world of healthcare. What has happened since? In this special podcast series, The AI Revolution in Medicine, Revisited, Lee revisits the book, exploring how patients, providers, and other medical professionals are experiencing and using generative AI today while examining what he and his coauthors got right—and what they didn’t foresee. 

In this episode, Laura Adams (opens in new tab), Vardit Ravitsky (opens in new tab), and Dr. Roxana Daneshjou (opens in new tab), experts at the intersection of healthcare, ethics, and technology, join Lee to discuss the responsible implementation of AI in healthcare. Adams, a strategic advisor at the National Academy of Medicine leading the development of a national AI code of conduct, shares her initial curiosity and skepticism of generative AI and then her recognition of the technology as a transformative tool requiring new governance approaches. Ravitsky, bioethicist and president and CEO of The Hastings Center for Bioethics, examines how AI is reshaping healthcare relationships and the need for bioethics to proactively guide implementation. Daneshjou, a Stanford physician-scientist bridging dermatology, biomedical data science, and AI, discusses her work on identifying, understanding, and mitigating bias in AI systems and also leveraging AI to better serve patient needs.


Learn more

Health Care Artificial Intelligence Code of Conduct (opens in new tab) (Adams) 
Project homepage | National Academy of Medicine 

Artificial Intelligence in Health, Health Care, and Biomedical Science: An AI Code of Conduct Principles and Commitments Discussion Draft (opens in new tab) (Adams) 
National Academy of Medicine commentary paper | April 2024 

Ethics of AI in Health and Biomedical Research (opens in new tab) (Ravitsky) 
The Hastings Center for Bioethics 

Ethics in Patient Preferences for Artificial Intelligence–Drafted Responses to Electronic Messages (opens in new tab) (Ravitsky) 
Publication | March 2025 

Daneshjou Lab (opens in new tab) (Daneshjou) 
Lab homepage 

Red teaming ChatGPT in medicine to yield real-world insights on model behavior (opens in new tab) (Daneshjou) 
Publication | March 2025 

Dermatologists’ Perspectives and Usage of Large Language Models in Practice: An Exploratory Survey (opens in new tab) (Daneshjou) 
Publication | October 2024 

Deep learning-aided decision support for diagnosis of skin disease across skin tones (opens in new tab) (Daneshjou) 
Publication | February 2024 

Large language models propagate race-based medicine (opens in new tab) (Daneshjou) 
Publication | October 2023 

Disparities in dermatology AI performance on a diverse, curated clinical image set (opens in new tab) (Daneshjou) 
Publication | August 2022

Transcript

[MUSIC]    

[BOOK PASSAGE]  

PETER LEE: “… This is the moment for broad, thoughtful consideration of how to ensure maximal safety and also maximum access. Like any medical tool, AI needs those guardrails to keep patients as safe as possible. But it’s a tricky balance: those safety measures must not mean that the great advantages that we document in this book end up unavailable to many who could benefit from them. One of the most exciting aspects of this moment is that the new AI could accelerate healthcare in a direction that is better for patients, all patients, and providers as well—if they have access.” 

[END OF BOOK PASSAGE]    

[THEME MUSIC]    

This is The AI Revolution in Medicine, Revisited. I’m your host, Peter Lee.    

Shortly after OpenAI’s GPT-4 was publicly released, Carey Goldberg, Dr. Zak Kohane, and I published The AI Revolution in Medicine to help educate the world of healthcare and medical research about the transformative impact this new generative AI technology could have. But because we wrote the book when GPT-4 was still a secret, we had to speculate. Now, two years later, what did we get right, and what did we get wrong?     

In this series, we’ll talk to clinicians, patients, hospital administrators, and others to understand the reality of AI in the field and where we go from here. 


[THEME MUSIC FADES] 

The passage I read at the top there is from Chapter 9, “Safety First.” 

One needs only to look at examples such as laws mandating seatbelts in cars and, more recently, internet regulation to know that policy and oversight are often playing catch-up with emerging technologies. When we were writing our book, Carey, Zak, and I didn’t claim that putting frameworks in place to allow for innovation and adoption while prioritizing inclusiveness and protecting patients from hallucination and other harms would be easy. In fact, in our writing, we posed more questions than answers in the hopes of highlighting the complexities at hand and supporting constructive discussion and action in this space.  

In this episode, I’m pleased to welcome three experts who have been thinking deeply about these matters: Laura Adams, Vardit Ravitsky, and Dr. Roxana Daneshjou.  

Laura is an expert in AI, digital health, and human-centered care. As a senior advisor at the National Academy of Medicine, or NAM, she guides strategy for the academy’s science and technology portfolio and leads the Artificial Intelligence Code of Conduct national initiative.  

Vardit is president and CEO of The Hastings Center for Bioethics, a bioethics and health policy institute. She leads research projects funded by the National Institutes of Health, is a member of the committee developing the National Academy of Medicine’s AI Code of Conduct, and is a senior lecturer at Harvard Medical School.  

Roxana is a board-certified dermatologist and an assistant professor of both dermatology and biomedical data science at Stanford University. Roxana is among the world’s thought leaders in AI, healthcare, and medicine, thanks in part to groundbreaking work on AI biases and trustworthiness. 

One of the good fortunes I’ve had in my career is the chance to work with both Laura and Vardit, mainly through our joint work with the National Academy of Medicine. They’re both incredibly thoughtful and decisive leaders working very hard to help the world of healthcare—and healthcare regulators—come to grips with generative AI. And over the past few years, I’ve become an avid reader of all of Roxana’s research papers. Her work is highly technical, super influential but also informative in a way that spans computer science, medicine, bioethics, and law.  

These three leaders—one from the medical establishment, one from the bioethics field, and the third from clinical research—provide insights into three incredibly important dimensions of the issues surrounding regulations, norms, and ethics of AI in medicine. 

[TRANSITION MUSIC] 

Here is my interview with Laura Adams: 

LEE: Laura, I’m just incredibly honored and excited that you’re joining us here today, so welcome. 

ADAMS: Thank you, Peter, my pleasure. Excited to be here. 

LEE: So, Laura, you know, I’ve been working with you at the NAM for a while, and you are a strategic advisor at the NAM. But I think a lot of our listeners might not know too much about the National Academy of Medicine and then, within the National Academy of Medicine, what a strategic advisor does.  

So why don’t we start there. You know, how would you explain to a person’s mother or father what the National Academy of Medicine is? 

ADAMS: Sure. National Academy was formed more than 50 years ago. It was formed by the federal government, but it is not the federal government. It was formed as an independent body to advise the nation and the federal government on issues of science and primarily technology-related issues, as well.  

So with that 50 years, some probably know of the National Academy of Medicine when it was the Institute of Medicine and produced such publications as To Err is Human (opens in new tab) and Crossing the Quality Chasm (opens in new tab), both of which were seminal publications that I think had a dramatic impact on quality, safety, and how we saw our healthcare system and what we saw in terms of its potential. 

LEE: So now, for your role within NAM, what does the senior advisor do? What do you do?  

ADAMS: What I do there is in the course of leading the AI Code of Conduct project, my role there was in framing the vision for that project, really understanding what did we want it to do, what impact did we want it to make.  

So for example, some thought that it might be that we wanted everyone to use our code of conduct. And my advice on that was let’s use this as a touchstone. We want people to think about their own codes of conduct for their use of AI. That’s a valuable exercise, to decide what you value, what your aspirations are.  

I also then do a lot of the field alignment around that work. So I probably did 50 talks last year—conference presentations, webinars, different things—where the code of conduct was presented so that the awareness could be raised around it so people could see the practicality of using that tool.   
 
Especially the six commitments that were based on the idea of complex adaptive systems simple rules, where we could recall those in the heat of decision-making around AI, in the heat of application, or even in the planning and strategic thinking around it.  

LEE: All right, we’re going to want to really break into a lot of details here.  

But I would just like to rewind the clock a little bit and talk about your first encounters with AI. And there’s sort of, I guess, two eras. There’s the era of AI and machine learning before ChatGPT, before the generative AI era, and then afterwards.  

Before the era of generative AI, what was your relationship with the idea of artificial intelligence? Was it a big part of your role and something you thought about, or was it just one of many technologies that you considered? 

ADAMS: It was one of many.  

LEE: Yeah.  

ADAMS: Watching it help us evolve from predictive analytics to predictive AI, which of course I was fascinated by the fact that it could use structured and unstructured data, that it could learn from its own processes. These things were really quite remarkable, but my sense about it was that it was one of many.  

We were looking at telemedicine. We were looking at [a] variety of other things, particularly wearables and things that were affecting and empowering patients to take better care of themselves and take more … have more agency around their own care. So I saw it as one of many.  

And then the world changed in 2022, changed dramatically. 

LEE: [LAUGHS] OK. Right. OK, so November 2022, ChatGPT. Later in the spring of 2023, GPT-4. And so, you know, what were your first encounters, and what were you feeling? What were you experiencing? 

ADAMS: At the time, I was curious, and I thought, I think I’m seeing four things here that make this way different.  

And one was, and it proved to be true over time, the speed with which this evolved. And I was watching it evolve very, very quickly and thinking, this is almost, this is kind of mind blowing how fast this is getting better.  

And then this idea that, you know, we could scale this. As we were watching the early work with ambient listening, I was working with a group of physicians that were lamenting the cost and the unavailability of scribes. They wanted to use scribes. And I’m thinking, We don’t have to incur the cost of that. We don’t have to struggle with the unavailability of that type of … someone in the workforce.   

And then I started watching the ubiquity, and I thought, Oh, my gosh, this is unlike any other technology that we’ve seen. Because with electronic health records, for example, it’s had its place, but it was over here. We had another digital technology, maybe telehealth, over here. This was one, and I thought, there will be no aspect of healthcare that will be left untouched by AI. That blew my mind.  

LEE: Yeah. 

ADAMS: And then I think the last thing was the democratization. And I realized: Wow, anyone with a smartphone has access to the most powerful large language models in the world.  
 
And I thought, This, to me, is a revolution in cheap expertise. Those were the things that really began to stun me, and I just knew that we were in a way different era. 

LEE: It’s interesting that you first talked about ambient listening. Why was that of particular initial interest to you specifically? 

ADAMS: It was because one of the things that we were putting together in our code of conduct, which began pre-generative AI, was the idea that we wanted to renew the moral well-being and the sense of shared purpose to the healthcare workforce. That’s one of the six principles.  

And I knew that the cognitive burden was becoming unbearable. When we came out of COVID, it was such a huge wake-up call to understand exactly what was going on at that point of care and how challenging it had become because information overload is astonishing in and of itself. And that idea that we have so much in the way of documentation that needed to be done and how much of a clinician’s time was taken up doing that rather than doing the thing that they went into the profession to do. And that was interact with people, that was to heal, that was to develop human connection that had a healing effect, and they just … so much of the time was taken away from that activity.  

I also looked at it and because I studied diffusion of innovations theory and understand what causes something to move rapidly across a social system and get adopted, it has to have a clear relative advantage. It has to be compatible with the way that processes work. 

So I didn’t see that this was going to be a hugely disruptive activity to workflow, which is a challenge of most digital tools, is that they’re designed without that sense of, how does this impact the workflow? And then I just thought that it was going to be a front runner in adoption, and it might then start to create that tsunami, that wave of interest in this, and I don’t think I was wrong. 

LEE: I have to ask you, because I’ve been asking every guest, there must have been moments early on in the encounter with generative AI where you felt doubt or skepticism. Is that true, or did you immediately think, Wow, this is something very important? 

ADAMS: No, I did feel doubt and skepticism.  

My understanding tells me of it, and told me of it in the very beginning, that this is trained on the internet with all of its flaws. When we think about AI, we think about it being very futuristic, but it’s trained on data from the past. I’m well aware of how flawed that data, how biased that data is, mostly men, mostly white men, when we think about it during a certain age grouping of.  

So I knew that we had inherent massive flaws in the training data and that concerned me. I saw other things about it that also concerned me. I saw that … its difficulty in beginning to use it and govern it effectively.  

You really do have to put a good governance system in if you’re going to put this into a care delivery system. And I began to worry about widening a digital divide that already was a chasm. And that was between those well-resourced, usually urban, hospitals and health systems that are serving the well-resourced, and the inner-city hospital in Chicago or the rural hospital in Nebraska or the Mississippi community health center.  

LEE: Yes. So I think this skepticism about technology, new technologies, in healthcare is very well-earned. So you’ve zeroed in on this issue of technology where oftentimes we hope it’ll reduce or eliminate biases but actually seems to oftentimes have the opposite effect.  

And maybe this is a good segue then into this really super-important national effort that you’re leading on the AI code of conduct. Because in a way, I think those failures of the past and even just the idea—the promise—that technology should make a doctor or a nurse’s job easier, not harder, even that oftentimes seems not to have panned out in the way that we hope.  

And then there’s, of course, the well-known issue of hallucinations or of mistakes being made. You know, how did those things inform this effort around a code of conduct, and why a code of conduct? 

ADAMS: Those things weighed heavily on me as the initiative leader because I had been deeply involved in the spread of electronic health records, not really knowing and understanding that electronic health records were going to have the effect that they had on the providers that use them.  

Looking back now, I think that there could have been design changes, but we probably didn’t have as much involvement of providers in the design. And in some cases, we did. We just didn’t understand what it would take to work it into their workflows.  

So I wanted to be sure that the code of conduct took into consideration and made explicit some of the things that I believe would have helped us had we had those guardrails or those guidelines explicit for us.  

And those are things like our first one is to protect and advance human health and connection.  

We also wanted to see things about openly sharing and monitoring because we know that for this particular technology, it’s emergent. We’re going to have to do a much better job at understanding whether what we’re doing works and works in the real world.  

So the reason for a code of conduct was we knew that … the good news, when the “here comes AI and it’s barreling toward us,” the good news was that everybody was putting together guidelines, frameworks, and principle sets. The bad news was same. That everybody was putting together their own guideline, principle, and framework set.  

And I thought back to how much I struggled when I worked in the world of health information exchange and built a statewide health information exchange and then turned to try to exchange that data across the nation and realized that we had a patchwork of privacy laws and regulations across the state; it was extremely costly to try to move data.  

And I thought we actually need, in addition to data interoperability, we need governance interoperability, where we can begin to agree on a core set of principles that will more easily allow us to move ahead and achieve some of the potential and the vision that we have for AI if we are not working with a patchwork of different guidelines, principles, and frameworks.  

So that was the impetus behind it. Of course, we again want it to be used as a touchstone, not everybody wholesale adopt what we’ve said.  

LEE: Right. 

ADAMS: We want people to think about this and think deeply about it. 

LEE: Yeah, Laura, I always am impressed with just how humble you are. You were indeed, you know, one of the prime instigators of the digitization of health records leading to electronic health record systems. And I don’t think you need to feel bad about that. That was a tremendous advance. I mean, moving a fifth of the US economy to be digital, I think, is significant.  

Also, our listeners might want to know that you led something called the Rhode Island Quality Institute, which was really, I think, maybe the, arguably, the most important early kind of examples that set a pattern for how and why health data might actually lead very directly to improvements in human health at a statewide level or at a population level. And so I think your struggles and frustrations on, you know, how to expand that nationwide, I think, are really, really informative.  

So let’s get into what these principles are, you know, what’s in the code of conduct.  

ADAMS: Yeah, the six simple rules were derived out of a larger set of principles that we pulled together. And the origin of all of this was we did a fairly extensive landscape review. We looked at least at 60 different sets of these principles, guidelines, and frameworks. We looked for areas of real convergence. We looked for areas where there was inconsistencies. And we looked for out-and-out gaps.  

The out-and-out gaps that we saw at the time were things like a dearth of references to advancing human health as the priority. Also monitoring post-implementation. So at the time, we were watching these evolve and we thought these are very significant gaps. Also, the impact on the environment was a significant gap, as well. And so when we pull that together, we developed a set of principles and cross-walked those with learning health system principles (opens in new tab).  

And then once we got that, we again wanted to distill that down into a set of commitments which we knew that people could find accessible. And we published that draft set of principles (opens in new tab) last year. And we have a new publication that will be coming out in the coming months that will be the revised set of principles and code commitments that we got because we took this out publicly. 

So we opened it up for public comment once we did the draft last year. Again, many of those times that I spoke about this, almost all of those times came with an invitation for feedback, and conversations that we had with people shaped it. And it is in no way, shape, or form a final code of conduct, this set of principles and commitments, because we see this as dynamic. But what we also knew about this was that we wanted to build this with a super solid foundation, a set of immutables, the things that don’t change at some vicissitudes or the whims of this or the whims of that. We wanted those things that were absolutely foundational.  

LEE: Yeah, so we’ll provide a link to the documents that describe the current state of this, but can we give an example of one or two of these principles and one or two of the commitments? 

ADAMS: Sure. I’ve mentioned the “protect and advance human health and connection” as the primary aim. We also want to ensure the equitable distribution of risks and benefits, and that equitable distribution of risks and benefits is something that I was referring to earlier about when I see well-resourced organizations. And one that’s particularly important to me is engaging people as partners with agency at every stage of the AI lifecycle.  

That one matters because this one talks about and speaks to the idea that we want to begin bringing in those that are affected by AI, those on whom AI is used, into the early development and conceptualization of what we want this new tool, this new application, to do. So that includes the providers that use it, the patients. And we find that when we include them—the ethicists that come along with that—we develop much better applications, much more targeted applications that do what we intend them to do in a more precise way.  

The other thing about that engaging with agency, by agency we mean that person, that participant can affect the decisions and they can affect the outcome. So it isn’t that they’re sort of a token person coming into the table and we’ll allow you to tell your story or so, but this is an active participant.  

We practiced what we preached when we developed the code of conduct, and we brought patient advocates in to work with us on the development of this, work with us on our applications, that first layer down of what the applications would look like, which is coming out in this new paper.  

We really wanted that component of this because I’m also seeing that patients are definitely not passive users of this, and they’re having an agency moment, let’s say, with generative AI because they’ve discovered a new capacity to gain information, to—in many ways—claim some autonomy in all of this.   

And I think that there is a disruption underway right now, a big one that has been in the works for many years, but it feels to me like AI may be the tipping point for that disruption of the delivery system as we know it.  

LEE: Right. I think it just exudes sort of inclusivity and thoughtfulness in the whole process. During this process, were there surprises, things that you didn’t expect? Things about AI technology itself that surprised you? 

ADAMS: The surprises that came out of this process for me, one of them was I surprised myself. We were working on the commentary paper, and Steven Lin from Stanford had strong input into that paper. And when we looked at what we thought were missing, he said, “Let’s make sure we have the environmental impact.” And I said, “Oh, really, Steven, we really want to think about things that are more directly aligned with health,” which I couldn’t believe came out of my own mouth. [LAUGHTER] 

And Steven, without saying, “Do you hear yourself?” I mean, I think he could have said that. But he was more diplomatic than that. And he persisted a bit longer and said, “I think it’s actually the greatest threat to human health.” And I said, “Of course, you’re right.” [LAUGHS] 

But that was surprising and embarrassing for me. But it was eye-opening in that even when I thought that I had understood the gaps and the using this as a touchstone. So the learning that took place and how rapidly that learning was happening among people involved in this.  

The other thing that was surprising for me was the degree at which patients became vastly facile with using it to the extent that it helped them begin to, again, build their own capacity.  

The #PatientsUseAI from Dave deBronkart—watch that one. This is more revolutionary than we think. And so I watched that, the swell of that happening, and it sort of shocked me because I was envisioning this as, again, a tool for use in the clinical setting.  

LEE: Yeah. OK, so we’re running now towards the end of our time together. And I always like to end our conversations with a more provocative topic. And I thought for you, I’d like to use the very difficult word regulation.  

And when I think about the book that Carey, Zak, and I wrote, we have a chapter on regulation, but honestly, we didn’t have ideas. We couldn’t understand how this would be regulated. And so we just defaulted to publishing a conversation about regulation with GPT-4. And in a way, I think … I don’t know that I or my coauthors were satisfied with that.  

In your mind, where do we stand two years later now when we think about the need or not to regulate AI, particularly in its application to healthcare, and where has the thinking evolved to? 

ADAMS: There are two big differences that I see in that time that has elapsed. And the first one is we have understood the insufficiency of simply making sure that AI-enabled devices are safe prior to going out into implementation settings.  

We recognize now that there’s got to be this whole other aspect of regulation and assurance that these things are functioning as intended and we have the capacity to do that in the point of care type of setting. So that’s been one of the major ones. The other thing is how wickedly challenging it is to regulate generative AI.  

I think one of the most provocative and exciting articles (opens in new tab) that I saw written recently was by Bakul Patel and David Blumenthal, who posited, should we be regulating generative AI as we do a licensed and qualified provider?  

Should it be treated in the sense that it’s got to have a certain amount of training and a foundation that’s got to pass certain tests? It has to demonstrate that it’s improving and keeping up with current literature. Does it … be responsible for mistakes that it makes in some way, shape, or form? Does it have to report its performance?  

And I’m thinking, what a provocative idea …  

LEE: Right. 

ADAMS: … but it’s worth considering. I chair the Global Opportunities Group for a regulatory and innovation AI sandbox in the UK. And we’re hard at work thinking about, how do you regulate something as unfamiliar and untamed, really, as generative AI?  

So I’d like to see us think more about this idea of sandboxes, more this idea of should we be just completely rethinking the way that we regulate. To me, that’s where the new ideas will come because the danger, of course, in regulating in the old way … first of all, we haven’t kept up over time, even with predictive AI; even with pre-generative AI, we haven’t kept up.  

And what worries me about continuing on in that same vein is that we will stifle innovation … 

LEE: Yes. 

ADAMS: … and we won’t protect from potential harms. Nobody wants an AI Chernobyl, nobody.  

LEE: Right 
 
ADAMS: But I worry that if we use those old tools on the new applications that we will not only not regulate, then we’ll stifle innovation. And when I see all of the promise coming out of this for things that we thought were unimaginable, then that would be a tragedy. 

LEE: You know, I think the other reflection I’ve had on this is the consumer aspect of it, because I think a lot of our current regulatory frameworks are geared towards experts using the technology.  

ADAMS: Yes. 

LEE: So when you have a medical device, you know you have a trained, board-certified doctor or licensed nurse using the technology. But when you’re putting things in the hands of a consumer, I think somehow the surface area of risk seems wider to me. And so I think that’s another thing that somehow our current regulatory concepts aren’t really ready for. 

ADAMS: I would agree with that. I think a few things to consider, vis-a-vis that, is that this revolution of patients using it is unstoppable. So it will happen. But we’re considering a project here at the National Academy about patients using AI and thinking about: let’s explore all the different facets of that. Let’s understand, what does safe usage look like? What might we do to help this new development enhance the patient-provider relationship and not erode it as we saw, “Don’t confuse your Google search with my medical degree” type of approach.  

Thinking about: how does it change the identity of the provider? How does it … what can we do to safely build a container in which patients can use this without giving them the sense that it’s being taken away, or that … because I just don’t see that happening. I don’t think they’re going to let it happen.  

That, to me, feels extremely important for us to explore all the dimensions of that. And that is one project that I hope to be following on to the AI Code of Conduct and applying the code of conduct principles with that project.  

LEE: Well, Laura, thank you again for joining us. And thank you even more for your tremendous national, even international, leadership on really helping mobilize the greatest institutions in a diverse way to fully confront the realities of AI in healthcare. I think it’s tremendously important work. 

ADAMS: Peter, thank you for having me. This has been an absolute pleasure. 

[TRANSITION MUSIC]  

I’ve had the opportunity to watch Laura in action as she leads a national effort to define an AI code of conduct. And our conversation today has only heightened my admiration for her as a national leader.  

What impresses me is Laura’s recognition that technology adoption in healthcare has had a checkered history and furthermore oftentimes not accommodated the huge diversity of stakeholders that are affected equally. 

The concept of an AI code of conduct seems straightforward in some ways, but you can tell that every word in the emerging code has been chosen carefully. And Laura’s tireless engagement traveling to virtually every corner of the United States, as well as to several other countries, shows real dedication. 

And now here’s my conversation with Vardit Ravitsky:

LEE: Vardit, thank you so much for joining. 

RAVITSKY: It’s a real pleasure. I’m honored that you invited me. 

LEE: You know, we’ve been lucky. We’ve had a few chances to interact and work together within the National Academy of Medicine and so on. But I think for many of the normal subscribers to the Microsoft Research Podcast, they might not know what The Hastings Center for Bioethics is and then what you as the leader of The Hastings Center do every day. So I’d like to start there, first off, with what is The Hastings Center? 

RAVITSKY: Mostly, we’re a research center. We’ve been around for more than 55 years. And we’re considered one of the organizations that actually founded the field known today as bioethics, which is the field that explores the policy implications, the ethical, social issues in biomedicine. So we look at how biotechnology is rolled out; we look at issues of equity, of access to care. We look at issues at the end of life, the beginning of life, how our natural environment impacts our health. Any aspect of the delivery of healthcare, the design of the healthcare system, and biomedical research leading to all this. Any aspect that has an ethical implication is something that we’re happy to explore.  

We try to have broad conversations with many, many stakeholders, people from different disciplines, in order to come up with guidelines and recommendations that would actually help patients, families, communities.  

We also have an editorial department. We publish academic journals. We publish a blog. And we do a lot of public engagement activities—webinars, in-person events. So, you know, we just try to promote the thinking of the public and of experts on the ethical aspects of health and healthcare.  

LEE: One thing I’ve been impressed with, with your work and the work of The Hastings Center is it really confronts big questions but also gets into a lot of practical detail. And so we’ll get there. But before that just a little bit about you then. The way I like to ask this question is: how do you explain to your parents what you do every day? [LAUGHS] 

RAVITSKY: Funny that you brought my parents into this, Peter, because I come from a family of philosophers. Everybody in my family is in humanities, in academia. When I was 18, I thought that that was the only profession [LAUGHTER] and that I absolutely had to become a philosopher, or else what else can you do with your life? 

I think being a bioethicist is really about, on one hand, keeping an eye constantly on the science as it evolves. When a new scientific development occurs, you have to understand what’s happening so that you can translate that outside of science. So if we can now make a gamete from a skin cell so that babies will be created differently, you have to understand how that’s done, what that means, and how to talk about it.  

The second eye you keep on the ethics literature. What ethical frameworks, theories, principles have we developed over the last decades that are now relevant to this technology. So you’re really a bridge between science, biomedicine on one hand and humanities on the other hand.  

LEE: OK. So let’s shift to AI. And here I’d like to start with a kind of an origin story because I’m assuming before generative AI and ChatGPT became widely known and available, you must have had some contact with ideas in data science, in machine learning, and, you know, in the concept of AI before ChatGPT. Is that true? And, you know, what were some of those early encounters like for you? 

RAVITSKY: The earlier issues that I heard people talk about in the field were really around diagnostics and reading images and, Ooh, it looks like machines could perform better than radiologists. And, Oh, what if women preferred that their mammographies be read by these algorithms? And, Does that threaten us clinicians? Because it sort of highlights our limitations and weaknesses as, you know, the weakness of the human eye and the human brain.  

So there were early concerns about, will this outperform the human and potentially take away our jobs? Will it impact our authority with patients? What about de-skilling clinicians or radiologists or any type of diagnostician losing the ability … some abilities that they’ve had historically because machines take over? So those were the early-day reflections and interestingly some of them remain even now with generative AI.  

All those issues of the standing of a clinician, and what sets us apart, and will a machine ever be able to perform completely autonomously, and what about empathy, and what about relationships? Much of that translated later on into the, you know, more advanced technology. 

LEE: I find it interesting that you use words like our and we to implicitly refer to humans, homo sapiens, to human beings. And so do you see a fundamental distinction, a hard distinction that separates humans from machines? Or, you know, how … if there are replacements of some human capabilities or some things that human beings do by machines, you know, how do you think about that? 

RAVITSKY: Ooh, you’re really pushing hard on the philosopher in me here. [LAUGHTER] I’ve read books and heard lectures by those who think that the line is blurred, and I don’t buy that. I think there’s a clear line between human and machine.  

I think the issue of AGI—of artificial general intelligence—and will that amount to consciousness … again, it’s such a profound, deep philosophical challenge that I think it would take a lot of conceptual work to get there. So how do we define consciousness? How do we define morality? The way it stands now, I look into the future without being a technologist, without being an AI developer, and I think, maybe I hope, that the line will remain clear. That there’s something about humanity that is irreplaceable.  

But I’m also remembering that Immanuel Kant, the famous German philosopher, when he talked about what it means to be a part of the moral universe, what it means to be a moral agent, he talked about rationality and the ability to implement what he called the categorical imperative. And he said that would apply to any creature, not just humans. 

And that’s so interesting. It’s always fascinated me that so many centuries ago, he said such a progressive thing. 

LEE: That’s amazing, yeah. 

RAVITSKY: It is amazing because I often, as an ethicist, I don’t just ask myself, What makes us human? I ask myself, What makes us worthy of moral respect? What makes us holders of rights? What gives us special status in the universe that other creatures don’t have? And I know this has been challenged by people like Peter Singer who say [that] some animals should have the same respect. “And what about fetuses and what about people in a coma?” I know the landscape is very fraught.  

But the notion of what makes humans deserving of special moral treatment to me is the core question of ethics. And if we think that it’s some capacities that give us this respect, that make us hold that status, then maybe it goes beyond human. So it doesn’t mean that the machine is human, but maybe at [a] certain point, these machines will deserve a certain type of moral respect that … it’s hard for us right now to think of a machine as deserving that respect. That I can see. 

But completely collapsing the distinction between human and machine? I don’t think so, and I hope not. 

LEE: Yeah. Well, you know, in a way I think it’s easier to entertain this type of conversation post-ChatGPT. And so now, you know, what was your first personal encounter with what we now call generative AI, and what went through your mind as you had that first encounter? 

RAVITSKY: No one’s ever asked me this before, Peter. It almost feels exposing to share your first encounter. [LAUGHTER]  

So I just logged on, and I asked a simple question, but it was an ethical question. I framed an ethical dilemma because I thought, if I ask it to plan a trip, like all my friends already did, it’s less interesting to me.  

And within seconds, a pretty thoughtful, surprisingly nuanced analysis was kind of trickling down my screen, and I was shocked. I was really taken aback. I was almost sad because I think my whole life I was hoping that only humans can generate this kind of thinking using moral and ethical terms.  

And then I started tweaking my question, and I asked for specific philosophical approaches to this. And it just kept surprising me in how well it performed.  

So I literally had to catch my breath and, you know, sit down and go, OK, this is a new world, something very important and potentially scary is happening here. How is this going to impact my teaching? How is this going to impact my writing? How is this going to impact health? Like, it was really a moment of shock. 

LEE: I think the first time I had the privilege of meeting you, I heard you speak and share some of your initial framing of how, you know, how to think about the potential ethical implications of AI and the human impacts of AI in the future. Keeping in mind that people listening to this podcast will tend to be technologists and computer scientists as well as some medical educators and practicing clinicians, you know, what would you like them to know or understand most about your thoughts? 

RAVITSKY: I think from early on, Peter, I’ve been an advocate in favor of bioethics as a field positioning itself to be a facilitator of implementing AI. I think on one hand, if we remain the naysayers as we have been regarding other technologies, we will become irrelevant. Because it’s happening, it’s happening fast, we have to keep our eye on the ball, and not ask, “Should we do it?” But rather ask, “How should we do it?” 

And one of the reasons that bioethics is going to be such a critical player is that the stakes are so high. The risk of making a mistake in diagnostics is literally life and death; the risk of breaches of privacy that would lead to patients losing trust and refusing to use these tools; the risk of clinicians feeling overwhelmed and replaceable. The risks are just too high.  

And therefore, creating guardrails, creating frameworks with principles that sensitize us to the ethical aspects, that is critically important for AI and health to succeed. And I’m saying it as someone who wants it very badly to succeed. 

LEE: You are actually seeing a lot of healthcare organizations adopting and deploying AI. Has any aspect of that been surprising to you? Have you expected it to be happening faster or slower or unfolding in a different way? 

RAVITSKY: One thing that surprises me is how it seems to be isolated. Different systems, different institutions making their own, you know, decisions about what to acquire and how to implement. I’m not seeing consistency. And I’m not even seeing anybody at a higher level collecting all the information about who’s buying and implementing what under what types of principles and what are their outcomes? What are they seeing?  

It seems to be just siloed and happening everywhere. And I wish we collected all this data, even about how the decision is made at the executive level to buy a certain tool, to implement it, where, why, by whom. So that’s one thing that surprised me.  

The speed is not surprising me because it really solves problems that healthcare systems have been struggling with. What seems to be one of the more popular uses, and again, you know this better than I do, is the help with scribes with taking notes, ambient recording. This seems to be really desired because of burnout that clinicians face around this whole issue of note taking.  

And it’s also seen as a way to allow clinicians to do more human interaction, you know, … 

LEE: Right.  

RAVITSKY: … look at the patient, talk to the patient, … 

LEE: Yep. 

RAVITSKY: … listen, rather than focus on the screen. We’ve all sat across the desk with a doctor that never looks at us because they only look at the screen. So there’s a real problem here, and there’s a real solution and therefore it’s hitting the ground quickly.  

But what’s surprising to me is how many places don’t think that it’s their responsibility to inform patients that this is happening. So some places do; some places don’t. And to me, this is a fundamental ethical issue of patient autonomy and empowerment. And it’s also pragmatically the fear of a crisis of trust. People don’t like being recorded without their consent. Surprise, surprise.  

LEE: Mm-hmm. Yeah, yeah. 

RAVITSKY: People worry about such a recording of a very private conversation that they consider to be confidential, such a recording ending up in the wrong hands or being shared externally or going to a commercial entity. People care; patients care.  

So what is our ethical responsibility to tell them? And what is the institutional responsibility to implement these wonderful tools? I’m not against them, I’m totally in favor—implement these great tools in a way that respects long-standing ethical principles of informed consent, transparency, accountability for, you know, change in practice? And, you know, bottom line: patients right to know what’s happening in their care.  

LEE: You actually recently had a paper in a medical journal (opens in new tab) that touched on an aspect of this, which I think was not with scribes, but with notes, you know, … 

RAVITSKY: Yep. 

LEE: … that doctors would send to patients. And in fact, in previous episodes of this podcast, we actually talked to both the technology developers of that type of feature as well as doctors who were using that feature. And in fact, even in those previous conversations, there was the question, “Well, what does the patient need to know about how this note was put together?” So you and your coauthors had a very interesting recent paper about this. 

RAVITSKY: Yeah, so the trigger for the paper was that patients seemed to really like being able to send electronic messages to clinicians.  

LEE: Yes. 

RAVITSKY: We email and text all day long. Why not in health, right? People are used to communicating in that way. It’s efficient; it’s fast.   

So we asked ourselves, “Wait, what if an AI tool writes the response?” Because again, this is a huge burden on clinicians, and it’s a real issue of burnout.  

We surveyed hundreds of respondents, and basically what we discovered is that there was a statistically significant difference in their level of satisfaction when they got an answer from a human clinician, when they got an answer, again, electronic message from AI.  

And it turns out that they preferred the messages written by AI. They were longer, more detailed, even conveyed more empathy. You know, AI has all the time in the world [LAUGHS] to write you a text. It’s not rushing to the next one. 

But then when we disclosed who wrote the message, they were less satisfied when they were told it was AI.  

So the ethical question that that raises is the following: if your only metric is patient satisfaction, the solution is to respond using AI but not tell them that. 

Now when we compared telling them that it was AI or human or not telling them anything, their satisfaction remained high, which means that if they were not told anything, they probably assumed that it was a human clinician writing, because their satisfaction for human clinician or no disclosure was the same. 

So basically, if we say nothing and just send back an AI-generated response, they will be more satisfied because the response is nicer, but they won’t be put off by the fact that it was written by AI. And therefore, hey presto, optimal satisfaction. But we challenge that, and we say, it’s not just about satisfaction. 

It’s about long-term trust. It’s about your right to know. It’s about empowering you to make decisions about how you want to communicate.  

So we push back against this notion that we’re just there to optimize patient satisfaction, and we bring in broader ethical considerations that say, “No, patients need to know.” If it’s not the norm yet to get your message from AI, … 

LEE: Yeah. 

RAVITSKY: … they should know that this is happening. And I think, Peter, that maybe we’re in a transition period. 

It could be that in two years, maybe less than that, most of our communication will come back from AI, and we will just take it for granted …  

LEE: Right. 

RAVITSKY: … that that’s the case. And at that point, maybe disclosure is not necessary because many, many surveys will show us that patients assume that, and therefore they are informed. But at this point in time, when it’s transition and it’s not the norm yet, I firmly think that ethics requires that we inform patients. 

LEE: Let me push on this a little bit because I think this final point that you just made is, I think is so interesting. Does it matter what kind of information is coming from a human or AI? Is there a time when patients will have different expectations for different types of information from their doctors? 

RAVITSKY: I think, Peter, that you’re asking the right question because it’s more nuanced. And these are the kinds of empirical questions that we will be exploring in the coming months and years. Our recent paper showed that there was no difference regarding the content. If the message was about what we call the “serious” matter or a less “serious” matter, the preferences were the same. But we didn’t go deep enough into that. That would require a different type of design of study. And you just said, you know, there are different types of information. We need to categorize them.  

LEE: Yeah.  

RAVITSKY: What types of information and what degree of impact on your life? Is it a life-and-death piece of information? Is it a quality-of-life piece of information? How central is it to your care and to your thinking? So all of that would have to be mapped out so that we can design these studies.  

But you know, you pushed back in that way, and I want to push back in a different direction that to me is more fundamental and philosophical. How much do we know now? You know, I keep saying, oh, patients deserve a chance for informed consent, … 

LEE: Right.  

RAVITSKY: … and they need to be empowered to make decisions. And if they don’t want that tool used in their care, then they should be able to say, “No.” Really? Is that the world we live in now? [LAUGHTER] Do I have access to the black box that is my doctor’s brain? Do I know how they performed on this procedure in the last year? 

Do I know whether they’re tired? Do I know if they’re up to speed on the literature with this? We already deal with black boxes, except they’re not AI. And I think the evidence emerges that AI outperforms the humans in so many of these tasks.  

So my pushback is, are we seeing AI exceptionalism in the sense that if it’s AI, Huh, panic! We have to inform everybody about everything, and we have to give them choices, and they have to be able to reject that tool and the other tool versus, you know, the rate of human error in medicine is awful. People don’t know the numbers. The annual deaths attributed to medical error is outrageous.  

So why are we so focused on informed consent and empowerment regarding implementation of AI and less in other contexts. Is it just because it’s new? Is it because it is some sort of existential threat, … 

LEE: Yep, yeah. 

RAVITSKY: … not just a matter of risk? I don’t know the answer, but I don’t want us to suffer from AI exceptionalism, and I don’t want us to hold AI to such a high standard that we won’t be able to benefit from it. Whereas, again, we’re dealing with black boxes already in medicine. 

LEE: Just to stay on this topic, though, one more question, which is, maybe, almost silly in how hypothetical it is. If instead of email, it were a Teams call or a Zoom call, doctor-patient, except that the doctor is not the real doctor, but it’s a perfect replica of the doctor designed to basically fool the patient that this is the real human being and having that interaction. Does that change the bioethical considerations at all? 

RAVITSKY: I think it does because it’s always a question of, are we really misleading? Now if you get a text message in an environment that, you know, people know AI is already integrated to some extent, maybe not your grandmother, but the younger generation is aware of this implementation, then maybe you can say, “Hmm. It was implied. I didn’t mean to mislead the patient.” 

If the patient thinks they’re talking to a clinician, and they’re seeing, like—what if it’s not you now, Peter? What if I’m talking to an avatar [LAUGHS] or some representation of you? Would I feel that I was misled in recording this podcast? Yeah, I would. Because you really gave me good reasons to assume that it was you.  

So it’s something deeper about trust, I think. And it touches on the notion of empathy. A lot of literature is being developed now on the issue of: what will remain the purview of the human clinician? What are humans good for [LAUGHS] when AI is so successful and especially in medicine?  

And if we see that the text messages are read as conveying more empathy and more care and more attention, and if we then move to a visual representation, facial expressions that convey more empathy, we really need to take a hard look at what we mean by care. What about then the robots, right, that we can touch, that can hug us?  

I think we’re really pushing the frontier of what we mean by human interaction, human connectedness, care, and empathy. This will be a lot of material for philosophers to ask themselves the fundamental question you asked me at first: what does it mean to be human?  

But this time, what does it mean to be two humans together and to have a connection?  

And if we can really be replaced in the sense that patients will feel more satisfied, more heard, more cared for, do we have ethical grounds for resisting that? And if so, why?  

You’re really going deep here into the conceptual questions, but bioethics is already looking at that. 

LEE: Vardit, it’s just always amazing to talk to you. The incredible span of what you think about from those fundamental philosophical questions all the way to the actual nitty gritty, like, you know, what parts of an email from a doctor to a patient should be marked as AI. I think that span is just incredible and incredibly needed and useful today. So thank you for all that you do. 

RAVITSKY: Thank you so much for inviting me. 

[TRANSITION MUSIC] 

The field of bioethics, and this is my take, is all about the adoption of disruptive new technologies into biomedical research and healthcare. And Vardit is able to explain this with such clarity. I think one of the reasons that AI has been challenging for many people is that its use spans the gamut from the nuts and bolts of how and when to disclose to patients that AI is being used to craft an email, all the way to, what does it mean to be a human being caring for another human?  

What I learned from the conversation with Vardit is that bioethicists are confronting head-on the issue of AI in medicine and not with an eye towards restricting it, but recognizing that the technology is real, it’s arrived, and needs to be harnessed now for maximum benefit.  

And so now, here’s my conversation with Dr. Roxana Daneshjou: 

LEE: Roxana, I’m just thrilled to have this chance to chat with you. 

ROXANA DANESHJOU: Thank you so much for having me on today. I’m looking forward to our conversation. 

LEE: In Microsoft Research, of course, you know, we think about healthcare and biomedicine a lot, but I think there’s less of an understanding from our audience what people actually do in their day-to-day work. And of course, you have such a broad background, both on the science side with a PhD and on the medical side. So what’s your typical work week like? 

DANESHJOU: I spend basically 90% of my time working on running my research lab (opens in new tab) and doing research on how AI interacts with medicine, how we can implement it to fix the pain points in medicine, and how we can do that in a fair and responsible way. And 10% of my time, I am in clinic. So I am a practicing dermatologist at Stanford, and I see patients half a day a week. 

LEE: And your background, it’s very interesting. There’s always been these MD-PhDs in the world, but somehow, especially right now with what’s happening in AI, people like you have become suddenly extremely important because it suddenly has become so important to be able to combine these two things. Did you have any inkling about that when you started, let’s say, on your PhD work? 

DANESHJOU: So I would say that during my—[LAUGHS] because I was in training for so long—during … my PhD was in computational genomics, and I still have a significant interest in precision medicine, and I think AI is going to be central to that.  

But I think the reason I became interested in AI initially is that I was thinking about how we associate genetic factors with patient phenotypes. Patient phenotypes being, How does the disease present? What does the disease look like? And I thought, you know, AI might be a good way to standardize phenotypes from images of, say, skin disease, because I was interested in dermatology at that time. And, you know, the part about phenotyping disease was a huge bottleneck because you would have humans sort of doing the phenotyping.  

And so in my head, when I was getting into the space, I was thinking I’ll bring together, you know, computer vision and genetics to try to, you know, make new discoveries about how genetics impacts human disease. And then when I actually started my postdoc to learn computer vision, I went down this very huge rabbit hole, which I am still, I guess, falling down, where I realized, you know, about biases in computer vision and how much work needed to be done for generalizability. 

And then after that, large language models came out, and, like, everybody else became incredibly interested in how this could help in healthcare and now also vision language models and multimodal models. So, you know, we’re just tumbling down the rabbit hole.  

LEE: Indeed, I think you really made a name for yourself by looking at the issues of biases, for example, in training datasets. And that was well before large language models were a thing. Maybe our audience would like to hear a little bit more about that earlier work.  

DANESHJOU: So as I mentioned, my PhD was in computational genetics. In genetics, what has happened during the genetic revolution is these large-scale studies to discover how genetic variation impacts human disease and human response to medication, so that’s what pharmacogenomics is, is human response to medications. And as I got, you know, entrenched in that world, I came to realize that I wasn’t really represented in the data. And it was because the majority of these genetic studies were on European ancestry individuals. You weren’t represented either. 

LEE: Right, yeah. 

DANESHJOU: Many diverse global populations were completely excluded from these studies, and genetic variation is quite diverse across the globe. And so you’re leaving out a large portion of genetic variation from these research studies. Now things have improved. It still needs work in genetics. But definitely there has been many amazing researchers, you know, sounding the alarm in that space. And so during my PhD, I actually focused on doing studies of pharmacogenomics in non-European ancestry populations. So when I came to computer vision and in particular dermatology, where there were a lot of papers being published about how AI models perform at diagnosing skin disease and several papers essentially saying, oh, it’s equivalent to a dermatologist—of course, that’s not completely true because it’s a very sort of contrived, you know, setting of diagnosis— … 

LEE: Right, right. 

DANESHJOU: but my first inkling was, well, are these models going to be able to perform well across skin tones? And one of our, you know, landmark papers, which was in Science Advances (opens in new tab), showed … we created a diverse dataset, our own diverse benchmark of skin disease, and showed that the models performed significantly worse on brown and black skin. And I think the key here is we also showed that it was an addressable problem because when we fine-tuned on diverse skin tones, you could make that bias go away. So it was really, in this case, about what data was going into the training of these computer vision models. 

LEE: Yeah, and I think if you’re listening to this conversation, if you haven’t read that paper, I think it’s really must reading. It was not only, Roxana, it wasn’t only just a landmark scientifically and medically, but it also sort of crossed the chasm and really became a subject of public discourse and debate, as well. And I think you really changed the public discourse around AI.  

So now I want to get into generative AI. I always like to ask, what was your first encounter with generative AI personally? And what went through your head? You know, what was that experience like for you? 

DANESHJOU: Yeah, I mean, I actually tell this story a lot because I think it’s a fun story. So I would say that I had played with, you know, GPT-3 prior and wasn’t particularly, you know, impressed … 

LEE: Yeah. 

DANESHJOU: … by how it was doing. And I was at NeurIPS [Conference on Neural Information Processing Systems] in New Orleans, and I was … we were walking back from a dinner. I was with Andrew Beam from Harvard. I was with his group. 

And we were just, sort of, you know, walking along, enjoying the sites of New Orleans, chatting. And one of his students said, “Hey, OpenAI just released this thing called ChatGPT.”  

LEE: So this would be New Orleans in December … 

DANESHJOU: 2022.  

LEE: 2022, right? Yes. Uh-huh. OK. 

DANESHJOU: So I went back to my hotel room. I was very tired. But I, you know, went to the website to see, OK, like, what is this thing? And I started to ask it medical questions, and I started all of a sudden thinking, “Uh-oh, we have made … we have made a leap here; something has changed.”  

LEE: So it must have been very intense for you from then because months later, you had another incredibly impactful, or landmark, paper basically looking at biases, race-based medicine in large language models (opens in new tab). So can you say more about that? 

DANESHJOU: Yes. I work with a very diverse team, and we have thought about bias in medicine, not just with technology but also with humans. Humans have biases, too. And there’s this whole debate around, is the technology going to be more biased than the humans? How do we do that? But at the same time, like, the technology actually encodes the biases of humans.  

And there was a paper in the Proceedings of the National Academy of Sciences (opens in new tab), which did not look at technology at all but actually looked at the race-based biases of medical trainees that were untrue and harmful in that they perpetuated racist beliefs.  

And so we thought, if medical trainees and humans have these biases, why don’t we see if the models carry them forward? And we added a few more questions that we, sort of, brainstormed as a team, and we started asking the models those questions. And … 

LEE: And by this time, it was GPT-4?  

DANESHJOU: We did include GPT-4 because GPT-4 came out, as well. And we also included other models, as well, such as Claude, because we wanted to look across the board. And what we found is that all of the models had instances of perpetuating race-based medicine. And actually, the GPT models had one of the most, I think, one of the most egregious responses—and, again, this is 3.5 and 4; we haven’t, you know, fully checked to see what things look like, because there have been newer models—in that they said that we should use race in calculating kidney function because there were differences in muscle mass between the races. And this is sort of a racist trope in medicine that is not true because race is not based on biology; it’s a social construct.   

So, yeah, that was that study. And that one did spur a lot of public conversation. 

LEE: Your work there even had the issue of bias overtake hallucination, you know, as really the most central and most difficult issue. So how do you think about bias in LLMs, and does that in your mind disqualify the use of large language models from particular uses in medicine?  

DANESHJOU: Yeah, I think that the hallucinations are an issue, too. And in some senses, they might even go with one another, right. Like, if it’s hallucinating information that’s not true but also, like, biased.  

So I think these are issues that we have to address with the use of LLMs in healthcare. But at the same time, things are moving very fast in this space. I mean, we have a secure instance of several large language models within our healthcare system at Stanford so that you could actually put secure patient information in there.  

So while I acknowledge that bias and hallucinations are a huge issue, I also acknowledge that the healthcare system is quite broken and needs to be improved, needs to be streamlined. Physicians are burned out; patients are not getting access to care in the appropriate ways. And I have a really great story about that, which I can share with you later.  

So in 2024, we did a study asking dermatologists, are they using large language models (opens in new tab) in their clinical practice? And I think this percentage has probably gone up since then: 65% of dermatologists reported using large language models in their practices on tasks such as, you know, writing insurance authorization letters, you know, writing information sheets for the patient, even, sort of, using them to educate themselves, which makes me a little nervous because in my mind, the best use of large language models right now are cases where you can verify facts easily.  

So, for example, I did show and teach my nurse how to use our secure large language model in our healthcare system to write rebuttal letters to the insurance. I told her, “Hey, you put in these bullet points that you want to make, and you ask it to write the letter, and you can verify that the letter contains the facts which you want and which are true.” 

LEE: Yes. 

DANESHJOU: And we have also done a lot of work to try to stress test models because we want them to be better. And so we held this red-teaming event at Stanford where we brought together 80 physicians, computer scientists, engineers and had them write scenarios and real questions that they might ask on a day to day or tasks that they might actually ask AI to do. 

And then we had them grade the performance. And we did this with the GPT models. At the  time, we were doing it with GPT-3.5, 4, and 4 with internet. But before the paper (opens in new tab) came out, we then ran the dataset on newer models.  

And we made the dataset public (opens in new tab) because I’m a big believer in public data. So we made the dataset public so that others could use this dataset, and we labeled what the issues were in the responses, whether it was bias, hallucination, like, a privacy issue, those sort of things. 

LEE: If I think about the hits or misses in our book, you know, we actually wrote a little bit, not too much, about noticing biases. I think we underestimated the magnitude of the issue in our book. And another element that we wrote about in our book is that we noticed that the language model, if presented with some biased decision-making, more often than not was able to spot that the decision-making was possibly being influenced by some bias. What do you make of that?  

DANESHJOU: So funny enough, I think we had … we had a—before I moved from Twitter to Bluesky—but we had a little back and forth on Twitter about this, which actually inspired us to look into this as a research, and we have a preprint up on it of actually using other large language models to identify bias and then to write critiques that the original model can incorporate and improve its answer upon. 

I mean, we’re moving towards this sort of agentic systems framework rather than a singular large language model, and people, of course, are talking about also retrieval-augmented generation, where you sort of have this corpus of, you know, text that you trust and find trustworthy and have that incorporated into the response of the model.  

And so you could build systems essentially where you do have other models saying, “Hey, specifically look for bias.” And then it will sort of focus on that task. And you can even, you know, give it examples of what bias is within context learning now. So I do think that we are going to be improving in this space. And actually, my team is … most recently, we’re working on building patient-facing chatbots. That’s where my, like, patient story comes in. But we’re building patient-facing chatbots. And we’re using, you know, we’re using prompt-engineering tools. We’re using automated eval tools. We’re building all of these things to try to make it more accurate and less bias. So it’s not just like one LLM spitting out an answer. It’s a whole system. 

LEE: All right. So let’s get to your patient-facing story.  

DANESHJOU: Oh, of course. Over the summer, my 6-year-old fell off the monkey bars and broke her arm. And I picked her up from school. She’s crying so badly. And I just look at her, and I know that we’re in trouble. 

And I said, OK, you know, we’re going straight to the emergency room. And we went straight to the emergency room. She’s crying the whole time. I’m almost crying because it’s just, like, she doesn’t even want to go into the hospital. And so then my husband shows up, and we also had the baby, and the baby wasn’t allowed in the emergency room, so I had to step out. 

And thanks to the [21st Century] Cures Act (opens in new tab), I’m getting, like, all the information, you know, as it’s happening. Like, I’m getting the x-ray results, and I’m looking at it. And I can tell there’s a fracture, but I can’t, you know, tell, like, how bad it is. Like, is this something that’s going to need surgery?  

And I’m desperately texting, like, all the orthopedic folks I know, the pediatricians I know. [LAUGHTER] “Hey, what does this mean?” Like, getting real-time information. And later in the process, there was a mistake in her after-visit summary about how much Tylenol she could take. But I, as a physician, knew that this dose was a mistake.  

I actually asked ChatGPT. I gave it the whole after-visit summary, and I said, are there any mistakes here? And it clued in that the dose of the medication was wrong. So again, I—as a physician with all these resources—have difficulty kind of navigating the healthcare system; understanding what’s going on in x-ray results that are showing up on my phone; can personally identify medication dose mistakes, but you know, most people probably couldn’t. And it could be very … I actually, you know, emailed the team and let them know, to give feedback.  

So we have a healthcare system that is broken in so many ways, and it’s so difficult to navigate. So I get it. And so that’s been, you know, a big impetus for me to work in this space and try to make things better. 

LEE: That’s an incredible story. It’s also validating because, you know, one of the examples in our book was the use of an LLM to spot a medication error that a doctor or a nurse might make. You know, interestingly, we’re finding no sort of formalized use of AI right now in the field. But anecdotes like this are everywhere. So it’s very interesting.  

All right. So we’re starting to run short on time. So I want to ask you a few quick questions and a couple that might be a little provocative.  

DANESHJOU: Oh boy. [LAUGHTER] Well, I don’t run away … I don’t run away from controversy. 

LEE: So, of course, with that story you just told, I can see that you use AI yourself. When you are actually in clinic, when you are being a dermatologist … 

DANESHJOU: Yeah.  

LEE: … and seeing patients, are you using generative AI? 

DANESHJOU: So I do not use it in clinic except for the situation of the insurance authorization letters. And even, I was offered, you know, sort of an AI-based scribe, which many people are using. There have been some studies that show that they can make mistakes. I have a human scribe. To me, writing the notes is actually part of the thinking process. So when I write my notes at the end of the day, there have been times that I’ve all of a sudden had an epiphany, particularly on a complex case. But I have used it to write, you know, sort of these insurance authorization letters. I’ve also used it in grant writing. So as a scientist, I have used it a lot more.  

LEE: Right. So I don’t know of anyone who has a more nuanced and deeper understanding of the issues of biases in AI in medicine than you. Do you think [these] biases can be repaired in AI, and if not, what are the implications? 

DANESHJOU: So I think there are several things here, and I just want to be thoughtful about it. One, I think, the bias in the technology comes from bias in the system and bias in medicine, which very much exists and is incredibly problematic. And so I always tell people, like, it doesn’t matter if you have the most perfect, fair AI. If you have a biased human and you add those together, because you’re going to have this human-AI interaction, you’re still going to have a problem.  

And there is a paper that I’m on with Dr. Matt Groh (opens in new tab), which looked at looking at dermatology diagnosis across skin tones and then with, like, AI assistance. And we found there’s a bias gap, you know, with even physicians. So it’s not just an AI problem; humans have the problem, too. And… 

LEE: Hmm. Yup. 

DANESHJOU: … we also looked at when you have the human-AI system, how that impacts the gap because you want to see the gap close. And it was kind of a mixed result in the sense that there was actually situations where, like, the accuracy increased in both groups, but the gap actually also increased because they were actually, even though they knew it was a fair AI, for some reason, they were relying upon the AI more often when … or they were trusting it more often on diagnoses on white skin—maybe they’d read my papers, who knows? [LAUGHTER]—even though we had told them, you know, it was a fair model.  

So I think for me, the important thing is understanding how the AI model works with the physician at the task. And what I would like to see is it improve the overall bias and disparities with that unit.  

And at the same time, I tell human physicians, we have to work on ourselves. We have to work on our system, you know, our medical system that has systemic issues of access to care or how patients get treated based on what they might look like or other parts of their background.  

LEE: All right, final question. So we started off with your stories about imaging in dermatology. And of course, Geoff Hinton, Turing winner and one of the grandfathers of the AI revolution, famously had predicted many years ago that by 2018 or something like that, we wouldn’t need human radiologists because of AI. 

That hasn’t come to pass, but since you work in a field that also is very dependent on imaging technologies, do you see a future when radiologists or, for that matter, dermatologists might be largely replaced by machines? 

DANESHJOU: I think that’s a complex question. Let’s say you have the most perfect AI systems. I think there’s still a lot of nuance in how these, you know, things get done. I’m not a radiologist, so I don’t want to speak to what happens in radiology. But in dermatology, it ends up being quite complex, the process. 

LEE: Yeah. 

DANESHJOU: You know, I don’t just look at lesions and make diagnoses. Like, I do skin exams to first identify the lesions of concern. So maybe if we had total-body photography that could help, like, catch which lesions would be of concern, which people have worked on, that would be step, sort of, step one.  

And then the second thing is, you know, it’s … I have to do the biopsy. So, you know, the robot’s not going to be doing the biopsy. [LAUGHTER]  

And then the pathology for skin cancer is sometimes very clear, but there’s also, like, intermediate-type lesions where we have to make a decision bringing all that information together. For rashes, it can be quite complex. And then we have to kind of think about what other tests we’re going to order, what therapeutics we might try first, that sort of stuff.  

So, you know, there is a thought that you might have AI that could reason through all of those steps maybe, but I just don’t feel like we’re anywhere close to that at all. I think the other thing is AI does a lot better on sort of, you know, tasks that are well defined. And a lot of things in medicine, like, it would be hard to train the model on because it’s not well defined. Even human physicians would disagree on the next best step.  

LEE: Well, Roxana, for whatever it’s worth, I can’t even begin to imagine anything replacing you. I think your work has been just so—I think you used the word, and I agree with it— landmark, and multiple times. So thank you for all that you’re doing and thank you so much for joining this podcast.  

DANESHJOU: Thanks for having me. This was very fun. 

[TRANSITION MUSIC] 

The issue of bias in AI has been the subject of truly landmark work by Roxana and her collaborators. And this includes biases in large language models.  

This was something that in our writing of the book, Carey, Zak, and I recognized and wrote about. But in fairness, I don’t think Carey, Zak, or I really understood the full implications of it. And this is where Roxana’s work has been so illuminating and important. 

Roxana’s practical prescriptions around red teaming have proven to be important in practice, and equally important were Roxana’s insights into how AI might always be guilty of the same biases, not only of individuals but also of whole healthcare organizations. But at the same time, AI might also be a potentially powerful tool to detect and help mitigate against such biases. 

When I think about the book that Carey, Zak, and I wrote, I think when we talked about laws, norms, ethics, regulations, it’s the place that we struggled the most. And in fact, we actually relied on a conversation with GPT-4 in order to tease out some of the core issues. Well, moving on from that conversation with an AI to a conversation with three deep experts who have dedicated their careers to making sure that we can harness all of the goodness while mitigating against the risks of AI, it’s been both fulfilling, very interesting, and a great learning experience. 

[THEME MUSIC]   

I’d like to say thank you again to Laura, Vardit, and Roxana for sharing their stories and insights. And to our listeners, thank you for joining us. We have some really great conversations planned for the coming episodes, including an examination on the broader economic impact of AI in health and a discussion on AI drug discovery. We hope you’ll continue to tune in.  

Until next time. 

[MUSIC FADES] 


The post Laws, norms, and ethics for AI in health appeared first on Microsoft Research.

Read More