Revolutionizing earth observation with geospatial foundation models on AWS

Emerging transformer-based vision models for geospatial data—also called geospatial foundation models (GeoFMs)—offer a new and powerful technology for mapping the earth’s surface at a continental scale, providing stakeholders with the tooling to detect and monitor surface-level ecosystem conditions such as forest degradation, natural disaster impact, crop yield, and many others.

GeoFMs represent an emerging research field and are a type of pre-trained vision transformer (ViT) specifically adapted to geospatial data sources. GeoFMs offer immediate value without training. The models excel as embedding models for geospatial similarity search and ecosystem change detection. With minimal labeled data, GeoFMs can be fine-tuned for custom tasks such as land surface classification, semantic segmentation, or pixel-level regression. Many leading models are available under very permissive licenses making them accessible for a wide audience. Examples include SatVision-Base, Prithvi-100M, SatMAE, and Clay (used in this solution).

In this post, we explore how Clay Foundation’s Clay foundation model, available on Hugging Face, can be deployed for large-scale inference and fine-tuning on Amazon SageMaker. For illustrative purposes, we focus on a deforestation use case from the Amazon rainforest, one of the most biodiverse ecosystems in the world. Given the strong evidence that the Amazon forest system could soon be reaching a tipping point, it presents an important domain of study and a high-impact application area for GeoFMs, for example, through early detection of forest degradation. However, the solution presented here generalizes to a wide range of geospatial use cases. It also comes with ready-to-deploy code samples to help you get started quickly with deploying GeoFMs in your own applications on AWS.

Let’s dive in!

Solution overview

At the core of our solution is a GeoFM. Architecturally, GeoFMs build on the ViT architecture first introduced in the seminal 2022 research paper An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale. To account for the specific properties of geospatial data (multiple channels ranging from ultraviolet to infrared, varying electromagnetic spectrum coverage, and spatio-temporal nature of data), GeoFMs incorporate several architectural innovations such as variable input size (to capture multiple channels) or the addition of positional embeddings that capture spatio-temporal aspects such as seasonality and location on earth. The pre-training of these models is conducted on unlabeled geospatial data sampled from across the globe using masked autoencoders (MAE) as self-supervised learners. Sampling from global-scale data helps ensure that diverse ecosystems and surface types are represented appropriately in the training set. What results are general purpose models that can be used for three core use cases:

Geospatial similarity search: Quickly map diverse surface types with semantic geospatial search using the embeddings to find similar items (such as deforested areas).
Embedding-based change detection: Analyze a time series of geospatial embeddings to identify surface disruptions over time for a specific region.
Custom geospatial machine learning: Fine-tune a specialized regression, classification, or segmentation model for geospatial machine learning (ML) tasks. While this requires a certain amount of labeled data, overall data requirements are typically much lower compared to training a dedicated model from the ground up.

The general solution flow is shown in the following diagram. Note that this flow diagram is highly abstracted and omits certain architectural details for reasons of clarity. For a full architecture diagram demonstrating how the flow can be implemented on AWS, see the accompanying GitHub repository. This repository also contains detailed deployment instructions to get you started quickly with applying GeoFMs to your own use cases.

Retrieve and process satellite imagery for GeoFM inference or training: The first step is to get the raw geospatial data into a format that’s consumable by the GeoFM. This entails breaking down the large raw satellite imagery into equally-sized 256×256 pixel chips (the size that the mode expects) and normalizing pixel values, among other data preparation steps required by the GeoFM that you choose. This routine can be conducted at scale using an Amazon SageMaker AI processing job.
Retrieve model weights and deploy the GeoFM: Next, retrieve the open weights of the GeoFM from a model registry of your choice (HuggingFace in this example) and deploy the model for inference. The best deployment option ultimately depends on how the model is consumed. If you need to generate embedding asynchronously, use a SageMaker AI processing or transform step. For real-time inference, consider deploying to a SageMaker AI real-time endpoint, which can be configured to auto-scale with demand, allowing for large-scale inference. In this example, we use a SageMaker AI processing job with a custom Docker image for generating embeddings in batch.
Generate geospatial embeddings: The GeoFM is an encoder-only model, meaning that it outputs an embedding vector. During inference, you perform a forward pass of the pre-processed satellite image chip through the GeoFM. This produces the corresponding embedding vector, which can be thought of as a compressed representation of the information contained in the image. This process is equivalent to using text embedding models for RAG use cases or similar.

The generated geospatial embeddings can be used largely as-is for two key use cases: geospatial similarity search and ecosystem change detection.

Run similarity search on the embeddings to identify semantically similar images: The GeoFM embeddings reside in the same vector space. This allows us to identify similar items by identifying vectors that are very close to a given query point. A common high-performance search algorithm for this is approximate nearest neighbor (ANN). For scalability and search performance, we index the embedding vectors in a vector database.
Analyze time-series of embeddings for break points that indicate change: Instead of looking for similarity between embedding vectors, you can also look for distance. Doing this for a specific region and across time lets you pinpoint specific times where change occurs. This allows you to use embeddings for surface change detection over time, a very common use case in geospatial analytics.

Optionally, you can also fine-tune a model on top of the GeoFM.

Train a custom head and run inference: To fine-tune a model you add a custom (and typically lightweight) head on top of the GeoFM and fine-tune it on a (often small) labeled dataset. The GeoFM weights remain frozen and are not retrained. The custom head takes the GeoFM-generated embedding vectors as input and produces classification masks, pixel-level recessions results, or simply a class per image, depending on the use case.

We explore the key steps of this workflow in the next sections. For additional details on the implementation—including. how to build a high-quality user interface with Solara—see the accompanying GitHub repository.

Geospatial data processing and embedding generation

Our comprehensive, four-stage data processing pipeline transforms raw satellite imagery into analysis-ready vector embeddings that power advanced geospatial analytics. This orchestrated workflow uses Amazon SageMaker AI Pipelines to create a robust, reproducible, and scalable processing architecture. The end-to-end solution can process Earth observation data for a selected region of interest, with built-in flexibility to adapt to different use cases. In this example, we use Sentinel-2 imagery from the Amazon Registry of Open Data for monitoring deforestation in the Brazilian rainforest. However, our pipeline architecture is designed to work seamlessly with other satellite image providers and resolutions (such as NAIP with 1m/pixel resolution, or Maxar and Planet Labs up to below 1m/pixel resolution).

Pipeline architecture overview

The SageMaker pipeline consists of four processing steps, shown in the preceding figure, each step builds on the outputs of the previous steps with intermediate results stored in Amazon Simple Storage Service (Amazon S3).

Pre-process satellite tiles: Divides the satellite imagery into chips. We chose a chip size of 256×256 pixels as expected by Clay v1. For Sentinel-2 images this corresponds to an area of 2.56 x 2.56 km².
Generate embeddings: Creates 768-dimensional vector representations for the chips using the Clay v1 model.
Process embeddings: Performs dimensionality reduction and computes similarity metrics (for downstream analyses).
Consolidate and index: Consolidates outputs and loads embeddings vectors into a Vector store.

# Pipeline definition (simplified)
pipeline = Pipeline(
    name=f"EmbeddingPipeline-{aoi_name}",
    steps=[preprocess_step, embeddings_step, postprocessing_step, consolidation_step],
    parameters=[bucket_name_param, aoi_name_param, s2_grid_id_param, 
               chip_size_param, batch_size_param, s2_bands_param],
    sagemaker_session=pipeline_session
)

Step 1: Satellite data acquisition and chipping

The pipeline starts by accessing Sentinel-2 multispectral satellite imagery through the AWS Open Data program from S3 buckets. This imagery provides 10-meter resolution across multiple spectral bands including RGB (visible light) and NIR (near-infrared), which are critical for environmental monitoring.

This step filters out chips that have excessive cloud cover and divides large satellite scenes into manageable 256×256 pixel chips, which enables efficient parallel processing and creates uniform inputs for the foundation model. This step also runs on a SageMaker AI Processing job with a custom Docker image optimized for geospatial operations.

# Chip generation – Simplified version
def generate_chips_from_xarray(ds, chip_size, aoi_name, s2_scene_id):
    """Generates image chips of specified size from xarray dataset
       using Area of Interest (AOI) name and Sentinel-2 scene id"""
    try:
        # Calculate optimal number of jobs for parallel processing
        n_jobs = min(os.cpu_count(), 16)  # Use optimal number of cores or available CPU count
        logger.info(f"Using {n_jobs} parallel jobs for chip generation")
        
        # Generate all x,y coordinates for processing
        coordinates = [(x, y) 
                    for x in range(0, ds.sizes['x'], chip_size)
                    for y in range(0, ds.sizes['y'], chip_size)]
        
        logger.info(f"Generating {len(coordinates)} chips from scene")
        
        # Process chips in parallel
        with parallel_backend('threading', n_jobs=n_jobs):
            results = Parallel(verbose=1)(
                delayed(process_single_chip)(
                    x, y, ds, chip_size, s2_scene_id
                ) for x, y in coordinates
            )
        
        # Filter out None results (e.g., chips with too many cloud pixels)
        valid_results = [r for r in results if r is not None]
        
        logger.info(f"Successfully generated {len(valid_results)} valid chips")
        
        return valid_results
    except Exception as e:
        logger.error(f"Error in chip generation for scene {s2_scene_id}: {str(e)}")
        raise

For each chip, this step generates:

NetCDF datacubes (.netcdf) containing the full multispectral information
RGB thumbnails (.png) for visualization
Rich metadata (.parquet) with geolocation, timestamps, and other metadata

Step 2: Embedding generation using a Clay foundation model

The second step transforms the preprocessed image chips into vector embeddings using the Clay v1 foundation model. This is the most computationally intensive part of the pipeline, using multiple GPU instances (ml.g5.xlarge) to efficiently process the satellite imagery.

For each chip, this step:

Accesses the NetCDF datacube from Amazon S3
Normalizes the spectral bands according to the Clay v1 model’s input requirements
Generates both patch-level and class token (CLS) embeddings
Stores the embeddings as NumPy arrays (.npy) alongside the original data on S3 as intermediate store

While Clay can use all Sentinel-2 spectral bands, our implementation uses RGB and NIR as input bands to generate a 768-dimensional embedding, which provide excellent results in our examples. Customers can easily adapt the input bands based on their specific use-cases. These embeddings encapsulate high-level features such as vegetation patterns, urban structures, water bodies, and land use characteristics—without requiring explicit feature engineering.

Step 3: Embedding processing and analysis

The third step analyzes the embeddings to extract meaningful insights, particularly for time-series analysis. Running on high-memory instances, this step:

Performs dimensionality reduction on the embeddings using principal component analysis (PCA) and t-distributed stochastic neighbor embedding (t-SNE) (to be used later for change detection)
Computes cosine similarity between embeddings over time (an alternative for change detection)
Identifies significant changes in the embeddings that might indicate surface changes
Saves processed embeddings in Parquet format for efficient querying

The output includes processed embedding files that contain both the original high-dimensional vectors and their reduced representations, along with computed similarity metrics.

For change detection applications, this step establishes a baseline for each geographic location and calculates deviations from this baseline over time. These deviations, captured as vector distances, provide a powerful indicator of surface changes like deforestation, urban development, or natural disasters.

Step 4: Consolidation and vector database integration

The final pipeline step consolidates the processed embeddings into a unified dataset and loads them into vector databases optimized for similarity search. The outputs include consolidated embedding files, GeoJSON grid files for visualization, and configuration files for frontend applications.

The solution supports two vector database options:

Amazon OpenSearch Serverless for production deployments with very low latency and high scalability requirements
LanceDB for lightweight deployments for smaller regions

Both options provide efficient ANN search capabilities, enabling sub-second query performance. The choice between them depends on the scale of deployment, integration requirements, and operational preferences.

With this robust data processing and embedding generation foundation in place, let’s explore the real-world applications enabled by the pipeline, beginning with geospatial similarity search.

Geospatial similarity search

Organizations working with Earth observation data have traditionally struggled with efficiently identifying specific landscape patterns across large geographic regions. Traditional Earth observation analysis requires specialized models trained on labeled datasets for each target feature. This approach forces organizations into a lengthy process of data collection, annotation, and model training before obtaining results.

In contrast, the GeoFM-powered similarity search converts satellite imagery into 768-dimensional vector embeddings that capture the semantic essence of landscape features, eliminating the need for manual feature engineering and computation of specialized indices like NDVI or NDWI.

This capability uses the Clay foundation model’s pre-training on diverse global landscapes to understand complex relationships between features without explicit programming. The result is an intuitive image-to-image search capability where users can select a reference area—such as early-stage deforestation or wildfire damage—and instantly find similar patterns across vast territories in seconds rather than weeks.

Similarity search implementation

Our implementation provides a streamlined workflow for finding similar geographic areas using the embeddings generated by the data processing pipeline. The search process involves:

Reference area selection: Users select a reference chip representing a search term (for example, a deforested patch, urban development, or agricultural field)
Search parameters: Users specify the number of results and a similarity threshold
Vector search execution: The system retrieves similar chips using cosine similarity between embeddings
Result visualization: Matching chips are highlighted on the map

Let’s dive deeper on a real-world application, taking our running example of detecting deforestation in the Mato Grosso region of the Brazilian Amazon. Traditional monitoring approaches often detect forest loss too late—after significant damage has already occurred. The Clay-powered similarity search capability offers a new approach by enabling early detection of emerging deforestation patterns before they expand into large-scale clearing operations.

Using a single reference chip showing the initial signs of forest degradation—such as selective logging, small clearings, or new access roads—analysts can instantly identify similar patterns across vast areas of the Amazon rainforest. As demonstrated in the following example images, the system effectively recognizes the subtle signatures of early-stage deforestation based on a single reference image. This capability enables environmental protection agencies and conservation organizations to deploy resources precisely, improving the anti-deforestation efforts by addressing threats to prevent major forest loss. While a single reference chip image led to good results in our examples, alternative approaches exist, such as an average vector strategy, which leverages embeddings from multiple reference images to enhance the similarity search results.

Ecosystem change detection

Unlike vector-based similarity search, change detection focuses on measuring the distance between embedding vectors over time, the core assumption being that the more distant embedding vectors are to each other, the more dissimilar the underlying satellite imagery is. If applied to a single region over time, this lets you pinpoint so called change points—periods where significant and long-lasting change in surface conditions occurred.

Our solution implements a timeline view of Sentinel-2 satellite observations from 2018 to present. Each observation point corresponds to a unique satellite image, allowing for detailed temporal analysis. While embedding vectors are highly dimensional, we use the previously computed PCA (and optionally t-SNE) to reduce dimensionality to a single dimension for visualization purposes.

Let’s review a compelling example from our analysis of deforestation in the Amazon. The following image is a timeseries plot of geospatial embeddings (first principal component) for a single 256×256 pixel chip. Cloudy images and major outliers have been removed.

Points clustered closely on the y-axis indicate similar ground conditions; sudden and persistent discontinuities in the embedding values signal significant change. Here’s what the analysis shows:

Stable forest conditions from 2018 through 2020
A significant discontinuity in embedding values during 2021. Closer review of the underlying satellite imagery shows clear evidence of forest clearing and conversion to agricultural fields
Further transformation visible in 2024 imagery

Naturally, we need a way to automate the process of change detection so that it can be applied at scale. Given that we do not typically have extensive changepoint training datasets, we need an unsupervised approach that works without labeled data. The intuition behind unsupervised change detection is the following: identify what normal looks like, then highlight large enough deviations from normal and flag them as change points; after a change point has occurred, characterize the new normal and repeat the process.

The following function performs harmonic regression analysis on the embeddings timeseries data, specifically designed to model yearly seasonality patterns. The function fits a harmonic regression with a specified frequency (default 365 days for annual patterns) to the embedding data of a baseline period (the year 2018 in this example). It then generates predictions and calculates error metrics (absolute and percentage deviations). Large deviations from the normal seasonal pattern indicate change and can be automatically flagged using thresholding.

def fit_harmonic_regression(data, outliers, date_col="date", y_col="patch_emb_pca_1", 
                          baseline_start_date=datetime.datetime(2018,1,1), 
                          monitoring_start_date=datetime.datetime(2019,1,1), 
                          deg=3, reg=0.001, avg_days_yr=365):
    """
    Fits a harmonic regression model to time series data using Lasso regularization.
    
    Parameters:
    - data: DataFrame containing the time series data
    - outliers: List of dates to exclude from the fitting process
    - date_col: Name of the column containing dates (default: 'date')
    - y_col: Name of the column containing target values (default: 'patch_emb_pca_1')
    - baseline_start_date: Start date for the baseline period (default: Jan 1, 2018)
    - monitoring_start_date: Start date for the monitoring period (default: Jan 1, 2019)
    - deg: Degree of polynomial features (default: 3)
    - reg: Regularization strength for Lasso (default: 0.001)
    - avg_days_yr: Number of days in a year (default: 365)
    """

    # Convert date column to datetime format and create numerical representation
    data[date_col] = pd.to_datetime(data[date_col])
    data["date_numerical"] = data[date_col].apply(lambda x: mdates.date2num(x))
    t_full = data["date_numerical"]
    y_full = data[y_col]
    
    # Filter data for model fitting:
    # - Remove outliers
    # - Keep only data between baseline_start_date and monitoring_start_date
    t_fitting = data[
        (~data["date"].isin(outliers)) & 
        (data[date_col] >= baseline_start_date) & 
        (data[date_col] < monitoring_start_date)
    ]["date_numerical"]
    
    y_fitting = data[
        (~data["date"].isin(outliers)) & 
        (data[date_col] >= baseline_start_date) & 
        (data[date_col] < monitoring_start_date)
    ][y_col]
    
    # Create design matrix for harmonic regression
    # Convert time to angular frequency (2π/period)
    w = 2 * np.pi / avg_days_yr
    
    # Generate polynomial features from sine and cosine transformations
    poly = PolynomialFeatures(deg)
    # Create feature matrix for fitting period
    X_fitting = poly.fit_transform(np.column_stack((
        np.sin(w*t_fitting), 
        np.cos(w*t_fitting)
    )))
    # Create feature matrix for all data points
    X_full = poly.fit_transform(np.column_stack((
        np.sin(w*t_full), 
        np.cos(w*t_full)
    )))

    # Initialize and fit Lasso regression model
    lasso_model = Lasso(alpha=reg)
    lasso_model.fit(X_fitting, y_fitting)
    
    # Generate predictions for all time points
    y_fit = lasso_model.predict(X_full)
    
    # Calculate error metrics
    absolute_deviation = np.abs(y_full - y_fit)
    percentage_deviation = (absolute_deviation / np.abs(y_fit)) * 100
    
    # Compile results into a DataFrame
    df = pd.DataFrame()
    df["date"] = data[date_col]
    df["date_numerical"] = data["date_numerical"] 
    df[f"{y_col}_true"] = data[y_col]          # Original values
    df[f"{y_col}_pred"] = y_fit                # Predicted values
    df[f"{y_col}_abs_error"] = absolute_deviation    # Absolute error
    df[f"{y_col}_perc_error"] = percentage_deviation # Percentage error
    # Add time-based features
    df["year"] = df["date"].apply(lambda x: x.year)
    df["month"] = df["date"].apply(lambda x: x.month)
    df["year_month"] = df.apply(
        lambda x: "{}_{}".format(str(x.year), str(x.month)), 
        axis=1
    )
    # Return the fitted model, polynomial transformer, and results DataFrame
    return lasso_model, poly, df

When applied to the chips across an area of observation and defining a threshold on the maximum deviation from the fitted harmonic regression, we can automatically map change intensity allowing analysts to quickly zoom in on problematic areas.

While this method performs well in our analyses, it is also quite rigid in that it requires a careful tuning of error thresholds and the definition of a baseline period. There are more sophisticated approaches available ranging from general-purpose time-series analyses that automate the baseline definition and change point detection using recursive methods (for example, Gaussian Processes) to specialized algorithms for geospatial change detection (for example, LandTrendr, and Continuous Change Detection and Classification (CCDC)).

In sum, our approach to change detection demonstrates the power of geospatial embedding vectors in tracking environmental changes over time, providing valuable insights for land use monitoring, environmental protection, and urban planning applications.

GeoFM fine-tuning for your custom use case

Fine-tuning is a specific implementation of transfer learning, in which a pre-trained foundation model is adapted to specific tasks through targeted additional training on specialized labeled datasets. For GeoFMs, these specific tasks can target agriculture, disaster monitoring or urban analysis. The model retains its broad spatial understanding while developing expertise for particular regions, ecosystems or analytical tasks. This approach significantly reduces computational and data requirements compared to building specialized models from scratch, without sacrificing accuracy. Fine-tuning typically involves preserving the pre-trained Clay’s encoder—which has already learned rich representations of spectral patterns, spatial relationships, and temporal dynamics from massive satellite imagery, while attaching and training a specialized task-specific head.

For pixel-wise prediction tasks—such as land use segmentation—the specialized head is typically a decoder architecture, whereas for class-level outputs (classification tasks) the head can be as basic as a multilayer perceptron network. Training focuses exclusively on the new decoder that captures the feature representations from model’s frozen encoder and gradually transforms them back to full-resolution images where each pixel is classified according to its land use type.

The segmentation framework combines the powerful pre-trained Clay encoder with an efficient convolutional decoder, taking Clay’s rich understanding of satellite imagery and converting it into detailed land use maps. The lightweight decoder features convolutional layers and pixel shuffle upsampling techniques that capture the feature representations from Clay’s frozen encoder and gradually transforms them back to full-resolution images where each pixel is classified according to its land use type. By freezing the encoder (which contains 24 transformer heads and 16 attention heads) and only training the compact decoder, the model achieves a good balance between computational efficiency and segmentation accuracy.

We applied this segmentation architecture on a labeled land use land cover (LULC) dataset from Impact Observatory and hosted on the Amazon Registry of Open Data. For illustrative purposes, we again focused on our running example from Brazil’s Mato Grosso region. We trained the decoder head for 10 epochs which took 17 minutes total and tracked intersection over union (IOU) and F1 score as segmentation accuracy metrics. After just one training epoch, the model already achieved 85.7% validation IOU. With the full 10 epochs completed, performance increased to an impressive 92.4% IOU and 95.6% F1 score. In the following image, we show ground truth satellite imagery (upper) and the model’s predictions (lower). The visual comparison highlights how accurately this approach can classify different land use categories.

Conclusion

Novel GeoFMs provide an encouraging new approach to geospatial analytics. Through their extensive pre-training, these models have incorporated a deep implicit understanding of geospatial data and can be used out-of-the-box for high-impact use cases such as similarity search or change detection. They can also serve as the basis for specialized models using a fine-tuning process that is significantly less data-hungry (fewer labeled data needed) and has lower compute requirements.

In this post, we have shown how you can deploy a state-of-the-art GeoFM (Clay) on AWS and have explored one specific use case – monitoring deforestation in the Amazon rainforest – in greater detail. The same approach is applicable to a large variety of industry use case. For example, insurance companies can use a similar approach to ours to assess damage after natural disasters including hurricanes, floods or fires and keep track of their insured assets. Agricultural organizations can use GeoFMs for crop type identification, crop yield predictions, or other use cases. We also envision high-impact use cases in industries like urban planning, emergency and disaster response, supply chain and global trade, sustainability and environmental modeling, and many others. To get started applying GeoFMs to your own earth observation use case, check out the accompanying GitHub repository, which has the prerequisites and a step-by-step walkthrough to run it on your own area of interest.

About the Authors

Dr. Karsten Schroer is a Senior Machine Learning (ML) Prototyping Architect at AWS, focused on helping customers leverage artificial intelligence (AI), ML, and generative AI technologies. With deep ML expertise, he collaborates with companies across industries to design and implement data- and AI-driven solutions that generate business value. Karsten holds a PhD in applied ML.

Bishesh Adhikari is a Senior ML Prototyping Architect at AWS with over a decade of experience in software engineering and AI/ML. Specializing in GenAI, LLMs, NLP, CV, and GeoSpatial ML, he collaborates with AWS customers to build solutions for challenging problems through co-development. His expertise accelerates customers’ journey from concept to production, tackling complex use cases across various industries. In his free time, he enjoys hiking, traveling, and spending time with family and friends.

Dr. Iza Moise is a Senior Machine Learning (ML) Prototyping Architect at AWS, with expertise in both traditional ML and advanced techniques like foundation models and vision transformers. She focuses on applied ML across diverse scientific fields, publishing and reviewing at Amazon’s internal ML conferences. Her strength lies in translating theoretical advances into practical solutions that deliver measurable impact through thoughtful implementation.

Create an agentic RAG application for advanced knowledge discovery with LlamaIndex, and Mistral in Amazon Bedrock

Agentic Retrieval Augmented Generation (RAG) applications represent an advanced approach in AI that integrates foundation models (FMs) with external knowledge retrieval and autonomous agent capabilities. These systems dynamically access and process information, break down complex tasks, use external tools, apply reasoning, and adapt to various contexts. They go beyond simple question answering by performing multi-step processes, making decisions, and generating complex outputs.

In this post, we demonstrate an example of building an agentic RAG application using the LlamaIndex framework. LlamaIndex is a framework that connects FMs with external data sources. It helps ingest, structure, and retrieve information from databases, APIs, PDFs, and more, enabling the agent and RAG for AI applications.

This application serves as a research tool, using the Mistral Large 2 FM on Amazon Bedrock generate responses for the agent flow. The example application interacts with well-known websites, such as Arxiv, GitHub, TechCrunch, and DuckDuckGo, and can access knowledge bases containing documentation and internal knowledge.

This application can be further expanded to accommodate broader use cases requiring dynamic interaction with internal and external APIs, as well as the integration of internal knowledge bases to provide more context-aware responses to user queries.

Solution overview

This solution uses the LlamaIndex framework to build an agent flow with two main components: AgentRunner and AgentWorker. The AgentRunner serves as an orchestrator that manages conversation history, creates and maintains tasks, executes task steps, and provides a user-friendly interface for interactions. The AgentWorker handles the step-by-step reasoning and task execution.

For reasoning and task planning, we use Mistral Large 2 on Amazon Bedrock. You can use other text generation FMs available from Amazon Bedrock. For the full list of supported models, see Supported foundation models in Amazon Bedrock. The agent integrates with GitHub, arXiv, TechCrunch, and DuckDuckGo APIs, while also accessing internal knowledge through a RAG framework to provide context-aware answers.

In this solution, we present two options for building the RAG framework:

Document integration with Amazon OpenSearch Serverless – The first option involves using LlamaIndex to programmatically load and process documents. It splits the documents into chunks using various chunking strategies and then stores these chunks in an Amazon OpenSearch Serverless vector store for future retrieval.
Document integration with Amazon Bedrock Knowledge Bases – The second option uses Amazon Bedrock Knowledge Bases, a fully managed service that handles the loading, processing, and chunking of documents. This service can quickly create a new vector store on your behalf with a few configurations and clicks. You can choose from Amazon OpenSearch Serverless, Amazon Aurora PostgreSQL-Compatible Edition Serverless, and Amazon Neptune Analytics. Additionally, the solution includes a document retrieval rerank feature to enhance the relevance of the responses.

You can select the RAG implementation option that best suits your preference and developer skill level.

The following diagram illustrates the solution architecture.

In the following sections, we present the steps to implement the agentic RAG application. You can also find the sample code in the GitHub repository.

Prerequisites

The solution has been tested in the AWS Region us-west-2. Complete the following steps before proceeding:

Set up the following resources:
1. Create an Amazon SageMaker
2. Create a SageMaker domain user profile.
3. Launch Amazon SageMaker Studio, select JupyterLab, and create a space.
4. Select the instance t3.medium and the image SageMaker Distribution 2.3.1, then run the space.
Request model access:
1. On the Amazon Bedrock console, choose Model access in the navigation pane.
2. Choose Modify model access.
3. Select the models Mistral Large 2 (24.07), Amazon Titan Text Embeddings V2, and Rerank 1.0 from the list, and request access to these models.
Configure AWS Identity and Access Management (IAM) permissions:
1. In the SageMaker console, go to the SageMaker user profile details and find the execution role that the SageMaker notebook uses. It should look like AmazonSageMaker-ExecutionRole-20250213T123456.
In the IAM console, create an inline policy for this execution role. that your role can perform the following actions:
1. Access to Amazon Bedrock services including:
  - Reranking capabilities
  - Retrieving information
  - Invoking models
  - Listing available foundation models
2. IAM permissions to:
  - Create policies
  - Attach policies to roles within your account
3. Full access to Amazon OpenSearch Serverless service
Run the following command in the JupyterLab notebook terminal to download the sample code from GitHub:

git init
git remote add origin https://github.com/aws-samples/mistral-on-aws.git
git sparse-checkout init
git sparse-checkout set "notebooks/mistral-llamaindex-agentic-rag"
git pull origin main

Finally, install the required Python packages by running the following command in the terminal:

cd mistral-llamaindex-agentic-rag
pip install -r requirements.txt

Initialize the models

Initialize the FM used for orchestrating the agentic flow with Amazon Bedrock Converse API. This API provides a unified interface for interacting with various FMs available on Amazon Bedrock. This standardization simplifies the development process, allowing developers to write code one time and seamlessly switch between different models without adjusting for model-specific differences. In this example, we use the Mistral Large 2 model on Amazon Bedrock.

Next, initialize the embedding model from Amazon Bedrock, which is used for converting document chunks into embedding vectors. For this example, we use Amazon Titan Text Embeddings V2. See the following code:

# Initialise and configure the BedrockConverse LLM with the Mistral Large 2 model and set it as the default in Settings

from llama_index.llms.bedrock_converse import BedrockConverse
from llama_index.core import Settings
llm = BedrockConverse(model="mistral.mistral-large-2407-v1:0", max_tokens = 2048)
Settings.llm = BedrockConverse(model="mistral.mistral-large-2407-v1:0", max_tokens = 2048)

# Initialise and configure the embedding model with Amazon Titan Text Embeddings V2, and set it as the default in Settings

from llama_index.embeddings.bedrock import BedrockEmbedding
embed_model = BedrockEmbedding(model_name="amazon.titan-embed-text-v2:0")
Settings.embed_model = BedrockEmbedding(model_name="amazon.titan-embed-text-v2:0")

Integrate API tools

Implement two functions to interact with the GitHub and TechCrunch APIs. The APIs shown in this post don’t require credentials. To provide clear communication between the agent and the foundation model, follow Python function best practices, including:

Type hints for parameter and return value validation
Detailed docstrings explaining function purpose, parameters, and expected returns
Clear function descriptions

The following code sample shows the function that integrates with the GitHub API. After the function is created, use the FunctionTool.from_defaults() method to wrap the function as a tool and integrate it seamlessly into the LlamaIndex workflow.

See the code repository for the full code samples of the function that integrates with the TechCrunch API.

# Define a function to search GitHub repositories by topic, sorting by stars or update date, and return top results
import requests
def github_search(topic: str, num_results: int = 3, sort_by: str = "stars") -> list:
    """
    Retrieve a specified number of GitHub repositories based on a given topic, 
    ranked by the specified criteria.

    This function uses the GitHub API to search for repositories related to a 
    specific topic or keyword. The results can be sorted by the number of stars 
    (popularity) or the most recent update, with the most relevant repositories 
    appearing first according to the chosen sorting method.

    Parameters:
    -----------
    topic : str
        The topic or keyword to search for in GitHub repositories.
        The topic cannot contain blank spaces.
    num_results : int, optional
        The number of repository results to retrieve. Defaults to 3.
    sort_by : str, optional
        The criterion for sorting the results. Options include:
        - 'stars': Sort by the number of stars (popularity).
        - 'updated': Sort by the date of the last update (most recent first).
        Defaults to 'stars'.

    Returns:
    --------
    list
        A list of dictionaries, where each dictionary contains information 
        about a repository. Each dictionary includes:
        - 'html_url': The URL of the repository.
        - 'description': A brief description of the repository.
        - 'stargazers_count': The number of stars (popularity) the repository has.
    """

    url = f"https://api.github.com/search/repositories?q=topic:{topic}&sort={sort_by}&order=desc"
    response = requests.get(url).json()
    code_repos = [
        {
            'html_url': item['html_url'],
            'description': item['description'],
            'stargazers_count': item['stargazers_count'],
        }
        for item in response['items'][:num_results]
    ]
    return code_repos

github_tool = FunctionTool.from_defaults(fn=github_search)

For arXiv and DuckDuckGo integration, we use LlamaIndex’s pre-built tools instead of creating custom functions. You can explore other available pre-built tools in the LlamaIndex documentation to avoid duplicating existing solutions.

# Import and configure the ArxivToolSpec and DuckDuckGoSearchToolSpec from LlamaIndex prebuilt tools

from llama_index.tools.arxiv import ArxivToolSpec
from llama_index.tools.duckduckgo import DuckDuckGoSearchToolSpec

arxiv_tool = ArxivToolSpec()
search_tool = DuckDuckGoSearchToolSpec()

api_tools = arxiv_tool.to_tool_list() + search_tool.to_tool_list()

# Consolidate all tools into one list. 
api_tools.extend([news_tool, github_tool])

RAG option 1: Document integration with Amazon OpenSearch Serverless

Next, programmatically build the RAG component using LlamaIndex to load, process, and chunk documents. store the embedding vectors in Amazon OpenSearch Serverless. This approach offers greater flexibility for advanced scenarios, such as loading various file types (including .epub and .ppt) and selecting advanced chunking strategies based on file types (such as HTML, JSON, and code).

Before moving forward, you can download some PDF documents for testing from the AWS website using the following command, or you can use your own documents. The following documents are AWS guides that help in choosing the right generative AI service (such as Amazon Bedrock or Amazon Q) based on use case, customization needs, and automation potential. They also assist in selecting AWS machine learning (ML) services (such as SageMaker) for building models, using pre-trained AI, and using cloud infrastructure.

# download test documents from below links
!wget -O docs/genai_on_aws.pdf https://docs.aws.amazon.com/pdfs/decision-guides/latest/generative-ai-on-aws-how-to-choose/generative-ai-on-aws-how-to-choose.pdf?did=wp_card&trk=wp_card#guide
!wget -O docs/ml_on_aws.pdf https://docs.aws.amazon.com/pdfs/decision-guides/latest/machine-learning-on-aws-how-to-choose/machine-learning-on-aws-how-to-choose.pdf?did=wp_card&trk=wp_card#guide

Load the PDF documents using SimpleDirectoryReader() in the following code. For a full list of supported file types, see the LlamaIndex documentation.

# use Llamaindex to load documents 
from llama_index.core import SimpleDirectoryReader
loader = SimpleDirectoryReader('docs/')
documents = loader.load_data()

Next, create an Amazon OpenSearch Serverless collection as the vector database. Check the utils.py file for details on the create_collection() function.

# Create Amazon OpenSearch Serverless collection 
from utils import *
import sagemaker 
import random

region_name = "us-west-2"
suffix = random.randrange(1, 500)
collection_name = "llamaindex-blog-"+str(suffix)
notebook_execution_role = sagemaker.get_execution_role()
endpoint = create_collection(collection_name, notebook_execution_role)

After you create the collection, create an index to store embedding vectors:

## create an index in the collection
index_name = "pdf-rag"
create_index(index_name, endpoint, emb_dim=1024)

Next, use the following code to implement a document search system using LlamaIndex integrated with Amazon OpenSearch Serverless. It first sets up AWS authentication to securely access OpenSearch Service, then configures a vector client that can handle 1024-dimensional embeddings (specifically designed for the Amazon Titan Embedding V2 model). The code processes input documents by breaking them into manageable chunks of 1,024 tokens with a 20-token overlap, converts these chunks into vector embeddings, and stores them in the OpenSearch Serverless vector index. You can select a different or more advanced chunking strategy by modifying the transformations parameter in the VectorStoreIndex.from_documents() method. For more information and examples, see the LlamaIndex documentation.

import boto3
from llama_index.vector_stores.opensearch import  OpensearchVectorStore,   OpensearchVectorClient
from opensearchpy import RequestsHttpConnection, AWSV4SignerAuth
from llama_index.core import VectorStoreIndex, StorageContext
from llama_index.core.node_parser import SentenceSplitter

## integrate Amazon OpenSearch Serverless collection and index to llamaindex 

dim = 1024 # Amazon Titan Embedding V2 model dimension 
service = 'aoss'
credentials = boto3.Session().get_credentials()
awsauth = AWSV4SignerAuth(credentials, region_name, service)

client = OpensearchVectorClient(
    endpoint, 
    index_name, 
    dim, 
    embedding_field="vector", 
    text_field="chunk",
    http_auth=awsauth,
    use_ssl=True,
    verify_certs=True,
    connection_class=RequestsHttpConnection,
)

# initialise vector store and save document chunks to the vector store 
vector_store = OpensearchVectorStore(client)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

index = VectorStoreIndex.from_documents(
    documents, 
    storage_context=storage_context,
    transformations=[SentenceSplitter(chunk_size=1024, chunk_overlap=20)]
)

You can add a reranking step in the RAG pipeline, which improves the quality of information retrieved by making sure that the most relevant documents are presented to the language model, resulting in more accurate and on-topic responses:

from llama_index.postprocessor.bedrock_rerank import AWSBedrockRerank
reranker = AWSBedrockRerank(
    top_n=3,
    model_id="amazon.rerank-v1:0",#  another rerank model option is: cohere.rerank-v3-5:0
    region_name="us-west-2",
)
query_engine = index.as_query_engine(
    similarity_top_k=10,
    node_postprocessors=[reranker],
)

Use the following code to test the RAG framework. You can compare results by enabling or disabling the reranker model.

response = query_engine.query(
    "In which situation should I use Amazon Bedrock over Amazon SageMaker?",
)

Next, convert the vector store into a LlamaIndex QueryEngineTool, which requires a tool name and a comprehensive description. This tool is then combined with other API tools to create an agent worker that executes tasks in a step-by-step manner. The code initializes an AgentRunner to orchestrate the entire workflow, analyzing text inputs and generating responses. The system can be configured to support parallel tool execution for improved efficiency.

# create QueryEngineTool based on the OpenSearch vector store 

from llama_index.core.tools import QueryEngineTool, ToolMetadata
oss_tool = QueryEngineTool(
        query_engine=query_engine,
        metadata=ToolMetadata(
            name="oss_guide_tool",
            description="""
            These decision guides help users select appropriate AWS machine learning and generative AI services based on specific needs. 
            They cover pre-built solutions, customizable platforms, and infrastructure options for ML workflows, 
            while outlining how generative AI can automate processes, personalize content, augment data, reduce costs, 
            and enable faster experimentation in various business contexts.""",
        ),
    )

all_tools = api_tools +[oss_tool]

agent_worker = FunctionCallingAgentWorker.from_tools(
    all_tools, 
    llm=llm, 
    verbose=True, # Set verbose=True to display the full trace of steps. 
    system_prompt = system_prompt,
    # allow_parallel_tool_calls = True  # Uncomment this line to allow multiple tool invocations
)
agent = AgentRunner(agent_worker)
response = agent.chat(text_input)

You have now completed building the agentic RAG application using LlamaIndex and Amazon OpenSearch Serverless. You can test the chatbot application with your own questions. For example, ask about the latest news and features regarding Amazon Bedrock, or inquire about the latest papers and most popular GitHub repositories related to generative AI.

RAG option 2: Document integration with Amazon Bedrock Knowledge Bases

In this section, you use Amazon Bedrock Knowledge Bases to build the RAG framework. You can create an Amazon Bedrock knowledge base on the Amazon Bedrock console or follow the provided notebook example to create it programmatically. Create a new Amazon Simple Storage Service (Amazon S3) bucket for the knowledge base, then upload the previously downloaded files to this S3 bucket. You can select different embedding models and chunking strategies that work better for your data. After you create the knowledge base, remember to sync the data. Data synchronization might take a few minutes.

To enable your newly created knowledge base to invoke the rerank model, you need to modify its permissions. First, open the Amazon Bedrock console and locate the service role that matches the one shown in the following screenshot.

Choose the role and add the following provided IAM permission policy as an inline policy. This additional authorization grants your knowledge base the necessary permissions to successfully invoke the rerank model on Amazon Bedrock.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "bedrock:InvokeModel",
            "Resource": "arn:aws:bedrock:us-west-2::foundation-model/amazon.rerank-v1:0"
        },
        {
            "Effect": "Allow",
            "Action": "bedrock:Rerank",
            "Resource": "*"
        }
    ]
}

Use the following code to integrate the knowledge base into the LlamaIndex framework. Specific configurations can be provided in the retrieval_config parameter, where numberOfResults is the maximum number of retrieved chunks from the vector store, and overrideSearchType has two valid values: HYBRID and SEMANTIC. In the rerankConfiguration, you can optionally provide a rerank modelConfiguration and numberOfRerankedResults to sort the retrieved chunks by relevancy scores and select only the defined number of results. For the full list of available configurations for retrieval_config, refer to the Retrieve API documentation.

# Configure a knowledge base retriever using AmazonKnowledgeBasesRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.retrievers.bedrock import AmazonKnowledgeBasesRetriever

# maximum number of relevant text chunks that will be retrieved
# If you need quick, focused answers: lower numbers (1-3)
# If you need detailed, comprehensive answers: higher numbers (5-10)
top_k = 10

# search mode options: HYBRID, SEMANTIC
# HYBRID search combines the strengths of semantic search and keyword search 
# Balances semantic understanding with exact matching
# https://docs.llamaindex.ai/en/stable/examples/retrievers/bedrock_retriever/
search_mode = "HYBRID"

kb_retriever = AmazonKnowledgeBasesRetriever(
    knowledge_base_id=knowledge_base_id,
    retrieval_config={
        "vectorSearchConfiguration": {
            "numberOfResults": top_k,
            "overrideSearchType": search_mode,
            'rerankingConfiguration': {
                'bedrockRerankingConfiguration': {
                    'modelConfiguration': {
                        'modelArn': 'arn:aws:bedrock:us-west-2::foundation-model/amazon.rerank-v1:0'
                    },
                    'numberOfRerankedResults': 3
                },
                'type': 'BEDROCK_RERANKING_MODEL'
            }
        },
        
    }
)
kb_engine = RetrieverQueryEngine(retriever=kb_retriever)

Like the first option, you can create the knowledge base as a QueryEngineTool in LlamaIndex and combine it with other API tools. Then, you can create a FunctionCallingAgentWorker using these combined tools and initialize an AgentRunner to interact with them. By using this approach, you can chat with and take advantage of the capabilities of the integrated tools.

# Create a query tool for Bedrock Knowledge Base
kb_tool = QueryEngineTool(
        query_engine=kb_engine,
        metadata=ToolMetadata(
            name="kb_tool",
            description="""
            These decision guides help users select appropriate AWS machine learning and generative AI services based on specific needs. 
            They cover pre-built solutions, customizable platforms, and infrastructure options for ML workflows, 
            while outlining how generative AI can automate processes, personalize content, augment data, reduce costs, 
            and enable faster experimentation in various business contexts.""",
        ),
    )

# Update the agent to include all API tools and the Knowledge Base tool.
all_tools = api_tools +[kb_tool]

agent_worker = FunctionCallingAgentWorker.from_tools(
    all_tools, 
    llm=llm, 
    verbose=True, # Set verbose=True to display the full trace of steps. 
    system_prompt = system_prompt,
    # allow_parallel_tool_calls = True  # Uncomment this line to allow multiple tool invocations
)
agent = AgentRunner(agent_worker)
response = agent.chat(text_input)

Now you have built the agentic RAG solution using LlamaIndex and Amazon Bedrock Knowledge Bases.

Clean up

When you finish experimenting with this solution, use the following steps to clean up the AWS resources to avoid unnecessary costs:

In the Amazon S3 console, delete the S3 bucket and data created for this solution.
In the OpenSearch Service console, delete the collection that was created for storing the embedding vectors.
In the Amazon Bedrock Knowledge Bases console, delete the knowledge base you created.
In the SageMaker console, navigate to your domain and user profile, and launch SageMaker Studio to stop or delete the JupyterLab instance.

Conclusion

This post demonstrated how to build a powerful agentic RAG application using LlamaIndex and Amazon Bedrock that goes beyond traditional question answering systems. By integrating Mistral Large 2 as the orchestrating model with external APIs (GitHub, arXiv, TechCrunch, and DuckDuckGo) and internal knowledge bases, you’ve created a versatile technology discovery and research tool.

We showed you two complementary approaches to implement the RAG framework: a programmatic implementation using LlamaIndex with Amazon OpenSearch Serverless, providing maximum flexibility for advanced use cases, and a managed solution using Amazon Bedrock Knowledge Bases that simplifies document processing and storage with minimal configuration. You can try out the solution using the following code sample.

For more relevant information, see Amazon Bedrock, Amazon Bedrock Knowledge Bases, Amazon OpenSearch Serverless, and Use a reranker model in Amazon Bedrock. Refer to Mistral AI in Amazon Bedrock to see the latest Mistral models that are available on both Amazon Bedrock and AWS Marketplace.

About the Authors

Ying Hou, PhD, is a Sr. Specialist Solution Architect for Gen AI at AWS, where she collaborates with model providers to onboard the latest and most intelligent AI models onto AWS platforms. With deep expertise in Gen AI, ASR, computer vision, NLP, and time-series forecasting models, she works closely with customers to design and build cutting-edge ML and GenAI applications. Outside of architecting innovative AI solutions, she enjoys spending quality time with her family, getting lost in novels, and exploring the UK’s national parks.

Preston Tuggle is a Sr. Specialist Solutions Architect with the Third-Party Model Provider team at AWS. He focuses on working with model providers across Amazon Bedrock and Amazon SageMaker, helping them accelerate their go-to-market strategies through technical scaling initiatives and customer engagement.

The Supercomputer Designed to Accelerate Nobel-Worthy Science

Ready for a front-row seat to the next scientific revolution?

That’s the idea behind Doudna — a groundbreaking supercomputer being built at Lawrence Berkeley National Laboratory. The system represents a major national investment in advancing U.S. high-performance computing leadership, ensuring U.S. researchers have access to cutting-edge tools to address global challenges.

Also known as NERSC-10, Doudna is named for Nobel laureate and CRISPR pioneer Jennifer Doudna. The next-generation system announced today at Lawrence Berkeley National Laboratory is designed not just for speed, but for impact.

“The Doudna system represents DOE’s commitment to advancing American leadership in science, AI, and high-performance computing,” said U.S. Secretary of Energy Chris Wright said in a statement. “It will be a powerhouse for rapid innovation that will transform our efforts to develop abundant, affordable energy supplies and advance breakthroughs in quantum computing.”

Powered by Dell infrastructure with the NVIDIA Vera Rubin architecture, and set to launch in 2026, Doudna is tailored for real-time discovery across the U.S. Department of Energy’s most urgent scientific missions. It’s poised to catapult American researchers to the forefront of critical scientific breakthroughs, fostering innovation and securing the nation’s competitive edge in key technological fields.

“Doudna is a time machine for science — compressing years of discovery into days,” said Jensen Huang, founder and CEO of NVIDIA in a statement. “Built together with DOE and powered by NVIDIA’s Vera Rubin platform, it will let scientists delve deeper and think bigger to seek the fundamental truths of the universe.”

Designed to Accelerate Breakthroughs

Unlike traditional systems that operate in silos, Doudna merges simulation, data and AI into a single seamless platform.

“The Doudna supercomputer is designed to accelerate a broad set of scientific workflows,” said NERSC Director Sudip Dosanjh in a statement. “Doudna will be connected to DOE experimental and observational facilities through the Energy Sciences Network (ESnet), allowing scientists to stream data seamlessly into the system from all parts of the country and to analyze it in near-real time.”

The Mayall 4-Meter Telescope, which will be home to the Dark Energy Spectroscopic Instrument (DESI), seen at night at Kitt Peak National Observatory. © The Regents of the University of California, Lawrence Berkeley National Laboratory

It’s engineered to empower over 11,000 researchers with almost instantaneous responsiveness and integrated workflows, helping scientists explore bigger questions and reach answers faster than ever.

“We’re not just building a faster computer,” said Nick Wright, advanced technologies group lead and Doudna chief architect at NERSC. “We’re building a system that helps researchers think bigger, and discover sooner.”

Here’s what Wright expects Doudna to advance:

Fusion energy: Breakthroughs in simulation that unlocks clean fusion energy.
Materials science: AI models that design new classes of superconducting materials.
Drug discovery acceleration: Ultrarapid workflow that helps biologists fold proteins fast enough to outpace a pandemic.
Astronomy: Real-time processing of data from the Dark Energy Spectroscopic Instrument at Kitt Peak to help scientists map the universe.

Doudna is expected to outperform its predecessor, Perlmutter, by more than 10x in scientific output, all while using just 2-3x the power.

This translates to a 3-5x increase in performance per watt, a result of innovations in chip design, dynamic load balancing and system-level efficiencies.

AI-Powered Discovery, at Scale

Doudna will power AI-driven breakthroughs across high-impact scientific fields nationwide.

Highlights include:

AI for protein design: David Baker, a 2024 Nobel laureate, used NERSC systems to support his work using AI to predict novel protein structures, addressing challenges across scientific disciplines.
AI for fundamental physics: Researchers like Benjamin Nachman are using AI to “unfold” detector distortions in particle physics data and analyze proton data from electron-proton colliders.
AI for materials science: A collaboration including Berkeley Lab and Meta created “Open Molecules 2025,” a massive dataset for using AI to accurately model complex molecular chemical reactions. Researchers involved also use NERSC for their AI models.

Real-Time Science, Real-World Impact

The new system is named for Nobel laureate and CRISPR pioneer Jennifer Doudna. © The Regents of the University of California, Lawrence Berkeley National Laboratory

Doudna isn’t a standalone system. It’s an integral part of scientific workflows. DOE’s ESnet will stream data from telescopes, detectors and genome sequencers directly into the machine with low-latency, high-throughput NVIDIA Quantum-X800 InfiniBand networking.

This critical data flow is prioritized by intelligent QoS mechanisms, ensuring it stays fast and uninterrupted, from input to insight.

This will make the system incredibly responsive. At the DIII-D national fusion ignition facility, for example, data will stream control-room events directly into Doudna for rapid-response plasma modeling, so scientists can make adjustments in real time.

“We used to think of the supercomputer as a passive participant in the corner,” Wright said. “Now it’s part of the entire workflow, connected to experiments, telescopes, detectors.”

The Platform for What’s Next: Unlocking Quantum and HPC Workflows

Doudna supports traditional HPC, cutting-edge AI, real-time streaming and even quantum workflows.

This includes support for scalable quantum algorithm development and the co-design of future integrated quantum-HPC systems, using platforms like NVIDIA CUDA-Q.

All of these workflows will run on the next-generation NVIDIA Vera Rubin platform, which will blend high-performance CPUs with coherent GPUs, meaning all processors can access and share data directly to support the most demanding scientific workloads.

Researchers are already porting full pipelines using frameworks like PyTorch, the NVIDIA Holoscan software development kit, NVIDIA TensorFlow, NVIDIA cuDNN and NVIDIA CUDA-Q, all optimized for the system’s Rubin GPUs and NVIDIA NVLink architecture.

Over 20 research teams are already porting full workflows to Doudna through the NERSC Science Acceleration Program, tackling everything from climate models to particle physics. This isn’t just about raw compute, it’s about discovery, integrated from idea to insight.

Designed for Urgency

In 2024, AI-assisted science earned two Nobel Prizes. From climate research to pandemic response, the next breakthroughs won’t wait for better infrastructure.

With deployment slated for 2026, Doudna is positioned to lead a new era of accelerated science. DOE facilities across the country, from Fermilab to the Joint Genome Institute, will rely on its capabilities to turn today’s questions into tomorrow’s breakthroughs.

“This isn’t a system for one field,” Wright said. “It’s for discovery — across chemistry, physics and fields we haven’t imagined yet.”

As NVIDIA founder and CEO Jensen Huang put it, Doudna is “a time machine for science.” It compresses years of discovery into days, and gives the world’s toughest problems the power they’ve been waiting for.

Text-to-image basics with Amazon Nova Canvas

AI image generation has emerged as one of the most transformative technologies in recent years, revolutionizing how you create and interact with visual content. Amazon Nova Canvas is a generative model in the suite of Amazon Nova creative models that enables you to generate realistic and creative images from plain text descriptions.

This post serves as a beginner’s guide to using Amazon Nova Canvas. We begin with the steps to get set up on Amazon Bedrock. Amazon Bedrock is a fully managed service that hosts leading foundation models (FMs) for various use cases such as text, code, and image generation; summarization; question answering; and custom use cases that involve fine-tuning and Retrieval Augmented Generation (RAG). In this post, we focus on the Amazon Nova image generation models available in AWS Regions in the US, in particular, the Amazon Nova Canvas model. We then provide an overview of the image generation process (diffusion) and dive deep into the input parameters for text-to-image generation with Amazon Nova Canvas.

Get started with image generation on Amazon Bedrock

Complete the following steps to get setup with access to Amazon Nova Canvas and the image playground:

Create an AWS account if you don’t have one already.
Open the Amazon Bedrock console as an AWS Identity and Access Management (IAM) administrator or appropriate IAM user.
Confirm and choose one of the Regions where the Amazon Nova Canvas model is available (for example, US East (N. Virginia)).
In the navigation pane, choose Model access under Bedrock configurations.

Under What is Model access, choose Modify model access or Enable specific models (if not yet activated).

Select Nova Canvas, then choose Next.

On the Review and submit page, choose Submit.

Refresh the Base models
If you see the Amazon Nova Canvas model in the Access Granted status, you are ready to proceed with the next steps.

In the navigation pane, choose Image / Video under Playgrounds.

Choose Select model, then choose Amazon and Nova Canvas. Then choose Apply.

You are all set up to start generating images with Amazon Nova Canvas on Amazon Bedrock. The following screenshot shows an example of our playground.

Understanding the generation process

Amazon Nova Canvas uses diffusion-based approaches to generate images:

Starting point – The process begins with random noise (a pure static image).
Iterative denoising – The model gradually removes noise in steps, guided by your prompts. The amount of noise to remove at each step is learned at training. For instance, for a model to generate an image of a cat, it has to be trained on multiple cat images, and iteratively insert noise into the image until it is complete noise. When learning the amount of noise to add at each step, the model effectively learns the reverse process, starting with a noisy image and iteratively subtracting noise to arrive at the image of a cat.
Text conditioning – The text prompt serves as the conditioning that guides the image generation process. The prompt is encoded as a numerical vector, referenced against similar vectors in a text-image embedding space that corresponds to images, and then using these vectors, a noisy image is transformed into an image that captures the input prompt.
Image conditioning – In addition to text prompts, Amazon Nova Canvas also accepts images as inputs.
Safety and fairness – To comply with safety and fairness goals, both the prompt and the generated output image go through filters. If no filter is triggered, the final image is returned.

Prompting fundamentals

Image generation begins with effective prompting—the art of crafting text descriptions that guide the model toward your desired output. Well-constructed prompts include specific details about subject, style, lighting, perspective, mood, and composition, and work better when structured as image captions rather than a command or conversation. For example, rather than saying “generate an image of a mountain,” a more effective prompt might be “a majestic snow-capped mountain peak at sunset with dramatic lighting and wispy clouds, photorealistic style.” Refer to Amazon Nova Canvas prompting best practices for more information about prompting.

Let’s address the following prompt elements and observe their impact on the final output image:

Subject descriptions (what or who is in the image) – In the following example, we use the prompt “a cat sitting on a chair.”

Style references (photography, oil painting, 3D render) – In the following examples, we use the prompts “A cat sitting on a chair, oil painting style” and then “A cat sitting on a chair, anime style.”

Compositional elements and technical specifications (foreground, background, perspective, lighting) – In the following examples, we use the prompts “A cat sitting on a chair, mountains in the background,” and “A cat sitting on a chair, sunlight from the right low angle shot.”

Positive and negative prompts

Positive prompts tell the model what to include. These are the elements, styles, and characteristics you want to observe in the final image. Avoid the use of negation words like “no,” “not,” or “without” in your prompt. Amazon Nova Canvas has been trained on image-caption pairs, and captions rarely describe what isn’t in an image. Therefore, the model has never learned the concept of negation. Instead, use negative prompts to specify elements to exclude from the output.

Negative prompts specify what to avoid. Common negative prompts include “blurry,” “distorted,” “low quality,” “poor anatomy,” “bad proportions,” “disfigured hands,” or “extra limbs,” which help models avoid typical generation artifacts.

In the following examples, we first use the prompt “An aerial view of an archipelago,” then we refine the prompt as “An aerial view of an archipelago. Negative Prompt: Beaches.”

The balance between positive and negative prompting creates a defined creative space for the model to work within, often resulting in more predictable and desirable outputs.

Image dimensions and aspect ratios

Amazon Nova Canvas is trained on 1:1, portrait and landscape resolutions, with generation tasks having a maximum output resolution of 4.19 million pixels (that is, 2048×2048, 2816×1536). For editing tasks, the image should be 4,096 pixels on its longest side, have an aspect ratio between 1:4 and 4:1, and have a total pixel count of 4.19 million or smaller. Understanding dimensional limitations helps avoid stretched or distorted results, particularly for specialized composition needs.

Classifier-free guidance scale

The classifier-free guidance (CFG) scale controls how strictly the model follows your prompt:

Low values (1.1–3) – More creative freedom for the AI, potentially more aesthetic, but low contrast and less prompt-adherent results
Medium values (4–7) – Balanced approach, typically recommended for most generations
High values (8–10) – Strict prompt adherence, which can produce more precise results but sometimes at the cost of natural aesthetics and increased color saturation

In the following examples, we use the prompt “Cherry blossoms, bonsai, Japanese style landscape, high resolution, 8k, lush greens in the background.”

The first image with CFG 2 captures some elements of cherry blossoms and bonsai. The second image with CFG 8 adheres more to the prompt with a potted bonsai, more pronounced cherry blossom flowers, and lush greens in the background.

Think of CFG scale as adjusting how literally your instructions are taken into consideration vs. how much artistic interpretation it applies.

Seed values and reproducibility

Every image generation begins with a randomization seed—essentially a starting number that determines initial conditions:

Seeds are typically represented as long integers (for example, 1234567890)
Using the same seed, prompt, and parameters reproduces identical images every time
Saving seeds allows you to revisit successful generations or create variations on promising results
Seed values have no inherent quality; they are simply different starting points

Reproducibility through seed values is essential for professional workflows, allowing refined iterations on the prompt or other input parameters to clearly see their effect, rather than completely random generations. The following images are generated using two slightly different prompts (“A portrait of a girl smiling” vs. “A portrait of a girl laughing”), while holding the seed value and all other parameters constant.

All preceding images in this post have been generated using the text-to-image (TEXT_IMAGE) task type of Amazon Nova Canvas, available through the Amazon Bedrock InvokeModel API. The following is the API request and response structure for image generation:

#Request Structure
{
    "taskType": "TEXT_IMAGE",
    "textToImageParams": {
        "text": string,         #Positive Prompt
        "negativeText": string  #Negative Prompt
    },
    "imageGenerationConfig": {
        "width": int,           #Image Resolution Width
        "height": int,          #Image Resolution Width
        "quality": "standard" | "premium",   #Image Quality
        "cfgScale": float,      #Classifer Free Guidance Scale
        "seed": int,            #Seed value
        "numberOfImages": int   #Number of images to be generated (max 5)
    }
}
#Response Structure
{
    "images": "images": string[], #list of Base64 encoded images
    "error": string
}

Code example

This solution can also be tested locally with a Python script or a Jupyter notebook. For this post, we use an Amazon SageMaker AI notebook using Python (v3.12). For more information, see Run example Amazon Bedrock API requests using an Amazon SageMaker AI notebook. For instructions to set up your SageMaker notebook instance, refer to Create an Amazon SageMaker notebook instance. Make sure the instance is set up in the same Region where Amazon Nova Canvas access is enabled. For this post, we create a Region variable to match the Region where Amazon Nova Canvas is enabled (us-east-1). You must modify this variable if you’ve enabled the model in a different Region. The following code demonstrates text-to-image generation by invoking the Amazon Nova Canvas v1.0 model using Amazon Bedrock. To understand the API request and response structure for different types of generations, parameters, and more code examples, refer to Generating images with Amazon Nova.

import base64  #For encoding/decoding base64 data
import io  #For handling byte streams
import json  #For JSON processing
import boto3  #AWS SDK for Python
from PIL import Image  #Python Imaging Library for image processing
from botocore.config import Config  #For AWS client configuration

#Create a variable to fix the region to where Nova Canvas is enabled
region = "us-east-1"

#Setup an Amazon Bedrock runtime client
client = boto3.client(service_name='bedrock-runtime', region_name=region, config=Config(read_timeout=300))

#Set the content type and accept headers for the API call
accept = "application/json"
content_type = "application/json"

#Define the prompt for image generation
prompt = """A cat sitting on a chair, mountains in the background, low angle shot."""

#Create the request body with generation parameters
api_request= json.dumps({
        "taskType": "TEXT_IMAGE",  #Specify text-to-image generation
        "textToImageParams": {
            "text": prompt  
        },
        "imageGenerationConfig": {
            "numberOfImages": 1,   #Generate one image
            "height": 720,        #Image height in pixels
            "width": 1280,         #Image width in pixels
            "cfgScale": 7.0,       #CFG Scale
            "seed": 0              #Seed number for generation
        }
})
#Call the Bedrock model to generate the image
response = client.invoke_model(body=api_request, modelId='amazon.nova-canvas-v1:0', accept=accept, 
contentType=content_type)
        
#Parse the JSON response
response_json = json.loads(response.get("body").read())

#Extract the base64-encoded image from the response
base64_image = response_json.get("images")[0]
#Convert the base64 string to ASCII bytes
base64_bytes = base64_image.encode('ascii')
#Decode the base64 bytes to get the actual image bytes
image_data = base64.b64decode(base64_bytes)

#Convert bytes to an image object
output_image = Image.open(io.BytesIO(image_data))
#Display the image
output_image.show()
#Save the image to current working directory
output_image.save('output_image.png')

Clean up

When you have finished testing this solution, clean up your resources to prevent AWS charges from being incurred:

Back up the Jupyter notebooks in the SageMaker notebook instance.
Shut down and delete the SageMaker notebook instance.

Cost considerations

Consider the following costs from the solution deployed on AWS:

You will incur charges for generative AI inference on Amazon Bedrock. For more details, refer to Amazon Bedrock pricing.
You will incur charges for your SageMaker notebook instance. For more details, refer to Amazon SageMaker pricing.

Conclusion

This post introduced you to AI image generation, and then provided an overview of accessing image models available on Amazon Bedrock. We then walked through the diffusion process and key parameters with examples using Amazon Nova Canvas. The code template and examples demonstrated in this post aim to get you familiar with the basics of Amazon Nova Canvas and get started with your AI image generation use cases on Amazon Bedrock.

For more details on text-to-image generation and other capabilities of Amazon Nova Canvas, see Generating images with Amazon Nova. Give it a try and let us know your feedback in the comments.

About the Author

Arjun Singh is a Sr. Data Scientist at Amazon, experienced in artificial intelligence, machine learning, and business intelligence. He is a visual person and deeply curious about generative AI technologies in content creation. He collaborates with customers to build ML and AI solutions to achieve their desired outcomes. He graduated with a Master’s in Information Systems from the University of Cincinnati. Outside of work, he enjoys playing tennis, working out, and learning new skills.

Real-world applications of Amazon Nova Canvas for interior design and product photography

As AI image generation becomes increasingly central to modern business workflows, organizations are seeking practical ways to implement this technology for specific industry challenges. Although the potential of AI image generation is vast, many businesses struggle to effectively apply it to their unique use cases.

In this post, we explore how Amazon Nova Canvas can solve real-world business challenges through advanced image generation techniques. We focus on two specific use cases that demonstrate the power and flexibility of this technology:

Interior design – Image conditioning with segmentation helps interior designers rapidly iterate through design concepts, dramatically reducing the time and cost associated with creating client presentations
Product photography – Outpainting enables product photographers to create diverse environmental contexts for products without extensive photo shoots

Whether you’re an interior design firm looking to streamline your visualization process or a retail business aiming to reduce photography costs, this post can help you use the advanced features of Amazon Nova Canvas to achieve your specific business objectives. Let’s dive into how these powerful tools can transform your image generation workflow.

Prerequisites

You should have the following prerequisites:

An AWS account to manage the necessary AWS resources for this solution.
Access to the Amazon Nova Canvas model on Amazon Bedrock in AWS Region us-east-1.
If testing the solution on the Amazon Bedrock console, you should be familiar with Amazon Bedrock playgrounds. For more information, see Generate responses in the console using playgrounds.
If coding the solution with an Amazon SageMaker AI notebook, you should be familiar with the Python coding language (v3.12). For more information, see Run example Amazon Bedrock API requests using an Amazon SageMaker AI notebook. Refer to Create an Amazon SageMaker notebook instance for instructions to set up your SageMaker notebook instance.

Interior design

An interior design firm has the following problem: Their designers spend hours creating photorealistic designs for client presentations, needing multiple iterations of the same room with different themes and decorative elements. Traditional 3D rendering is time-consuming and expensive. To solve this problem, you can use the image conditioning (segmentation) features of Amazon Nova Canvas to rapidly iterate on existing room photos. The condition image is analyzed to identify prominent content shapes, resulting in a segmentation mask that guides the generation. The generated image closely follows the layout of the condition image while allowing the model to have creative freedom within the bounds of each content area.

The following images show examples of the initial input, a segmentation mask based on the input, and output based on two different prompts.


Input image of a living room	Segmentation mask of living room

Prompt: A minimalistic living room	Prompt: A coastal beach themed living room

This post demonstrates how to maintain structural integrity while transforming interior elements, so you can generate multiple variations in minutes with simple prompting and input images. The following code block presents the API request structure for image conditioning with segmentation. Parameters to perform these transformations are passed to the model through the API request. Make sure that the output image has the same dimensions as the input image to avoid distorted results.

{
    "taskType": "TEXT_IMAGE",
    "textToImageParams": {
        "conditionImage": string (Base64 encoded image), #Original living room
        "controlMode": "SEGMENTATION", 
        "controlStrength": float, #Specify how closely to follow the condition       #image (0.0-1.0; Default: 0.7).
        "text": string, #A minimalistic living room
        "negativeText": string
    },
    "imageGenerationConfig": {
        "width": int,
        "height": int,
        "quality": "standard" | "premium",
        "cfgScale": float,
        "seed": int,
        "numberOfImages": int
    }
}

The taskType object determines the type of operation being performed and has its own set of parameters, and the imageGenerationConfig object contains general parameters common to all task types (except background removal). To learn more about the request/response structure for different types of generations, refer to Request and response structure for image generation.

The following Python code demonstrates an image conditioning generation by invoking the Amazon Nova Canvas v1.0 model on Amazon Bedrock:

import base64  #For encoding/decoding base64 data
import io  #For handling byte streams
import json  #For JSON operations
import boto3  #AWS SDK for Python
from PIL import Image  #Python Imaging Library for image processing
from botocore.config import Config  #For AWS client configuration
#Create a variable to fix the region to where Nova Canvas is enabled 
region = "us-east-1"

#Create Bedrock client with 300 second timeout
bedrock = boto3.client(service_name='bedrock-runtime', region_name=region,
        config=Config(read_timeout=300))

#Original living room image in current working directory
input_image_path = "Original Living Room.jpg"

#Read and encode the image
def prepare_image(image_path):
    with open(image_path, 'rb') as image_file:
        image_data = image_file.read()
        base64_encoded = base64.b64encode(image_data).decode('utf-8')
    return base64_encoded

#Get the base64 encoded image
input_image = prepare_image(input_image_path)

#Set the content type and accept headers for the API call
accept = "application/json"
content_type = "application/json"

#Prepare the request body
api_request = json.dumps({
       "taskType": "TEXT_IMAGE",  #Type of generation task
       "textToImageParams": {
             "text": "A minimalistic living room",  #Prompt
             "negativeText": "bad quality, low res",  #What to avoid
             "conditionImage": input_image,  #Base64 encoded original living room
             "controlMode": "SEGMENTATION"  #Segmentation mode
            },
       "imageGenerationConfig": {
             "numberOfImages": 1,  #Generate one image
             "height": 1024,  #Image height, same as the input image
             "width": 1024,  #Image width, same as the input image
             "seed": 0, #Modify seed value to get variations on the same prompt
             "cfgScale": 7.0  #Classifier Free Guidance scale
            }
})

#Call the model to generate image
response = bedrock.invoke_model(body=api_request, modelId='amazon.nova-canvas-v1:0', accept=accept, contentType=content_type)

#Parse the response body
response_json = json.loads(response.get("body").read())

#Extract and decode the base64 image
base64_image = response_json.get("images")[0]  #Get first image
base64_bytes = base64_image.encode('ascii')  #Convert to ASCII
image_data = base64.b64decode(base64_bytes)  #Decode base64 to bytes

#Display the generated image
output_image = Image.open(io.BytesIO(image_data))
output_image.show()
#Save the image to current working directory
output_image.save('output_image.png')

Product photography

A sports footwear company has the following problem: They need to showcase their versatile new running shoes in multiple environments (running track, outdoors, and more), requiring expensive location shoots and multiple photography sessions for each variant. To solve this problem, you can use Amazon Nova Canvas to generate diverse shots from a single product photo. Outpainting can be used to replace the background of an image. You can instruct the model to preserve parts of the image by providing a mask prompt, for example, “Shoes.” A mask prompt is a natural language description of the objects in your image that should not be changed during outpainting. You can then generate the shoes in different backgrounds with new prompts.

The following images show examples of the initial input, a mask created for “Shoes,” and output based on two different prompts.


Studio photo of running shoes	Mask created for “Shoes”

Prompt: Product photoshoot of sports shoes placed on a running track outdoor	Prompt: Product photoshoot of sports shoes on rocky terrain, forest background

Instead of using a mask prompt, you can input a mask image, which defines the areas of the image to preserve. The mask image must be the same size as the input image. Areas to be edited are shaded pure white and areas to preserve are shaded pure black. Outpainting mode is a parameter to define how the mask is treated. Use DEFAULT to transition smoothly between the masked area and the non-masked area. This mode is generally better when you want the new background to use similar colors as the original background. However, you can get a halo effect if your prompt calls for a new background that is significantly different than the original background. Use PRECISE to strictly adhere to the mask boundaries. This mode is generally better when you’re making significant changes to the background.

This post demonstrates how to use outpainting to capture product accuracy, and then turn one studio photo into different environments seamlessly. The following code illustrates the API request structure for outpainting:

{
    "taskType": "OUTPAINTING",
    "outPaintingParams": {
        "image": string (Base64 encoded image),
        "maskPrompt": string, #Shoes
        "maskImage": string, #Base64 encoded image
        "outPaintingMode": "DEFAULT" | "PRECISE", 
        "text": string,  #Product photoshoot of sports shoes on rocky terrain
        "negativeText": string
    },
    "imageGenerationConfig": {
        "numberOfImages": int,
        "quality": "standard" | "premium",
        "cfgScale": float,
        "seed": int
    }
}

The following Python code demonstrates an outpainting-based background replacement by invoking the Amazon Nova Canvas v1.0 model on Amazon Bedrock. For more code examples, see Code examples.

import base64  #For encoding/decoding base64 data
import io  #For handling byte streams
import json  #For JSON operations
import boto3  #AWS SDK for Python
from PIL import Image  #Python Imaging Library for image processing
from botocore.config import Config  #For AWS client configuration
#Create a variable to fix the region to where Nova Canvas is enabled 
region = "us-east-1"

#Create Bedrock client with 300 second timeout
bedrock = boto3.client(service_name='bedrock-runtime', region_name=region,
        config=Config(read_timeout=300))

#Original studio image of shoes in current working directory
input_image_path = "Shoes.png"

#Read and encode the image
def prepare_image(image_path):
    with open(image_path, 'rb') as image_file:
        image_data = image_file.read()
        base64_encoded = base64.b64encode(image_data).decode('utf-8')
    return base64_encoded

#Get the base64 encoded image
input_image = prepare_image(input_image_path)

#Set the content type and accept headers for the API call
accept = "application/json"
content_type = "application/json"

#Prepare the request body
api_request = json.dumps({
        "taskType": "OUTPAINTING",
        "outPaintingParams": {
             "image": input_image,
             "maskPrompt": "Shoes", 
             "outPaintingMode": "DEFAULT", 
             "text": "Product photoshoot of sports shoes placed on a running track outdoor",
             "negativeText": "bad quality, low res"
            },
        "imageGenerationConfig": {
             "numberOfImages": 1,
             "seed": 0, #Modify seed value to get variations on the same prompt
             "cfgScale": 7.0
            }
})

#Call the model to generate image
response = bedrock.invoke_model(body=api_request, modelId='amazon.nova-canvas-v1:0', accept=accept, contentType=content_type)

#Parse the response body
response_json = json.loads(response.get("body").read())

#Extract and decode the base64 image
base64_image = response_json.get("images")[0]  #Get first image
base64_bytes = base64_image.encode('ascii')  #Convert to ASCII
image_data = base64.b64decode(base64_bytes)  #Decode base64 to bytes

#Display the generated image
output_image = Image.open(io.BytesIO(image_data))
output_image.show()
#Save the image to current working directory
output_image.save('output_image.png')

Clean up

When you have finished testing this solution, clean up your resources to prevent AWS charges from being incurred:

Back up the Jupyter notebooks in the SageMaker notebook instance.
Shut down and delete the SageMaker notebook instance.

Cost considerations

Consider the following costs from the solution deployed on AWS:

You will incur charges for generative AI inference on Amazon Bedrock. For more details, refer to Amazon Bedrock pricing.
You will incur charges for your SageMaker notebook instance. For more details, refer to Amazon SageMaker pricing.

Conclusion

In this post, we explored practical implementations of Amazon Nova Canvas for two high-impact business scenarios. You can now generate multiple design variations or diverse environments in minutes rather than hours. With Amazon Nova Canvas, you can significantly reduce costs associated with traditional visual content creation. Refer to Generating images with Amazon Nova to learn about the other capabilities supported by Amazon Nova Canvas.

As next steps, begin with a single use case that closely matches your business needs. Use our provided code examples as a foundation and adapt them to your specific requirements. After you’re familiar with the basic implementations, explore combining multiple techniques and scale gradually. Don’t forget to track time savings and cost reductions to measure ROI. Contact your AWS account team for enterprise implementation guidance.

About the Author

Arjun Singh is a Sr. Data Scientist at Amazon, experienced in artificial intelligence, machine learning, and business intelligence. He is a visual person and deeply curious about generative AI technologies in content creation. He collaborates with customers to build ML/AI solutions to achieve their desired outcomes. He graduated with a Master’s in Information Systems from the University of Cincinnati. Outside of work, he enjoys playing tennis, working out, and learning new skills.

Independent evaluations demonstrate Nova Premier’s safety

Independent evaluations demonstrate Nova Premiers safety

In both black-box stress testing and red-team exercises, Nova Premier comes out on top.

Conversational AI

Rahul Gupta

Christophe Dupuy

May 29, 03:22 PMMay 29, 03:22 PM

AI safety is a priority at Amazon. Our investment in safe, transparent, and responsible AI (RAI) includes collaboration with the global community and policymakers. We are members of and collaborate with organizations such as the Frontier Model Forum, the Partnership on AI, and other forums organized by government agencies such as the National Institute of Standards and Technology (NIST). Consistent with Amazon’s endorsement of the Korea Frontier AI Safety Commitments, we published our Frontier Model Safety Framework earlier this year.

Amazon Nova Premier’s guardrails help prevent generation of unsafe content.

During the development of the Nova Premier model, we conducted a comprehensive evaluation to assess its performance and safety. This included testing on both internal and public benchmarks and internal/automated and third-party red-teaming exercises. Once the final model was ready, we prioritized obtaining unbiased, third-party evaluations of the model’s robustness against RAI controls. In this post, we outline the key findings from these evaluations, demonstrating the strength of our testing approach and Amazon Premier’s standing as a safe model. Specifically, we cover our evaluations with two third-party evaluators: PRISM AI and ActiveFence.

Evaluation of Nova Premier against PRISM AI

PRISM Eval’s Behavior Elicitation Tool (BET) dynamically and systematically stress-tests AI models’ safety guardrails. The methodology focuses on measuring how many adversarial attempts (steps) it takes to get a model to generate harmful content across several key risk dimensions. The central metric is “steps to elicit” the number of increasingly sophisticated prompting attempts required before a model generates an inappropriate response. A higher number of steps indicates stronger safety measures, as the model is more resistant to manipulation. The PRISM risk dimensions (inspired by the MLCommons AI Safety Benchmarks) include CBRNE weapons, violent crimes, non-violent crimes, defamation, and hate, amongst several others.

Using the BET Eval tool and its V1.0 metric, which is tailored toward non-reasoning models, we compared the recently released Nova models (Pro and Premier) to the latest models in the same class: Claude (3.5 v2 and 3.7 non-reasoning) and Llama4 Maverick, all available through Amazon Bedrock. PRISM BET conducts black-box evaluations (where model developers dont have access to the test prompts) of models integrated with their API. The evaluation conducted with BET Eval MAX, PRISMs most comprehensive/aggressive testing suite, revealed significant variations in safety against malicious instructions. Nova models demonstrated superior overall safety performance, with an average of 43 steps for Premier and 52 steps for Pro, compared to 37.7 for Claude 3.5 v2 and fewer than 12 steps for other models in the comparison set (namely, 9.9 for Claude3.7, 11.5 for Claude 3.7 thinking, and 6.5 for Maverick). This higher step count suggests that on average, Nova’s safety guardrails are more sophisticated and harder to circumvent through adversarial prompting. The figure below presents the number of steps per harm category evaluated through BET Eval MAX.

Results of tests using PRISM’s BET Eval MAX testing suite.

The PRISM evaluation provides valuable insights into the relative safety of different Amazon Bedrock models. Nova’s strong performance, particularly in hate speech and defamation resistance, represents meaningful progress in AI safety. However, the results also highlight the ongoing challenge of building truly robust safety measures into AI systems. As the field continues to evolve, frameworks like BET will play an increasingly important role in benchmarking and improving AI safety. As a part of this collaboration Nicolas Miailhe, CEO of PRISM Eval, said, Its incredibly rewarding for us to see Nova outperforming strong baselines using the BET Eval MAX; our aim is to build a long-term partnership toward safer-by-design models and to make BET available to various model providers.” Organizations deploying AI systems should carefully consider these safety metrics when selecting models for their applications.

Manual red teaming with ActiveFence

The AI safety & security company ActiveFence benchmarked Nova Premier on Bedrock on prompts distributed across Amazons eight core RAI categories. ActiveFence also evaluated Claude 3.7 (non-reasoning mode) and GPT 4.1 API on the same set. The flag rate on Nova Premier was lower than that on the other two models, indicating that Nova Premier is the safest of the three.

<tbody><tr><td colspan=”1″ rowspan=”1″><b>Model</b></td><td colspan=”1″ rowspan=”1″><b>3P Flag Rate [ is better]</b></td></tr><tr><td colspan=”1″ rowspan=”1″>Nova Premier</td><td colspan=”1″ rowspan=”1″>12.0%</td></tr><tr><td colspan=”1″ rowspan=”1″>Sonnet 3.7 (non-reasoning)</td><td colspan=”1″ rowspan=”1″>20.6%</td></tr><tr><td colspan=”1″ rowspan=”1″>GPT4.1 API </td><td colspan=”1″ rowspan=”1″>22.4%</td></tr></tbody>

Our role is to think like an adversary but act in service of safety, said Guy Paltieli from ActiveFence. By conducting a blind stress test of Nova Premier under realistic threat scenarios, we helped evaluate its security posture in support of Amazons broader responsible-AI goals, ensuring the model could be deployed with greater confidence.”

These evaluations conducted with PRISM and ActiveFence give us confidence in the strength of our guardrails and our ability to protect our customers safety when they use our models. While these evaluations demonstrate strong safety performance, we recognize that AI safety is an ongoing challenge requiring continuous improvement. These assessments represent a point-in-time snapshot, and we remain committed to regular testing and enhancement of our safety measures. No AI system can guarantee perfect safety in all scenarios, which is why we maintain monitoring and response systems after deployment.

Acknowledgments: Vincent Ponzo, Elyssa Vincent

Research areas: Conversational AI

Tags: Responsible AI

What AI’s impact on individuals means for the health workforce and industry

Illustrated headshots of Azeem Azhar, Peter Lee, and Ethan Mollick.

Two years ago, OpenAI’s GPT-4 kick-started a new era in AI. In the months leading up to its public release, Peter Lee, president of Microsoft Research, cowrote a book full of optimism for the potential of advanced AI models to transform the world of healthcare. What has happened since? In this special podcast series, The AI Revolution in Medicine, Revisited, Lee revisits the book, exploring how patients, providers, and other medical professionals are experiencing and using generative AI today while examining what he and his coauthors got right—and what they didn’t foresee.

In this episode, Ethan Mollick (opens in new tab) and Azeem Azhar (opens in new tab), thought leaders at the forefront of AI’s impact on work, education, and society, join Lee to discuss how generative AI is reshaping healthcare and organizational systems. Mollick, professor at the Wharton School, discusses the conflicting emotions that come with navigating AI’s effect on the tasks we enjoy and those we don’t; the systemic challenges in AI adoption; and the need for organizations to actively experiment with AI rather than wait for top-down solutions. Azhar, a technology analyst and writer who explores the intersection of AI, economics, and society, explores how generative AI is transforming healthcare through applications like medical scribing, clinician support, and consumer health monitoring.

Learn more:

Co-Intelligence: Living and Working with AI (opens in new tab) (Mollick)
Book | April 2024
One Useful Thing (opens in new tab) (Mollick)
Substack blog/newsletter
The Exponential Age: How Accelerating Technology is Transforming Business, Politics and Society (opens in new tab) (Azhar)
Book | September 2021
Exponential View (opens in new tab) (Azhar)
Substack blog/newsletter
The AI Revolution in Medicine: GPT-4 and Beyond  
Book | Peter Lee, Carey Goldberg, Isaac Kohane | April 2023

Transcript

[MUSIC]  

[BOOK PASSAGE] 

PETER LEE: “In American primary care, the missing workforce is stunning in magnitude, the shortfall estimated to reach up to 48,000 doctors within the next dozen years. China and other countries with aging populations can expect drastic shortfalls, as well. Just last month, I asked a respected colleague retiring from primary care who he would recommend as a replacement; he told me bluntly that, other than expensive concierge care practices, he could not think of anyone, even for himself. This mismatch between need and supply will only grow, and the US is far from alone among developed countries in facing it.”

[END OF BOOK PASSAGE]  

[THEME MUSIC]  

This is The AI Revolution in Medicine, Revisited. I’m your host, Peter Lee.  

Shortly after OpenAI’s GPT-4 was publicly released, Carey Goldberg, Dr. Zak Kohane, and I published The AI Revolution in Medicine to help educate the world of healthcare and medical research about the transformative impact this new generative AI technology could have. But because we wrote the book when GPT-4 was still a secret, we had to speculate. Now, two years later, what did we get right, and what did we get wrong?   

In this series, we’ll talk to clinicians, patients, hospital administrators, and others to understand the reality of AI in the field and where we go from here.    

[THEME MUSIC FADES]

The book passage I read at the top is from “Chapter 4: Trust but Verify,” which was written by Zak.

You know, it’s no secret that in the US and elsewhere shortages in medical staff and the rise of clinician burnout are affecting the quality of patient care for the worse. In our book, we predicted that generative AI would be something that might help address these issues.

So in this episode, we’ll delve into how individual performance gains that our previous guests have described might affect the healthcare workforce as a whole, and on the patient side, we’ll look into the influence of generative AI on the consumerization of healthcare. Now, since all of this consumes such a huge fraction of the overall economy, we’ll also get into what a general-purpose technology as disruptive as generative AI might mean in the context of labor markets and beyond.

To help us do that, I’m pleased to welcome Ethan Mollick and Azeem Azhar.

Ethan Mollick is the Ralph J. Roberts Distinguished Faculty Scholar, a Rowan Fellow, and an associate professor at the Wharton School of the University of Pennsylvania. His research into the effects of AI on work, entrepreneurship, and education is applied by organizations around the world, leading him to be named one of Time magazine’s most influential people in AI for 2024. He’s also the author of the New York Times best-selling book Co-Intelligence.

Azeem Azhar is an author, founder, investor, and one of the most thoughtful and influential voices on the interplay between disruptive emerging technologies and business and society. In his best-selling book, The Exponential Age, and in his highly regarded newsletter and podcast, Exponential View, he explores how technologies like AI are reshaping everything from healthcare to geopolitics.

Ethan and Azeem are two leading thinkers on the ways that disruptive technologies—and especially AI—affect our work, our jobs, our business enterprises, and whole industries. As economists, they are trying to work out whether we are in the midst of an economic revolution as profound as the shift from an agrarian to an industrial society.

[TRANSITION MUSIC]

Here is my interview with Ethan Mollick:

LEE: Ethan, welcome.

ETHAN MOLLICK: So happy to be here, thank you.

LEE: I described you as a professor at Wharton, which I think most of the people who listen to this podcast series know of as an elite business school. So it might surprise some people that you study AI. And beyond that, you know, that I would seek you out to talk about AI in medicine. [LAUGHTER] So to get started, how and why did it happen that you’ve become one of the leading experts on AI?

MOLLICK: It’s actually an interesting story. I’ve been AI-adjacent my whole career. When I was [getting] my PhD at MIT, I worked with Marvin Minsky (opens in new tab) and the MIT [Massachusetts Institute of Technology] Media Labs AI group. But I was never the technical AI guy. I was the person who was trying to explain AI to everybody else who didn’t understand it.

And then I became very interested in, how do you train and teach? And AI was always a part of that. I was building games for teaching, teaching tools that were used in hospitals and elsewhere, simulations. So when LLMs burst into the scene, I had already been using them and had a good sense of what they could do. And between that and, kind of, being practically oriented and getting some of the first research projects underway, especially under education and AI and performance, I became sort of a go-to person in the field.

And once you’re in a field where nobody knows what’s going on and we’re all making it up as we go along—I thought it’s funny that you led with the idea that you have a couple of months head start for GPT-4, right. Like that’s all we have at this point, is a few months’ head start. [LAUGHTER] So being a few months ahead is good enough to be an expert at this point. Whether it should be or not is a different question.

LEE: Well, if I understand correctly, leading AI companies like OpenAI, Anthropic, and others have now sought you out as someone who should get early access to really start to do early assessments and gauge early reactions. How has that been?

MOLLICK: So, I mean, I think the bigger picture is less about me than about two things that tells us about the state of AI right now.

One, nobody really knows what’s going on, right. So in a lot of ways, if it wasn’t for your work, Peter, like, I don’t think people would be thinking about medicine as much because these systems weren’t built for medicine. They weren’t built to change education. They weren’t built to write memos. They, like, they weren’t built to do any of these things. They weren’t really built to do anything in particular. It turns out they’re just good at many things.

And to the extent that the labs work on them, they care about their coding ability above everything else and maybe math and science secondarily. They don’t think about the fact that it expresses high empathy. They don’t think about its accuracy and diagnosis or where it’s inaccurate. They don’t think about how it’s changing education forever.

So one part of this is the fact that they go to my Twitter feed or ask me for advice is an indicator of where they are, too, which is they’re not thinking about this. And the fact that a few months’ head start continues to give you a lead tells you that we are at the very cutting edge. These labs aren’t sitting on projects for two years and then releasing them. Months after a project is complete or sooner, it’s out the door. Like, there’s very little delay. So we’re kind of all in the same boat here, which is a very unusual space for a new technology.

LEE: And I, you know, explained that you’re at Wharton. Are you an odd fit as a faculty member at Wharton, or is this a trend now even in business schools that AI experts are becoming key members of the faculty?

MOLLICK: I mean, it’s a little of both, right. It’s faculty, so everybody does everything. I’m a professor of innovation-entrepreneurship. I’ve launched startups before and working on that and education means I think about, how do organizations redesign themselves? How do they take advantage of these kinds of problems? So medicine’s always been very central to that, right. A lot of people in my MBA class have been MDs either switching, you know, careers or else looking to advance from being sort of individual contributors to running teams. So I don’t think that’s that bad a fit. But I also think this is general-purpose technology; it’s going to touch everything. The focus on this is medicine, but Microsoft does far more than medicine, right. It’s … there’s transformation happening in literally every field, in every country. This is a widespread effect.

So I don’t think we should be surprised that business schools matter on this because we care about management. There’s a long tradition of management and medicine going together. There’s actually a great academic paper that shows that teaching hospitals that also have MBA programs associated with them have higher management scores and perform better (opens in new tab). So I think that these are not as foreign concepts, especially as medicine continues to get more complicated.

LEE: Yeah. Well, in fact, I want to dive a little deeper on these issues of management, of entrepreneurship, um, education. But before doing that, if I could just stay focused on you. There is always something interesting to hear from people about their first encounters with AI. And throughout this entire series, I’ve been doing that both pre-generative AI and post-generative AI. So you, sort of, hinted at the pre-generative AI. You were in Minsky’s lab. Can you say a little bit more about that early encounter? And then tell us about your first encounters with generative AI.

MOLLICK: Yeah. Those are great questions. So first of all, when I was at the media lab, that was pre-the current boom in sort of, you know, even in the old-school machine learning kind of space. So there was a lot of potential directions to head in. While I was there, there were projects underway, for example, to record every interaction small children had. One of the professors was recording everything their baby interacted with in the hope that maybe that would give them a hint about how to build an AI system.

There was a bunch of projects underway that were about labeling every concept and how they relate to other concepts. So, like, it was very much Wild West of, like, how do we make an AI work—which has been this repeated problem in AI, which is, what is this thing?

The fact that it was just like brute force over the corpus of all human knowledge turns out to be a little bit of like a, you know, it’s a miracle and a little bit of a disappointment in some ways [LAUGHTER] compared to how elaborate some of this was. So, you know, I think that, that was sort of my first encounters in sort of the intellectual way.

The generative AI encounters actually started with the original, sort of, GPT-3, or, you know, earlier versions. And it was actually game-based. So I played games like AI Dungeon. And as an educator, I realized, oh my gosh, this stuff could write essays at a fourth-grade level. That’s really going to change the way, like, middle school works, was my thinking at the time. And I was posting about that back in, you know, 2021 that this is a big deal. But I think everybody was taken surprise, including the AI companies themselves, by, you know, ChatGPT, by GPT-3.5. The difference in degree turned out to be a difference in kind.

LEE: Yeah, you know, if I think back, even with GPT-3, and certainly this was the case with GPT-2, it was, at least, you know, from where I was sitting, it was hard to get people to really take this seriously and pay attention.

MOLLICK: Yes.

LEE: You know, it’s remarkable. Within Microsoft, I think a turning point was the use of GPT-3 to do code completions. And that was actually productized as GitHub Copilot (opens in new tab), the very first version. That, I think, is where there was widespread belief. But, you know, in a way, I think there is, even for me early on, a sense of denial and skepticism. Did you have those initially at any point?

MOLLICK: Yeah, I mean, it still happens today, right. Like, this is a weird technology. You know, the original denial and skepticism was, I couldn’t see where this was going. It didn’t seem like a miracle because, you know, of course computers can complete code for you. Like, what else are they supposed to do? Of course, computers can give you answers to questions and write fun things. So there’s difference of moving into a world of generative AI. I think a lot of people just thought that’s what computers could do. So it made the conversations a little weird. But even today, faced with these, you know, with very strong reasoner models that operate at the level of PhD students, I think a lot of people have issues with it, right.

I mean, first of all, they seem intuitive to use, but they’re not always intuitive to use because the first use case that everyone puts AI to, it fails at because they use it like Google or some other use case. And then it’s genuinely upsetting in a lot of ways. I think, you know, I write in my book about the idea of three sleepless nights. That hasn’t changed. Like, you have to have an intellectual crisis to some extent, you know, and I think people do a lot to avoid having that existential angst of like, “Oh my god, what does it mean that a machine could think—apparently think—like a person?”

So, I mean, I see resistance now. I saw resistance then. And then on top of all of that, there’s the fact that the curve of the technology is quite great. I mean, the price of GPT-4 level intelligence from, you know, when it was released has dropped 99.97% at this point, right.

LEE: Yes. Mm-hmm.

MOLLICK: I mean, I could run a GPT-4 class system basically on my phone. Microsoft’s releasing things that can almost run on like, you know, like it fits in almost no space, that are almost as good as the original GPT-4 models. I mean, I don’t think people have a sense of how fast the trajectory is moving either.

LEE: Yeah, you know, there’s something that I think about often. There is this existential dread, or will this technology replace me? But I think the first people to feel that are researchers—people encountering this for the first time. You know, if you were working, let’s say, in Bayesian reasoning or in traditional, let’s say, Gaussian mixture model based, you know, speech recognition, you do get this feeling, Oh, my god, this technology has just solved the problem that I’ve dedicated my life to. And there is this really difficult period where you have to cope with that. And I think this is going to be spreading, you know, in more and more walks of life. And so this … at what point does that sort of sense of dread hit you, if ever?

MOLLICK: I mean, you know, it’s not even dread as much as like, you know, Tyler Cowen wrote that it’s impossible to not feel a little bit of sadness as you use these AI systems, too. Because, like, I was talking to a friend, just as the most minor example, and his talent that he was very proud of was he was very good at writing limericks for birthday cards. He’d write these limericks. Everyone was always amused by them. [LAUGHTER]

And now, you know, GPT-4 and GPT-4.5, they made limericks obsolete. Like, anyone can write a good limerick, right. So this was a talent, and it was a little sad. Like, this thing that you cared about mattered.

You know, as academics, we’re a little used to dead ends, right, and like, you know, some getting the lap. But the idea that entire fields are hitting that way. Like in medicine, there’s a lot of support systems that are now obsolete. And the question is how quickly you change that. In education, a lot of our techniques are obsolete.

What do you do to change that? You know, it’s like the fact that this brute force technology is good enough to solve so many problems is weird, right. And it’s not just the end of, you know, of our research angles that matter, too. Like, for example, I ran this, you know, 14-person-plus, multimillion-dollar effort at Wharton to build these teaching simulations, and we’re very proud of them. It took years of work to build one.

Now we’ve built a system that can build teaching simulations on demand by you talking to it with one team member. And, you know, you literally can create any simulation by having a discussion with the AI. I mean, you know, there’s a switch to a new form of excitement, but there is a little bit of like, this mattered to me, and, you know, now I have to change how I do things. I mean, adjustment happens. But if you haven’t had that displacement, I think that’s a good indicator that you haven’t really faced AI yet.

LEE: Yeah, what’s so interesting just listening to you is you use words like sadness, and yet I can see the—and hear the—excitement in your voice and your body language. So, you know, that’s also kind of an interesting aspect of all of this.

MOLLICK: Yeah, I mean, I think there’s something on the other side, right. But, like, I can’t say that I haven’t had moments where like, ughhhh, but then there’s joy and basically like also, you know, freeing stuff up. I mean, I think about doctors or professors, right. These are jobs that bundle together lots of different tasks that you would never have put together, right. If you’re a doctor, you would never have expected the same person to be good at keeping up with the research and being a good diagnostician and being a good manager and being good with people and being good with hand skills.

Like, who would ever want that kind of bundle? That’s not something you’re all good at, right. And a lot of our stress of our job comes from the fact that we suck at some of it. And so to the extent that AI steps in for that, you kind of feel bad about some of the stuff that it’s doing that you wanted to do. But it’s much more uplifting to be like, I don’t have to do this stuff I’m bad anymore, or I get the support to make myself good at it. And the stuff that I really care about, I can focus on more. Well, because we are at kind of a unique moment where whatever you’re best at, you’re still better than AI. And I think it’s an ongoing question about how long that lasts. But for right now, like you’re not going to say, OK, AI replaces me entirely in my job in medicine. It’s very unlikely.

But you will say it replaces these 17 things I’m bad at, but I never liked that anyway. So it’s a period of both excitement and a little anxiety.

LEE: Yeah, I’m going to want to get back to this question about in what ways AI may or may not replace doctors or some of what doctors and nurses and other clinicians do. But before that, let’s get into, I think, the real meat of this conversation. In previous episodes of this podcast, we talked to clinicians and healthcare administrators and technology developers that are very rapidly injecting AI today to do various forms of workforce automation, you know, automatically writing a clinical encounter note, automatically filling out a referral letter or request for prior authorization for some reimbursement to an insurance company.

And so these sorts of things are intended not only to make things more efficient and lower costs but also to reduce various forms of drudgery, cognitive burden on frontline health workers. So how do you think about the impact of AI on that aspect of workforce, and, you know, what would you expect will happen over the next few years in terms of impact on efficiency and costs?

MOLLICK: So I mean, this is a case where I think we’re facing the big bright problem in AI in a lot of ways, which is that this is … at the individual level, there’s lots of performance gains to be gained, right. The problem, though, is that we as individuals fit into systems, in medicine as much as anywhere else or more so, right. Which is that you could individually boost your performance, but it’s also about systems that fit along with this, right.

So, you know, if you could automatically, you know, record an encounter, if you could automatically make notes, does that change what you should be expecting for notes or the value of those notes or what they’re for? How do we take what one person does and validate it across the organization and roll it out for everybody without making it a 10-year process that it feels like IT in medicine often is? Like, so we’re in this really interesting period where there’s incredible amounts of individual innovation in productivity and performance improvements in this field, like very high levels of it, but not necessarily seeing that same thing translate to organizational efficiency or gains.

And one of my big concerns is seeing that happen. We’re seeing that in nonmedical problems, the same kind of thing, which is, you know, we’ve got research showing 20 and 40% performance improvements, like not uncommon to see those things. But then the organization doesn’t capture it; the system doesn’t capture it. Because the individuals are doing their own work and the systems don’t have the ability to, kind of, learn or adapt as a result.

LEE: You know, where are those productivity gains going, then, when you get to the organizational level?

MOLLICK: Well, they’re dying for a few reasons. One is, there’s a tendency for individual contributors to underestimate the power of management, right.

Practices associated with good management increase happiness, decrease, you know, issues, increase success rates. In the same way, about 40%, as far as we can tell, of the US advantage over other companies, of US firms, has to do with management ability. Like, management is a big deal. Organizing is a big deal. Thinking about how you coordinate is a big deal.

At the individual level, when things get stuck there, right, you can’t start bringing them up to how systems work together. It becomes, How do I deal with a doctor that has a 60% performance improvement? We really only have one thing in our playbook for doing that right now, which is, OK, we could fire 40% of the other doctors and still have a performance gain, which is not the answer you want to see happen.

So because of that, people are hiding their use. They’re actually hiding their use for lots of reasons.

And it’s a weird case because the people who are able to figure out best how to use these systems, for a lot of use cases, they’re actually clinicians themselves because they’re experimenting all the time. Like, they have to take those encounter notes. And if they figure out a better way to do it, they figure that out. You don’t want to wait for, you know, a med tech company to figure that out and then sell that back to you when it can be done by the physicians themselves.

So we’re just not used to a period where everybody’s innovating and where the management structure isn’t in place to take advantage of that. And so we’re seeing things stalled at the individual level, and people are often, especially in risk-averse organizations or organizations where there’s lots of regulatory hurdles, people are so afraid of the regulatory piece that they don’t even bother trying to make change.

LEE: If you are, you know, the leader of a hospital or a clinic or a whole health system, how should you approach this? You know, how should you be trying to extract positive success out of AI?

MOLLICK: So I think that you need to embrace the right kind of risk, right. We don’t want to put risk on our patients … like, we don’t want to put uninformed risk. But innovation involves risk to how organizations operate. They involve change. So I think part of this is embracing the idea that R&D has to happen in organizations again.

What’s happened over the last 20 years or so has been organizations giving that up. Partially, that’s a trend to focus on what you’re good at and not try and do this other stuff. Partially, it’s because it’s outsourced now to software companies that, like, Salesforce tells you how to organize your sales team. Workforce tells you how to organize your organization. Consultants come in and will tell you how to make change based on the average of what other people are doing in your field.

So companies and organizations and hospital systems have all started to give up their ability to create their own organizational change. And when I talk to organizations, I often say they have to have two approaches. They have to think about the crowd and the lab.

So the crowd is the idea of how to empower clinicians and administrators and supporter networks to start using AI and experimenting in ethical, legal ways and then sharing that information with each other. And the lab is, how are we doing R&D about the approach of how to [get] AI to work, not just in direct patient care, right. But also fundamentally, like, what paperwork can you cut out? How can we better explain procedures? Like, what management role can this fill?

And we need to be doing active experimentation on that. We can’t just wait for, you know, Microsoft to solve the problems. It has to be at the level of the organizations themselves.

LEE: So let’s shift a little bit to the patient. You know, one of the things that we see, and I think everyone is seeing, is that people are turning to chatbots, like ChatGPT, actually to seek healthcare information for, you know, their own health or the health of their loved ones.

And there was already, prior to all of this, a trend towards, let’s call it, consumerization of healthcare. So just in the business of healthcare delivery, do you think AI is going to hasten these kinds of trends, or from the consumer’s perspective, what … ?

MOLLICK: I mean, absolutely, right. Like, all the early data that we have suggests that for most common medical problems, you should just consult AI, too, right. In fact, there is a real question to ask: at what point does it become unethical for doctors themselves to not ask for a second opinion from the AI because it’s cheap, right? You could overrule it or whatever you want, but like not asking seems foolish.

I think the two places where there’s a burning almost, you know, moral imperative is … let’s say, you know, I’m in Philadelphia, I’m a professor, I have access to really good healthcare through the Hospital University of Pennsylvania system. I know doctors. You know, I’m lucky. I’m well connected. If, you know, something goes wrong, I have friends who I can talk to. I have specialists. I’m, you know, pretty well educated in this space.

But for most people on the planet, they don’t have access to good medical care, they don’t have good health. It feels like it’s absolutely imperative to say when should you use AI and when not. Are there blind spots? What are those things?

And I worry that, like, to me, that would be the crash project I’d be invoking because I’m doing the same thing in education, which is this system is not as good as being in a room with a great teacher who also uses AI to help you, but it’s better than not getting an, you know, to the level of education people get in many cases. Where should we be using it? How do we guide usage in the right way? Because the AI labs aren’t thinking about this. We have to.

So, to me, there is a burning need here to understand this. And I worry that people will say, you know, everything that’s true—AI can hallucinate, AI can be biased. All of these things are absolutely true, but people are going to use it. The early indications are that it is quite useful. And unless we take the active role of saying, here’s when to use it, here’s when not to use it, we don’t have a right to say, don’t use this system. And I think, you know, we have to be exploring that.

LEE: What do people need to understand about AI? And what should schools, universities, and so on be teaching?

MOLLICK: Those are, kind of, two separate questions in lot of ways. I think a lot of people want to teach AI skills, and I will tell you, as somebody who works in this space a lot, there isn’t like an easy, sort of, AI skill, right. I could teach you prompt engineering in two to three classes, but every indication we have is that for most people under most circumstances, the value of prompting, you know, any one case is probably not that useful.

A lot of the tricks are disappearing because the AI systems are just starting to use them themselves. So asking good questions, being a good manager, being a good thinker tend to be important, but like magic tricks around making, you know, the AI do something because you use the right phrase used to be something that was real but is rapidly disappearing.

So I worry when people say teach AI skills. No one’s been able to articulate to me as somebody who knows AI very well and teaches classes on AI, what those AI skills that everyone should learn are, right.

I mean, there’s value in learning a little bit how the models work. There’s a value in working with these systems. A lot of it’s just hands on keyboard kind of work. But, like, we don’t have an easy slam dunk “this is what you learn in the world of AI” because the systems are getting better, and as they get better, they get less sensitive to these prompting techniques. They get better prompting themselves. They solve problems spontaneously and start being agentic. So it’s a hard problem to ask about, like, what do you train someone on? I think getting people experience in hands-on-keyboards, getting them to … there’s like four things I could teach you about AI, and two of them are already starting to disappear.

But, like, one is be direct. Like, tell the AI exactly what you want. That’s very helpful. Second, provide as much context as possible. That can include things like acting as a doctor, but also all the information you have. The third is give it step-by-step directions—that’s becoming less important. And the fourth is good and bad examples of the kind of output you want. Those four, that’s like, that’s it as far as the research telling you what to do, and the rest is building intuition.

LEE: I’m really impressed that you didn’t give the answer, “Well, everyone should be teaching my book, Co-Intelligence.” [LAUGHS]

MOLLICK: Oh, no, sorry! Everybody should be teaching my book Co-Intelligence. I apologize. [LAUGHTER]

LEE: It’s good to chuckle about that, but actually, I can’t think of a better book, like, if you were to assign a textbook in any professional education space, I think Co-Intelligence would be number one on my list. Are there other things that you think are essential reading?

MOLLICK: That’s a really good question. I think that a lot of things are evolving very quickly. I happen to, kind of, hit a sweet spot with Co-Intelligence to some degree because I talk about how I used it, and I was, sort of, an advanced user of these systems.

So, like, it’s, sort of, like my Twitter feed, my online newsletter. I’m just trying to, kind of, in some ways, it’s about trying to make people aware of what these systems can do by just showing a lot, right. Rather than picking one thing, and, like, this is a general-purpose technology. Let’s use it for this. And, like, everybody gets a light bulb for a different reason. So more than reading, it is using, you know, and that can be Copilot or whatever your favorite tool is.

But using it. Voice modes help a lot. In terms of readings, I mean, I think that there is a couple of good guides to understanding AI that were originally blog posts. I think Tim Lee has one called Understanding AI (opens in new tab), and it had a good overview …

LEE: Yeah, that’s a great one.

MOLLICK: … of that topic that I think explains how transformers work, which can give you some mental sense. I think [Andrej] Karpathy (opens in new tab) has some really nice videos of use that I would recommend.

Like on the medical side, I think the book that you did, if you’re in medicine, you should read that. I think that that’s very valuable. But like all we can offer are hints in some ways. Like there isn’t … if you’re looking for the instruction manual, I think it can be very frustrating because it’s like you want the best practices and procedures laid out, and we cannot do that, right. That’s not how a system like this works.

LEE: Yeah.

MOLLICK: It’s not a person, but thinking about it like a person can be helpful, right.

LEE: One of the things that has been sort of a fun project for me for the last few years is I have been a founding board member of a new medical school at Kaiser Permanente. And, you know, that medical school curriculum is being formed in this era. But it’s been perplexing to understand, you know, what this means for a medical school curriculum. And maybe even more perplexing for me, at least, is the accrediting bodies, which are extremely important in US medical schools; how accreditors should think about what’s necessary here.

Besides the things that you’ve … the, kind of, four key ideas you mentioned, if you were talking to the board of directors of the LCME [Liaison Committee on Medical Education] accrediting body, what’s the one thing you would want them to really internalize?

MOLLICK: This is both a fast-moving and vital area. This can’t be viewed like a usual change, which [is], “Let’s see how this works.” Because it’s, like, the things that make medical technologies hard to do, which is like unclear results, limited, you know, expensive use cases where it rolls out slowly. So one or two, you know, advanced medical facilities get access to, you know, proton beams or something else at multi-billion dollars of cost, and that takes a while to diffuse out. That’s not happening here. This is all happening at the same time, all at once. This is now … AI is part of medicine.

I mean, there’s a minor point that I’d make that actually is a really important one, which is large language models, generative AI overall, work incredibly differently than other forms of AI. So the other worry I have with some of these accreditors is they blend together algorithmic forms of AI, which medicine has been trying for long time—decision support, algorithmic methods, like, medicine more so than other places has been thinking about those issues. Generative AI, even though it uses the same underlying techniques, is a completely different beast.

So, like, even just take the most simple thing of algorithmic aversion, which is a well-understood problem in medicine, right. Which is, so you have a tool that could tell you as a radiologist, you know, the chance of this being cancer; you don’t like it, you overrule it, right.

We don’t find algorithmic aversion happening with LLMs in the same way. People actually enjoy using them because it’s more like working with a person. The flaws are different. The approach is different. So you need to both view this as universal applicable today, which makes it urgent, but also as something that is not the same as your other form of AI, and your AI working group that is thinking about how to solve this problem is not the right people here.

LEE: You know, I think the world has been trained because of the magic of web search to view computers as question-answering machines. Ask a question, get an answer.

MOLLICK: Yes. Yes.

LEE: Write a query, get results. And as I have interacted with medical professionals, you can see that medical professionals have that model of a machine in mind. And I think that’s partly, I think psychologically, why hallucination is so alarming. Because you have a mental model of a computer as a machine that has absolutely rock-solid perfect memory recall.

But the thing that was so powerful in Co-Intelligence, and we tried to get at this in our book also, is that’s not the sweet spot. It’s this sort of deeper interaction, more of a collaboration. And I thought your use of the term Co-Intelligence really just even in the title of the book tried to capture this. When I think about education, it seems like that’s the first step, to get past this concept of a machine being just a question-answering machine. Do you have a reaction to that idea?

MOLLICK: I think that’s very powerful. You know, we’ve been trained over so many years at both using computers but also in science fiction, right. Computers are about cold logic, right. They will give you the right answer, but if you ask it what love is, they explode, right. Like that’s the classic way you defeat the evil robot in Star Trek, right. “Love does not compute.” [LAUGHTER]

Instead, we have a system that makes mistakes, is warm, beats doctors in empathy in almost every controlled study on the subject, right. Like, absolutely can outwrite you in a sonnet but will absolutely struggle with giving you the right answer every time. And I think our mental models are just broken for this. And I think you’re absolutely right. And that’s part of what I thought your book does get at really well is, like, this is a different thing. It’s also generally applicable. Again, the model in your head should be kind of like a person even though it isn’t, right.

There’s a lot of warnings and caveats to it, but if you start from person, smart person you’re talking to, your mental model will be more accurate than smart machine, even though both are flawed examples, right. So it will make mistakes; it will make errors. The question is, what do you trust it on? What do you not trust it? As you get to know a model, you’ll get to understand, like, I totally don’t trust it for this, but I absolutely trust it for that, right.

LEE: All right. So we’re getting to the end of the time we have together. And so I’d just like to get now into something a little bit more provocative. And I get the question all the time. You know, will AI replace doctors? In medicine and other advanced knowledge work, project out five to 10 years. What do think happens?

MOLLICK: OK, so first of all, let’s acknowledge systems change much more slowly than individual use. You know, doctors are not individual actors; they’re part of systems, right. So not just the system of a patient who like may or may not want to talk to a machine instead of a person but also legal systems and administrative systems and systems that allocate labor and systems that train people.

So, like, it’s hard to imagine that in five to 10 years medicine being so upended that even if AI was better than doctors at every single thing doctors do, that we’d actually see as radical a change in medicine as you might in other fields. I think you will see faster changes happen in consulting and law and, you know, coding, other spaces than medicine.

But I do think that there is good reason to suspect that AI will outperform people while still having flaws, right. That’s the difference. We’re already seeing that for common medical questions in enough randomized controlled trials that, you know, best doctors beat AI, but the AI beats the mean doctor, right. Like, that’s just something we should acknowledge is happening at this point.

Now, will that work in your specialty? No. Will that work with all the contingent social knowledge that you have in your space? Probably not.

Like, these are vignettes, right. But, like, that’s kind of where things are. So let’s assume, right … you’re asking two questions. One is, how good will AI get?

LEE: Yeah.

MOLLICK: And we don’t know the answer to that question. I will tell you that your colleagues at Microsoft and increasingly the labs, the AI labs themselves, are all saying they think they’ll have a machine smarter than a human at every intellectual task in the next two to three years. If that doesn’t happen, that makes it easier to assume the future, but let’s just assume that that’s the case. I think medicine starts to change with the idea that people feel obligated to use this to help for everything.

Your patients will be using it, and it will be your advisor and helper at the beginning phases, right. And I think that I expect people to be better at empathy. I expect better bedside manner. I expect management tasks to become easier. I think administrative burden might lighten if we handle this right way or much worse if we handle it badly. Diagnostic accuracy will increase, right.

And then there’s a set of discovery pieces happening, too, right. One of the core goals of all the AI companies is to accelerate medical research. How does that happen and how does that affect us is a, kind of, unknown question. So I think clinicians are in both the eye of the storm and surrounded by it, right. Like, they can resist AI use for longer than most other fields, but everything around them is going to be affected by it.

LEE: Well, Ethan, this has been really a fantastic conversation. And, you know, I think in contrast to all the other conversations we’ve had, this one gives especially the leaders in healthcare, you know, people actually trying to lead their organizations into the future, whether it’s in education or in delivery, a lot to think about. So I really appreciate you joining.

MOLLICK: Thank you.

[TRANSITION MUSIC] 

I’m a computing researcher who works with people who are right in the middle of today’s bleeding-edge developments in AI. And because of that, I often lose sight of how to talk to a broader audience about what it’s all about. And so I think one of Ethan’s superpowers is that he has this knack for explaining complex topics in AI in a really accessible way, getting right to the most important points without making it so simple as to be useless. That’s why I rarely miss an opportunity to read up on his latest work.

One of the first things I learned from Ethan is the intuition that you can, sort of, think of AI as a very knowledgeable intern. In other words, think of it as a persona that you can interact with, but you also need to be a manager for it and to always assess the work that it does.

In our discussion, Ethan went further to stress that there is, because of that, a serious education gap. You know, over the last decade or two, we’ve all been trained, mainly by search engines, to think of computers as question-answering machines. In medicine, in fact, there’s a question-answering application that is really popular called UpToDate (opens in new tab). Doctors use it all the time. But generative AI systems like ChatGPT are different. There’s therefore a challenge in how to break out of the old-fashioned mindset of search to get the full value out of generative AI.

The other big takeaway for me was that Ethan pointed out while it’s easy to see productivity gains from AI at the individual level, those same gains, at least today, don’t often translate automatically to organization-wide or system-wide gains. And one, of course, has to conclude that it takes more than just making individuals more productive; the whole system also has to adjust to the realities of AI.

Here’s now my interview with Azeem Azhar:

LEE: Azeem, welcome.

AZEEM AZHAR: Peter, thank you so much for having me.

LEE: You know, I think you’re extremely well known in the world. But still, some of the listeners of this podcast series might not have encountered you before.

And so one of the ways I like to ask people to introduce themselves is, how do you explain to your parents what you do every day?

AZHAR: Well, I’m very lucky in that way because my mother was the person who got me into computers more than 40 years ago. And I still have that first computer, a ZX81 with a Z80 chip …

LEE: Oh wow.

AZHAR: … to this day. It sits in my study, all seven and a half thousand transistors and Bakelite plastic that it is. And my parents were both economists, and economics is deeply connected with technology in some sense. And I grew up in the late ’70s and the early ’80s. And that was a time of tremendous optimism around technology. It was space opera, science fiction, robots, and of course, the personal computer and, you know, Bill Gates and Steve Jobs. So that’s where I started.

And so, in a way, my mother and my dad, who passed away a few years ago, had always known me as someone who was fiddling with computers but also thinking about economics and society. And so, in a way, it’s easier to explain to them because they’re the ones who nurtured the environment that allowed me to research technology and AI and think about what it means to firms and to the economy at large.

LEE: I always like to understand the origin story. And what I mean by that is, you know, what was your first encounter with generative AI? And what was that like? What did you go through?

AZHAR: The first real moment was when Midjourney and Stable Diffusion emerged in that summer of 2022. I’d been away on vacation, and I came back—and I’d been off grid, in fact—and the world had really changed.

Now, I’d been aware of GPT-3 and GPT-2, which I played around with and with BERT, the original transformer paper about seven or eight years ago, but it was the moment where I could talk to my computer, and it could produce these images, and it could be refined in natural language that really made me think we’ve crossed into a new domain. We’ve gone from AI being highly discriminative to AI that’s able to explore the world in particular ways. And then it was a few months later that ChatGPT came out—November, the 30th.

And I think it was the next day or the day after that I said to my team, everyone has to use this, and we have to meet every morning and discuss how we experimented the day before. And we did that for three or four months. And, you know, it was really clear to me in that interface at that point that, you know, we’d absolutely pass some kind of threshold.

LEE: And who’s the we that you were experimenting with?

AZHAR: So I have a team of four who support me. They’re mostly researchers of different types. I mean, it’s almost like one of those jokes. You know, I have a sociologist, an economist, and an astrophysicist. And, you know, they walk into the bar, [LAUGHTER] or they walk into our virtual team room, and we try to solve problems.

LEE: Well, so let’s get now into brass tacks here. And I think I want to start maybe just with an exploration of the economics of all this and economic realities. Because I think in a lot of your work—for example, in your book—you look pretty deeply at how automation generally and AI specifically are transforming certain sectors like finance, manufacturing, and you have a really, kind of, insightful focus on what this means for productivity and which ways, you know, efficiencies are found.

And then you, sort of, balance that with risks, things that can and do go wrong. And so as you take that background and looking at all those other sectors, in what ways are the same patterns playing out or likely to play out in healthcare and medicine?

AZHAR: I’m sure we will see really remarkable parallels but also new things going on. I mean, medicine has a particular quality compared to other sectors in the sense that it’s highly regulated, market structure is very different country to country, and it’s an incredibly broad field. I mean, just think about taking a Tylenol and going through laparoscopic surgery. Having an MRI and seeing a physio. I mean, this is all medicine. I mean, it’s hard to imagine a sector that is [LAUGHS] more broad than that.

So I think we can start to break it down, and, you know, where we’re seeing things with generative AI will be that the, sort of, softest entry point, which is the medical scribing. And I’m sure many of us have been with clinicians who have a medical scribe running alongside—they’re all on Surface Pros I noticed, right? [LAUGHTER] They’re on the tablet computers, and they’re scribing away.

And what that’s doing is, in the words of my friend Eric Topol, it’s giving the clinician time back (opens in new tab), right. They have time back from days that are extremely busy and, you know, full of administrative overload. So I think you can obviously do a great deal with reducing that overload.

And within my team, we have a view, which is if you do something five times in a week, you should be writing an automation for it. And if you’re a doctor, you’re probably reviewing your notes, writing the prescriptions, and so on several times a day. So those are things that can clearly be automated, and the human can be in the loop. But I think there are so many other ways just within the clinic that things can help.

So, one of my friends, my friend from my junior school—I’ve known him since I was 9—is an oncologist who’s also deeply into machine learning, and he’s in Cambridge in the UK. And he built with Microsoft Research a suite of imaging AI tools from his own discipline, which they then open sourced.

So that’s another way that you have an impact, which is that you actually enable the, you know, generalist, specialist, polymath, whatever they are in health systems to be able to get this technology, to tune it to their requirements, to use it, to encourage some grassroots adoption in a system that’s often been very, very heavily centralized.

LEE: Yeah.

AZHAR: And then I think there are some other things that are going on that I find really, really exciting. So one is the consumerization of healthcare. So I have one of those sleep tracking rings, the Oura (opens in new tab).

LEE: Yup.

AZHAR: That is building a data stream that we’ll be able to apply more and more AI to. I mean, right now, it’s applying traditional, I suspect, machine learning, but you can imagine that as we start to get more data, we start to get more used to measuring ourselves, we create this sort of pot, a personal asset that we can turn AI to.

And there’s still another category. And that other category is one of the completely novel ways in which we can enable patient care and patient pathway. And there’s a fantastic startup in the UK called Neko Health (opens in new tab), which, I mean, does physicals, MRI scans, and blood tests, and so on.

It’s hard to imagine Neko existing without the sort of advanced data, machine learning, AI that we’ve seen emerge over the last decade. So, I mean, I think that there are so many ways in which the temperature is slowly being turned up to encourage a phase change within the healthcare sector.

And last but not least, I do think that these tools can also be very, very supportive of a clinician’s life cycle. I think we, as patients, we’re a bit … I don’t know if we’re as grateful as we should be for our clinicians who are putting in 90-hour weeks. [LAUGHTER] But you can imagine a world where AI is able to support not just the clinicians’ workload but also their sense of stress, their sense of burnout.

So just in those five areas, Peter, I sort of imagine we could start to fundamentally transform over the course of many years, of course, the way in which people think about their health and their interactions with healthcare systems

LEE: I love how you break that down. And I want to press on a couple of things.

You also touched on the fact that medicine is, at least in most of the world, is a highly regulated industry. I guess finance is the same way, but they also feel different because the, like, finance sector has to be very responsive to consumers, and consumers are sensitive to, you know, an abundance of choice; they are sensitive to price. Is there something unique about medicine besides being regulated?

AZHAR: I mean, there absolutely is. And in finance, as well, you have much clearer end states. So if you’re not in the consumer space, but you’re in the, you know, asset management space, you have to essentially deliver returns against the volatility or risk boundary, right. That’s what you have to go out and do. And I think if you’re in the consumer industry, you can come back to very, very clear measures, net promoter score being a very good example.

In the case of medicine and healthcare, it is much more complicated because as far as the clinician is concerned, people are individuals, and we have our own parts and our own responses. If we didn’t, there would never be a need for a differential diagnosis. There’d never be a need for, you know, Let’s try azithromycin first, and then if that doesn’t work, we’ll go to vancomycin, or, you know, whatever it happens to be. You would just know. But ultimately, you know, people are quite different. The symptoms that they’re showing are quite different, and also their compliance is really, really different.

I had a back problem that had to be dealt with by, you know, a physio and extremely boring exercises four times a week, but I was ruthless in complying, and my physio was incredibly surprised. He’d say well no one ever does this, and I said, well you know the thing is that I kind of just want to get this thing to go away.

LEE: Yeah.

AZHAR: And I think that that’s why medicine is and healthcare is so different and more complex. But I also think that’s why AI can be really, really helpful. I mean, we didn’t talk about, you know, AI in its ability to potentially do this, which is to extend the clinician’s presence throughout the week.

LEE: Right. Yeah.

AZHAR: The idea that maybe some part of what the clinician would do if you could talk to them on Wednesday, Thursday, and Friday could be delivered through an app or a chatbot just as a way of encouraging the compliance, which is often, especially with older patients, one reason why conditions, you know, linger on for longer.

LEE: You know, just staying on the regulatory thing, as I’ve thought about this, the one regulated sector that I think seems to have some parallels to healthcare is energy delivery, energy distribution.

Because like healthcare, as a consumer, I don’t have choice in who delivers electricity to my house. And even though I care about it being cheap or at least not being overcharged, I don’t have an abundance of choice. I can’t do price comparisons.

And there’s something about that, just speaking as a consumer of both energy and a consumer of healthcare, that feels similar. Whereas other regulated industries, you know, somehow, as a consumer, I feel like I have a lot more direct influence and power. Does that make any sense to someone, you know, like you, who’s really much more expert in how economic systems work?

AZHAR: I mean, in a sense, one part of that is very, very true. You have a limited panel of energy providers you can go to, and in the US, there may be places where you have no choice.

I think the area where it’s slightly different is that as a consumer or a patient, you can actually make meaningful choices and changes yourself using these technologies, and people used to joke about you know asking Dr. Google. But Dr. Google is not terrible, particularly if you go to WebMD. And, you know, when I look at long-range change, many of the regulations that exist around healthcare delivery were formed at a point before people had access to good quality information at the touch of their fingertips or when educational levels in general were much, much lower. And many regulations existed because of the incumbent power of particular professional sectors.

I’ll give you an example from the United Kingdom. So I have had asthma all of my life. That means I’ve been taking my inhaler, Ventolin, and maybe a steroid inhaler for nearly 50 years. That means that I know … actually, I’ve got more experience, and I—in some sense—know more about it than a general practitioner.

LEE: Yeah.

AZHAR: And until a few years ago, I would have to go to a general practitioner to get this drug that I’ve been taking for five decades, and there they are, age 30 or whatever it is. And a few years ago, the regulations changed. And now pharmacies can … or pharmacists can prescribe those types of drugs under certain conditions directly.

LEE: Right.

AZHAR: That was not to do with technology. That was to do with incumbent lock-in. So when we look at the medical industry, the healthcare space, there are some parallels with energy, but there are a few little things that the ability that the consumer has to put in some effort to learn about their condition, but also the fact that some of the regulations that exist just exist because certain professions are powerful.

LEE: Yeah, one last question while we’re still on economics. There seems to be a conundrum about productivity and efficiency in healthcare delivery because I’ve never encountered a doctor or a nurse that wants to be able to handle even more patients than they’re doing on a daily basis.

And so, you know, if productivity means simply, well, your rounds can now handle 16 patients instead of eight patients, that doesn’t seem necessarily to be a desirable thing. So how can we or should we be thinking about efficiency and productivity since obviously costs are, in most of the developed world, are a huge, huge problem?

AZHAR: Yes, and when you described doubling the number of patients on the round, I imagined you buying them all roller skates so they could just whizz around [LAUGHTER] the hospital faster and faster than ever before.

We can learn from what happened with the introduction of electricity. Electricity emerged at the end of the 19th century, around the same time that cars were emerging as a product, and car makers were very small and very artisanal. And in the early 1900s, some really smart car makers figured out that electricity was going to be important. And they bought into this technology by putting pendant lights in their workshops so they could “visit more patients.” Right?

LEE: Yeah, yeah.

AZHAR: They could effectively spend more hours working, and that was a productivity enhancement, and it was noticeable. But, of course, electricity fundamentally changed the productivity by orders of magnitude of people who made cars starting with Henry Ford because he was able to reorganize his factories around the electrical delivery of power and to therefore have the moving assembly line, which 10xed the productivity of that system.

So when we think about how AI will affect the clinician, the nurse, the doctor, it’s much easier for us to imagine it as the pendant light that just has them working later …

LEE: Right.

AZHAR: … than it is to imagine a reconceptualization of the relationship between the clinician and the people they care for.

And I’m not sure. I don’t think anybody knows what that looks like. But, you know, I do think that there will be a way that this changes, and you can see that scale out factor. And it may be, Peter, that what we end up doing is we end up saying, OK, because we have these brilliant AIs, there’s a lower level of training and cost and expense that’s required for a broader range of conditions that need treating. And that expands the market, right. That expands the market hugely. It’s what has happened in the market for taxis or ride sharing. The introduction of Uber and the GPS system …

LEE: Yup.

AZHAR: … has meant many more people now earn their living driving people around in their cars. And at least in London, you had to be reasonably highly trained to do that.

So I can see a reorganization is possible. Of course, entrenched interests, the economic flow … and there are many entrenched interests, particularly in the US between the health systems and the, you know, professional bodies that might slow things down. But I think a reimagining is possible.

And if I may, I’ll give you one example of that, which is, if you go to countries outside of the US where there are many more sick people per doctor, they have incentives to change the way they deliver their healthcare. And well before there was AI of this quality around, there was a few cases of health systems in India—Aravind Eye Care (opens in new tab) was one, and Narayana Hrudayalaya [now known as Narayana Health (opens in new tab)] was another. And in the latter, they were a cardiac care unit where you couldn’t get enough heart surgeons.

LEE: Yeah, yep.

AZHAR: So specially trained nurses would operate under the supervision of a single surgeon who would supervise many in parallel. So there are ways of increasing the quality of care, reducing the cost, but it does require a systems change. And we can’t expect a single bright algorithm to do it on its own.

LEE: Yeah, really, really interesting. So now let’s get into regulation. And let me start with this question. You know, there are several startup companies I’m aware of that are pushing on, I think, a near-term future possibility that a medical AI for consumer might be allowed, say, to prescribe a medication for you, something that would normally require a doctor or a pharmacist, you know, that is certified in some way, licensed to do. Do you think we’ll get to a point where for certain regulated activities, humans are more or less cut out of the loop?

AZHAR: Well, humans would have been in the loop because they would have provided the training data, they would have done the oversight, the quality control. But to your question in general, would we delegate an important decision entirely to a tested set of algorithms? I’m sure we will. We already do that. I delegate less important decisions like, What time should I leave for the airport to Waze. I delegate more important decisions to the automated braking in my car. We will do this at certain levels of risk and threshold.

If I come back to my example of prescribing Ventolin. It’s really unclear to me that the prescription of Ventolin, this incredibly benign bronchodilator that is only used by people who’ve been through the asthma process, needs to be prescribed by someone who’s gone through 10 years or 12 years of medical training. And why that couldn’t be prescribed by an algorithm or an AI system.

LEE: Right. Yep. Yep.

AZHAR: So, you know, I absolutely think that that will be the case and could be the case. I can’t really see what the objections are. And the real issue is where do you draw the line of where you say, “Listen, this is too important,” or “The cost is too great,” or “The side effects are too high,” and therefore this is a point at which we want to have some, you know, human taking personal responsibility, having a liability framework in place, having a sense that there is a person with legal agency who signed off on this decision. And that line I suspect will start fairly low, and what we’d expect to see would be that that would rise progressively over time.

LEE: What you just said, that scenario of your personal asthma medication, is really interesting because your personal AI might have the benefit of 50 years of your own experience with that medication. So, in a way, there is at least the data potential for, let’s say, the next prescription to be more personalized and more tailored specifically for you.

AZHAR: Yes. Well, let’s dig into this because I think this is super interesting, and we can look at how things have changed. So 15 years ago, if I had a bad asthma attack, which I might have once a year, I would have needed to go and see my general physician.

In the UK, it’s very difficult to get an appointment. I would have had to see someone privately who didn’t know me at all because I’ve just walked in off the street, and I would explain my situation. It would take me half a day. Productivity lost. I’ve been miserable for a couple of days with severe wheezing. Then a few years ago the system changed, a protocol changed, and now I have a thing called a rescue pack, which includes prednisolone steroids. It includes something else I’ve just forgotten, and an antibiotic in case I get an upper respiratory tract infection, and I have an “algorithm.” It’s called a protocol. It’s printed out. It’s a flowchart

I answer various questions, and then I say, “I’m going to prescribe this to myself.” You know, UK doctors don’t prescribe prednisolone, or prednisone as you may call it in the US, at the drop of a hat, right. It’s a powerful steroid. I can self-administer, and I can now get that repeat prescription without seeing a physician a couple of times a year. And the algorithm, the “AI” is, it’s obviously been done in PowerPoint naturally, and it’s a bunch of arrows. [LAUGHS]

Surely, surely, an AI system is going to be more sophisticated, more nuanced, and give me more assurance that I’m making the right decision around something like that.

LEE: Yeah. Well, at a minimum, the AI should be able to make that PowerPoint the next time. [LAUGHS]

AZHAR: Yeah, yeah. Thank god for Clippy. Yes.

LEE: So, you know, I think in our book, we had a lot of certainty about most of the things we’ve discussed here, but one chapter where I felt we really sort of ran out of ideas, frankly, was on regulation. And, you know, what we ended up doing for that chapter is … I can’t remember if it was Carey’s or Zak’s idea, but we asked GPT-4 to have a conversation, a debate with itself [LAUGHS], about regulation. And we made some minor commentary on that.

And really, I think we took that approach because we just didn’t have much to offer. By the way, in our defense, I don’t think anyone else had any better ideas anyway.

AZHAR: Right.

LEE: And so now two years later, do we have better ideas about the need for regulation, the frameworks around which those regulations should be developed, and, you know, what should this look like?

AZHAR: So regulation is going to be in some cases very helpful because it provides certainty for the clinician that they’re doing the right thing, that they are still insured for what they’re doing, and it provides some degree of confidence for the patient. And we need to make sure that the claims that are made stand up to quite rigorous levels, where ideally there are RCTs [randomized control trials], and there are the classic set of processes you go through.

You do also want to be able to experiment, and so the question is: as a regulator, how can you enable conditions for there to be experimentation? And what is experimentation? Experimentation is learning so that every element of the system can learn from this experience.

So finding that space where there can be bit of experimentation, I think, becomes very, very important. And a lot of this is about experience, so I think the first digital therapeutics have received FDA approval, which means there are now people within the FDA who understand how you go about running an approvals process for that, and what that ends up looking like—and of course what we’re very good at doing in this sort of modern hyper-connected world—is we can share that expertise, that knowledge, that experience very, very quickly.

So you go from one approval a year to a hundred approvals a year to a thousand approvals a year. So we will then actually, I suspect, need to think about what is it to approve digital therapeutics because, unlike big biological molecules, we can generate these digital therapeutics at the rate of knots [very rapidly].

LEE: Yes.

AZHAR: Every road in Hayes Valley in San Francisco, right, is churning out new startups who will want to do things like this. So then, I think about, what does it mean to get approved if indeed it gets approved? But we can also go really far with things that don’t require approval.

I come back to my sleep tracking ring. So I’ve been wearing this for a few years, and when I go and see my doctor or I have my annual checkup, one of the first things that he asks is how have I been sleeping. And in fact, I even sync my sleep tracking data to their medical record system, so he’s saying … hearing what I’m saying, but he’s actually pulling up the real data going, This patient’s lying to me again. Of course, I’m very truthful with my doctor, as we should all be. [LAUGHTER]

LEE: You know, actually, that brings up a point that consumer-facing health AI has to deal with pop science, bad science, you know, weird stuff that you hear on Reddit. And because one of the things that consumers want to know always is, you know, what’s the truth?

AZHAR: Right.

LEE: What can I rely on? And I think that somehow feels different than an AI that you actually put in the hands of, let’s say, a licensed practitioner. And so the regulatory issues seem very, very different for these two cases somehow.

AZHAR: I agree, they’re very different. And I think for a lot of areas, you will want to build AI systems that are first and foremost for the clinician, even if they have patient extensions, that idea that the clinician can still be with a patient during the week.

And you’ll do that anyway because you need the data, and you also need a little bit of a liability shield to have like a sensible person who’s been trained around that. And I think that’s going to be a very important pathway for many AI medical crossovers. We’re going to go through the clinician.

LEE: Yeah.

AZHAR: But I also do recognize what you say about the, kind of, kooky quackery that exists on Reddit. Although on Creatine, Reddit may yet prove to have been right. [LAUGHTER]

LEE: Yeah, that’s right. Yes, yeah, absolutely. Yeah.

AZHAR: Sometimes it’s right. And I think that it serves a really good role as a field of extreme experimentation. So if you’re somebody who makes a continuous glucose monitor traditionally given to diabetics but now lots of people will wear them—and sports people will wear them—you probably gathered a lot of extreme tail distribution data by reading the Reddit/biohackers …

LEE: Yes.

AZHAR: … for the last few years, where people were doing things that you would never want them to really do with the CGM [continuous glucose monitor]. And so I think we shouldn’t understate how important that petri dish can be for helping us learn what could happen next.

LEE: Oh, I think it’s absolutely going to be essential and a bigger thing in the future. So I think I just want to close here then with one last question. And I always try to be a little bit provocative with this.

And so as you look ahead to what doctors and nurses and patients might be doing two years from now, five years from now, 10 years from now, do you have any kind of firm predictions?

AZHAR: I’m going to push the boat out, and I’m going to go further out than closer in.

LEE: OK. [LAUGHS]

AZHAR: As patients, we will have many, many more touch points and interaction with our biomarkers and our health. We’ll be reading how well we feel through an array of things. And some of them we’ll be wearing directly, like sleep trackers and watches.

And so we’ll have a better sense of what’s happening in our lives. It’s like the moment you go from paper bank statements that arrive every month to being able to see your account in real time.

LEE: Yes.

AZHAR: And I suspect we’ll have … we’ll still have interactions with clinicians because societies that get richer see doctors more, societies that get older see doctors more, and we’re going to be doing both of those over the coming 10 years. But there will be a sense, I think, of continuous health engagement, not in an overbearing way, but just in a sense that we know it’s there, we can check in with it, it’s likely to be data that is compiled on our behalf somewhere centrally and delivered through a user experience that reinforces agency rather than anxiety.

And we’re learning how to do that slowly. I don’t think the health apps on our phones and devices have yet quite got that right. And that could help us personalize problems before they arise, and again, I use my experience for things that I’ve tracked really, really well. And I know from my data and from how I’m feeling when I’m on the verge of one of those severe asthma attacks that hits me once a year, and I can take a little bit of preemptive measure, so I think that that will become progressively more common and that sense that we will know our baselines.

I mean, when you think about being an athlete, which is something I think about, but I could never ever do, [LAUGHTER] but what happens is you start with your detailed baselines, and that’s what your health coach looks at every three or four months. For most of us, we have no idea of our baselines. You we get our blood pressure measured once a year. We will have baselines, and that will help us on an ongoing basis to better understand and be in control of our health. And then if the product designers get it right, it will be done in a way that doesn’t feel invasive, but it’ll be done in a way that feels enabling. We’ll still be engaging with clinicians augmented by AI systems more and more because they will also have gone up the stack. They won’t be spending their time on just “take two Tylenol and have a lie down” type of engagements because that will be dealt with earlier on in the system. And so we will be there in a very, very different set of relationships. And they will feel that they have different ways of looking after our health.

LEE: Azeem, it’s so comforting to hear such a wonderfully optimistic picture of the future of healthcare. And I actually agree with everything you’ve said.

Let me just thank you again for joining this conversation. I think it’s been really fascinating. And I think somehow the systemic issues, the systemic issues that you tend to just see with such clarity, I think are going to be the most, kind of, profound drivers of change in the future. So thank you so much.

AZHAR: Well, thank you, it’s been my pleasure, Peter, thank you.

[TRANSITION MUSIC] 

I always think of Azeem as a systems thinker. He’s always able to take the experiences of new technologies at an individual level and then project out to what this could mean for whole organizations and whole societies.

In our conversation, I felt that Azeem really connected some of what we learned in a previous episode—for example, from Chrissy Farr—on the evolving consumerization of healthcare to the broader workforce and economic impacts that we’ve heard about from Ethan Mollick.

Azeem’s personal story about managing his asthma was also a great example. You know, he imagines a future, as do I, where personal AI might assist and remember decades of personal experience with a condition like asthma and thereby know more than any human being could possibly know in a deeply personalized and effective way, leading to better care. Azeem’s relentless optimism about our AI future was also so heartening to hear.

Both of these conversations leave me really optimistic about the future of AI in medicine. At the same time, it is pretty sobering to realize just how much we’ll all need to change in pretty fundamental and maybe even in radical ways. I think a big insight I got from these conversations is how we interact with machines is going to have to be altered not only at the individual level, but at the company level and maybe even at the societal level.

Since my conversation with Ethan and Azeem, there have been some pretty important developments that speak directly to this. Just last week at Build (opens in new tab), which is Microsoft’s yearly developer conference, we announced a slew of AI agent technologies. Our CEO, Satya Nadella, in fact, started his keynote by going online in a GitHub developer environment and then assigning a coding task to an AI agent, basically treating that AI as a full-fledged member of a development team. Other agents, for example, a meeting facilitator, a data analyst, a business researcher, travel agent, and more were also shown during the conference.

But pertinent to healthcare specifically, what really blew me away was the demonstration of a healthcare orchestrator agent. And the specific thing here was in Stanford’s cancer treatment center, when they are trying to decide on potentially experimental treatments for cancer patients, they convene a meeting of experts. That is typically called a tumor board. And so this AI healthcare orchestrator agent actually participated as a full-fledged member of a tumor board meeting to help bring data together, make sure that the latest medical knowledge was brought to bear, and to assist in the decision-making around a patient’s cancer treatment. It was pretty amazing.

[THEME MUSIC]

A big thank-you again to Ethan and Azeem for sharing their knowledge and understanding of the dynamics between AI and society more broadly. And to our listeners, thank you for joining us. I’m really excited for the upcoming episodes, including discussions on medical students’ experiences with AI and AI’s influence on the operation of health systems and public health departments. We hope you’ll continue to tune in.

Until next time.

[MUSIC FADES]

The post What AI’s impact on individuals means for the health workforce and industry appeared first on Microsoft Research.

RTX on Deck: The GeForce NOW Native App for Steam Deck Is Here

GeForce NOW is supercharging Valve’s Steam Deck with a new native app — delivering the high-quality GeForce RTX-powered gameplay members are used to on a portable handheld device.

It’s perfect to pair with the six new games available this week, including Tokyo Xtreme Racing from Japanese game developer Genki.

Stream Deck

At the CES trade show in January, GeForce NOW announced a native app for the Steam Deck, unlocking the full potential of Valve’s handheld device for cloud gaming.

The app is now available, and gamers can stream titles on the Steam Deck at up to 4K 60 frames per second — connected to a TV — with HDR10, NVIDIA DLSS 4 and Reflex technologies on supported titles. Plus, members can run these games at settings and performance levels that aren’t possible natively on the Steam Deck. To top it off, Steam Deck users can enjoy up to 50% longer battery life when streaming from an RTX gaming rig in the cloud.

Steam Deck gamers can dive into graphics-intense AAA titles with the new app. Play Clair Obscur: Expedition 33, Elder Scrolls IV: Oblivion Remastered, Monster Hunter Wilds and Microsoft Flight Simulator 2024 at max settings — without worrying about hardware limits or battery drain.

Elder Scrolls IV on GFN Native app on Steam Deck — *Obliterate gaming limits on the go.*

Plus, Steam Deck users can now access over 2,200 supported games on GeForce NOW, including from their Steam, Epic Games Store, Ubisoft, Battle.net and Xbox libraries, with over 180 supported PC Game Pass titles.

Get all the perks of an RTX 4080 GPU owner while using a handheld device, with battery savings and no overheating. Dock it to the TV for a big-screen experience, or game on the go. Unlock a massive game library, better visuals and access to games that wouldn’t run on the handheld before.

Download the native app from the GeForce NOW page and find a step-by-step guide on the support page for GeForce NOW on Steam Deck.

To celebrate the launch of the new native app, NVIDIA is giving away two prize bundles — each including a Steam Deck OLED and Steam Deck Dock — as well as some free GeForce NOW Ultimate memberships. Be on the lookout for a chance to win by following the GeForce NOW social media channels (X, Facebook, Instagram, Threads), using #GFNOnSteamDeck and following the sweepstakes instructions.

Racing for New Games

Tokyo Xtreme Racer plunges players into the high-stakes world of Japanese highway street racing, featuring open-road duels on Tokyo’s expressways. Challenge rivals in intense one-on-one battles, aiming to drain their Spirit Points by outdriving them through traffic and tight corners. With deep car customization and a moody, neon-lit atmosphere, the game delivers a unique and immersive street racing experience.

Look for the following games available to stream in the cloud this week:

Nice Day for Fishing (New release on Steam, May 29)
Cash Cleaner Simulator (Steam)
Tokyo Xtreme Racer (Steam)
The Last Spell (Steam)
Tainted Grail: The Fall of Avalon (Steam)
Torque Drift 2 (Epic Games Store)

What are you planning to play this weekend? Let us know on X or in the comments below.

What’s the first game you ever played on GFN?

— NVIDIA GeForce NOW (@NVIDIAGFN) May 27, 2025

Run LLMs on AnythingLLM Faster With NVIDIA RTX AI PCs

Large language models (LLMs), trained on datasets with billions of tokens, can generate high-quality content. They’re the backbone for many of the most popular AI applications, including chatbots, assistants, code generators and much more.

One of today’s most accessible ways to work with LLMs is with AnythingLLM, a desktop app built for enthusiasts who want an all-in-one, privacy-focused AI assistant directly on their PC.

With new support for NVIDIA NIM microservices on NVIDIA GeForce RTX and NVIDIA RTX PRO GPUs, AnythingLLM users can now get even faster performance for more responsive local AI workflows.

What Is AnythingLLM?

AnythingLLM is an all-in-one AI application that lets users run local LLMs, retrieval-augmented generation (RAG) systems and agentic tools.

It acts as a bridge between a user’s preferred LLMs and their data, and enables access to tools (called skills), making it easier and more efficient to use LLMs for specific tasks like:

Question answering: Getting answers to questions from top LLMs — like Llama and DeepSeek R1 — without incurring costs.
Personal data queries: Use RAG to query content privately, including PDFs, Word files, codebases and more.
Document summarization: Generating summaries of lengthy documents, like research papers.
Data analysis: Extracting data insights by loading files and querying it with LLMs.
Agentic actions: Dynamically researching content using local or remote resources, running generative tools and actions based on user prompts.

AnythingLLM can connect to a wide variety of open-source local LLMs, as well as larger LLMs in the cloud, including those provided by OpenAI, Microsoft and Anthropic. In addition, the application provides access to skills for extending its agentic AI capabilities via its community hub.

With a one-click install and the ability to launch as a standalone app or browser extension — wrapped in an intuitive experience with no complicated setup required — AnythingLLM is a great option for AI enthusiasts, especially those with GeForce RTX and NVIDIA RTX PRO GPU-equipped systems.

RTX Powers AnythingLLM Acceleration

GeForce RTX and NVIDIA RTX PRO GPUs offer significant performance gains for running LLMs and agents in AnythingLLM — speeding up inference with Tensor Cores designed to accelerate AI.

AnythingLLM runs LLMs with Ollama for on-device execution accelerated through Llama.cpp and ggml tensor libraries for machine learning.

Ollama, Llama.cpp and GGML are optimized for NVIDIA RTX GPUs and the fifth-generation Tensor Cores. Performance on GeForce RTX 5090 is 2.4X compared to an Apple M3 Ultra.

GeForce RTX 5090 delivers 2.4x faster LLM inference in AnythingLLM than Apple M3 Ultra on both Llama 3.1 8B and DeepSeek R1 8B.

As NVIDIA adds new NIM microservices and reference workflows — like its growing library of AI Blueprints — tools like AnythingLLM will unlock even more multimodal AI use cases.

AnythingLLM — Now With NVIDIA NIM

AnythingLLM recently added support for NVIDIA NIM microservices — performance-optimized, prepackaged generative AI models that make it easy to get started with AI workflows on RTX AI PCs with a streamlined API.

NVIDIA NIMs are great for developers looking for a quick way to test a Generative AI model in a workflow. Instead of having to find the right model, download all the files and figure out how to connect everything, they provide a single container that has everything you need. And they can run both on Cloud and PC, making it easy to prototype locally and then deploy on the cloud.

By offering them within AnythingLLM’s user-friendly UI, users have a quick way to test them and experiment with them. And then they can either connect them to their workflows with AnythingLLM, or leverage NVIDIA AI Blueprints and NIM documentation and sample code to plug them directly to their apps or projects.

Explore the wide variety of NIM microservices available to elevate AI-powered workflows, including language and image generation, computer vision and speech processing.

Each week, the RTX AI Garage blog series features community-driven AI innovations and content for those looking to learn more about NIM microservices and AI Blueprints, as well as building AI agents, creative workflows, digital humans, productivity apps and more on AI PCs and workstations.

Plug in to NVIDIA AI PC on Facebook, Instagram, TikTok and X — and stay informed by subscribing to the RTX AI PC newsletter.

Follow NVIDIA Workstation on LinkedIn and X. See notice regarding software product information.

Part 3: Building an AI-powered assistant for investment research with multi-agent collaboration in Amazon Bedrock and Amazon Bedrock Data Automation

In the financial services industry, analysts need to switch between structured data (such as time-series pricing information), unstructured text (such as SEC filings and analyst reports), and audio/visual content (earnings calls and presentations). Each format requires different analytical approaches and specialized tools, creating workflow inefficiencies. Add on top of this the intense time pressure resulting from rapidly evolving industry conditions and narrow decision windows. Delayed analysis can mean missed opportunities or failure to identify emerging risks, with potentially significant financial consequences.

AI-assistants boost productivity by automating routine data collection and processing tasks, surfacing relevant insights, and allowing analysts to focus on higher-value activities. However, a single AI agent struggles with complex, multistep investment research workflows to effectively handle the full spectrum of multiple specialized tasks. This is where multi-agent collaboration provides an advancement. By creating specialized AI subagents that excel in specific tasks and functionalities and using them together under a coordinated framework using an AI supervisor agent, we can address the full complexity of investment research workflows. A supervisor agent can intelligently decompose complex queries, delegate specialized tasks to subagents, and synthesize their outputs into comprehensive answers—similar to how a research team functions in the real world. The benefits of this approach are substantial, stemming from distributed problem-solving and specialization, improved accuracy through specialized expertise, enhanced scalability by adding new agent capabilities without rebuilding the entire system, and greater transparency because the reasoning process of each specialized agent can be tracked and verified.

Amazon Bedrock Agents uses the reasoning of foundation models (FMs), APIs, and data to break down user requests, gather relevant information, and efficiently complete tasks. With the multi-agent collaboration capability of Amazon Bedrock, you can build, deploy, and manage multiple AI agents working together on complex multistep tasks that require specialized skills. Amazon Bedrock Data Automation (BDA) enables the generation of useful insights from unstructured multimodal content such as documents, images, audio, and video for your AI-powered applications, and it can be used as a parser when setting up a knowledge base for Retrieval Augmented Generation (RAG) workflows.

In this post, we walk through how to build a multi-agent investment research assistant using the multi-agent collaboration capability of Amazon Bedrock. Our solution demonstrates how a team of specialized AI agents can work together to analyze financial news, evaluate stock performance, optimize portfolio allocations, and deliver comprehensive investment insights—all orchestrated through a unified, natural language interface. We use BDA as a parser when setting up a knowledge base for RAG workflows with multi-modal data. Amazon Nova understanding models are used as the large language models (LLMs) associated with the supervisor agents and subagents.

Check the video that demonstrates the solution,

Results from these agents are demonstrative, not to be taken as financial advice.

Solution overview

The multi-agent investment research assistant is composed of a supervisor agent and three subagents (quantitative analysis agent, news agent, and smart summarizer agent) working together in a coordinated framework, as shown in the following diagram.

The supervisor agent uses the instructions you provide to understand the structure and role of each collaborator agent. The supervisor agent orchestrates the overall investment research process by breaking down user prompts, delegating subtasks to specialized subagents, and consolidating their outputs to generate the final response. Each subagent is configured to carry out specialized tasks as follows:

Quantitative analysis agent
- Functions – Powered by stock-query and portfolio-optimization action groups.
- Responsibilities – Query and analyze historical stock data and build optimized portfolio allocations based on user inputs such as stock tickers or investment amount.
News agent
- Functions – Powered by the web_search action group and the financial_analysis_KB knowledge base.
- Responsibilities – Search and retrieve relevant financial data such as earnings reports and filings from the knowledge base for context. If information isn’t present in the knowledge base, construct a web query. Handles queries related to analyzing company financials, performance drivers, and management commentary.
Smart summarizer agent
- Functions – Uses LLM capabilities with specific prompt instructions, no custom action group.
- Responsibilities – Takes in output from other subagents, such as recent news and financial data, and synthesizes the information into structured investment insights. Summarizes long texts or documents such as earnings reports into concise insights.

Multi-agent collaboration works as follows:

User prompts the multi-agent investment research assistant with a high-level research query.
Supervisor agent breaks the query into subtasks and orchestrates and invokes the relevant subagents in parallel or sequentially as needed:
- News agent to retrieve financial reports or data
- Quantitative analysis agent for stock data analysis
- Smart summarizer agent to summarize lengthy texts
Supervisor agent consolidates the subagent outputs along with the original query context.
Supervisor agent passes the consolidated information to the LLM to generate the final research insights.

This architecture uses the strengths of the different specialized agents to comprehensively address the various aspects of investment research, and the supervisor agent provides the intelligent orchestration.

Technical architecture

The following diagram is the technical architecture.

Some of the key components and the associated codes of the technical architecture are:

Amazon Bedrock Data Automation (BDA) for processing data – The following invokes a BDA job asynchronously using the invoke_data_automation_async() method from the bda-runtime client in boto3:

bda_response = bda_runtime_client.invoke_data_automation_async(
            inputConfiguration={
                "s3Uri": s3_uri_input
            },
            outputConfiguration={
                "s3Uri": s3_uri_output
            },
        )

Defining and building subagents with tools – The following creates a subagent using Agent.create. It defines two Lambda tools—one for fetching historical stock data and another for running portfolio optimization—then registers them with the subagent along with its role, goal, instructions, and the underlying FM.

quantitative_analysis_agent = Agent.create(
    name="quantitative_analysis_agent",
    role="Financial Data Collector",
    goal="Retrieve real-time and historic stock prices as well as optimizing a portfolio given tickers.",
    instructions="""Specialist in real-time financial data extraction and portfolio optimization.
                        ...""",
    tools=[
        # Stock Data Lookup Tool
        {
            "code": stock_data_lookup_arn,
            "definition": {
                "name": "stock_data_lookup",
                "description": "Gets the 1-month stock price history for a given stock ticker, formatted as JSON.",
                "parameters": {
                    "ticker": {"description": "The ticker to retrieve price history for", "type": "string", "required": True}
                },
            },
        },
    ],
    llm=LLM,
)

Defining and building the supervisor agent (multi-agent investment research assistant) – The following creates a supervisor agent using SupervisorAgent.create. It defines the collaborator agents (subagents), and registers them with the supervisor agent along with its role, goal, instructions, and the underlying FM.

investment_research_assistant = SupervisorAgent.create(
    "investment_research_assistant",
    role="Investment Research Assistant",
    goal="A seasoned investment research expert responsible for orchestrating subagents ...",
    collaboration_type="SUPERVISOR",
    instructions=f"""You are a Investment Research Assistant, a financial research supervisor overseeing multiple specialized agents. Your goal is to coordinate and synthesize their outputs to create a structured stock investment analysis.
                ...""",
    collaborator_agents=[
        {
            "agent": "news_agent",
            "instructions": f"Always check the knowledge base (ID: {kb_id}) first. Use this collaborator for finding news and analyzing specific documents."
        },
        {
            "agent": "quantitative_analysis_agent",
            "instructions": "Use this collaborator for retrieving stock price history and performing portfolio optimization."
        },
        {
            "agent": "smart_summarizer_agent",
            "instructions": "Use this collaborator for synthesizing stock trends, financial data, and generating structured investment insights."
        }
    ],
    collaborator_objects=[news_agent, quantitative_analysis_agent, smart_summarizer_agent],
    llm=LLM,
)

Invoking the supervisor agent (multi-agent investment research assistant) – The following uses the invoke method to pass in the defined request as a prompt and get response from the supervisor agent, using the multi-agent collaboration architecture.

request = """What's AMZN stock price doing over the last week
               and relate that to recent news"""
print(f"Request:n{request}n")
result = investment_research_assistant.invoke(
    request,
    enable_trace=True,
    trace_level="core",
)
print(f"Final answer:n{result}")

Prerequisites

To deploy this solution, attach the appropriate permissions for Amazon Bedrock, AWS Lambda, Amazon OpenSearch Serverless, Amazon Simple Storage Service (Amazon S3), and AWS Identity and Access Management (IAM) to your execution role. You also need to deploy two AWS CloudFormation stacks: web_search and stock_data.

For more information on the permission policy, CloudFormation stacks, and setting up the environment, refer to the README.md in the investment_research_agent repository on GitHub.

Dive deeper into the solution

To dive deeper into the solution and the code shown in this post, refer to the investment research assistant agent GitHub repo. The repo contains instructions for the end-to-end solution, including setting up the supervisor agent and subagents, associated action groups, unstructured data (earnings reports PDF files and earnings call audio files), and structured data (stocks time series).

In the appendix at the end of this post, we share different questions asked by a financial analyst to the supervisor agent, the subagent invoked, and the answers from the supervisor agent.

Cleanup

To clean up this solution, make sure to delete all agents, knowledge bases, and associated Amazon OpenSearch Serverless instances. Delete both the web_search and stock_data stacks. This will make sure that no charges are incurred after you’re finished testing this solution. Read the associated notebook in the GitHub repository for more instructions on cleaning up the agents.

Conclusion

The multi-agent investment research assistant built using Amazon Bedrock Agents with its multi-agent collaboration feature represents how financial professionals can use AI to enhance their analytical capabilities. By using specialized agents—each focusing on distinct aspects of financial analysis—orchestrated by a supervisor agent, we’ve created a system that can process diverse data types, apply appropriate analytical techniques, and synthesize comprehensive insights in response to complex investment queries. The result is an AI assistant that can handle multistep financial analysis workflows that would otherwise overwhelm a single-agent system.

The application of multi-agent collaboration extends broadly across the financial sector, from equity research and portfolio management to risk assessment and compliance monitoring. A complex financial workflow that requires multiple specialized skills and data sources can benefit from this architectural approach.

The implementation uses key AWS services, including Amazon Bedrock for foundation models, the multi-agent collaboration capability of Amazon Bedrock Agents, Amazon S3 for document storage, Amazon OpenSearch Serverless for vector search capabilities, and Lambda functions for specialized actions. This serverless architecture provides scalability, and the knowledge base integration provides context from financial documents to ground the AI’s responses in factual data.

For more information:

Get started building your own multi-agent investment research assistant by accessing our complete solution code and implementation guide in the GitHub repo.
Explore how to customize the solution for your specific investment research needs by modifying the subagents’ specialized tools and action groups and by bringing your data securely to Amazon Bedrock Knowledge Bases.
You can also explore and run Amazon Bedrock multi-agent collaboration workshop with AWS specialists or on your own.

Read more about single agent solutions for investment research with multi-modal data:

Part 1 – Generative AI and multi-modal agents in AWS: The key to unlocking new value in financial markets, which demonstrates how multi-modal agents use AWS services to efficiently gather financial insights from an organization’s text, speech, and tabular data sources for a financial analyst’s query using open source LangChain agents as alternative implementation.
Part 2 – AI-powered assistants for investment research with multi-modal data: An application of Amazon Bedrock Agents, which introduces a solution using Amazon Bedrock Agents and Amazon Bedrock Knowledge Bases that can help financial analysts use various multi-modal financial data to gather financial insights.

Read the blog post on Unlocking complex problem-solving with multi-agent collaboration on Amazon Bedrock for more details on benefits, design, and evaluation of multi-agent collaboration for enterprise applications.

About the Authors

Sovik Kumar Nath is an AI/ML and Generative AI senior solution architect with AWS. He has extensive experience designing end-to-end machine learning and business analytics solutions in finance, operations, marketing, healthcare, supply chain management, and IoT. He has double masters degrees from the University of South Florida, University of Fribourg, Switzerland, and a bachelors degree from the Indian Institute of Technology, Kharagpur. Outside of work, Sovik enjoys traveling, taking ferry rides, and watching movies.

Lucas Banerji is an AI/ML and GenAI specialist Solutions Architect at AWS. He is passionate about building agentic AI systems and exploring the frontier of what’s possible with intelligent automation. Lucas holds a degree in Computer Science from the University of Virginia. Outside of work, he enjoys running, practicing Muay Thai, and traveling the world.

Mohan Musti is a Principal Technical Account Manger based out of Dallas. Mohan helps customers architect and optimize applications on AWS. Mohan has Computer Science and Engineering from JNT University, India. In his spare time, he enjoys spending time with his family and camping.

Appendix

The following table contains example questions asked and responses from the AI-powered assistant for investment research based on structured and unstructured data.

Question Purpose	Question asked by financial analyst to AI-powered assistant	Subagent, action group and/or knowledge base invoked	Intermediate steps from Supervisor agent chain-of-thought reasoning	Multi-agent Investment Research Assistant – Responses
Prompt the agent to analyze the company’s financial health and performance.	Analyze Amazon’s financial health based on the 2024 10k report. Calculate important financial ratios. Limit to 5 sentences.	news_agent (financial_analysis_kb), smart_summarizer_agent	1) I need to specifically request the 2024 10-K report information from the knowledge base. Let me ask the news agent for this specific report. 2) Since the knowledge base contains recent financial information and forward-looking guidance, I’ll provide a comprehensive summary of Amazon’s financial metrics and projections. 3) Now I’ll ask the analyst agent to analyze this information and calculate key financial ratios.	Based on Amazon’s recent financial data, the company demonstrates strong financial health with total assets of $624.9B and a robust cash position of $78.8B, while generating $143.3B in net sales and $59.2B in net income. Key profitability metrics show improvement with a 46% gross margin, 4.4% operating margin, and 14.8% return on equity, while efficiency ratios remain healthy with inventory turnover at 13.2x and asset turnover at 1.4x. Liquidity ratios (current ratio: 1.06, quick ratio: 0.88) are adequate though tight, but are supported by strong operating cash flow of $54.3B, representing a 38% year-over-year increase. The company maintains a balanced leverage profile with a debt-to-equity ratio of 0.98 and solid interest coverage ratio of 8.5x, indicating good debt management. Despite ongoing high capital expenditures, Amazon’s improving margins, strong cash generation, and efficient operations suggest a healthy financial position with sustainable growth potential.
Prompt the agent to analyze the company’s financial performance from company’s earnings reports	What were the main drivers of AWS’s revenue growth in the Q1 2023 quarter? Please limit your answers to 5 sentences.	news_agent (financial_analysis_kb), smart_summarizer_agent	1.) I need to get information about AWS’s Q1 2023 earnings report. I’ll ask the news agent to check the knowledge base for this specific information. 2.) Based on the search results, I can provide a comprehensive answer about AWS’s Q1 2023 performance and growth drivers. The information comes directly from their earnings reports. 3.) I’ll synthesize this information into a concise response focusing on the main drivers and send it to the user.	AWS’s Q1 2023 revenue growth was primarily driven by continued cloud adoption, achieving net sales of $21.4 billion and a 16% year-over-year growth rate. However, the growth was tempered by customers across all segments optimizing their cloud spending in response to economic uncertainty. The business maintained strong fundamentals with an annualized sales run rate exceeding $85 billion, though momentum showed signs of slowing with April growth rates declining by approximately 500 basis points compared to Q1.
Prompt the agent to analyze the company’s management team and their plans for the future from company’s earnings reports	What are Amazon’s top priorities for 2024? What are the biggest risks facing the company? Please limit your answers to 5 sentences.	news_agent (financial_analysis_kb & web_search), smart_summarizer_agent	1.) I’ll ask the news agent to check the knowledge base for recent information about Amazon’s strategic priorities and risk factors for 2024. 2.) Let me get additional recent news and risk factors by asking the news agent to search for supplementary information. 3.) Based on the recent news searches, I can now provide a comprehensive overview of the significant risks currently facing Amazon	Amazon’s top priorities for 2024 include expanding international stores, growing grocery and B2B retail segments, scaling the “Buy with Prime” program, advancing healthcare initiatives, and developing low Earth orbit satellite capabilities for improved connectivity. The company faces several significant risks, including substantial AI-related capital expenditure pressures (expected $75 billion in 2024), increasing regulatory scrutiny from the FTC and competition authorities, potential impacts from new Chinese import tariffs, and intensifying competition in both cloud computing and retail sectors. Amazon is balancing these ambitious growth initiatives with cost optimization efforts, while viewing their strategic investments as potentially transformative for their long-term future.
Prompt the agent to build an optimal portfolio using the collected data	Optimize my portfolio with AAA, BBB, and CCC [anonymized tickers]	quantitative_analysis_agent (stock_data_lookup & portfolio tool)	1.) I need to get historical stock data for these three stocks before running portfolio optimization. I’ll ask the stock data agent for this information. 2.) Now that I have retrieved the historical price data for all three stocks, I’ll perform portfolio optimization using this data. 3.) I’ll synthesize this information into a clear, comprehensive response for the user.	Based on portfolio optimization analysis, here is the recommended allocation: Company A (AAA): 61.04% Company B (BBB): 32.54% Company C (CCC): 6.42% This allocation optimizes risk-adjusted returns based on recent market data. Please note that these allocations are based on mathematical optimization and historical data. Consider your personal investment goals, risk tolerance, and market conditions when making investment decisions.
Prompt the agent to look up news and relate it to stock data	What was happening to AMZN stock in November 2024, relate that to news in the same timeframe	news_agent (web_search), quantitative_analysis_agent (stock_data_lookup), & smart_summarizer_agent	1.) I’ll need to get both recent stock price data and news, then synthesize them together. I’ll request both simultaneously to be efficient. 2.) I’ll now synthesize the stock price movement with the recent news developments to provide a comprehensive analysis.	November 2024: Stock Price Movement: – Trading Range: $161.43 (low) to $206.21 (high) – Overall Trend: Significant upward momentum with notable volatility – Key Technical Levels: Resistance at $206, Support at $190. Key News Impact: 1. Q4 2024 Guidance (Released Oct 31): – Projected net sales: $181.5-188.5 billion – Expected growth: 7-11% year-over-year – This guidance served as a major catalyst for the stock’s movement Price-News Correlation: – The strong upward movement (~27% range) appears to be primarily driven by positive Q4 guidance – Investor confidence was bolstered by the projected sales growth – The wide guidance range reflected optimism around the holiday shopping season – Higher trading volumes were observed during price movements, indicating strong market participation The stock’s performance suggests that investors responded positively to Amazon’s forward-looking guidance, particularly regarding holiday season expectations. The volatility in price movements indicates active trading around these news events, with the market generally maintaining an optimistic outlook on Amazon’s near-term prospects.

Solution overview

Geospatial data processing and embedding generation

Step 1: Satellite data acquisition and chipping

Step 2: Embedding generation using a Clay foundation model

Step 3: Embedding processing and analysis

Step 4: Consolidation and vector database integration

Geospatial similarity search

Similarity search implementation

Ecosystem change detection

GeoFM fine-tuning for your custom use case

Conclusion

About the Authors

Solution overview

Prerequisites

Initialize the models

Integrate API tools

RAG option 1: Document integration with Amazon OpenSearch Serverless

RAG option 2: Document integration with Amazon Bedrock Knowledge Bases

Clean up

Conclusion

About the Authors

Get started with image generation on Amazon Bedrock

Understanding the generation process

Prompting fundamentals

Positive and negative prompts

Image dimensions and aspect ratios

Classifier-free guidance scale

Seed values and reproducibility

Code example

Clean up

Cost considerations

Conclusion

About the Author

Prerequisites

Interior design

Product photography

Clean up

Cost considerations

Conclusion

About the Author

Evaluation of Nova Premier against PRISM AI

Manual red teaming with ActiveFence

Learn more:

Subscribe to the Microsoft Research Podcast:

Transcript

Stream Deck

Racing for New Games

What Is AnythingLLM?

RTX Powers AnythingLLM Acceleration

AnythingLLM — Now With NVIDIA NIM

Solution overview

Technical architecture

Prerequisites

Dive deeper into the solution

Cleanup

Conclusion

About the Authors

Appendix

Navigation

GenAI Vision Endless Possibilities

"I'm interested in things that change the world or that affect the future and wondrous, new technology where you see it, and you're like, 'Wow, how did that even happen? How is that possible?'" -- Elon Musk

Copyright © 2019-2025 Vedere AI. All Rights Reserved.