Announcing support for Llama 2 and Mistral models and streaming responses in Amazon SageMaker Canvas

Announcing support for Llama 2 and Mistral models and streaming responses in Amazon SageMaker Canvas

Launched in 2021, Amazon SageMaker Canvas is a visual, point-and-click service for building and deploying machine learning (ML) models without the need to write any code. Ready-to-use Foundation Models (FMs) available in SageMaker Canvas enable customers to use generative AI for tasks such as content generation and summarization.

We are thrilled to announce the latest updates to Amazon SageMaker Canvas, which bring exciting new generative AI capabilities to the platform. With support for Meta Llama 2 and Mistral.AI models and the launch of streaming responses, SageMaker Canvas continues to empower everyone that wants to get started with generative AI without writing a single line of code. In this post, we discuss these updates and their benefits.

Introducing Meta Llama 2 and Mistral models

Llama 2 is a cutting-edge foundation model by Meta that offers improved scalability and versatility for a wide range of generative AI tasks. Users have reported that Llama 2 is capable of engaging in meaningful and coherent conversations, generating new content, and extracting answers from existing notes. Llama 2 is among the state-of-the-art large language models (LLMs) available today for the open source community to build their own AI-powered applications.

Mistral.AI, a leading AI French start-up, has developed the Mistral 7B, a powerful language model with 7.3 billion parameters. Mistral models has been very well received by the open-source community thanks to the usage of Grouped-query attention (GQA) for faster inference, making it highly efficient and performing comparably to model with twice or three times the number of parameters.

Today, we are excited to announce that SageMaker Canvas now supports three Llama 2 model variants and two Mistral 7B variants:

To test these models, navigate to the SageMaker Canvas Ready-to-use models page, then choose Generate, extract and summarize content. This is where you’ll find the SageMaker Canvas GenAI chat experience. In here, you can use any model from Amazon Bedrock or SageMaker JumpStart by selecting them on the model drop-down menu.

In our case, we choose one of the Llama 2 models. Now you can provide your input or query. As you send the input, SageMaker Canvas forwards your input to the model.

Choosing which one of the models available in SageMaker Canvas fits best for your use case requires you to take into account information about the models themselves: the Llama-2-70B-chat model is a bigger model (70 billion parameters, compared to 13 billion with Llama-2-13B-chat ), which means that its performance is generally higher that the smaller one, at the cost of a slightly higher latency and an increased cost per token. Mistral-7B has performances comparable to Llama-2-7B or Llama-2-13B, however it is hosted on Amazon SageMaker. This means that the pricing model is different, moving from a dollar-per-token pricing model, to a dollar-per-hour model. This can be more cost effective with a significant amount of requests per hour and a consistent usage at scale. All of the models above can perform well on a variety of use cases, so our suggestion is to evaluate which model best solves your problem, considering output, throughput, and cost trade-offs.

If you’re looking for a straightforward way to compare how models behave, SageMaker Canvas  natively provides this capability in the form of model comparisons. You can select up to three different models and send the same query to all of them at once. SageMaker Canvas will then get the responses from each of the models and show them in a side-by-side chat UI. To do this, choose Compare and choose other models to compare against, as shown below:

Introducing response streaming: Real-time interactions and enhanced performance

One of the key advancements in this release is the introduction of streamed responses. The streaming of responses provides a richer experience for the user and better reflects a chat experience. With streaming responses, users can receive instant feedback and seamless integration in their chatbot applications. This allows for a more interactive and responsive experience, enhancing the overall performance and user satisfaction of the chatbot. The ability to receive immediate responses in a chat-like manner creates a more natural conversation flow and improves the user experience.

With this feature, you can now interact with your AI models in real time, receiving instant responses and enabling seamless integration into a variety of applications and workflows. All models that can be queried in SageMaker Canvas—from Amazon Bedrock and SageMaker JumpStart—can stream responses to the user.

Get started today

Whether you’re building a chatbot, recommendation system, or virtual assistant, the Llama 2 and Mistral models combined with streamed responses bring enhanced performance and interactivity to your projects.

To use the latest features of SageMaker Canvas, make sure to delete and recreate the app. To do that, log out from the app by choosing Log out, then open SageMaker Canvas again. You should see the new models and enjoy the latest releases. Logging out of the SageMaker Canvas application will release all resources used by the workspace instance, therefore avoiding incurring additional unintended charges.

Conclusion

To get started with the new streamed responses for the Llama 2 and Mistral models in SageMaker Canvas, visit the SageMaker console and explore the intuitive interface. To learn more about how SageMaker Canvas and generative AI can help you achieve your business goals, refer to Empower your business users to extract insights from company documents using Amazon SageMaker Canvas and Generative AI and Overcoming common contact center challenges with generative AI and Amazon SageMaker Canvas.

If you want to learn more about SageMaker Canvas features and deep dive on other ML use cases, check out the other posts available in the SageMaker Canvas category of the AWS ML Blog. We can’t wait to see the amazing AI applications you will create with these new capabilities!


About the authors

Picture of DavideDavide Gallitelli is a Senior Specialist Solutions Architect for AI/ML. He is based in Brussels and works closely with customers all around the globe that are looking to adopt Low-Code/No-Code Machine Learning technologies, and Generative AI. He has been a developer since he was very young, starting to code at the age of 7. He started learning AI/ML at university, and has fallen in love with it since then.

Dan Sinnreich is a Senior Product Manager at AWS, helping to democratize low-code/no-code machine learning. Previous to AWS, Dan built and commercialized enterprise SaaS platforms and time-series models used by institutional investors to manage risk and construct optimal portfolios. Outside of work, he can be found playing hockey, scuba diving, and reading science fiction.

Read More

How HSR.health is limiting risks of disease spillover from animals to humans using Amazon SageMaker geospatial capabilities

How HSR.health is limiting risks of disease spillover from animals to humans using Amazon SageMaker geospatial capabilities

This is a guest post co-authored by Ajay K Gupta, Jean Felipe Teotonio and Paul A Churchyard from HSR.health.

HSR.health is a geospatial health risk analytics firm whose vision is that global health challenges are solvable through human ingenuity and the focused and accurate application of data analytics. In this post, we present one approach for zoonotic disease prevention that uses Amazon SageMaker geospatial capabilities to create a tool that provides more accurate disease spread information to health scientists to help them save more lives, quicker.

Zoonotic diseases affect both animals and humans. The transition of a disease from animal to human, known as spillover, is a phenomenon that continually occurs on our planet. According to health organizations such as the Centers for Disease Control and Prevention (CDC) and the World Health Organization (WHO), a spillover event at a wet market in Wuhan, China most likely caused the coronavirus disease 2019 (COVID-19). Studies suggest that a virus found in fruit bats underwent significant mutations, allowing it to infect humans. The initial patient, or ‘patient zero’, for COVID-19 probably started a subsequent local outbreak that eventually spread on internationally. HSR.health’s Zoonotic Spillover Risk Index aims to assist in the identification of these early outbreaks before they cross international borders and lead to widespread global impact.

The main weapon public health has against the propagation of regional outbreaks is disease surveillance: an entire interlocking system of disease reporting, investigation, and data communication between different levels of a public health system. This system is dependent not only on human factors, but also on technology and resources to collect disease data, analyze patterns, and create a consistent and continuous stream of data transfer from local to regional to central health authorities.

The speed at which COVID-19 went from a local outbreak to a global disease present in every single continent should be a sobering example of the dire need to harness innovative technology to create more efficient and accurate disease surveillance systems.

The risk of zoonotic disease spillover is sharply correlated with multiple social, environmental, and geographic factors that influence how often human beings interact with wildlife. HSR.health’s Zoonotic Disease Spillover Risk Index uses over 20 distinct geographic, social, and environmental factors historically known to affect the risk of human-wildlife interaction and therefore zoonotic disease spillover risk. Many of these factors can be mapped through a combination of satellite imagery and remote sensing.

In this post, we explore how HSR.health uses SageMaker geospatial capabilities to retrieve relevant features from satellite imagery and remote sensing for developing the risk index. SageMaker geospatial capabilities make it easy for data scientists and machine learning (ML) engineers to build, train, and deploy models using geospatial data. With SageMaker geospatial capabilities, you can efficiently transform or enrich large-scale geospatial datasets, accelerate model building with pre-trained ML models, and explore model predictions and geospatial data on an interactive map using 3D accelerated graphics and built-in visualization tools.

Using ML and geospatial data for risk mitigation

ML is highly effective for anomaly detection on spatial or temporal data due to its ability to learn from data without being explicitly programmed to identify specific types of anomalies. Spatial data, which relates to the physical position and shape of objects, often contains complex patterns and relationships that may be difficult for traditional algorithms to analyze.

Incorporating ML with geospatial data enhances the capability to detect anomalies and unusual patterns systematically, which is essential for early warning systems. These systems are crucial in fields such as environmental monitoring, disaster management, and security. Predictive modeling using historical geospatial data allows organizations to identify and prepare for potential future events. These events range from natural disasters and traffic disruptions to, as this post discusses, disease outbreaks.

Detecting Zoonotic spillover risks

To predict zoonotic spillover risks, HSR.health has adopted a multimodal approach. By using a blend of data types—including environmental, biogeographical, and epidemiological information—this method enables a comprehensive assessment of disease dynamics. Such a multifaceted perspective is critical for developing proactive measures and enabling a rapid response to outbreaks.

The approach includes the following components:

  • Disease and outbreak data – HSR.health uses the extensive disease and outbreak data provided by Gideon and the World Health Organization (WHO), two trusted sources of global epidemiological information. This data serves as a fundamental pillar in the analytics framework. For Gideon, the data can be accessed through an API, and for the WHO, HSR.health has built a large language model (LLM) to mine outbreak data from past disease outbreak reports.
  • Earth observation data – Environmental factors, land use analysis and detection of habitat changes are integral components to assessing zoonotic risk. These insights can be derived from satellite-based earth observation data. HSR.health is able to streamline the use of earth observation data by using SageMaker geospatial capabilities to access and manipulate large-scale geospatial datasets. SageMaker geospatial offers a rich data catalog, including datasets from USGS Landsat-8, Sentinel-1, Sentinel-2, and others. It is also possible to bring in other datasets, such as high-resolution imagery from Planet Labs.
  • Social determinants of risk – Beyond biological and environmental factors, the team at HSR.health also considered social determinants, which encompass various socioeconomic and demographic indicators, and play a pivotal role in shaping zoonotic spillover dynamics.

From these components, HSR.health evaluated a range of different factors, and the following features have been identified as influential for identifying zoonotic spillover risks:

  • Animal habitats and habitable zones – Understanding the habitats of potential zoonotic hosts and their habitable zones is fundamental to assessing transmission risk.
  • Population centers – Proximity to densely populated areas is a key consideration because it influences the likelihood of human-animal interactions.
  • Loss of habitat – The degradation of natural habitats, particularly through deforestation, can accelerate zoonotic spillover events.
  • Human-wildland interface – Areas where human settlements intersect with wildlife habitats are potential hotspots for zoonotic transmission.
  • Social characteristics – Socioeconomic and cultural factors can significantly impact zoonotic risk, and HSR.health examines these as well.
  • Human health characteristics – The health status of local human populations is an essential variable because it affects susceptibility and transmission dynamics.

Solution overview

HSR.health’s workflow encompasses data preprocessing, feature extraction, and the creation of informative visualizations using ML techniques. This allows for a clear understanding of the data’s evolution from its raw form to actionable insights.

The following is a visual representation of the workflow, starting with input data from Gideon, earth observation data, and social determinant of risk data.

Solution overview

Retrieve and process satellite imagery using SageMaker geospatial capabilities

Satellite data forms a cornerstone of the analysis performed to build the risk index, providing critical information on environmental changes. To generate insights from satellite imagery, HSR.health uses Earth Observation Jobs (EOJs). EOJs enable the acquisition and transformation of raster data gathered from the Earth’s surface. An EOJ obtains satellite imagery from a designated data source—for instance, a satellite constellation—over a specific area and time period. It then applies one or more models to the retrieved images.

Additionally, Amazon SageMaker Studio offers a geospatial notebook pre-installed with commonly-used geospatial libraries. This notebook enables direct visualization and processing of geospatial data within a Python notebook environment. EOJs can be created in the geospatial notebook environment.

To configure an EOJ, the following parameters are used:

  • InputConfig – The input configuration specifies the data sources and the filtering criteria to be used during data acquisition:
    • RasterDataCollectionArn – Specifies the satellite from which to collect data.
    • AreaOfInterest – The geographical area of interest (AOI) defines the polygon boundaries for image collection.
    • TimeRangeFilter – The time range of interest: {StartTime: <string>, EndTime: <string>}.
    • PropertyFilters – Additional property filters, such as acceptable percentage of cloud coverage or desired sun azimuth angles.
  • JobConfig – This configuration defines the type of job to be applied to the retrieved satellite image data. It supports operations such as band math, resampling, geomosaic or cloud removal.

The following example code demonstrates running an EOJ for cloud removal, representative of the steps performed by HSR.health:

eoj_input_config = {
    "RasterDataCollectionQuery": {
        "RasterDataCollectionArn": "arn:aws:sagemaker-geospatial:us-west-2:378778860802:raster-data-collection/public/nmqj48dcu3g7ayw8",
        "AreaOfInterest": {
            "AreaOfInterestGeometry": {
                "PolygonGeometry": {
                    "Coordinates": [
                        [
                            [-76.23240119828894,-6.268815697653608],
                            [-76.23240119828894,-6.339419992332921],
                            [-76.13834453776985,-6.339419992332921],
                            [-76.13834453776985,-6.268815697653608],
                            [-76.23240119828894,-6.268815697653608]                       
        ]
                    ]
                }
            }
        },
        "TimeRangeFilter": {
            "StartTime": "2022-03-01T00:00:00Z",
            "EndTime": "2022-06-30T23:59:59Z",
        },
        "PropertyFilters": {
            "Properties": [{"Property": {"EoCloudCover": {"LowerBound": 0.0, "UpperBound": 2.0}}}],
            "LogicalOperator": "AND",
        },
    }
}
eoj_job_config = {
    "CloudRemovalConfig": {
        "AlgorithmName": "INTERPOLATION",
        "InterpolationValue": "-9999",
        "TargetBands": ["red", "green", "blue", "nir", "swir16"],
    }
}

eoj = geospatial_client.start_earth_observation_job(
    Name="eoj-analysis-loreto",
    InputConfig=eoj_input_config,
    JobConfig=eoj_job_config,
    ExecutionRoleArn=execution_role,
)

HSR.health used several operations to preprocess the data and extract relevant features. This includes operations such as land cover classification, mapping temperature variation, and vegetation indexes.

One vegetation index relevant for indicating vegetation health is the Normalized Difference Vegetation Index (NDVI). The NDVI quantifies vegetation health by using near-infrared light, which vegetation reflects, and red light, which vegetation absorbs. Monitoring the NDVI over time can reveal changes in vegetation, such as the impact of human activities like deforestation.

The following code snippet demonstrates how to calculate a vegetation index like the NDVI based on the data that has been passed through cloud removal:

eoj_input_config = {
    "PreviousEarthObservationJobArn": eoj["Arn"]
}
eoj_job_config = {
  "BandMathConfig": {
    "CustomIndices": {
        "Operations": [
            {
                "Equation": "(nir - red) / (nir + red)",
                "Name": "ndvi",
                "OutputType": "FLOAT32"
            }
        ]
    }
  }
}
eoj = geospatial_client.start_earth_observation_job(
    Name="eoj-vi-ndvi",
    InputConfig=eoj_input_config,
    JobConfig=eoj_job_config,
    ExecutionRoleArn=execution_role,
)

EOJ visualization

We can visualize the job output using SageMaker geospatial capabilities. SageMaker geospatial capabilities can help you overlay model predictions on a base map and provide layered visualization to make collaboration easier. With the GPU-powered interactive visualizer and Python notebooks, it’s possible to explore millions of data points in one view, facilitating the collaborative exploration of insights and results.

The steps outlined in this post demonstrate just one of the many raster-based features that HSR.health has extracted to create the risk index.

Combining raster-based features with health and social data

After extracting the relevant features in raster format, HSR.health used zonal statistics to aggregate the raster data within the administrative boundary polygons to which the social and health data are assigned. The analysis incorporates a combination of raster and vector geospatial data. This kind of aggregation allows for the management of raster data in a geodataframe, which facilitates its integration with the health and social data to produce the final risk index.

The following code snippet demonstrates how to aggregate raster data to administrative vector boundaries:

import geopandas as gp
import numpy as np
import pandas as pd
import rasterio
from rasterstats import zonal_stats
import pandas as pd

def get_proportions(inRaster, inVector, classDict, idCols, year):
    # Reading In Vector File
    if '.parquet' in inVector:
        vector = gp.read_parquet(inVector)
    else:
        vector = gp.read_file(inVector)
    raster = rasterio.open(inRaster)
    vector = vector.to_crs(raster.crs)
    # Retrieving the Bounding Box for the Raster Image
    xmin, ymin, xmax, ymax = raster.bounds
    # Selecting the Vector Features that Intersect with the Raster Bounding Box
    vector = vector.cx[xmin:xmax, ymin:ymax]
    vector = vector.reset_index()
    # Calculate the sum of pixels of each class in the vector geometries
    stats = zonal_stats(vector.geometry, raster.read(1), affine=raster.transform, nodata=raster.nodata, categorical=True)
    # Creating a dataframe with the class sum of pixels and the id fields of the vector geometries
    df1 = pd.DataFrame(data=stats)
    df1 = df1.fillna(0)
    df1['totalpixels'] = df1.sum(axis=1)  
    df1['year'] = year 
    if 'year' in vector.columns.tolist():
        vector = vector.drop(['year'], 1)
    # Merging the class sum of pixels dataframe with the vector geodataframe
    df = vector.merge(df1, left_index=True, right_index=True)
    # Renaming Columns
    cdict = pd.read_csv(classDict)
    cdict = cdict.set_index("Value")['Class_name'].to_dict()
    df = df.rename(columns=cdict)
    keptCols = [x for x in df.columns.tolist() if x in idCols + list(cdict.values()) + ['totalpixels', 'year']]
    df = df[keptCols]
    return(df)

def aggregateData(rasterList, inVector, classDict, idCols, years):
    dfList = []
    # Creating aggregated raster to vector geodataframes for all rasters in rasterList
    for tiff in rasterList:
        inRaster = tiff
        year = [x for x in years if x in tiff][0]
        dfList.append(get_proportions(inRaster, inVector, classDict, idCols, year))
    # Concating into a single geodataframe
    allDf = pd.concat(dfList, ignore_index=True)
    classDictDf = pd.read_csv(classDict)
    # Renaming the numerical values of the categories to the string version of the category name
    classCols = classDictDf['Class_name'].unique().tolist()
    # Summing the pixel counts by administrative division as a single administrative division might cover more than one raster image
    for col in classCols:
        allDf[col] = allDf[col].fillna(0)
        allDf[col] = allDf.groupby(idCols + ['year'])[col].transform(lambda x: x.sum())
    # Removing Duplicates from the dataframe
    allDf = allDf.groupby(idCols + ['year']).first().reset_index()
    # Reattaching the geometry to the aggregated raster data
    if '.parquet' in inVector:
        vector = gp.read_parquet(inVector)
    else:
        vector = gp.read_file(inVector)
    allDf = vector.merge(allDf, on=idCols)
    return(allDf)

To evaluate the extracted features effectively, ML models are used to predict factors representing each feature. One of the models used is a support vector machine (SVM). The SVM model assists in revealing patterns and associations within data that inform risk assessments.

The index represents a quantitative assessment of risk levels, calculated as a weighted average of these factors, to aid in understanding potential spillover events in various regions.

import pandas as pd
import numpy as np
import geopandas as gp

def finalIndicatorCalculation(inputLayer, weightDictionary, outLayer):
    # Creating a dictionary with the weights for each factor in the indicator
    weightsDict = pd.read_csv(weightDictionary).set_index('metric')['weight'].to_dict()
    # Reading in the data from the layer
    layer = gp.read_file(inputLayer)
    # Initializing the Sum of the Weights
    layer['sumweight'] = 0
    # Calculating the sum of the weighted factors
    for col in weightsDict.keys():
        layer[col] = layer[col].fillna(0)
        layer['sumweight'] = layer['sumweight'] + (layer[col] * zweights[col])
    # Calculating Raw Zoonotic Spillover Risk Index
    layer['raw_idx'] = np.log(layer['e_pop']) * layer['sumweight']
    # Normalizing the Index between 0 and 100
    layer['zs_idx'] = ((layer['raw_idx'] - layer['raw_idx'].min()) / (layer['raw_idx'].max() - layer['raw_idx'].min()) * 100).round(2)
    return(layer)

The following figure on the left shows the aggregation of the image classification from the test area scene in northern Peru aggregated to the district administrative level with the calculated change in the forest area between 2018–2023. Deforestation is one of the key factors that determine the risk of zoonotic spillover. The figure on the right highlights the zoonotic spillover risk severity levels within the regions covered, ranging from highest (red) to the lowest (dark green) risk. The area was chosen as one of the training areas for the image classification due to the diversity of land cover captured in the scene, including: urban, forest, sand, water, grassland, and agriculture, among others. Additionally, this is one of many areas of interest for potential zoonotic spillover events due to the deforestation and interaction between humans and animals.

Zoonotic spillover risk severity levels in northern Peru

By adopting this multi-modal approach, encompassing historical data on disease outbreak, Earth observation data, social determinants, and ML techniques, we can better understand and predict zoonotic spillover risk, ultimately directing disease surveillance and prevention strategies to areas of greatest outbreak risk. The following screenshot shows a dashboard of the output from a zoonotic spillover risk analysis. This risk analysis highlights where resources and surveillance for new potential zoonotic outbreaks can occur so that the next disease can be contained before it becomes an endemic or a new pandemic.

Zoonotic spillover risk analysis dashboard

A novel approach to pandemic prevention

In 1998, along the Nipah River in Malaysia, between the fall of 1998 and spring of 1999, 265 people were infected with a then unknown virus that caused acute encephalitis and severe respiratory distress. 105 of them died, a 39.6% fatality rate. COVID-19’s untreated fatality rate by contrast is 6.3%. Since then, the Nipah Virus, as it is now dubbed, has transitioned out of its forest habitat and caused over 20 deadly outbreaks, mostly in India and Bangladesh.

Viruses such as Nipah surface each year, posing challenges to our daily lives, particularly in countries where establishing strong, lasting, and robust systems for disease surveillance and detection is more difficult. These detection systems are crucial for reducing the risks associated with such viruses.

Solutions that use ML and geospatial data, such as the Zoonotic Spillover Risk Index, can assist local public health authorities in prioritizing resource allocation to areas of highest risk. By doing so, they can establish targeted and localized surveillance measures to detect and halt regional outbreaks before they extend beyond borders. This approach can significantly limit the impact of a disease outbreak and save lives.

Conclusion

This post demonstrated how HSR.health successfully developed the Zoonotic Spillover Risk Index by integrating geospatial data, health, social determinants, and ML. By using SageMaker, the team created a scalable workflow that can pinpoint the most substantial threats of a potential future pandemic. Effective management of these risks can lead to a reduction in the global disease burden. The substantial economic and social advantages of reducing pandemic risk cannot be overstated, with benefits extending regionally and globally.

HSR.health used SageMaker geospatial capabilities for an initial implementation of the Zoonotic Spillover Risk Index and is now seeking partnerships, as well as support from host countries and funding sources, to develop the index further and extend its application to additional regions around the world. For more information about HSR.health and the Zoonotic Spillover Risk Index, visit www.hsr.health.

Discover the potential of integrating Earth observation data into your healthcare initiatives by exploring SageMaker geospatial features. For more information, refer to Amazon SageMaker geospatial capabilities, or engage with additional examples to get hands-on experience.


About the Authors

Ajay K GuptaAjay K Gupta is Co-Founder and CEO of HSR.health, a firm that disrupts and innovates health risk analytics through geospatial tech and AI techniques to predict the spread and severity of disease. And provides these insights to industry, governments, and the health sector so they can anticipate, mitigate, and take advantage of future risks. Outside of work, you can find Ajay behind the mic bursting eardrums while belting out his favorite pop music tunes from U2, Sting, George Michael, or Imagine Dragons.

Jean Felipe TeotonioJean Felipe Teotonio is a driven physician and passionate expert in healthcare quality and infectious disease epidemiology, Jean Felipe leads the HSR.health public health team. He works towards the shared goal of improving public health by reducing the global burden of disease by leveraging GeoAI approaches to develop solutions for the greatest health challenges of our time. Outside of work, his hobbies include reading sci fi books, hiking, the English premier league, and playing bass guitar.

Paul A ChurchyardPaul A Churchyard, CTO and Chief Geospatial Engineer for HSR.health, uses his broad technical skills and expertise to build the core infrastructure for the firm as well as its patented and proprietary GeoMD Platform. Additionally, he and the data science team incorporate geospatial analytics and AI/ML techniques into all health risk indices HSR.health produces. Outside of work, Paul is a self-taught DJ and loves snow.

Janosch WoschitzJanosch Woschitz is a Senior Solutions Architect at AWS, specializing in geospatial AI/ML. With over 15 years of experience, he supports customers globally in leveraging AI and ML for innovative solutions that capitalize on geospatial data. His expertise spans machine learning, data engineering, and scalable distributed systems, augmented by a strong background in software engineering and industry expertise in complex domains such as autonomous driving.

Emmett NelsonEmmett Nelson is an Account Executive at AWS supporting Nonprofit Research customers across the Healthcare & Life Sciences, Earth / Environmental Sciences, and Education verticals. His primary focus is enabling use cases across analytics, AI/ML, high performance computing (HPC), genomics, and medical imaging. Emmett joined AWS in 2020 and is based in Austin, TX.

Read More

Canada Partners with NVIDIA to Supercharge Computing Power

Canada Partners with NVIDIA to Supercharge Computing Power

AI is reshaping industries, society and the “very fabric of innovation” — and Canada is poised to play a key role in this global transformation, said NVIDIA founder and CEO Jensen Huang during a fireside chat with leaders from across Canada’s thriving AI ecosystem.

“Canada, as you know, even though you’re so humble, you might not acknowledge it, is the epicenter of the invention of modern AI,” Huang told an audience of more than 400 from academia, industry and government gathered Thursday in Toronto.

In a pivotal development, Canada’s Industry Minister François-Philippe Champagne shared Friday on X, formerly known as Twitter, that Canada has signed a letter of intent with NVIDIA.

Nations including Canada, France, India and Japan are discussing the importance of investing in “sovereign AI capabilities,” Huang said in an interview with Bloomberg Television in Canada.

Such efforts promise to enhance domestic computing capabilities, turbocharging local economies and unlocking local talent.

“Their natural resource, data, should be refined and produced for their country. The recognition of sovereign AI capabilities is global,” Huang told Bloomberg.

Huang’s conversation with the group of Canadian AI leaders, or “four heroes of mine,” as he described them — Raquel Urtasun of Waabi, Brendan Frey of Deep Genomics, University of Toronto Professor Alan Aspuru-Guzik and Aiden Gomez of Cohere — highlighted both Canada’s enormous contributions and its growing capability as a leader in AI.

“Each one of you,” Huang remarked, addressing the panelists and noting that every one of them is affiliated with the University of Toronto, “are doing foundational work in some of the largest industries in the world.”

“Let’s seize the moment,” Champagne said as he kicked off the panel discussion. “We need to move from fear to opportunity to build trust so that people understand what AI can do for them.”

“It’s about inspiring young researchers to continue to do research here in Canada and about creating opportunities for them after they graduate to be able to start companies here,” Huang said.

The panelists echoed Huang’s optimism, providing insights into how AI is reshaping industries, society and technology.

Gomez, reflecting on the democratization of AI, shared his optimism for the future, stating that it’s “an exciting time to explore product space,” highlighting the vast opportunities for innovation and disruption within the AI landscape.

Cohere’s world-class large language models help enterprises build powerful, secure applications that search, understand meaning and converse in text.

Gomez said the future lies in communities of researchers and entrepreneurs able to leverage AI to bridge technological divides and foster inclusivity.

“I owe everything to this community, the people on the stage and in this room,” he said, acknowledging the collaborative spirit that fuels innovation in AI technologies.

Urtasun highlighted the imperative of safety in autonomous technology as a non-negotiable standard, emphasizing its role in saving lives and shaping the future of transportation.

“Safety should not be proprietary. This technology is a driver, and it’s going to save so many lives,” she said.

Frey underscored the transformative impact of AI in RNA biology. He said that, while “it’s taken quite a bit of time,” at Deep Genomics, he and his colleagues have built the world’s first foundation model for RNA biology, envisioning a future where drug discovery is accelerated, bringing life-saving therapies to market more efficiently.

“If we’re highly successful, best-case scenario, it means that we will be able to design molecules that are safe, and highly efficacious, without doing any cell model studies without doing any animal studies … and getting drugs that save our loved ones rapidly and safely,” Frey said.

Aspuru-Guzik pointed to the fusion of AI with quantum computing and materials science as a frontier for sustainable solutions, emphasizing the importance of creating a conducive environment for innovation in Canada.

“We want to build it here in Canada,” he said.

His work exemplifies the potential of AI to accelerate the development of new materials, driving forward a sustainable future.

Together, these visions articulate a future where AI enhances societal well-being, drives scientific advancement and fosters an inclusive, global community of innovation.

“AI is about the greatest opportunity to close the technology divide and be inclusive, for everybody to enjoy the technology revolution,” Huang said.

For more on the fast-growing impact of AI across the globe, visit NVIDIA’s AI nations hub

Featured image credit: Christian Raul Hernandez, Creative Commons Attribution-Share Alike 4.0 International license.

Read More

New Study Cites AI as Strategic Tool to Combat Climate Change

New Study Cites AI as Strategic Tool to Combat Climate Change

A new study underscores the potential of AI and accelerated computing to deliver energy efficiency and combat climate change, efforts in which NVIDIA has long been deeply engaged.

The study, called “Rethinking Concerns About AI’s Energy Use,” provides a well-researched examination into how AI can — and in many cases already does — play a large role in addressing these critical needs.

Citing dozens of sources, the study from the Information Technology and Innovation Foundation (ITIF), a Washington-based think tank focused on science and technology policy, calls for governments to accelerate adoption of AI as a significant new tool to drive energy efficiency across many industries.

AI can help “reduce carbon emissions, support clean energy technologies, and address climate change,” it said.

How AI Drives Energy Efficiency

The report documents ways machine learning is already helping many sectors reduce their impact on the environment.

For example, it noted:

  • Farmers are using AI to lessen their use of fertilizer and water.
  • Utilities are adopting it to make the electric grid more efficient.
  • Logistics operations use it to optimize delivery routes, reducing the fuel consumption of their fleets.
  • Factories are deploying it to reduce waste and increase energy efficiency.

In these and many other ways, the study argues that AI advances energy efficiency. So, it calls on policymakers “to ensure AI is part of the solution, not part of the problem, when it comes to the environment.”

It also recommends adopting AI broadly across government agencies to “help the public sector reduce carbon emissions through more efficient digital services, smart cities and buildings, intelligent transportation systems, and other AI-enabled efficiencies.”

Reviewing the Data on AI

The study’s author, Daniel Castro, saw in current predictions about AI a repeat of exaggerated forecasts that emerged during the rise of the internet more than two decades ago.

Daniel Castro, ITIF
Daniel Castro

“People extrapolate from early studies, but don’t consider all the important variables including improvements you see over time in digitalization like energy efficiency,” said Castro, who leads ITIF’s Center for Data Innovation.

“The danger is policymakers could miss the big picture and hold back beneficial uses of AI that are having positive impacts, especially in regulated areas like healthcare,” he said.

“For example, we’ve had electronic health records since the 1980s, but it took focused government investments to get them deployed,” he added. “Now AI brings big opportunities for decarbonization across the government and the economy.”

Optimizing Efficiency Across Data Centers

Data centers of every size have a part to play in maximizing their energy efficiency with AI and accelerated computing.

For instance, NVIDIA’s AI-based weather-prediction model, FourCastNet, is about 45,000x faster and consumes 12,000x less energy to produce a forecast than current techniques. That promises efficiency boosts for supercomputers around the world that run continuously to provide regional forecasts, Bjorn Stevens, director of the Max Planck Institute for Meteorology, said in a blog.

Overall, data centers could save a whopping 19 terawatt-hours of electricity a year if all AI, high performance computing and networking offloads were run on GPU and DPU accelerators instead of CPUs, according to NVIDIA’s calculations. That’s the equivalent of the energy consumption of 2.9 million passenger cars driven for a year.

Last year, the U.S. Department of Energy’s lead facility for open science documented its advances with accelerated computing.

Using NVIDIA A100 Tensor Core GPUs, energy efficiency improved 5x on average across four key scientific applications in tests at the National Energy Research Scientific Computing Center. An application for weather forecasting logged gains of nearly 10x.

Chart showing the energy efficiency of GPUs has increased dramatically over time.
The energy efficiency of GPUs has increased dramatically over time.

AI, Accelerated Computing Advance Climate Science

The combination of accelerated computing and AI is creating new scientific instruments to help understand and combat climate change.

In 2021, NVIDIA announced Earth-2, an initiative to build a digital twin of Earth on a supercomputer capable of simulating climate on a global scale. It’s among a handful of similarly ambitious efforts around the world.

An example is Destination Earth, a pan-European project to create digital twins of the planet, that’s using accelerated computing, AI and “collaboration on an unprecedented scale,” said the project’s leader, Peter Bauer, a veteran with more than 20 years at Europe’s top weather-forecasting center.

Experts in the utility sector agree AI is key to advancing sustainability.

“AI will play a crucial role maintaining stability for an electric grid that’s becoming exponentially more complex with large numbers of low-capacity, variable generation sources like wind and solar coming online, and two-way power flowing into and out of houses,” said Jeremy Renshaw, a senior program manager at the Electric Power Research Institute, an independent nonprofit that collaborates with more than 450 companies in 45 countries, in a blog.

Learn more about sustainable computing as well as NVIDIA’s commitment to use 100% renewable energy starting in fiscal year 2025. And watch the video below for more on how AI is accelerating efforts to combat climate change.

Read More

A decoder-only foundation model for time-series forecasting

A decoder-only foundation model for time-series forecasting

Time-series forecasting is ubiquitous in various domains, such as retail, finance, manufacturing, healthcare and natural sciences. In retail use cases, for example, it has been observed that improving demand forecasting accuracy can meaningfully reduce inventory costs and increase revenue. Deep learning (DL) models have emerged as a popular approach for forecasting rich, multivariate, time-series data because they have proven to perform well in a variety of settings (e.g., DL models dominated the M5 competition leaderboard).

At the same time, there has been rapid progress in large foundation language models used for natural language processing (NLP) tasks, such as translation, retrieval-augmented generation, and code completion. These models are trained on massive amounts of textual data derived from a variety of sources like common crawl and open-source code that allows them to identify patterns in languages. This makes them very powerful zero-shot tools; for instance, when paired with retrieval, they can answer questions about and summarize current events.

Despite DL-based forecasters largely outperforming traditional methods and progress being made in reducing training and inference costs, they face challenges: most DL architectures require long and involved training and validation cycles before a customer can test the model on a new time-series. A foundation model for time-series forecasting, in contrast, can provide decent out-of-the-box forecasts on unseen time-series data with no additional training, enabling users to focus on refining forecasts for the actual downstream task like retail demand planning.

To that end, in “A decoder-only foundation model for time-series forecasting”, we introduce TimesFM, a single forecasting model pre-trained on a large time-series corpus of 100 billion real world time-points. Compared to the latest large language models (LLMs), TimesFM is much smaller (200M parameters), yet we show that even at such scales, its zero-shot performance on a variety of unseen datasets of different domains and temporal granularities come close to the state-of-the-art supervised approaches trained explicitly on these datasets. Later this year we plan to make this model available for external customers in Google Cloud Vertex AI.

A decoder-only foundation model for time-series forecasting

LLMs are usually trained in a decoder-only fashion that involves three steps. First, text is broken down into subwords called tokens. Then, the tokens are fed into stacked causal transformer layers that produce an output corresponding to each input token (it cannot attend to future tokens). Finally, the output corresponding to the i-th token summarizes all the information from previous tokens and predicts the (i+1)-th token. During inference, the LLM generates the output one token at a time. For example, when prompted with “What is the capital of France?”, it might generate the token “The”, then condition on “What is the capital of France? The” to generate the next token “capital” and so on until it generates the complete answer: “The capital of France is Paris”.

A foundation model for time-series forecasting should adapt to variable context (what we observe) and horizon (what we query the model to forecast) lengths, while having enough capacity to encode all patterns from a large pretraining dataset. Similar to LLMs, we use stacked transformer layers (self-attention and feedforward layers) as the main building blocks for the TimesFM model. In the context of time-series forecasting, we treat a patch (a group of contiguous time-points) as a token that was popularized by a recent long-horizon forecasting work. The task then is to forecast the (i+1)-th patch of time-points given the i-th output at the end of the stacked transformer layers.

However, there are several key differences from language models. Firstly, we need a multilayer perceptron block with residual connections to convert a patch of time-series into a token that can be input to the transformer layers along with positional encodings (PE). For that, we use a residual block similar to our prior work in long-horizon forecasting. Secondly, at the other end, an output token from the stacked transformer can be used to predict a longer length of subsequent time-points than the input patch length, i.e., the output patch length can be larger than the input patch length.

Consider a time-series of length 512 time-points being used to train a TimesFM model with input patch length 32 and output patch length 128. During training, the model is simultaneously trained to use the first 32 time-points to forecast the next 128 time-points, the first 64 time-points to forecast time-points 65 to 192, the first 96 time-points to forecast time-points 97 to 224 and so on. During inference, suppose the model is given a new time-series of length 256 and tasked with forecasting the next 256 time-points into the future. The model will first generate the future predictions for time-points 257 to 384, then condition on the initial 256 length input plus the generated output to generate time-points 385 to 512. On the other hand, if in our model the output patch length was equal to the input patch length of 32 then for the same task we would have to go through eight generation steps instead of just the two above. This increases the chances of more errors accumulating and therefore, in practice, we see that a longer output patch length yields better performance for long-horizon forecasting

TimesFM architecture.

Pretraining data

Just like LLMs get better with more tokens, TimesFM requires a large volume of legitimate time series data to learn and improve. We have spent a great amount of time creating and assessing our training datasets, and the following is what we have found works best:

Synthetic data helps with the basics. Meaningful synthetic time-series data can be generated using statistical models or physical simulations. These basic temporal patterns can teach the model the grammar of time series forecasting.

Real-world data adds real-world flavor. We comb through available public time series datasets, and selectively put together a large corpus of 100 billion time-points. Among these datasets there are Google Trends and Wikipedia Pageviews, which track what people are interested in, and that nicely mirrors trends and patterns in many other real-world time series. This helps TimesFM understand the bigger picture and generalize better when provided with domain-specific contexts not seen during training.

Zero-shot evaluation results

We evaluate TimesFM zero-shot on data not seen during training using popular time-series benchmarks. We observe that TimesFM performs better than most statistical methods like ARIMA, ETS and can match or outperform powerful DL models like DeepAR, PatchTST that have been explicitly trained on the target time-series.

We used the Monash Forecasting Archive to evaluate TimesFM’s out-of-the-box performance. This archive contains tens of thousands of time-series from various domains like traffic, weather, and demand forecasting covering frequencies ranging from few minutes to yearly data. Following existing literature, we inspect the mean absolute error (MAE) appropriately scaled so that it can be averaged across the datasets. We see that zero-shot (ZS) TimesFM is better than most supervised approaches, including recent deep learning models. We also compare TimesFM to GPT-3.5 for forecasting using a specific prompting technique proposed by llmtime(ZS). We demonstrate that TimesFM performs better than llmtime(ZS) despite being orders of magnitude smaller.

Scaled MAE (the lower the better) of TimesFM(ZS) against other supervised and zero-shot approaches on Monash datasets.

Most of the Monash datasets are short or medium horizon, i.e., the prediction length is not too long. We also test TimesFM on popular benchmarks for long horizon forecasting against a recent state-of-the-art baseline PatchTST (and other long-horizon forecasting baselines). In the next figure, we plot the MAE on ETT datasets for the task of predicting 96 and 192 time-points into the future. The metric has been calculated on the last test window of each dataset (as done by the llmtime paper). We see that TimesFM not only surpasses the performance of llmtime(ZS) but also matches that of the supervised PatchTST model explicitly trained on the respective datasets.

Last window MAE (the lower the better) of TimesFM(ZS) against llmtime(ZS) and long-horizon forecasting baselines on ETT datasets.

Conclusion

We train a decoder-only foundation model for time-series forecasting using a large pretraining corpus of 100B real world time-points, the majority of which was search interest time-series data derived from Google Trends and pageviews from Wikipedia. We show that even a relatively small 200M parameter pretrained model that uses our TimesFM architecture displays impressive zero-shot performance on a variety of public benchmarks from different domains and granularities.

Acknowledgements

This work is the result of a collaboration between several individuals across Google Research and Google Cloud, including (in alphabetical order): Abhimanyu Das, Weihao Kong, Andrew Leach, Mike Lawrence, Alex Martin, Rajat Sen, Yang Yang and Yichen Zhou.

Read More

Intervening on early readouts for mitigating spurious features and simplicity bias

Intervening on early readouts for mitigating spurious features and simplicity bias

Machine learning models in the real world are often trained on limited data that may contain unintended statistical biases. For example, in the CELEBA celebrity image dataset, a disproportionate number of female celebrities have blond hair, leading to classifiers incorrectly predicting “blond” as the hair color for most female faces — here, gender is a spurious feature for predicting hair color. Such unfair biases could have significant consequences in critical applications such as medical diagnosis.

Surprisingly, recent work has also discovered an inherent tendency of deep networks to amplify such statistical biases, through the so-called simplicity bias of deep learning. This bias is the tendency of deep networks to identify weakly predictive features early in the training, and continue to anchor on these features, failing to identify more complex and potentially more accurate features.

With the above in mind, we propose simple and effective fixes to this dual challenge of spurious features and simplicity bias by applying early readouts and feature forgetting. First, in “Using Early Readouts to Mediate Featural Bias in Distillation”, we show that making predictions from early layers of a deep network (referred to as “early readouts”) can automatically signal issues with the quality of the learned representations. In particular, these predictions are more often wrong, and more confidently wrong, when the network is relying on spurious features. We use this erroneous confidence to improve outcomes in model distillation, a setting where a larger “teacher” model guides the training of a smaller “student” model. Then in “Overcoming Simplicity Bias in Deep Networks using a Feature Sieve”, we intervene directly on these indicator signals by making the network “forget” the problematic features and consequently look for better, more predictive features. This substantially improves the model’s ability to generalize to unseen domains compared to previous approaches. Our AI Principles and our Responsible AI practices guide how we research and develop these advanced applications and help us address the challenges posed by statistical biases.

Animation comparing hypothetical responses from two models trained with and without the feature sieve.

Early readouts for debiasing distillation

We first illustrate the diagnostic value of early readouts and their application in debiased distillation, i.e., making sure that the student model inherits the teacher model’s resilience to feature bias through distillation. We start with a standard distillation framework where the student is trained with a mixture of label matching (minimizing the cross-entropy loss between student outputs and the ground-truth labels) and teacher matching (minimizing the KL divergence loss between student and teacher outputs for any given input).

Suppose one trains a linear decoder, i.e., a small auxiliary neural network named as Aux, on top of an intermediate representation of the student model. We refer to the output of this linear decoder as an early readout of the network representation. Our finding is that early readouts make more errors on instances that contain spurious features, and further, the confidence on those errors is higher than the confidence associated with other errors. This suggests that confidence on errors from early readouts is a fairly strong, automated indicator of the model’s dependence on potentially spurious features.

Illustrating the usage of early readouts (i.e., output from the auxiliary layer) in debiasing distillation. Instances that are confidently mispredicted in the early readouts are upweighted in the distillation loss.

We used this signal to modulate the contribution of the teacher in the distillation loss on a per-instance basis, and found significant improvements in the trained student model as a result.

We evaluated our approach on standard benchmark datasets known to contain spurious correlations (Waterbirds, CelebA, CivilComments, MNLI). Each of these datasets contain groupings of data that share an attribute potentially correlated with the label in a spurious manner. As an example, the CelebA dataset mentioned above includes groups such as {blond male, blond female, non-blond male, non-blond female}, with models typically performing the worst on the {non-blond female} group when predicting hair color. Thus, a measure of model performance is its worst group accuracy, i.e., the lowest accuracy among all known groups present in the dataset. We improved the worst group accuracy of student models on all datasets; moreover, we also improved overall accuracy in three of the four datasets, showing that our improvement on any one group does not come at the expense of accuracy on other groups. More details are available in our paper.

Comparison of Worst Group Accuracies of different distillation techniques relative to that of the Teacher model. Our method outperforms other methods on all datasets.

Overcoming simplicity bias with a feature sieve

In a second, closely related project, we intervene directly on the information provided by early readouts, to improve feature learning and generalization. The workflow alternates between identifying problematic features and erasing identified features from the network. Our primary hypothesis is that early features are more prone to simplicity bias, and that by erasing (“sieving”) these features, we allow richer feature representations to be learned.

Training workflow with feature sieve. We alternate between identifying problematic features (using training iteration) and erasing them from the network (using forgetting iteration).

We describe the identification and erasure steps in more detail:

  • Identifying simple features: We train the primary model and the readout model (AUX above) in conventional fashion via forward- and back-propagation. Note that feedback from the auxiliary layer does not back-propagate to the main network. This is to force the auxiliary layer to learn from already-available features rather than create or reinforce them in the main network.
  • Applying the feature sieve: We aim to erase the identified features in the early layers of the neural network with the use of a novel forgetting loss, Lf , which is simply the cross-entropy between the readout and a uniform distribution over labels. Essentially, all information that leads to nontrivial readouts are erased from the primary network. In this step, the auxiliary network and upper layers of the main network are kept unchanged.

We can control specifically how the feature sieve is applied to a given dataset through a small number of configuration parameters. By changing the position and complexity of the auxiliary network, we control the complexity of the identified- and erased features. By modifying the mixing of learning and forgetting steps, we control the degree to which the model is challenged to learn more complex features. These choices, which are dataset-dependent, are made via hyperparameter search to maximize validation accuracy, a standard measure of generalization. Since we include “no-forgetting” (i.e., the baseline model) in the search space, we expect to find settings that are at least as good as the baseline.

Below we show features learned by the baseline model (middle row) and our model (bottom row) on two benchmark datasets — biased activity recognition (BAR) and animal categorization (NICO). Feature importance was estimated using post-hoc gradient-based importance scoring (GRAD-CAM), with the orange-red end of the spectrum indicating high importance, while green-blue indicates low importance. Shown below, our trained models focus on the primary object of interest, whereas the baseline model tends to focus on background features that are simpler and spuriously correlated with the label.

Feature importance scoring using GRAD-CAM on activity recognition (BAR) and animal categorization (NICO) generalization benchmarks. Our approach (last row) focuses on the relevant objects in the image, whereas the baseline (ERM; middle row) relies on background features that are spuriously correlated with the label.

Through this ability to learn better, generalizable features, we show substantial gains over a range of relevant baselines on real-world spurious feature benchmark datasets: BAR, CelebA Hair, NICO and ImagenetA, by margins up to 11% (see figure below). More details are available in our paper.

Our feature sieve method improves accuracy by significant margins relative to the nearest baseline for a range of feature generalization benchmark datasets.

Conclusion

We hope that our work on early readouts and their use in feature sieving for generalization will both spur the development of a new class of adversarial feature learning approaches and help improve the generalization capability and robustness of deep learning systems.

Acknowledgements

The work on applying early readouts to debiasing distillation was conducted in collaboration with our academic partners Durga Sivasubramanian, Anmol Reddy and Prof. Ganesh Ramakrishnan at IIT Bombay. We extend our sincere gratitude to Praneeth Netrapalli and Anshul Nasery for their feedback and recommendations. We are also grateful to Nishant Jain, Shreyas Havaldar, Rachit Bansal, Kartikeya Badola, Amandeep Kaur and the whole cohort of pre-doctoral researchers at Google Research India for taking part in research discussions. Special thanks to Tom Small for creating the animation used in this post.

Read More

Monitor embedding drift for LLMs deployed from Amazon SageMaker JumpStart

Monitor embedding drift for LLMs deployed from Amazon SageMaker JumpStart

One of the most useful application patterns for generative AI workloads is Retrieval Augmented Generation (RAG). In the RAG pattern, we find pieces of reference content related to an input prompt by performing similarity searches on embeddings. Embeddings capture the information content in bodies of text, allowing natural language processing (NLP) models to work with language in a numeric form. Embeddings are just vectors of floating point numbers, so we can analyze them to help answer three important questions: Is our reference data changing over time? Are the questions users are asking changing over time? And finally, how well is our reference data covering the questions being asked?

In this post, you’ll learn about some of the considerations for embedding vector analysis and detecting signals of embedding drift. Because embeddings are an important source of data for NLP models in general and generative AI solutions in particular, we need a way to measure whether our embeddings are changing over time (drifting). In this post, you’ll see an example of performing drift detection on embedding vectors using a clustering technique with large language models (LLMS) deployed from Amazon SageMaker JumpStart. You’ll also be able to explore these concepts through two provided examples, including an end-to-end sample application or, optionally, a subset of the application.

Overview of RAG

The RAG pattern lets you retrieve knowledge from external sources, such as PDF documents, wiki articles, or call transcripts, and then use that knowledge to augment the instruction prompt sent to the LLM. This allows the LLM to reference more relevant information when generating a response. For example, if you ask an LLM how to make chocolate chip cookies, it can include information from your own recipe library. In this pattern, the recipe text is converted into embedding vectors using an embedding model, and stored in a vector database. Incoming questions are converted to embeddings, and then the vector database runs a similarity search to find related content. The question and the reference data then go into the prompt for the LLM.

Let’s take a closer look at the embedding vectors that get created and how to perform drift analysis on those vectors.

Analysis on embedding vectors

Embedding vectors are numeric representations of our data so analysis of these vectors can provide insight into our reference data that can later be used to detect potential signals of drift. Embedding vectors represent an item in n-dimensional space, where n is often large. For example, the GPT-J 6B model, used in this post, creates vectors of size 4096. To measure drift, assume that our application captures embedding vectors for both reference data and incoming prompts.

We start by performing dimension reduction using Principal Component Analysis (PCA). PCA tries to reduce the number of dimensions while preserving most of the variance in the data. In this case, we try to find the number of dimensions that preserves 95% of the variance, which should capture anything within two standard deviations.

Then we use K-Means to identify a set of cluster centers. K-Means tries to group points together into clusters such that each cluster is relatively compact and the clusters are as distant from each other as possible.

We calculate the following information based on the clustering output shown in the following figure:

  • The number of dimensions in PCA that explain 95% of the variance
  • The location of each cluster center, or centroid

Additionally, we look at the proportion (higher or lower) of samples in each cluster, as shown in the following figure.

Finally, we use this analysis to calculate the following:

  • Inertia – Inertia is the sum of squared distances to cluster centroids, which measures how well the data was clustered using K-Means.
  • Silhouette score – The silhouette score is a measure for the validation of the consistency within clusters, and ranges from -1 to 1. A value close to 1 means that the points in a cluster are close to the other points in the same cluster and far from the points of the other clusters. A visual representation of the silhouette score can be seen in the following figure.

We can periodically capture this information for snapshots of the embeddings for both the source reference data and the prompts. Capturing this data allows us to analyze potential signals of embedding drift.

Detecting embedding drift

Periodically, we can compare the clustering information through snapshots of the data, which includes the reference data embeddings and the prompt embeddings. First, we can compare the number of dimensions needed to explain 95% of the variation in the embedding data, the inertia, and the silhouette score from the clustering job. As you can see in the following table, compared to a baseline, the latest snapshot of embeddings requires 39 more dimensions to explain the variance, indicating that our data is more dispersed. The inertia has gone up, indicating that the samples are in aggregate farther away from their cluster centers. Additionally, the silhouette score has gone down, indicating that the clusters are not as well defined. For prompt data, that might indicate that the types of questions coming into the system are covering more topics.

Next, in the following figure, we can see how the proportion of samples in each cluster has changed over time. This can show us whether our newer reference data is broadly similar to the previous set, or covers new areas.

Finally, we can see if the cluster centers are moving, which would show drift in the information in the clusters, as shown in the following table.

Reference data coverage for incoming questions

We can also evaluate how well our reference data aligns to the incoming questions. To do this, we assign each prompt embedding to a reference data cluster. We compute the distance from each prompt to its corresponding center, and look at the mean, median, and standard deviation of those distances. We can store that information and see how it changes over time.

The following figure shows an example of analyzing the distance between the prompt embedding and reference data centers over time.

As you can see, the mean, median, and standard deviation distance statistics between prompt embeddings and reference data centers is decreasing between the initial baseline and the latest snapshot. Although the absolute value of the distance is difficult to interpret, we can use the trends to determine if the semantic overlap between reference data and incoming questions is getting better or worse over time.

Sample application

In order to gather the experimental results discussed in the previous section, we built a sample application that implements the RAG pattern using embedding and generation models deployed through SageMaker JumpStart and hosted on Amazon SageMaker real-time endpoints.

The application has three core components:

  • We use an interactive flow, which includes a user interface for capturing prompts, combined with a RAG orchestration layer, using LangChain.
  • The data processing flow extracts data from PDF documents and creates embeddings that get stored in Amazon OpenSearch Service. We also use these in the final embedding drift analysis component of the application.
  • The embeddings are captured in Amazon Simple Storage Service (Amazon S3) via Amazon Kinesis Data Firehose, and we run a combination of AWS Glue extract, transform, and load (ETL) jobs and Jupyter notebooks to perform the embedding analysis.

The following diagram illustrates the end-to-end architecture.

The full sample code is available on GitHub. The provided code is available in two different patterns:

  • Sample full-stack application with a Streamlit frontend – This provides an end-to-end application, including a user interface using Streamlit for capturing prompts, combined with the RAG orchestration layer, using LangChain running on Amazon Elastic Container Service (Amazon ECS) with AWS Fargate
  • Backend application – For those that don’t want to deploy the full application stack, you can optionally choose to only deploy the backend AWS Cloud Development Kit (AWS CDK) stack, and then use the Jupyter notebook provided to perform RAG orchestration using LangChain

To create the provided patterns, there are several prerequisites detailed in the following sections, starting with deploying the generative and text embedding models then moving on to the additional prerequisites.

Deploy models through SageMaker JumpStart

Both patterns assume the deployment of an embedding model and generative model. For this, you’ll deploy two models from SageMaker JumpStart. The first model, GPT-J 6B, is used as the embedding model and the second model, Falcon-40b, is used for text generation.

You can deploy each of these models through SageMaker JumpStart from the AWS Management Console, Amazon SageMaker Studio, or programmatically. For more information, refer to How to use JumpStart foundation models. To simplify the deployment, you can use the provided notebook derived from notebooks automatically created by SageMaker JumpStart. This notebook pulls the models from the SageMaker JumpStart ML hub and deploys them to two separate SageMaker real-time endpoints.

The sample notebook also has a cleanup section. Don’t run that section yet, because it will delete the endpoints just deployed. You will complete the cleanup at the end of the walkthrough.

After confirming successful deployment of the endpoints, you’re ready to deploy the full sample application. However, if you’re more interested in exploring only the backend and analysis notebooks, you can optionally deploy only that, which is covered in the next section.

Option 1: Deploy the backend application only

This pattern allows you to deploy the backend solution only and interact with the solution using a Jupyter notebook. Use this pattern if you don’t want to build out the full frontend interface.

Prerequisites

You should have the following prerequisites:

  • A SageMaker JumpStart model endpoint deployed – Deploy the models to SageMaker real-time endpoints using SageMaker JumpStart, as previously outlined
  • Deployment parameters – Record the following:
    • Text model endpoint name – The endpoint name of the text generation model deployed with SageMaker JumpStart
    • Embeddings model endpoint name – The endpoint name of the embedding model deployed with SageMaker JumpStart

Deploy the resources using the AWS CDK

Use the deployment parameters noted in the previous section to deploy the AWS CDK stack. For more information about AWS CDK installation, refer to Getting started with the AWS CDK.

Make sure that Docker is installed and running on the workstation that will be used for AWS CDK deployment. Refer to Get Docker for additional guidance.

$ cd pattern1-rag/cdk
$ cdk deploy BackendStack --exclusively
    -c textModelEndpointName=<Enter the SageMaker Endpoint Name for the Text generation model> 
    -c embeddingsModelEndpointName=<Enter the SageMaker Endpoint Name for the Text embeddings model>

Alternatively, you can enter the context values in a file called cdk.context.json in the pattern1-rag/cdk directory and run cdk deploy BackendStack --exclusively.

The deployment will print out outputs, some of which will be needed to run the notebook. Before you can start question and answering, embed the reference documents, as shown in the next section.

Embed reference documents

For this RAG approach, reference documents are first embedded with a text embedding model and stored in a vector database. In this solution, an ingestion pipeline has been built that intakes PDF documents.

An Amazon Elastic Compute Cloud (Amazon EC2) instance has been created for the PDF document ingestion and an Amazon Elastic File System (Amazon EFS) file system is mounted on the EC2 instance to save the PDF documents. An AWS DataSync task is run every hour to fetch PDF documents found in the EFS file system path and upload them to an S3 bucket to start the text embedding process. This process embeds the reference documents and saves the embeddings in OpenSearch Service. It also saves an embedding archive to an S3 bucket through Kinesis Data Firehose for later analysis.

To ingest the reference documents, complete the following steps:

  1. Retrieve the sample EC2 instance ID that was created (see the AWS CDK output JumpHostId) and connect using Session Manager, a capability of AWS Systems Manager. For instructions, refer to Connect to your Linux instance with AWS Systems Manager Session Manager.
  2. Go to the directory /mnt/efs/fs1, which is where the EFS file system is mounted, and create a folder called ingest:
    $ cd /mnt/efs/fs1
    $ mkdir ingest && cd ingest

  3. Add your reference PDF documents to the ingest directory.

The DataSync task is configured to upload all files found in this directory to Amazon S3 to start the embedding process.

The DataSync task runs on an hourly schedule; you can optionally start the task manually to start the embedding process immediately for the PDF documents you added.

  1. To start the task, locate the task ID from the AWS CDK output DataSyncTaskID and start the task with defaults.

After the embeddings are created, you can start the RAG question and answering through a Jupyter notebook, as shown in the next section.

Question and answering using a Jupyter notebook

Complete the following steps:

  1. Retrieve the SageMaker notebook instance name from the AWS CDK output NotebookInstanceName and connect to JupyterLab from the SageMaker console.
  2. Go to the directory fmops/full-stack/pattern1-rag/notebooks/.
  3. Open and run the notebook query-llm.ipynb in the notebook instance to perform question and answering using RAG.

Make sure to use the conda_python3 kernel for the notebook.

This pattern is useful to explore the backend solution without needing to provision additional prerequisites that are required for the full-stack application. The next section covers the implementation of a full-stack application, including both the frontend and backend components, to provide a user interface for interacting with your generative AI application.

Option 2: Deploy the full-stack sample application with a Streamlit frontend

This pattern allows you to deploy the solution with a user frontend interface for question and answering.

Prerequisites

To deploy the sample application, you must have the following prerequisites:

  • SageMaker JumpStart model endpoint deployed – Deploy the models to your SageMaker real-time endpoints using SageMaker JumpStart, as outlined in the previous section, using the provided notebooks.
  • Amazon Route 53 hosted zone – Create an Amazon Route 53 public hosted zone to use for this solution. You can also use an existing Route 53 public hosted zone, such as example.com.
  • AWS Certificate Manager certificate – Provision an AWS Certificate Manager (ACM) TLS certificate for the Route 53 hosted zone domain name and its applicable subdomains, such as example.com and *.example.com for all subdomains. For instructions, refer to Requesting a public certificate. This certificate is used to configure HTTPS on Amazon CloudFront and the origin load balancer.
  • Deployment parameters – Record the following:
    • Frontend application custom domain name – A custom domain name used to access the frontend sample application. The domain name provided is used to create a Route 53 DNS record pointing to the frontend CloudFront distribution; for example, app.example.com.
    • Load balancer origin custom domain name – A custom domain name used for the CloudFront distribution load balancer origin. The domain name provided is used to create a Route 53 DNS record pointing to the origin load balancer; for example, app-lb.example.com.
    • Route 53 hosted zone ID – The Route 53 hosted zone ID to host the custom domain names provided; for example, ZXXXXXXXXYYYYYYYYY.
    • Route 53 hosted zone name – The name of the Route 53 hosted zone to host the custom domain names provided; for example, example.com.
    • ACM certificate ARN – The ARN of the ACM certificate to be used with the custom domain provided.
    • Text model endpoint name – The endpoint name of the text generation model deployed with SageMaker JumpStart.
    • Embeddings model endpoint name – The endpoint name of the embedding model deployed with SageMaker JumpStart.

Deploy the resources using the AWS CDK

Use the deployment parameters you noted in the prerequisites to deploy the AWS CDK stack. For more information, refer to Getting started with the AWS CDK.

Make sure Docker is installed and running on the workstation that will be used for the AWS CDK deployment.

$ cd pattern1-rag/cdk
$ cdk deploy --all -c appCustomDomainName=<Enter Custom Domain Name to be used for Frontend App> 
    -c loadBalancerOriginCustomDomainName=<Enter Custom Domain Name to be used for Load Balancer Origin> 
    -c customDomainRoute53HostedZoneID=<Enter Route53 Hosted Zone ID for the Custom Domain being used> 
    -c customDomainRoute53HostedZoneName=<Enter Route53 Hostedzone Name> 
    -c customDomainCertificateArn=<Enter ACM Certificate ARN for Custom Domains provided> 
    -c textModelEndpointName=<Enter the SageMaker Endpoint Name for the Text generation model> 
    -c embeddingsModelEndpointName=<Enter the SageMaker Endpoint Name for the Text embeddings model>

In the preceding code, -c represents a context value, in the form of the required prerequisites, provided on input. Alternatively, you can enter the context values in a file called cdk.context.json in the pattern1-rag/cdk directory and run cdk deploy --all.

Note that we specify the Region in the file bin/cdk.ts. Configuring ALB access logs requires a specified Region. You can change this Region before deployment.

The deployment will print out the URL to access the Streamlit application. Before you can start question and answering, you need to embed the reference documents, as shown in the next section.

Embed the reference documents

For a RAG approach, reference documents are first embedded with a text embedding model and stored in a vector database. In this solution, an ingestion pipeline has been built that intakes PDF documents.

As we discussed in the first deployment option, an example EC2 instance has been created for the PDF document ingestion and an EFS file system is mounted on the EC2 instance to save the PDF documents. A DataSync task is run every hour to fetch PDF documents found in the EFS file system path and upload them to an S3 bucket to start the text embedding process. This process embeds the reference documents and saves the embeddings in OpenSearch Service. It also saves an embedding archive to an S3 bucket through Kinesis Data Firehose for later analysis.

To ingest the reference documents, complete the following steps:

  1. Retrieve the sample EC2 instance ID that was created (see the AWS CDK output JumpHostId) and connect using Session Manager.
  2. Go to the directory /mnt/efs/fs1, which is where the EFS file system is mounted, and create a folder called ingest:
    $ cd /mnt/efs/fs1
    $ mkdir ingest && cd ingest

  3. Add your reference PDF documents to the ingest directory.

The DataSync task is configured to upload all files found in this directory to Amazon S3 to start the embedding process.

The DataSync task runs on an hourly schedule. You can optionally start the task manually to start the embedding process immediately for the PDF documents you added.

  1. To start the task, locate the task ID from the AWS CDK output DataSyncTaskID and start the task with defaults.

Question and answering

After the reference documents have been embedded, you can start the RAG question and answering by visiting the URL to access the Streamlit application. An Amazon Cognito authentication layer is used, so it requires creating a user account in the Amazon Cognito user pool deployed via the AWS CDK (see the AWS CDK output for the user pool name) for first-time access to the application. For instructions on creating an Amazon Cognito user, refer to Creating a new user in the AWS Management Console.

Embed drift analysis

In this section, we show you how to perform drift analysis by first creating a baseline of the reference data embeddings and prompt embeddings, and then creating a snapshot of the embeddings over time. This allows you to compare the baseline embeddings to the snapshot embeddings.

Create an embedding baseline for the reference data and prompt

To create an embedding baseline of the reference data, open the AWS Glue console and select the ETL job embedding-drift-analysis. Set the parameters for the ETL job as follows and run the job:

  • Set --job_type to BASELINE.
  • Set --out_table to the Amazon DynamoDB table for reference embedding data. (See the AWS CDK output DriftTableReference for the table name.)
  • Set --centroid_table to the DynamoDB table for reference centroid data. (See the AWS CDK output CentroidTableReference for the table name.)
  • Set --data_path to the S3 bucket with the prefix; for example, s3://<REPLACE_WITH_BUCKET_NAME>/embeddingarchive/. (See the AWS CDK output BucketName for the bucket name.)

Similarly, using the ETL job embedding-drift-analysis, create an embedding baseline of the prompts. Set the parameters for the ETL job as follows and run the job:

  • Set --job_type to BASELINE
  • Set --out_table to the DynamoDB table for prompt embedding data. (See the AWS CDK output DriftTablePromptsName for the table name.)
  • Set --centroid_table to the DynamoDB table for prompt centroid data. (See the AWS CDK output CentroidTablePrompts for the table name.)
  • Set --data_path to the S3 bucket with the prefix; for example, s3://<REPLACE_WITH_BUCKET_NAME>/promptarchive/. (See the AWS CDK output BucketName for the bucket name.)

Create an embedding snapshot for the reference data and prompt

After you ingest additional information into OpenSearch Service, run the ETL job embedding-drift-analysis again to snapshot the reference data embeddings. The parameters will be the same as the ETL job that you ran to create the embedding baseline of the reference data as shown in the previous section, with the exception of setting the --job_type parameter to SNAPSHOT.

Similarly, to snapshot the prompt embeddings, run the ETL job embedding-drift-analysis again. The parameters will be the same as the ETL job that you ran to create the embedding baseline for the prompts as shown in the previous section, with the exception of setting the --job_type parameter to SNAPSHOT.

Compare the baseline to the snapshot

To compare the embedding baseline and snapshot for reference data and prompts, use the provided notebook pattern1-rag/notebooks/drift-analysis.ipynb.

To look at embedding comparison for reference data or prompts, change the DynamoDB table name variables (tbl and c_tbl) in the notebook to the appropriate DynamoDB table for each run of the notebook.

The notebook variable tbl should be changed to the appropriate drift table name. The following is an example of where to configure the variable in the notebook.

The table names can be retrieved as follows:

  • For the reference embedding data, retrieve the drift table name from the AWS CDK output DriftTableReference
  • For the prompt embedding data, retrieve the drift table name from the AWS CDK output DriftTablePromptsName

In addition, the notebook variable c_tbl should be changed to the appropriate centroid table name. The following is an example of where to configure the variable in the notebook.

The table names can be retrieved as follows:

  • For the reference embedding data, retrieve the centroid table name from the AWS CDK output CentroidTableReference
  • For the prompt embedding data, retrieve the centroid table name from the AWS CDK output CentroidTablePrompts

Analyze the prompt distance from the reference data

First, run the AWS Glue job embedding-distance-analysis. This job will find out which cluster, from the K-Means evaluation of the reference data embeddings, that each prompt belongs to. It then calculates the mean, median, and standard deviation of the distance from each prompt to the center of the corresponding cluster.

You can run the notebook pattern1-rag/notebooks/distance-analysis.ipynb to see the trends in the distance metrics over time. This will give you a sense of the overall trend in the distribution of the prompt embedding distances.

The notebook pattern1-rag/notebooks/prompt-distance-outliers.ipynb is an AWS Glue notebook that looks for outliers, which can help you identify whether you’re getting more prompts that are not related to the reference data.

Monitor similarity scores

All similarity scores from OpenSearch Service are logged in Amazon CloudWatch under the rag namespace. The dashboard RAG_Scores shows the average score and the total number of scores ingested.

Clean up

To avoid incurring future charges, delete all the resources that you created.

Delete the deployed SageMaker models

Reference the cleanup up section of the provided example notebook to delete the deployed SageMaker JumpStart models, or you can delete the models on the SageMaker console.

Delete the AWS CDK resources

If you entered your parameters in a cdk.context.json file, clean up as follows:

$ cd pattern1-rag/cdk
$ cdk destroy --all

If you entered your parameters on the command line and only deployed the backend application (the backend AWS CDK stack), clean up as follows:

$ cd pattern1-rag/cdk
$ cdk destroy --all
    -c textModelEndpointName=<Enter the SageMaker Endpoint Name for the Text generation model> 
    -c embeddingsModelEndpointName=<Enter the SageMaker Endpoint Name for the Text embeddings model>

If you entered your parameters on the command line and deployed the full solution (the frontend and backend AWS CDK stacks), clean up as follows:

$ cd pattern1-rag/cdk
$ cdk destroy --all -c appCustomDomainName=<Enter Custom Domain Name to be used for Frontend App> 
    -c loadBalancerOriginCustomDomainName=<Enter Custom Domain Name to be used for Load Balancer Origin> 
    -c customDomainRoute53HostedZoneID=<Enter Route53 Hosted Zone ID for the Custom Domain being used> 
    -c customDomainRoute53HostedZoneName=<Enter Route53 Hostedzone Name> 
    -c customDomainCertificateArn=<Enter ACM Certificate ARN for Custom Domains provided> 
    -c textModelEndpointName=<Enter the SageMaker Endpoint Name for the Text generation model> 
    -c embeddingsModelEndpointName=<Enter the SageMaker Endpoint Name for the Text embeddings model>

Conclusion

In this post, we provided a working example of an application that captures embedding vectors for both reference data and prompts in the RAG pattern for generative AI. We showed how to perform clustering analysis to determine whether reference or prompt data is drifting over time, and how well the reference data covers the types of questions users are asking. If you detect drift, it can provide a signal that the environment has changed and your model is getting new inputs that it may not be optimized to handle. This allows for proactive evaluation of the current model against changing inputs.


About the Authors

Abdullahi Olaoye is a Senior Solutions Architect at Amazon Web Services (AWS). Abdullahi holds a MSC in Computer Networking from Wichita State University and is a published author that has held roles across various technology domains such as DevOps, infrastructure modernization and AI. He is currently focused on Generative AI and plays a key role in assisting enterprises to architect and build cutting-edge solutions powered by Generative AI. Beyond the realm of technology, he finds joy in the art of exploration. When not crafting AI solutions, he enjoys traveling with his family to explore new places.

Randy DeFauw is a Senior Principal Solutions Architect at AWS. He holds an MSEE from the University of Michigan, where he worked on computer vision for autonomous vehicles. He also holds an MBA from Colorado State University. Randy has held a variety of positions in the technology space, ranging from software engineering to product management. In entered the Big Data space in 2013 and continues to explore that area. He is actively working on projects in the ML space and has presented at numerous conferences including Strata and GlueCon.

Shelbee Eigenbrode is a Principal AI and Machine Learning Specialist Solutions Architect at Amazon Web Services (AWS). She has been in technology for 24 years spanning multiple industries, technologies, and roles. She is currently focusing on combining her DevOps and ML background into the domain of MLOps to help customers deliver and manage ML workloads at scale. With over 35 patents granted across various technology domains, she has a passion for continuous innovation and using data to drive business outcomes. Shelbee is a co-creator and instructor of the Practical Data Science specialization on Coursera. She is also the Co-Director of Women In Big Data (WiBD), Denver chapter. In her spare time, she likes to spend time with her family, friends, and overactive dogs.

Read More