Generative AI to quantify uncertainty in weather forecasting

Generative AI to quantify uncertainty in weather forecasting

Accurate weather forecasts can have a direct impact on people’s lives, from helping make routine decisions, like what to pack for a day’s activities, to informing urgent actions, for example, protecting people in the face of hazardous weather conditions. The importance of accurate and timely weather forecasts will only increase as the climate changes. Recognizing this, we at Google have been investing in weather and climate research to help ensure that the forecasting technology of tomorrow can meet the demand for reliable weather information. Some of our recent innovations include MetNet-3, Google’s high-resolution forecasts up to 24-hours into the future, and GraphCast, a weather model that can predict weather up to 10 days ahead.

Weather is inherently stochastic. To quantify the uncertainty, traditional methods rely on physics-based simulation to generate an ensemble of forecasts. However, it is computationally costly to generate a large ensemble so that rare and extreme weather events can be discerned and characterized accurately.

With that in mind, we are excited to announce our latest innovation designed to accelerate progress in weather forecasting, Scalable Ensemble Envelope Diffusion Sampler (SEEDS), recently published in Science Advances. SEEDS is a generative AI model that can efficiently generate ensembles of weather forecasts at scale at a small fraction of the cost of traditional physics-based forecasting models. This technology opens up novel opportunities for weather and climate science, and it represents one of the first applications to weather and climate forecasting of probabilistic diffusion models, a generative AI technology behind recent advances in media generation.

The need for probabilistic forecasts: the butterfly effect

In December 1972, at the American Association for the Advancement of Science meeting in Washington, D.C., MIT meteorology professor Ed Lorenz gave a talk entitled, “Does the Flap of a Butterfly’s Wings in Brazil Set Off a Tornado in Texas?” which contributed to the term “butterfly effect”. He was building on his earlier, landmark 1963 paper where he examined the feasibility of “very-long-range weather prediction” and described how errors in initial conditions grow exponentially when integrated in time with numerical weather prediction models. This exponential error growth, known as chaos, results in a deterministic predictability limit that restricts the use of individual forecasts in decision making, because they do not quantify the inherent uncertainty of weather conditions. This is particularly problematic when forecasting extreme weather events, such as hurricanes, heatwaves, or floods.

Recognizing the limitations of deterministic forecasts, weather agencies around the world issue probabilistic forecasts. Such forecasts are based on ensembles of deterministic forecasts, each of which is generated by including synthetic noise in the initial conditions and stochasticity in the physical processes. Leveraging the fast error growth rate in weather models, the forecasts in an ensemble are purposefully different: the initial uncertainties are tuned to generate runs that are as different as possible and the stochastic processes in the weather model introduce additional differences during the model run. The error growth is mitigated by averaging all the forecasts in the ensemble and the variability in the ensemble of forecasts quantifies the uncertainty of the weather conditions.

While effective, generating these probabilistic forecasts is computationally costly. They require running highly complex numerical weather models on massive supercomputers multiple times. Consequently, many operational weather forecasts can only afford to generate ~10–50 ensemble members for each forecast cycle. This is a problem for users concerned with the likelihood of rare but high-impact weather events, which typically require much larger ensembles to assess beyond a few days. For instance, one would need a 10,000-member ensemble to forecast the likelihood of events with 1% probability of occurrence with a relative error less than 10%. Quantifying the probability of such extreme events could be useful, for example, for emergency management preparation or for energy traders.

SEEDS: AI-enabled advances

In the aforementioned paper, we present the Scalable Ensemble Envelope Diffusion Sampler (SEEDS), a generative AI technology for weather forecast ensemble generation. SEEDS is based on denoising diffusion probabilistic models, a state-of-the-art generative AI method pioneered in part by Google Research.

SEEDS can generate a large ensemble conditioned on as few as one or two forecasts from an operational numerical weather prediction system. The generated ensembles not only yield plausible real-weather–like forecasts but also match or exceed physics-based ensembles in skill metrics such as the rank histogram, the root-mean-squared error (RMSE), and the continuous ranked probability score (CRPS). In particular, the generated ensembles assign more accurate likelihoods to the tail of the forecast distribution, such as ±2σ and ±3σ weather events. Most importantly, the computational cost of the model is negligible when compared to the hours of computational time needed by supercomputers to make a forecast. It has a throughput of 256 ensemble members (at 2° resolution) per 3 minutes on Google Cloud TPUv3-32 instances and can easily scale to higher throughput by deploying more accelerators.

SEEDS generates an order-of-magnitude more samples to in-fill distributions of weather patterns.

Generating plausible weather forecasts

Generative AI is known to generate very detailed images and videos. This property is especially useful for generating ensemble forecasts that are consistent with plausible weather patterns, which ultimately result in the most added value for downstream applications. As Lorenz points out, “The [weather forecast] maps which they produce should look like real weather maps.” The figure below contrasts the forecasts from SEEDS to those from the operational U.S. weather prediction system (Global Ensemble Forecast System, GEFS) for a particular date during the 2022 European heat waves. We also compare the results to the forecasts from a Gaussian model that predicts the univariate mean and standard deviation of each atmospheric field at each location, a common and computationally efficient but less sophisticated data-driven approach. This Gaussian model is meant to characterize the output of pointwise post-processing, which ignores correlations and treats each grid point as an independent random variable. In contrast, a real weather map would have detailed correlational structures.

Because SEEDS directly models the joint distribution of the atmospheric state, it realistically captures both the spatial covariance and the correlation between mid-tropospheric geopotential and mean sea level pressure, both of which are closely related and are commonly used by weather forecasters for evaluation and verification of forecasts. Gradients in the mean sea level pressure are what drive winds at the surface, while gradients in mid-tropospheric geopotential create upper-level winds that move large-scale weather patterns.

The generated samples from SEEDS shown in the figure below (frames Ca–Ch) display a geopotential trough west of Portugal with spatial structure similar to that found in the operational U.S. forecasts or the reanalysis based on observations. Although the Gaussian model predicts the marginal univariate distributions adequately, it fails to capture cross-field or spatial correlations. This hinders the assessment of the effects that these anomalies may have on hot air intrusions from North Africa, which can exacerbate heat waves over Europe.

Stamp maps over Europe on 2022/07/14 at 0:00 UTC. The contours are for the mean sea level pressure (dashed lines mark isobars below 1010 hPa) while the heatmap depicts the geopotential height at the 500 hPa pressure level. (A) The ERA5 reanalysis, a proxy for real observations. (Ba-Bb) 2 members from the 7-day U.S. operational forecasts used as seeds to our model. (Ca-Ch) 8 samples drawn from SEEDS. (Da-Dh) 8 non-seeding members from the 7-day U.S. operational ensemble forecast. (Ea-Ed) 4 samples from a pointwise Gaussian model parameterized by the mean and variance of the entire U.S. operational ensemble.

Covering extreme events more accurately

Below we show the joint distributions of temperature at 2 meters and total column water vapor near Lisbon during the extreme heat event on 2022/07/14, at 1:00 local time. We used the 7-day forecasts issued on 2022/07/07. For each plot, we generate 16,384-member ensembles with SEEDS. The observed weather event from ERA5 is denoted by the star. The operational ensemble is also shown, with squares denoting the forecasts used to seed the generated ensembles, and triangles denoting the rest of ensemble members.

SEEDS provides better statistical coverage of the 2022/07/14 European extreme heat event, denoted by the brown star . Each plot shows the values of the total column-integrated water vapor (TCVW) vs. temperature over a grid point near Lisbon, Portugal from 16,384 samples generated by our models, shown as green dots, conditioned on 2 seeds (blue squares) taken from the 7-day U.S. operational ensemble forecasts (denoted by the sparser brown triangles). The valid forecast time is 1:00 local time. The solid contour levels correspond to iso-proportions of the kernel density of SEEDS, with the outermost one encircling 95% of the mass and 11.875% between each level.

According to the U.S. operational ensemble, the observed event was so unlikely seven days prior that none of its 31 members predicted near-surface temperatures as warm as those observed. Indeed, the event probability computed from a Gaussian kernel density estimate is lower than 1%, which means that ensembles with less than 100 members are unlikely to contain forecasts as extreme as this event. In contrast, the SEEDS ensembles are able to extrapolate from the two seeding forecasts, providing an envelope of possible weather states with much better statistical coverage of the event. This allows both quantifying the probability of the event taking place and sampling weather regimes under which it would occur. Specifically, our highly scalable generative approach enables the creation of very large ensembles that can characterize very rare events by providing samples of weather states exceeding a given threshold for any user-defined diagnostic.

Conclusion and future outlook

SEEDS leverages the power of generative AI to produce ensemble forecasts comparable to those from the operational U.S. forecast system, but at an accelerated pace. The results reported in this paper need only 2 seeding forecasts from the operational system, which generates 31 forecasts in its current version. This leads to a hybrid forecasting system where a few weather trajectories computed with a physics-based model are used to seed a diffusion model that can generate additional forecasts much more efficiently. This methodology provides an alternative to the current operational weather forecasting paradigm, where the computational resources saved by the statistical emulator could be allocated to increasing the resolution of the physics-based model or issuing forecasts more frequently.

We believe that SEEDS represents just one of the many ways that AI will accelerate progress in operational numerical weather prediction in coming years. We hope this demonstration of the utility of generative AI for weather forecast emulation and post-processing will spur its application in research areas such as climate risk assessment, where generating a large number of ensembles of climate projections is crucial to accurately quantifying the uncertainty about future climate.

Acknowledgements

All SEEDS authors, Lizao Li, Rob Carver, Ignacio Lopez-Gomez, Fei Sha and John Anderson, co-authored this blog post, with Carla Bromberg as Program Lead. We also thank Tom Small who designed the animation. Our colleagues at Google Research have provided invaluable advice to the SEEDS work. Among them, we thank Leonardo Zepeda-Núñez, Zhong Yi Wan, Stephan Rasp, Stephan Hoyer, and Tapio Schneider for their inputs and useful discussion. We thank Tyler Russell for additional technical program management, as well as Alex Merose for data coordination and support. We also thank Cenk Gazen, Shreya Agrawal, and Jason Hickey for discussions in the early stage of the SEEDS work.

Read More

Provide live agent assistance for your chatbot users with Amazon Lex and Talkdesk cloud contact center

Provide live agent assistance for your chatbot users with Amazon Lex and Talkdesk cloud contact center

Amazon Lex provides advanced conversational artificial intelligence (AI) capabilities to enable self-service support for your organization’s contact center. With Amazon Lex, you can implement an omnichannel strategy where customers engage via phone, websites, and messaging platforms. The bots can answer FAQs, provide self-service experiences, or triage customer requests before transferring to a human agent. Amazon Lex integrates with state-of-the-art contact centers including Amazon Connect, Genesys Cloud, and Amazon Chime SDK to facilitate a seamless omnichannel experience.

This is the second post of a two-part series. The integration of Amazon Lex with Talkdesk cloud contact center is inspired by WaFd Bank (WaFd)’s digital innovation journey to enhance customer experience. In our previous post, we described how Amazon Lex integrates with the Talkdesk cloud contact center for the voice channel. In this post, we are focusing on the chat channel to show how to use Amazon Lex and the Amazon Lex Web UI to enable live agents to interact with your customers in real time. For example, the following figure shows screenshots of a chatbot transitioning a customer to a live agent chat (courtesy of WaFd Bank).

Solution overview

The following diagram illustrates the solution architecture.

In the preceding architecture, the following sequence of steps takes place in a live customer/agent conversation:

  1. Using the Amazon Lex Web UI, a customer asks to be connected to an agent. The associated Amazon Lex chatbot is configured with an escalation intent to process the incoming agent assistance request.
  2. The Amazon Lex fulfillment AWS Lambda function retrieves the Talkdesk touchpoint ID and Talkdesk OAuth secrets from AWS Secrets Manager and initiates a request to Talkdesk Digital Connect using the Start a Conversation API. In the payload, the function includes information that may be useful to an agent, such as the customer sentiment or the history of previously traversed intents.
  3. If the request to the Talkdesk API is successful, a Talkdesk conversation ID is returned to Amazon Lex.
  4. The Amazon Lex fulfillment Lambda function stores the conversation ID in Amazon Lex session attributes, thus making the conversation ID accessible to the Amazon Lex Web UI.
  5. The Amazon Lex Web UI opens a communication session with agents on the Talkdesk contact center through a WebSocket API in Amazon API Gateway.
  6. The Lambda associated with the WebSocket API first stores the Talkdesk conversation ID to WebSocket client ID mappings in Amazon DynamoDB. Then, through the Talkdesk Send a Message API, the Lambda function sends the customer’s message to the agent on Talkdesk contact center.
  7. Your agent responds to the customer with a message sent through the callback Rest API in API Gateway. The payload includes the conversation ID of the active conversation.
  8. The callback Rest API is configured to support the agents’ incoming messages as well as the agent’s closing of the conversation. In order to send the agent’s message to the customer, the supporting Lambda function reads the WebSocket client ID associated to the conversation ID from the DynamoDB table. This makes sure the agent’s message is delivered to the appropriate WebSocket client ID.
  9. The agent’s response is displayed through the Amazon Lex Web UI and the customer responds or closes the chat as appropriate. Steps 6–9 are repeated as long as the conversation remains active. If the agent ends the conversation, the customer is notified and the WebSocket connection is closed.

In the following sections, we walk you through the steps to build the solution architecture. Dependencies among each step are cross-referenced.

Prerequisites

To implement the solution presented in this post, you should first familiarize yourself with the following AWS services and features:

Additionally, you should be familiar with the following Talkdesk services:

Prepare your Talkdesk instance for the Amazon Lex Web UI chat with an agent

This section outlines the basic steps required to configure the Talkdesk chat with agent experience using the Talkdesk Digital Connect channel. Review Talkdesk APIs for further details for any additional tasks that may be required as part of your specific implementation.

Complete the following steps:

  1. Enable Talkdesk Digital Connect on your Talkdesk instance.
  2. Configure your agents’ accounts and assign them to the agents’ queues.
  3. Build a Talkdesk Studio flow.

This will be used to send chat users to an inbox for agents to assign. A sample is provided with this solution.

  1. To create an integration for your Amazon Lex Web UI instance, in the Talkdesk Builder navigation pane, select Integrations.
  2. On the Actions tab, configure three actions using the input and output schemas provided through the following links:

  1. Create a Talkdesk Digital Connect Touchpoint.
  2. Name the Touchpoint Lex Web UI Chat and record the Touchpoint ID.

This will be stored in Secrets Manager as dev/talkdesk/touchpoint/ids.

  1. In Talkdesk Builder, choose OAuth Clients in the navigation pane to set up OAuth credentials.
  2. Select Grant type for Client credentials and set Scope to digital-connect:write.
  3. Record the client ID and secret key from the Keys tab.

These will be stored in Secrets Manager as dev/talkdesk/client/keys and used to authenticate and communicate with the Talkdesk API.


  1. In your AWS account, store the two secrets in Secrets Manager.

The following screenshot shows the details of the Touchpoint ID as a Secrets Manager secret.

The following screenshot shows the details of the client ID as a Secrets Manager secret.

Deploy the Talkdesk Amazon Lex CloudFormation template

The following AWS CloudFormation template creates all the resources of the solution architecture. This includes all necessary IAM roles to invoke API operations, run associated Lambda functions, access secrets on Secrets Manager, and store and retrieve conversation ID and WebSocket client ID pairs from DynamoDB.

To facilitate monitoring and debugging, a CloudWatch log group is created for each of the resources.

The CloudFormation template provides additional details for each of the resources.

Complete the following steps to deploy the template:

  1. Sign in to the AWS Management Console.
  2. Choose Launch Stack for your AWS Region to begin the CloudFormation stack creation process.
    US East (N. Virginia)
    US West (Oregon)
    Asia Pacific (Singapore)
    Asia Pacific (Sydney)
    Asia Pacific (Tokyo)
    Europe (Frankfurt)
    Europe (Ireland)
    Europe (London)
  3. For Stack name, enter a name.
  4. For TDAUTHHOST, enter the URL of your Talkdesk instance.
  5. Leave the other parameters as default and choose Next
  6. Select the acknowledgement check boxes and choose Create stack.
  7. After the CloudFormation template is complete, record the values for the following keys on the Outputs tab to use in later steps:
    • APIGatewayApiKey
    • BotAliasId
    • BotId
    • CallbackRestAPI
    • WebSocketAPIEndpoint

Update the Talkdesk instance

Log in to your Talkdesk instance and complete the following steps to update your instance:

  1. In Talkdesk Builder, select Integrations in the navigation pane.
  2. On the Settings tab, locate Base path and enter the callback Rest API URL you recorded earlier.
  3. Under Other settings, set x-api-key to the value of the API Gateway key.

Deploy the Amazon Lex Web UI

The solution outlined in this post uses the Amazon Lex Web UI, a full-featured web client to deploy your Amazon Lex chatbot on your website. With the Amazon Lex Web UI, you can quickly bring your chatbot-powered application to life while minimizing time-to-value.

  1. Choose Launch Stack for the Region in which you will use your chatbot:
    US East (N. Virginia)
    US West (Oregon)
    Asia Pacific (Singapore)
    Asia Pacific (Sydney)
    Asia Pacific (Tokyo)
    Europe (Frankfurt)
    Europe (Ireland)
    Europe (London)
  2. For LexV2BotId, enter the value for BotId.
  3. For LexV2BotAliasId, enter the value for BotAliasId.
  4. Launch the stack.
  5. When deployment is complete, locate the Amazon Simple Storage Service (Amazon S3) URL for WebAppBucket.
  6. Navigate to the S3 bucket on the Amazon S3 console and download the lex-web-ui-loader-config.json file.
  7. Open the file and modify or add the following parameters:
    1. In the connect configuration section, add the new parameter talkDeskWebsocketEndpoint and set its value to the WebSocket endpoint.
    2. In the UI configuration section, set enableLiveChat to true.

  8. Upload the modified lex-web-ui-loader-config.json file and overwrite the previous version of the file in the S3 bucket.
  9. Return to the CloudFormation stack Outputs tab and find the WebAppDomainName link.

This will redirect you to a full-page version of the Amazon Lex Web UI. From here, you can test the Talkdesk integration and confirm that the bot is able to connect to Talkdesk using the WebSocket connection.

Test the solution

Now you’re ready to try the Amazon Lex and Talkdesk chat interaction:

  1. Start your Banking Bot chat window using the WebAppUrl provided as output in the CloudFormation stack.
  2. Log in to your Talkdesk Digital Connect channel and navigate to Conversations.
  3. In the Banking Bot chat window, request to talk to an agent.
  4. Watch the customer’s message being delivered to the Talkdesk Conversations Inbox.
  5. The Talkdesk agent self-assigns the conversation and starts engaging with the customer.

The following video demonstrates the chat experience.

Clean up

To clean up your resources, complete the following steps:

  1. On the AWS CloudFormation console, select Stacks in the navigation pane.
  2. Select the LexTalkdesk stack (or the stack name you provided), and select Delete.
  3. Delete the stack resources by selecting Delete stack.

Conclusion

Amazon Lex brings the power of conversational self-service to your customer preferred channels, such as phone, web chat, and messaging applications. In this post, we demonstrated a solution that provides live agent assistance on your website with Amazon Lex, Amazon Lex Web UI, and Talkdesk cloud contact center. We provided a CloudFormation stack that includes DynamoDB and Lambda resources, and a Rest API and WebSocket API in API Gateway to maintain a communication session with agents in the Talkdesk contact center.

This solution is meant to be a reference architecture or a quick implementation guide that can be tailored to suit your organization’s requirements. If you need help setting up this solution, AWS Professional Services and Talkdesk are available to help you and your team through the process of selecting the right technologies for your cloud contact center.


About the authors

Grazia Russo Lassner is a Senior Consultant with the AWS Professional Services Natural Language AI team. She specialises in designing and developing conversational AI solutions using AWS technologies for customers in various industries. Outside of work, she enjoys beach weekends, reading the latest fiction books, and family time.

Austin Johnson is a Solutions Architect, helping to maintain the Lex Web UI open source library.

Chris Brown is a Principal Natural Language AI consultant at AWS focused on digital customer experiences – including mobile apps, websites, marketing campaigns, and most recently conversational AI applications. Chris is an award-winning strategist and product manager – working with the Fortune 100 to deliver the best experiences for their customers. In his free time, Chris enjoys traveling, music, art, and experiencing new cultures.

Bruno Mateus is a Principal Engineer at Talkdesk. With over 20 years of experience in the software industry, he specialises in large-scale distributed systems. When not working, he enjoys spending time outside with his family, trekking, mountain bike riding, and motorcycle riding.

Jonathan Diedrich is a Principal Solutions Consultant at Talkdesk. He works on enterprise and strategic projects to ensure technical execution and adoption. Outside of work, he enjoys ice hockey and games with the family.

Crispim Tribuna is a Senior Software Engineer at Talkdesk currently focusing on the AI-based virtual agent project. He has over 17 years of experience in computer science, with a focus on telecommunications, IPTV, and fraud prevention. In his free time, he enjoys spending time with his family, running (he has completed three marathons), and riding motorcycles.

Read More

Towards a World-English Language Model

Neural Network Language Models (NNLMs) of Virtual Assistants (VAs) are generally language-, region-, and in some cases, device-dependent, which increases the effort to scale and maintain them. Combining NNLMs for one or more of the categories could be one way to improve scalability. In this work, we combine regional variants of English by building a “World English” NNLM. We examine three data sampling techniques and we experiment with adding adapter bottlenecks to the existing production NNLMs to model dialect-specific characteristics and investigate different strategies to train adapters. We…Apple Machine Learning Research

AutoBNN: Probabilistic time series forecasting with compositional bayesian neural networks

AutoBNN: Probabilistic time series forecasting with compositional bayesian neural networks

Time series problems are ubiquitous, from forecasting weather and traffic patterns to understanding economic trends. Bayesian approaches start with an assumption about the data’s patterns (prior probability), collecting evidence (e.g., new time series data), and continuously updating that assumption to form a posterior probability distribution. Traditional Bayesian approaches like Gaussian processes (GPs) and Structural Time Series are extensively used for modeling time series data, e.g., the commonly used Mauna Loa CO2 dataset. However, they often rely on domain experts to painstakingly select appropriate model components and may be computationally expensive. Alternatives such as neural networks lack interpretability, making it difficult to understand how they generate forecasts, and don’t produce reliable confidence intervals.

To that end, we introduce AutoBNN, a new open-source package written in JAX. AutoBNN automates the discovery of interpretable time series forecasting models, provides high-quality uncertainty estimates, and scales effectively for use on large datasets. We describe how AutoBNN combines the interpretability of traditional probabilistic approaches with the scalability and flexibility of neural networks.

AutoBNN

AutoBNN is based on a line of research that over the past decade has yielded improved predictive accuracy by modeling time series using GPs with learned kernel structures. The kernel function of a GP encodes assumptions about the function being modeled, such as the presence of trends, periodicity or noise. With learned GP kernels, the kernel function is defined compositionally: it is either a base kernel (such as Linear, Quadratic, Periodic, Matérn or ExponentiatedQuadratic) or a composite that combines two or more kernel functions using operators such as Addition, Multiplication, or ChangePoint. This compositional kernel structure serves two related purposes. First, it is simple enough that a user who is an expert about their data, but not necessarily about GPs, can construct a reasonable prior for their time series. Second, techniques like Sequential Monte Carlo can be used for discrete searches over small structures and can output interpretable results.

AutoBNN improves upon these ideas, replacing the GP with Bayesian neural networks (BNNs) while retaining the compositional kernel structure. A BNN is a neural network with a probability distribution over weights rather than a fixed set of weights. This induces a distribution over outputs, capturing uncertainty in the predictions. BNNs bring the following advantages over GPs: First, training large GPs is computationally expensive, and traditional training algorithms scale as the cube of the number of data points in the time series. In contrast, for a fixed width, training a BNN will often be approximately linear in the number of data points. Second, BNNs lend themselves better to GPU and TPU hardware acceleration than GP training operations. Third, compositional BNNs can be easily combined with traditional deep BNNs, which have the ability to do feature discovery. One could imagine “hybrid” architectures, in which users specify a top-level structure of Add(Linear, Periodic, Deep), and the deep BNN is left to learn the contributions from potentially high-dimensional covariate information.

How might one translate a GP with compositional kernels into a BNN then? A single layer neural network will typically converge to a GP as the number of neurons (or “width”) goes to infinity. More recently, researchers have discovered a correspondence in the other direction — many popular GP kernels (such as Matern, ExponentiatedQuadratic, Polynomial or Periodic) can be obtained as infinite-width BNNs with appropriately chosen activation functions and weight distributions. Furthermore, these BNNs remain close to the corresponding GP even when the width is very much less than infinite. For example, the figures below show the difference in the covariance between pairs of observations, and regression results of the true GPs and their corresponding width-10 neural network versions.

Comparison of Gram matrices between true GP kernels (top row) and their width 10 neural network approximations (bottom row).
Comparison of regression results between true GP kernels (top row) and their width 10 neural network approximations (bottom row).

Finally, the translation is completed with BNN analogues of the Addition and Multiplication operators over GPs, and input warping to produce periodic kernels. BNN addition is straightforwardly given by adding the outputs of the component BNNs. BNN multiplication is achieved by multiplying the activations of the hidden layers of the BNNs and then applying a shared dense layer. We are therefore limited to only multiplying BNNs with the same hidden width.

Using AutoBNN

The AutoBNN package is available within Tensorflow Probability. It is implemented in JAX and uses the flax.linen neural network library. It implements all of the base kernels and operators discussed so far (Linear, Quadratic, Matern, ExponentiatedQuadratic, Periodic, Addition, Multiplication) plus one new kernel and three new operators:

  • a OneLayer kernel, a single hidden layer ReLU BNN,
  • a ChangePoint operator that allows smoothly switching between two kernels,
  • a LearnableChangePoint operator which is the same as ChangePoint except position and slope are given prior distributions and can be learnt from the data, and
  • a WeightedSum operator.

WeightedSum combines two or more BNNs with learnable mixing weights, where the learnable weights follow a Dirichlet prior. By default, a flat Dirichlet distribution with concentration 1.0 is used.

WeightedSums allow a “soft” version of structure discovery, i.e., training a linear combination of many possible models at once. In contrast to structure discovery with discrete structures, such as in AutoGP, this allows us to use standard gradient methods to learn structures, rather than using expensive discrete optimization. Instead of evaluating potential combinatorial structures in series, WeightedSum allows us to evaluate them in parallel.

To easily enable exploration, AutoBNN defines a number of model structures that contain either top-level or internal WeightedSums. The names of these models can be used as the first parameter in any of the estimator constructors, and include things like sum_of_stumps (the WeightedSum over all the base kernels) and sum_of_shallow (which adds all possible combinations of base kernels with all operators).

Illustration of the sum_of_stumps model. The bars in the top row show the amount by which each base kernel contributes, and the bottom row shows the function represented by the base kernel. The resulting weighted sum is shown on the right.

The figure below demonstrates the technique of structure discovery on the N374 (a time series of yearly financial data starting from 1949) from the M3 dataset. The six base structures were ExponentiatedQuadratic (which is the same as the Radial Basis Function kernel, or RBF for short), Matern, Linear, Quadratic, OneLayer and Periodic kernels. The figure shows the MAP estimates of their weights over an ensemble of 32 particles. All of the high likelihood particles gave a large weight to the Periodic component, low weights to Linear, Quadratic and OneLayer, and a large weight to either RBF or Matern.

Parallel coordinates plot of the MAP estimates of the base kernel weights over 32 particles. The sum_of_stumps model was trained on the N374 series from the M3 dataset (insert in blue). Darker lines correspond to particles with higher likelihoods.

By using WeightedSums as the inputs to other operators, it is possible to express rich combinatorial structures, while keeping models compact and the number of learnable weights small. As an example, we include the sum_of_products model (illustrated in the figure below) which first creates a pairwise product of two WeightedSums, and then a sum of the two products. By setting some of the weights to zero, we can create many different discrete structures. The total number of possible structures in this model is 216, since there are 16 base kernels that can be turned on or off. All these structures are explored implicitly by training just this one model.

Illustration of the “sum_of_products” model. Each of the four WeightedSums have the same structure as the “sum_of_stumps” model.

We have found, however, that certain combinations of kernels (e.g., the product of Periodic and either the Matern or ExponentiatedQuadratic) lead to overfitting on many datasets. To prevent this, we have defined model classes like sum_of_safe_shallow that exclude such products when performing structure discovery with WeightedSums.

For training, AutoBNN provides AutoBnnMapEstimator and AutoBnnMCMCEstimator to perform MAP and MCMC inference, respectively. Either estimator can be combined with any of the six likelihood functions, including four based on normal distributions with different noise characteristics for continuous data and two based on the negative binomial distribution for count data.

Result from running AutoBNN on the Mauna Loa CO2 dataset in our example colab. The model captures the trend and seasonal component in the data. Extrapolating into the future, the mean prediction slightly underestimates the actual trend, while the 95% confidence interval gradually increases.

To fit a model like in the figure above, all it takes is the following 10 lines of code, using the scikit-learn–inspired estimator interface:

import autobnn as ab

model = ab.operators.Add(
    bnns=(ab.kernels.PeriodicBNN(width=50),
          ab.kernels.LinearBNN(width=50),
          ab.kernels.MaternBNN(width=50)))

estimator = ab.estimators.AutoBnnMapEstimator(
    model, 'normal_likelihood_logistic_noise', jax.random.PRNGKey(42),
    periods=[12])

estimator.fit(my_training_data_xs, my_training_data_ys)
low, mid, high = estimator.predict_quantiles(my_training_data_xs)

Conclusion

AutoBNN provides a powerful and flexible framework for building sophisticated time series prediction models. By combining the strengths of BNNs and GPs with compositional kernels, AutoBNN opens a world of possibilities for understanding and forecasting complex data. We invite the community to try the colab, and leverage this library to innovate and solve real-world challenges.

Acknowledgements

AutoBNN was written by Colin Carroll, Thomas Colthurst, Urs Köster and Srinivas Vasudevan. We would like to thank Kevin Murphy, Brian Patton and Feras Saad for their advice and feedback.

Read More

Advanced RAG patterns on Amazon SageMaker

Advanced RAG patterns on Amazon SageMaker

Today, customers of all industries—whether it’s financial services, healthcare and life sciences, travel and hospitality, media and entertainment, telecommunications, software as a service (SaaS), and even proprietary model providers—are using large language models (LLMs) to build applications like question and answering (QnA) chatbots, search engines, and knowledge bases. These generative AI applications are not only used to automate existing business processes, but also have the ability to transform the experience for customers using these applications. With the advancements being made with LLMs like the Mixtral-8x7B Instruct, derivative of architectures such as the mixture of experts (MoE), customers are continuously looking for ways to improve the performance and accuracy of generative AI applications while allowing them to effectively use a wider range of closed and open source models.

A number of techniques are typically used to improve the accuracy and performance of an LLM’s output, such as fine-tuning with parameter efficient fine-tuning (PEFT), reinforcement learning from human feedback (RLHF), and performing knowledge distillation. However, when building generative AI applications, you can use an alternative solution that allows for the dynamic incorporation of external knowledge and allows you to control the information used for generation without the need to fine-tune your existing foundational model. This is where Retrieval Augmented Generation (RAG) comes in, specifically for generative AI applications as opposed to the more expensive and robust fine-tuning alternatives we’ve discussed. If you’re implementing complex RAG applications into your daily tasks, you may encounter common challenges with your RAG systems such as inaccurate retrieval, increasing size and complexity of documents, and overflow of context, which can significantly impact the quality and reliability of generated answers.

This post discusses RAG patterns to improve response accuracy using LangChain and tools such as the parent document retriever in addition to techniques like contextual compression in order to enable developers to improve existing generative AI applications.

Solution overview

In this post, we demonstrate the use of Mixtral-8x7B Instruct text generation combined with the BGE Large En embedding model to efficiently construct a RAG QnA system on an Amazon SageMaker notebook using the parent document retriever tool and contextual compression technique. The following diagram illustrates the architecture of this solution.

You can deploy this solution with just a few clicks using Amazon SageMaker JumpStart, a fully managed platform that offers state-of-the-art foundation models for various use cases such as content writing, code generation, question answering, copywriting, summarization, classification, and information retrieval. It provides a collection of pre-trained models that you can deploy quickly and with ease, accelerating the development and deployment of machine learning (ML) applications. One of the key components of SageMaker JumpStart is the Model Hub, which offers a vast catalog of pre-trained models, such as the Mixtral-8x7B, for a variety of tasks.

Mixtral-8x7B uses an MoE architecture. This architecture allows different parts of a neural network to specialize in different tasks, effectively dividing the workload among multiple experts. This approach enables the efficient training and deployment of larger models compared to traditional architectures.

One of the main advantages of the MoE architecture is its scalability. By distributing the workload across multiple experts, MoE models can be trained on larger datasets and achieve better performance than traditional models of the same size. Additionally, MoE models can be more efficient during inference because only a subset of experts needs to be activated for a given input.

For more information on Mixtral-8x7B Instruct on AWS, refer to Mixtral-8x7B is now available in Amazon SageMaker JumpStart. The Mixtral-8x7B model is made available under the permissive Apache 2.0 license, for use without restrictions.

In this post, we discuss how you can use LangChain to create effective and more efficient RAG applications. LangChain is an open source Python library designed to build applications with LLMs. It provides a modular and flexible framework for combining LLMs with other components, such as knowledge bases, retrieval systems, and other AI tools, to create powerful and customizable applications.

We walk through constructing a RAG pipeline on SageMaker with Mixtral-8x7B. We use the Mixtral-8x7B Instruct text generation model with the BGE Large En embedding model to create an efficient QnA system using RAG on a SageMaker notebook. We use an ml.t3.medium instance to demonstrate deploying LLMs via SageMaker JumpStart, which can be accessed through a SageMaker-generated API endpoint. This setup allows for the exploration, experimentation, and optimization of advanced RAG techniques with LangChain. We also illustrate the integration of the FAISS Embedding store into the RAG workflow, highlighting its role in storing and retrieving embeddings to enhance the system’s performance.

We perform a brief walkthrough of the SageMaker notebook. For more detailed and step-by-step instructions, refer to the Advanced RAG Patterns with Mixtral on SageMaker Jumpstart GitHub repo.

The need for advanced RAG patterns

Advanced RAG patterns are essential to improve upon the current capabilities of LLMs in processing, understanding, and generating human-like text. As the size and complexity of documents increase, representing multiple facets of the document in a single embedding can lead to a loss of specificity. Although it’s essential to capture the general essence of a document, it’s equally crucial to recognize and represent the varied sub-contexts within. This is a challenge you are often faced with when working with larger documents. Another challenge with RAG is that with retrieval, you aren’t aware of the specific queries that your document storage system will deal with upon ingestion. This could lead to information most relevant to a query being buried under text (context overflow). To mitigate failure and improve upon the existing RAG architecture, you can use advanced RAG patterns (parent document retriever and contextual compression) to reduce retrieval errors, enhance answer quality, and enable complex question handling.

With the techniques discussed in this post, you can address key challenges associated with external knowledge retrieval and integration, enabling your application to deliver more precise and contextually aware responses.

In the following sections, we explore how parent document retrievers and contextual compression can help you deal with some of the problems we’ve discussed.

Parent document retriever

In the previous section, we highlighted challenges that RAG applications encounter when dealing with extensive documents. To address these challenges, parent document retrievers categorize and designate incoming documents as parent documents. These documents are recognized for their comprehensive nature but aren’t directly utilized in their original form for embeddings. Rather than compressing an entire document into a single embedding, parent document retrievers dissect these parent documents into child documents. Each child document captures distinct aspects or topics from the broader parent document. Following the identification of these child segments, individual embeddings are assigned to each, capturing their specific thematic essence (see the following diagram). During retrieval, the parent document is invoked. This technique provides targeted yet broad-ranging search capabilities, furnishing the LLM with a wider perspective. Parent document retrievers provide LLMs with a twofold advantage: the specificity of child document embeddings for precise and relevant information retrieval, coupled with the invocation of parent documents for response generation, which enriches the LLM’s outputs with a layered and thorough context.

Contextual compression

To address the issue of context overflow discussed earlier, you can use contextual compression to compress and filter the retrieved documents in alignment with the query’s context, so only pertinent information is kept and processed. This is achieved through a combination of a base retriever for initial document fetching and a document compressor for refining these documents by paring down their content or excluding them entirely based on relevance, as illustrated in the following diagram. This streamlined approach, facilitated by the contextual compression retriever, greatly enhances RAG application efficiency by providing a method to extract and utilize only what’s essential from a mass of information. It tackles the issue of information overload and irrelevant data processing head-on, leading to improved response quality, more cost-effective LLM operations, and a smoother overall retrieval process. Essentially, it’s a filter that tailors the information to the query at hand, making it a much-needed tool for developers aiming to optimize their RAG applications for better performance and user satisfaction.

Prerequisites

If you’re new to SageMaker, refer to the Amazon SageMaker Development Guide.

Before you get started with the solution, create an AWS account. When you create an AWS account, you get a single sign-on (SSO) identity that has complete access to all the AWS services and resources in the account. This identity is called the AWS account root user.

Signing in to the AWS Management Console using the email address and password that you used to create the account gives you complete access to all the AWS resources in your account. We strongly recommend that you do not use the root user for everyday tasks, even the administrative ones.

Instead, adhere to the security best practices in AWS Identity and Access Management (IAM), and create an administrative user and group. Then securely lock away the root user credentials and use them to perform only a few account and service management tasks.

The Mixtral-8x7b model requires an ml.g5.48xlarge instance. SageMaker JumpStart provides a simplified way to access and deploy over 100 different open source and third-party foundation models. In order to launch an endpoint to host Mixtral-8x7B from SageMaker JumpStart, you may need to request a service quota increase to access an ml.g5.48xlarge instance for endpoint usage. You can request service quota increases through the console, AWS Command Line Interface (AWS CLI), or API to allow access to those additional resources.

Set up a SageMaker notebook instance and install dependencies

To get started, create a SageMaker notebook instance and install the required dependencies. Refer to the GitHub repo to ensure a successful setup. After you set up the notebook instance, you can deploy the model.

You can also run the notebook locally on your preferred integrated development environment (IDE). Make sure that you have the Jupyter notebook lab installed.

Deploy the model

Deploy the Mixtral-8X7B Instruct LLM model on SageMaker JumpStart:

# Import the JumpStartModel class from the SageMaker JumpStart library
from sagemaker.jumpstart.model import JumpStartModel

# Specify the model ID for the HuggingFace Mixtral 8x7b Instruct LLM model
model_id = "huggingface-llm-mixtral-8x7b-instruct"
model = JumpStartModel(model_id=model_id)
llm_predictor = model.deploy()

Deploy the BGE Large En embedding model on SageMaker JumpStart:

# Specify the model ID for the HuggingFace BGE Large EN Embedding model
model_id = "huggingface-sentencesimilarity-bge-large-en"
text_embedding_model = JumpStartModel(model_id=model_id)
embedding_predictor = text_embedding_model.deploy()

Set up LangChain

After importing all the necessary libraries and deploying the Mixtral-8x7B model and BGE Large En embeddings model, you can now set up LangChain. For step-by-step instructions, refer to the GitHub repo.

Data preparation

In this post, we use several years of Amazon’s Letters to Shareholders as a text corpus to perform QnA on. For more detailed steps to prepare the data, refer to the GitHub repo.

Question answering

Once the data is prepared, you can use the wrapper provided by LangChain, which wraps around the vector store and takes input for the LLM. This wrapper performs the following steps:

  1. Take the input question.
  2. Create a question embedding.
  3. Fetch relevant documents.
  4. Incorporate the documents and the question into a prompt.
  5. Invoke the model with the prompt and generate the answer in a readable manner.

Now that the vector store is in place, you can start asking questions:

prompt_template = """<s>[INST]
{query}
[INST]"""
PROMPT = PromptTemplate(
    template=prompt_template, input_variables=["query"]
)
query = "How has AWS evolved?"
answer = wrapper_store_faiss.query(question=PROMPT.format(query=query), llm=llm)
print(answer)
AWS, or Amazon Web Services, has evolved significantly since its initial launch in 2006. It started as a feature-poor service, offering only one instance size, in one data center, in one region of the world, with Linux operating system instances only. There was no monitoring, load balancing, auto-scaling, or persistent storage at the time. However, AWS had a successful launch and has since grown into a multi-billion-dollar service.

Over the years, AWS has added numerous features and services, with over 3,300 new ones launched in 2022 alone. They have expanded their offerings to include Windows, monitoring, load balancing, auto-scaling, and persistent storage. AWS has also made significant investments in long-term inventions that have changed what's possible in technology infrastructure.

One example of this is their investment in chip development. AWS has also seen a robust new customer pipeline and active migrations, with many companies opting to move to AWS for the agility, innovation, cost-efficiency, and security benefits it offers. AWS has transformed how customers, from start-ups to multinational companies to public sector organizations, manage their technology infrastructure.

Regular retriever chain

In the preceding scenario, we explored the quick and straightforward way to get a context-aware answer to your question. Now let’s look at a more customizable option with the help of RetrievalQA, where you can customize how the documents fetched should be added to the prompt using the chain_type parameter. Also, in order to control how many relevant documents should be retrieved, you can change the k parameter in the following code to see different outputs. In many scenarios, you might want to know which source documents the LLM used to generate the answer. You can get those documents in the output using return_source_documents, which returns the documents that are added to the context of the LLM prompt. RetrievalQA also allows you to provide a custom prompt template that can be specific to the model.

from langchain.chains import RetrievalQA

prompt_template = """<s>[INST]
Use the following pieces of context to provide a concise answer to the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

{context}

Question: {question}

[INST]"""
PROMPT = PromptTemplate(
    template=prompt_template, input_variables=["context", "question"]
)

qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore_faiss.as_retriever(
        search_type="similarity", search_kwargs={"k": 3}
    ),
    return_source_documents=True,
    chain_type_kwargs={"prompt": PROMPT}
)

Let’s ask a question:

query = "How did AWS evolve?"
result = qa({"query": query})
print(result['result'])
AWS (Amazon Web Services) evolved from an initially unprofitable investment to an $85B annual revenue run rate business with strong profitability, offering a wide range of services and features, and becoming a significant part of Amazon's portfolio. Despite facing skepticism and short-term headwinds, AWS continued to innovate, attract new customers, and migrate active customers, offering benefits such as agility, innovation, cost-efficiency, and security. AWS also expanded its long-term investments, including chip development, to provide new capabilities and change what's possible for its customers.

Parent document retriever chain

Let’s look at a more advanced RAG option with the help of ParentDocumentRetriever. When working with document retrieval, you may encounter a trade-off between storing small chunks of a document for accurate embeddings and larger documents to preserve more context. The parent document retriever strikes that balance by splitting and storing small chunks of data.

We use a parent_splitter to divide the original documents into larger chunks called parent documents and a child_splitter to create smaller child documents from the original documents:

# This text splitter is used to create the parent documents
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)

# This text splitter is used to create the child documents
# It should create documents smaller than the parent
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)

# The vectorstore to use to index the child chunks
vectorstore_faiss = FAISS.from_documents(
    child_splitter.split_documents(documents),
    sagemaker_embeddings,
)

The child documents are then indexed in a vector store using embeddings. This enables efficient retrieval of relevant child documents based on similarity. To retrieve relevant information, the parent document retriever first fetches the child documents from the vector store. It then looks up the parent IDs for those child documents and returns the corresponding larger parent documents.

qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    return_source_documents=True,
    chain_type_kwargs={"prompt": PROMPT}
)

Let’s ask a question:

query = "How did AWS evolve?"
result = qa({"query": query})
print(result['result'])
AWS (Amazon Web Services) started with a feature-poor initial launch of the Elastic Compute Cloud (EC2) service in 2006, providing only one instance size, in one data center, in one region of the world, with Linux operating system instances only, and without many key features like monitoring, load balancing, auto-scaling, or persistent storage. However, AWS's success allowed them to quickly iterate and add the missing capabilities, eventually expanding to offer various flavors, sizes, and optimizations of compute, storage, and networking, as well as developing their own chips (Graviton) to push price and performance further. AWS's iterative innovation process required significant investments in financial and people resources over 20 years, often well in advance of when it would pay out, to meet customer needs and improve long-term customer experiences, loyalty, and returns for shareholders.

Contextual compression chain

Let’s look at another advanced RAG option called contextual compression. One challenge with retrieval is that usually we don’t know the specific queries your document storage system will face when you ingest data into the system. This means that the information most relevant to a query may be buried in a document with a lot of irrelevant text. Passing that full document through your application can lead to more expensive LLM calls and poorer responses.

The contextual compression retriever addresses the challenge of retrieving relevant information from a document storage system, where the pertinent data may be buried within documents containing a lot of  text. By compressing and filtering the retrieved documents based on the given query context, only the most relevant information is returned.

To use the contextual compression retriever, you’ll need:

  • A base retriever – This is the initial retriever that fetches documents from the storage system based on the query
  • A document compressor – This component takes the initially retrieved documents and shortens them by reducing the contents of individual documents or dropping irrelevant documents altogether, using the query context to determine relevance

Adding contextual compression with an LLM chain extractor

First, wrap your base retriever with a ContextualCompressionRetriever. You’ll add an LLMChainExtractor, which will iterate over the initially returned documents and extract from each only the content that is relevant to the query.

from langchain.retrievers import ContextualCompressionRetrieverfrom langchain.retrievers.document_compressors import LLMChainExtractor

text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, just to show.
    chunk_size=1000,
    chunk_overlap=100,
)

docs = text_splitter.split_documents(documents)
retriever = FAISS.from_documents(
    docs,
    sagemaker_embeddings,
).as_retriever()

compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=retriever
)

compressed_docs = compression_retriever.get_relevant_documents(
    "How was Amazon impacted by COVID-19?"
)

Initialize the chain using the ContextualCompressionRetriever with an LLMChainExtractor and pass the prompt in via the chain_type_kwargs argument.

qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=compression_retriever,
    return_source_documents=True,
    chain_type_kwargs={"prompt": PROMPT}
)

Let’s ask a question:

query = "How did AWS evolve?"
result = qa({"query": query})
print(result['result'])
AWS evolved by starting as a small project inside Amazon, requiring significant capital investment and facing skepticism from both inside and outside the company. However, AWS had a head start on potential competitors and believed in the value it could bring to customers and Amazon. AWS made a long-term commitment to continue investing, resulting in over 3,300 new features and services launched in 2022. AWS has transformed how customers manage their technology infrastructure and has become an $85B annual revenue run rate business with strong profitability. AWS has also continuously improved its offerings, such as enhancing EC2 with additional features and services after its initial launch.

Filter documents with an LLM chain filter

The LLMChainFilter is a slightly simpler but more robust compressor that uses an LLM chain to decide which of the initially retrieved documents to filter out and which ones to return, without manipulating the document contents:

from langchain.retrievers.document_compressors import LLMChainFilter

_filter = LLMChainFilter.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(
    base_compressor=_filter, base_retriever=retriever
)

compressed_docs = compression_retriever.get_relevant_documents(
    "How was Amazon impacted by COVID-19?"
)
print(compressed_docs)

Initialize the chain using the ContextualCompressionRetriever with an LLMChainFilter and pass the prompt in via the chain_type_kwargs argument.

qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=compression_retriever,
    return_source_documents=True,
    chain_type_kwargs={"prompt": PROMPT}
)

Let’s ask a question:

query = "How did AWS evolve?"
result = qa({"query": query})
print(result['result'])
AWS (Amazon Web Services) evolved by initially launching feature-poor but iterating quickly based on customer feedback to add necessary capabilities. This approach allowed AWS to launch EC2 in 2006 with limited features and then continuously add new functionalities, such as additional instance sizes, data centers, regions, operating system options, monitoring tools, load balancing, auto-scaling, and persistent storage. Over time, AWS transformed from a feature-poor service to a multi-billion-dollar business by focusing on customer needs, agility, innovation, cost-efficiency, and security. AWS now has an $85B annual revenue run rate and offers over 3,300 new features and services each year, catering to a wide range of customers from start-ups to multinational companies and public sector organizations.

Compare results

The following table compares results from different queries based on technique.

Technique Query 1 Query 2 Comparison
How did AWS evolve? Why is Amazon successful?
Regular Retriever Chain Output AWS (Amazon Web Services) evolved from an initially unprofitable investment to an $85B annual revenue run rate business with strong profitability, offering a wide range of services and features, and becoming a significant part of Amazon’s portfolio. Despite facing skepticism and short-term headwinds, AWS continued to innovate, attract new customers, and migrate active customers, offering benefits such as agility, innovation, cost-efficiency, and security. AWS also expanded its long-term investments, including chip development, to provide new capabilities and change what’s possible for its customers. Amazon is successful due to its continuous innovation and expansion into new areas such as technology infrastructure services, digital reading devices, voice-driven personal assistants, and new business models like the third-party marketplace. Its ability to scale operations quickly, as seen in the rapid expansion of its fulfillment and transportation networks, also contributes to its success. Additionally, Amazon’s focus on optimization and efficiency gains in its processes has resulted in productivity improvements and cost reductions. The example of Amazon Business highlights the company’s capability to leverage its e-commerce and logistics strengths in different sectors. Based on the responses from the regular retriever chain, we notice that although it provides long answers, it suffers from context overflow and fails to mention any significant details from the corpus in regards to responding to the query provided. The regular retrieval chain is not able to capture the nuances with depth or contextual insight, potentially missing critical aspects of the document.
Parent Document Retriever Output AWS (Amazon Web Services) started with a feature-poor initial launch of the Elastic Compute Cloud (EC2) service in 2006, providing only one instance size, in one data center, in one region of the world, with Linux operating system instances only, and without many key features like monitoring, load balancing, auto-scaling, or persistent storage. However, AWS’s success allowed them to quickly iterate and add the missing capabilities, eventually expanding to offer various flavors, sizes, and optimizations of compute, storage, and networking, as well as developing their own chips (Graviton) to push price and performance further. AWS’s iterative innovation process required significant investments in financial and people resources over 20 years, often well in advance of when it would pay out, to meet customer needs and improve long-term customer experiences, loyalty, and returns for shareholders. Amazon is successful due to its ability to constantly innovate, adapt to changing market conditions, and meet customer needs in various market segments. This is evident in the success of Amazon Business, which has grown to drive roughly $35B in annualized gross sales by delivering selection, value, and convenience to business customers. Amazon’s investments in ecommerce and logistics capabilities have also enabled the creation of services like Buy with Prime, which helps merchants with direct-to-consumer websites drive conversion from views to purchases. The parent document retriever delves deeper into the specifics of AWS’s growth strategy, including the iterative process of adding new features based on customer feedback and the detailed journey from a feature-poor initial launch to a dominant market position, while providing a context-rich response. Responses cover a wide range of aspects, from technical innovations and market strategy to organizational efficiency and customer focus, providing a holistic view of the factors contributing to success along with examples. This can be attributed to the parent document retriever’s targeted yet broad-ranging search capabilities.
LLM Chain Extractor: Contextual Compression Output AWS evolved by starting as a small project inside Amazon, requiring significant capital investment and facing skepticism from both inside and outside the company. However, AWS had a head start on potential competitors and believed in the value it could bring to customers and Amazon. AWS made a long-term commitment to continue investing, resulting in over 3,300 new features and services launched in 2022. AWS has transformed how customers manage their technology infrastructure and has become an $85B annual revenue run rate business with strong profitability. AWS has also continuously improved its offerings, such as enhancing EC2 with additional features and services after its initial launch. Based on the provided context, Amazon’s success can be attributed to its strategic expansion from a book-selling platform to a global marketplace with a vibrant third-party seller ecosystem, early investment in AWS, innovation in introducing the Kindle and Alexa, and substantial growth in annual revenue from 2019 to 2022. This growth led to the expansion of the fulfillment center footprint, creation of a last-mile transportation network, and building a new sortation center network, which were optimized for productivity and cost reductions. The LLM chain extractor maintains a balance between covering key points comprehensively and avoiding unnecessary depth. It dynamically adjusts to the query’s context, so the output is directly relevant and comprehensive.
LLM Chain Filter: Contextual Compression Output AWS (Amazon Web Services) evolved by initially launching feature-poor but iterating quickly based on customer feedback to add necessary capabilities. This approach allowed AWS to launch EC2 in 2006 with limited features and then continuously add new functionalities, such as additional instance sizes, data centers, regions, operating system options, monitoring tools, load balancing, auto-scaling, and persistent storage. Over time, AWS transformed from a feature-poor service to a multi-billion-dollar business by focusing on customer needs, agility, innovation, cost-efficiency, and security. AWS now has an $85B annual revenue run rate and offers over 3,300 new features and services each year, catering to a wide range of customers from start-ups to multinational companies and public sector organizations. Amazon is successful due to its innovative business models, continuous technological advancements, and strategic organizational changes. The company has consistently disrupted traditional industries by introducing new ideas, such as an ecommerce platform for various products and services, a third-party marketplace, cloud infrastructure services (AWS), the Kindle e-reader, and the Alexa voice-driven personal assistant. Additionally, Amazon has made structural changes to improve its efficiency, such as reorganizing its US fulfillment network to decrease costs and delivery times, further contributing to its success. Similar to the LLM chain extractor, the LLM chain filter makes sure that although the key points are covered, the output is efficient for customers looking for concise and contextual answers.

Upon comparing these different techniques, we can see that in contexts like detailing AWS’s transition from a simple service to a complex, multi-billion-dollar entity, or explaining Amazon’s strategic successes, the regular retriever chain lacks the precision the more sophisticated techniques offer, leading to less targeted information. Although very few differences are visible between the advanced techniques discussed, they are by far more informative than regular retriever chains.

For customers in industries such as healthcare, telecommunications, and financial services who are looking to implement RAG in their applications, the limitations of the regular retriever chain in providing precision, avoiding redundancy, and effectively compressing information make it less suited to fulfilling these needs compared to the more advanced parent document retriever and contextual compression techniques. These techniques are able to distill vast amounts of information into the concentrated, impactful insights that you need, while helping improve price-performance.

Clean up

When you’re done running the notebook, delete the resources you created in order to avoid accrual of charges for the resources in use:

# Delete resources
llm_predictor.delete_model()
llm_predictor.delete_endpoint()
embedding_predictor.delete_model()
embedding_predictor.delete_endpoint()

Conclusion

In this post, we presented a solution that allows you to implement the parent document retriever and contextual compression chain techniques to enhance the ability of LLMs to process and generate information. We tested out these advanced RAG techniques with the Mixtral-8x7B Instruct and BGE Large En models available with SageMaker JumpStart. We also explored using persistent storage for embeddings and document chunks and integration with enterprise data stores.

The techniques we performed not only refine the way LLM models access and incorporate external knowledge, but also significantly improve the quality, relevance, and efficiency of their outputs. By combining retrieval from large text corpora with language generation capabilities, these advanced RAG techniques enable LLMs to produce more factual, coherent, and context-appropriate responses, enhancing their performance across various natural language processing tasks.

SageMaker JumpStart is at the center of this solution. With SageMaker JumpStart, you gain access to an extensive assortment of open and closed source models, streamlining the process of getting started with ML and enabling rapid experimentation and deployment. To get started deploying this solution, navigate to the notebook in the GitHub repo.


About the Authors

Niithiyn Vijeaswaran is a Solutions Architect at AWS. His area of focus is generative AI and AWS AI Accelerators. He holds a Bachelor’s degree in Computer Science and Bioinformatics. Niithiyn works closely with the Generative AI GTM team to enable AWS customers on multiple fronts and accelerate their adoption of generative AI. He’s an avid fan of the Dallas Mavericks and enjoys collecting sneakers.

Sebastian Bustillo is a Solutions Architect at AWS. He focuses on AI/ML technologies with a profound passion for generative AI and compute accelerators. At AWS, he helps customers unlock business value through generative AI. When he’s not at work, he enjoys brewing a perfect cup of specialty coffee and exploring the world with his wife.

Armando Diaz is a Solutions Architect at AWS. He focuses on generative AI, AI/ML, and Data Analytics. At AWS, Armando helps customers integrating cutting-edge generative AI capabilities into their systems, fostering innovation and competitive advantage. When he’s not at work, he enjoys spending time with his wife and family, hiking, and traveling the world.

Dr. Farooq Sabir is a Senior Artificial Intelligence and Machine Learning Specialist Solutions Architect at AWS. He holds PhD and MS degrees in Electrical Engineering from the University of Texas at Austin and an MS in Computer Science from Georgia Institute of Technology. He has over 15 years of work experience and also likes to teach and mentor college students. At AWS, he helps customers formulate and solve their business problems in data science, machine learning, computer vision, artificial intelligence, numerical optimization, and related domains. Based in Dallas, Texas, he and his family love to travel and go on long road trips.

Marco Punio is a Solutions Architect focused on generative AI strategy, applied AI solutions and conducting research to help customers hyper-scale on AWS. Marco is a digital native cloud advisor with experience in the FinTech, Healthcare & Life Sciences, Software-as-a-service, and most recently, in Telecommunications industries. He is a qualified technologist with a passion for machine learning, artificial intelligence, and mergers & acquisitions. Marco is based in Seattle, WA and enjoys writing, reading, exercising, and building applications in his free time.

AJ Dhimine is a Solutions Architect at AWS. He specializes in generative AI, serverless computing and data analytics. He is an active member/mentor in Machine Learning Technical Field Community and has published several scientific papers on various AI/ML topics. He works with customers, ranging from start-ups to enterprises, to develop AWSome generative AI solutions. He is particularly passionate about leveraging Large Language Models for advanced data analytics and exploring practical applications that address real-world challenges. Outside of work, AJ enjoys traveling, and is currently at 53 countries with a goal of visiting every country in the world.

Read More

Efficient continual pre-training LLMs for financial domains

Efficient continual pre-training LLMs for financial domains

Large language models (LLMs) are generally trained on large publicly available datasets that are domain agnostic. For example, Meta’s Llama models are trained on datasets such as CommonCrawl, C4, Wikipedia, and ArXiv. These datasets encompass a broad range of topics and domains. Although the resulting models yield amazingly good results for general tasks, such as text generation and entity recognition, there is evidence that models trained with domain-specific datasets can further improve LLM performance. For example, the training data used for BloombergGPT is 51% domain-specific documents, including financial news, filings, and other financial materials. The resulting LLM outperforms LLMs trained on non-domain-specific datasets when tested on finance-specific tasks. The authors of BloombergGPT concluded that their model outperforms all other models tested for four of the five financial tasks. The model provided even better performance when tested for Bloomberg’s internal financial tasks by a wide margin—as much as 60 points better (out of 100). Although you can learn more about the comprehensive evaluation results in the paper, the following sample captured from the BloombergGPT paper can give you a glimpse of the benefit of training LLMs using financial domain-specific data. As shown in the example, the BloombergGPT model provided correct answers while other non-domain-specific models struggled:

Input: CEO of Assicurazioni Generali?
BloombergGPT: Philippe Donnet
GPT-NeoX: Antonio De Lorenzo, Simone Gambarini, Enrico Zanetti
FLAN-T5-XXL: John M Forsyth, Christopher K Peters, {empty string}

Input: CEO of Silicon Valley Bank?
BloombergGPT: Greg Becker
GPT-NeoX: John Hodor, Mike Fridgen, Peter R Briger
FLAN-T5-XXL: Christian Klein, John M Forsyth, John M Forsyth

Input: CEO of IHS Markit Ltd?
BloombergGPT: Lance Uggla
GPT-NeoX: John F Lechleiter, David Steward, Niall O Broin
FLAN-T5-XXL: William M Hawkins, John M Forsyth, John M Forsyth
Input: CEO of Walgreens Boots Alliance?

BloombergGPT: Rosalind Brewer
GPT-NeoX: Mark V Dei, Stefano Pessina, Stefano Pessina
FLAN-T5-XXL: Christian Klein, John M Forsyth, John M Forsyth
Input: CEO of Citigroup Inc?

BloombergGPT: Jane Fraser
GPT-NeoX: Michael L Corbat, Michael L Corbat, Michael L Corbat
FLAN-T5-XXL: Christian Sewing, John M Forsyth, John M Forsyth

This post provides a guide to training LLMs specifically for the financial domain. We cover the following key areas:

  • Data collection and preparation – Guidance on sourcing and curating relevant financial data for effective model training
  • Continual pre-training vs. fine-tuning – When to use each technique to optimize your LLM’s performance
  • Efficient continual pre-training – Strategies to streamline the continual pre-training process, saving time and resources

This post brings together the expertise of the applied science research team within Amazon Finance Technology and the AWS Worldwide Specialist team for the Global Financial Industry. Some of the content is based on the paper Efficient Continual Pre-training for Building Domain Specific Large Language Models.

Collecting and preparing finance data

Domain continual pre-training necessities a large-scale, high-quality, domain-specific dataset. The following are the main steps for domain dataset curation:

  • Identify data sources – Potential data sources for domain corpus include open web, Wikipedia, books, social media, and internal documents.
  • Domain data filters – Because the ultimate goal is to curate domain corpus, you might need apply additional steps to filter out samples that irrelevant to the target domain. This reduces useless corpus for continual pre-training and reduces training cost.
  • Preprocessing – You might consider a series of preprocessing steps to improve data quality and training efficiency. For example, certain data sources can contain a fair number of noisy tokens; deduplication is considered a useful step to improve data quality and reduce training cost.

To develop financial LLMs, you can use two important data sources: News CommonCrawl and SEC filings. An SEC filing is a financial statement or other formal document submitted to the US Securities and Exchange Commission (SEC). Publicly listed companies are required to file various documents regularly. This creates a large number of documents over the years. News CommonCrawl is a dataset released by CommonCrawl in 2016. It contains news articles from news sites all over the world.

News CommonCrawl is available on Amazon Simple Storage Service (Amazon S3) in the commoncrawl bucket at crawl-data/CC-NEWS/. You can get the listings of files using the AWS Command Line Interface (AWS CLI) and the following command:

aws s3 ls --recursive s3://commoncrawl/crawl-data/CC-NEWS/

In Efficient Continual Pre-training for Building Domain Specific Large Language Models, the authors use a URL and keyword-based approach to filter financial news articles from generic news. Specifically, the authors maintain a list of important financial news outlets and a set of keywords related to financial news. We identify an article as financial news if either it comes from financial news outlets or any keywords show up in the URL. This simple yet effective approach enables you to identify financial news from not only financial news outlets but also finance sections of generic news outlets.

SEC filings are available online through the SEC’s EDGAR (Electronic Data Gathering, Analysis, and Retrieval) database, which provides open data access. You can scrape the filings from EDGAR directly, or use APIs in Amazon SageMaker with a few lines of code, for any period of time and for a large number of tickers (i.e., the SEC assigned identifier). To learn more, refer to SEC Filing Retrieval.

The following table summarizes the key details of both data sources.

. News CommonCrawl SEC Filing
Coverage 2016-2022 1993-2022
Size 25.8 billion words 5.1 billion words

The authors go through a few extra preprocessing steps before the data is fed into a training algorithm. First, we observe that SEC filings contain noisy text due to the removal of tables and figures, so the authors remove short sentences that are deemed to be table or figure labels. Secondly, we apply a locality sensitive hashing algorithm to deduplicate the new articles and filings. For SEC filings, we deduplicate at the section level instead of the document level. Lastly, we concatenate documents into a long string, tokenize it, and chunk the tokenization into pieces of max input length supported by the model to be trained. This improves the throughput of continual pre-training and reduces the training cost.

Continual pre-training vs. fine-tuning

Most available LLMs are general purpose and lack domain-specific abilities. Domain LLMs have shown considerable performance in medical, finance, or scientific domains. For an LLM to acquire domain-specific knowledge, there are four methods: training from scratch, continual pre-training, instruction fine-tuning on domain tasks, and Retrieval Augmented Generation (RAG).

In traditional models, fine-tuning is usually used to create task-specific models for a domain. This means maintaining multiple models for multiple tasks like entity extraction, intent classification, sentiment analysis, or question answering. With the advent of LLMs, the need to maintain separate models has become obsolete by using techniques like in-context learning or prompting. This saves the effort required to maintain a stack of models for related but distinct tasks.

Intuitively, you can train LLMs from scratch with domain-specific data. Although most of the work to create domain LLMs has focused on training from scratch, it is prohibitively expensive. For example, the GPT-4 model costs over $100 million to train. These models are trained on a mix of open domain data and domain data. Continual pre-training can help models acquire domain-specific knowledge without incurring the cost of pre-training from scratch because you pre-train an existing open domain LLM on only the domain data.

With instruction fine-tuning on a task, you can’t make the model acquire domain knowledge because the LLM only acquires domain information contained in the instruction fine-tuning dataset. Unless a very large dataset for instruction fine-tuning is used, it is not enough to acquire domain knowledge. Sourcing high-quality instruction datasets is usually challenging and is the reason to use LLMs in first place. Also, instruction fine-tuning on one task can affect performance on other tasks (as seen in this paper). However, instruction fine-tuning is more cost-effective than either of the pre-training alternatives.

The following figure compares traditional task-specific fine-tuning. vs in-context learning paradigm with LLMs.

RAG is the most effective way of guiding an LLM to generate responses grounded in a domain. Although it can guide a model to generate responses by providing facts from the domain as auxiliary information, it doesn’t acquire the domain-specific language because the LLM is still relying on non-domain language style to generate the responses.

Continual pre-training is a middle ground between pre-training and instruction fine-tuning in terms of cost while being a strong alternative to gaining domain-specific knowledge and style. It can provide a general model over which further instruction fine-tuning on limited instruction data can be performed. Continual pre-training can be a cost-effective strategy for specialized domains where set of downstream tasks is large or unknown and labeled instruction tuning data is limited. In other scenarios, instruction fine-tuning or RAG might be more suitable.

To learn more about fine-tuning, RAG, and model training, refer to Fine-tune a foundation model, Retrieval Augmented Generation (RAG), and Train a Model with Amazon SageMaker, respectively. For this post, we focus on efficient continual pre-training.

Methodology of efficient continual pre-training

Continual pre-training consists of the following methodology:

  • Domain-Adaptive Continual Pre-training (DACP) – In the paper Efficient Continual Pre-training for Building Domain Specific Large Language Models, the authors continually pre-train the Pythia language model suite on the financial corpus to adapt it to the finance domain. The objective is to create financial LLMs by feeding data from the whole financial domain into an open-sourced model. Because the training corpus contains all the curated datasets in the domain, the resultant model should acquire finance-specific knowledge, thereby becoming a versatile model for various financial tasks. This results in FinPythia models.
  • Task-Adaptive Continual Pre-training (TACP) – The authors pre-train the models further on labeled and unlabeled task data to tailor them for specific tasks. In certain circumstances, developers may prefer models delivering better performance on a group of in-domain tasks rather than a domain-generic model. TACP is designed as continual pre-training aiming to enhance performance on targeted tasks, without requirements for labeled data. Specifically, the authors continually pre-train the open sourced models on the task tokens (without labels). The primary limitation of TACP lies in constructing task-specific LLMs instead of foundation LLMs, owing to the sole use of unlabeled task data for training. Although DACP uses a much larger corpus, it is prohibitively expensive. To balance these limitations, the authors propose two approaches that aim to build domain-specific foundation LLMs while preserving superior performance on target tasks:
  • Efficient Task-Similar DACP (ETS-DACP) – The authors propose selecting a subset of financial corpus that is highly similar to the task data using embedding similarity. This subset is used for continual pre-training to make it more efficient. Specifically, the authors continually pre-train the open sourced LLM on a small corpus extracted from the financial corpus that is close to the target tasks in distribution. This can help improve task performance because we adopt the model to the distribution of task tokens despite labeled data not being required.
  • Efficient Task-Agnostic DACP (ETA-DACP) – The authors propose using metrics like perplexity and token type entropy that don’t require task data to select samples from financial corpus for efficient continual pre-training. This approach is designed to deal with scenarios where task data is unavailable or more versatile domain models for the broader domain are preferred. The authors adopt two dimensions to select data samples that are important for obtaining domain information from a subset of pre-training domain data: novelty and diversity. Novelty, measured by the perplexity recorded by the target model, refers to the information that was unseen by the LLM before. Data with high novelty indicates novel knowledge for the LLM, and such data is viewed as more difficult to learn. This updates generic LLMs with intensive domain knowledge during continual pre-training. Diversity, on the other hand, captures the diversity of distributions of token types in the domain corpus, which has been documented as a useful feature in the research of curriculum learning on language modeling.

The following figure compares an example of ETS-DACP (left) vs. ETA-DACP (right).

We adopt two sampling schemes to actively select data points from curated financial corpus: hard sampling and soft sampling. The former is done by first ranking the financial corpus by corresponding metrics and then selecting the top-k samples, where k is predetermined according to the training budget. For the latter, the authors assign sampling weights for each data points according the metric values, and then randomly sample k data points to meet the training budget.

Result and analysis

The authors evaluate the resulting financial LLMs on an array of financial tasks to investigate the efficacy of continual pre-training:

  • Financial Phrase Bank – A sentiment classification task on financial news.
  • FiQA SA – An aspect-based sentiment classification task based on financial news and headlines.
  • Headline – A binary classification task on whether a headline on a financial entity contains certain information.
  • NER – A financial named entity extraction task based on credit risk assessment section of SEC reports. Words in this task are annotated with PER, LOC, ORG, and MISC.

Because financial LLMs are instruction fine-tuned, the authors evaluate models in a 5-shot setting for each task for the sake of robustness. On average, the FinPythia 6.9B outperforms Pythia 6.9B by 10% across four tasks, which demonstrates the efficacy of domain-specific continual pre-training. For the 1B model, the improvement is less profound, but performance still improves 2% on average.

The following figure illustrates the performance difference before and after DACP on both models.

The following figure showcases two qualitative examples generated by Pythia 6.9B and FinPythia 6.9B. For two finance-related questions regarding an investor manager and a financial term, Pythia 6.9B doesn’t understand the term or recognize the name, whereas FinPythia 6.9B generates detailed answers correctly. The qualitative examples demonstrate that continual pre-training enables the LLMs to acquire domain knowledge during the process.

The following table compares various efficient continual pre-training approaches. ETA-DACP-ppl is ETA-DACP based on perplexity (novelty), and ETA-DACP-ent is based on entropy (diversity). ETS-DACP-com is similar to DACP with data selection by averaging all three metrics. The following are a few takeaways from the results:

  • Data selection methods are efficient – They surpass standard continual pre-training with just 10% of training data. Efficient continual pre-training including Task-Similar DACP (ETS-DACP), Task-Agnostic DACP based on entropy (ESA-DACP-ent) and Task-Similar DACP based on all three metrics (ETS-DACP-com) outperforms standard DACP on average despite the fact that they are trained on only 10% of financial corpus.
  • Task-aware data selection works the best in line with small language models research – ETS-DACP records the best average performance among all the methods and, based on all three metrics, records the second-best task performance. This suggests that using unlabeled task data is still an effective approach to boost task performance in the case of LLMs.
  • Task-agnostic data selection is close second – ESA-DACP-ent follows the performance of the task-aware data selection approach, implying that we could still boost task performance by actively selecting high-quality samples not tied to specific tasks. This paves the way to build financial LLMs for the whole domain while achieving superior task performance.

One critical question regarding continual pre-training is whether it negatively affects the performance on non-domain tasks. The authors also evaluate the continually pre-trained model on four widely used generic tasks: ARC, MMLU, TruthQA, and HellaSwag, which measure the ability of question answering, reasoning, and completion. The authors find that continual pre-training does not adversely affect non-domain performance. For more details, refer to Efficient Continual Pre-training for Building Domain Specific Large Language Models.

Conclusion

This post offered insights into data collection and continual pre-training strategies for training LLMs for financial domain. You can start training your own LLMs for financial tasks using Amazon SageMaker Training or Amazon Bedrock today.


About the Authors

Yong Xie is an applied scientist in Amazon FinTech. He focuses on developing large language models and Generative AI applications for finance.

Karan Aggarwal is a Senior Applied Scientist with Amazon FinTech with a focus on Generative AI for finance use-cases. Karan has extensive experience in time-series analysis and NLP, with particular interest in learning from limited labeled data

Aitzaz Ahmad is an Applied Science Manager at Amazon where he leads a team of scientists building various applications of Machine Learning and Generative AI in Finance. His research interests are in NLP, Generative AI, and LLM Agents. He received his PhD in Electrical Engineering from Texas A&M University.

Qingwei Li is a Machine Learning Specialist at Amazon Web Services. He received his Ph.D. in Operations Research after he broke his advisor’s research grant account and failed to deliver the Nobel Prize he promised. Currently he helps customers in financial service build machine learning solutions on AWS.

Raghvender Arni leads the Customer Acceleration Team (CAT) within AWS Industries. The CAT is a global cross-functional team of customer facing cloud architects, software engineers, data scientists, and AI/ML experts and designers that drives innovation via advanced prototyping, and drives cloud operational excellence via specialized technical expertise.

Read More

Greater Scope: Doctors Get Inside Look at Gut Health With AI-Powered Endoscopy

Greater Scope: Doctors Get Inside Look at Gut Health With AI-Powered Endoscopy

From humble beginnings as a university spinoff to an acquisition by the leading global medtech company in its field, Odin Vision has been on an accelerated journey since its founding less than five years ago.

An alum of the NVIDIA Inception program for cutting-edge startups, Odin Vision builds cloud-connected AI software that helps clinicians detect and characterize areas of concern during endoscopy, a procedure where a tiny camera mounted on a tube is inserted into the gastrointestinal tract.

Network-connected devices in the endoscopy room capture and stream real-time video data to the cloud, where powerful NVIDIA GPUs run AI inference. The models’ results are then streamed back to the endoscopy room so that clinicians can see the AI insights overlaid on the live video feed with minimal latency.

The startup was in 2022 acquired by Japanese medtech leader Olympus, which has a 70% global market share in gastrointestinal endoscopic equipment.

“We believe the acquisition brings us much closer to achieving our vision to revolutionize endoscopy through AI and cloud technology,” said Daniel Toth, cofounder and chief technology officer of Odin Vision. “Our software can reach Olympus’ global customer base, enabling us to bring our solutions to as many patients as possible.”

Olympus is also collaborating with NVIDIA on Olympus Office Hours, an advisory program that connects Inception startups with the medical device company’s experts, who will offer deep industry expertise and guidance to help the startups build AI solutions in key areas including gastroenterology, urology and surgery.

Eight leading AI startups have joined the inaugural cohort of the program — which is part of the NVIDIA Inception Alliance for Healthcare, an initiative that brings together medical AI startups with NVIDIA and its healthcare industry partners — to help accelerate their product development and go-to-market goals.

An Extra Set of AIs for Clinicians

Around a quarter of precancerous polyps are missed during colonoscopies, a kind of endoscopy procedure that examines the lower digestive tract.

While some are missed because the endoscope doesn’t capture video footage of every angle, others remain undetected by clinicians. That’s where AI can help provide a second set of eyes to support clinical decision-making.

Seamless AI integration into the video feeds that medical professionals view during an endoscopy provides an extra data source that can help doctors detect and remove polyps sooner, helping prevent cancer development.

“Polyps develop slowly, and can take five or 10 years to appear as cancer,” Toth said. “If a clinician can detect and remove them in time, it can help save lives.”

CADDIE, the company’s AI software for detecting and classifying polyps, has received the CE Mark of regulatory approval in Europe and is deployed across hospitals in the U.K., Spain, Germany, Poland and Italy — with plans for use in the U.S as well.

Odin Vision also has AI software that has received the CE Mark to assist gastroscopy, where doctors inspect the esophagus for signs of throat cancer.

Accelerated Inference for Real-Time Insights

Odin Vision began as a research project by two professors and a Ph.D. student at University College London who were developing AI techniques for polyp detection. In 2019, they teamed with Toth and Odin’s CEO, Peter Mountney, both from Siemens Healthineers, to commercialize their work.

“NVIDIA GPUs were part of our work from the start — they’ve been essential to train our AI models and were part of our first product prototypes for inference, too,” Toth said. “Since moving to a cloud-based deployment, we’ve begun using the NVIDIA Triton Inference Server for dynamic processing in the cloud.”

The team uses NVIDIA Tensor Core GPUs for accelerated inference — most recently transitioning to NVIDIA L4 GPUs. Adopting NVIDIA Triton Inference Server software and the NVIDIA TensorRT software development kit enabled them to meet the low-latency threshold needed for real-time video-processing AI applications.

In addition to supporting doctors during specific procedures, Odin Vision plans to develop generative AI models that can automate a first draft of the clinical notes doctors prepare afterward — as well as models that can aggregate data across procedures. These would allow endoscopy teams to review analytics and assess how well a procedure is performed compared to clinical guidelines.

“Once you get to a point where there are dozens of AI models tracking different elements of these procedures, we can see if a healthcare professional is inspecting a particular area of the digestive tract for only three minutes, when it’s supposed to take six minutes,” Toth said. “The system can provide a nudge to remind the clinician to follow the guidelines.”

Cloud-Connected Cancer Screening

Membership in NVIDIA Inception provided the Odin Vision team access to technical expertise from NVIDIA and cloud credits through leading cloud service providers.

“Cloud credits helped us massively speed up our technology development and deployment, enabling us to release our products to market months earlier than initially planned,” Toth said. “NVIDIA experts also validated our product concept from a technology perspective and provided consultation about GPU and accelerated software optimizations.”

The team found that a cloud-based solution made it easier to push software updates over the air to deployments across hospital customers.

“Some AI companies are sending boxes that need to sit in clinical sites and require regular maintenance, which can prevent normal clinical workflows from running smoothly,” Toth said. “With network-connected devices, we can instead update a single server and the changes reach all end users at the same time.”

Learn more about NVIDIA Inception and subscribe to NVIDIA healthcare news.

Read More

Get Cozy With ‘Palia’ on GeForce NOW

Get Cozy With ‘Palia’ on GeForce NOW

Ease into spring with the warm, cozy vibes of Palia, coming to the cloud this GFN Thursday.

It’s part of six new titles joining the GeForce NOW library of over 1,800 games.

Welcome Home

Palia on GeForce NOW
Better together.

Escape to a cozy world with Palia, a free-to-play massively multiplayer online game from Singularity 6 Corporation. The game, which has made its way onto more than 200,000 wishlists on Steam, has launched in the cloud this week.

Farm, fish, craft and explore with friendly villagers across a stunning variety of different biomes — from sprawling flower fields to hilly forests and rocky beaches — in the world of Palia. Inhabit the land, furnish a dream home, unravel ancient mysteries and interact with a vibrant online community.

Get ready for a captivating adventure across devices by streaming Palia from the cloud. GeForce NOW Ultimate and Priority members get faster access to servers and longer gaming sessions over Free members.

Time to Play

Millennia on GeForce NOW
10,000 years of history, all in the cloud.

Shape the course of history across 10,000 years in Millennia from Singularity Six and Paradox Interactive. GeForce NOW members can customize their own nations, explore unique combinations of traits and adapt to alternative histories in this captivating journey.

In addition, members can look for the following:

  • Palia (New release on Steam, March 25)
  • Bulwark: Falconeer Chronicles (New release on Steam, March 26)
  • Millennia (New release on Steam, March 26)
  • Outpost: Infinity Siege (New release on Steam, March 26)
  • SOUTH PARK: SNOW DAY! (New release on Steam, March 26)
  • Tchia (Steam)

What are you planning to play this weekend? Let us know on X or in the comments below.

Read More

AI Frontiers: Rethinking intelligence with Ashley Llorens and Ida Momennejad

AI Frontiers: Rethinking intelligence with Ashley Llorens and Ida Momennejad

photo of Ida Momennejad for the AI Frontiers Microsoft Research Podcast series

Powerful large-scale AI models like GPT-4 are showing dramatic improvements in reasoning, problem-solving, and language capabilities. This marks a phase change for artificial intelligence—and a signal of accelerating progress to come. 

In this Microsoft Research Podcast series, AI scientist and engineer Ashley Llorens hosts conversations with his collaborators and colleagues about what these models—and the models that will come next—mean for our approach to creating, understanding, and deploying AI, its applications in areas such as health care and education, and its potential to benefit humanity. 

This episode features Principal Researcher Ida Momennejad. Momennejad is applying her expertise in cognitive neuroscience and computer science to better understand—and extend—AI capabilities, particularly when it comes to multistep reasoning and short- and long-term planning. Llorens and Momennejad discuss the notion of general intelligence in both humans and machines; how Momennejad and colleagues leveraged prior research into the cognition of people and rats to create prompts for evaluating large language models; and the case for the development of a “prefrontal cortex” for AI.

Transcript

[MUSIC PLAYS]

ASHLEY LLORENS: I’m Ashley Llorens with Microsoft Research. In this podcast series, I share conversations with fellow researchers about the latest developments in AI models, the work we’re doing to understand their capabilities and limitations, and ultimately how innovations like these can have the greatest benefit for humanity. Welcome to AI Frontiers.

Today, I’ll speak with Ida Momennejad. Ida works at Microsoft Research in New York City at the intersection of machine learning and human cognition and behavior. Her current work focuses on building and evaluating multi-agent AI architectures, drawing from her background in both computer science and cognitive neuroscience. Over the past decade, she has focused on studying how humans and AI agents build and use models of their environment.


[MUSIC FADES]

Let’s dive right in. We are undergoing a paradigm shift where AI models and systems are starting to exhibit characteristics that I and, of course, many others have described as more general intelligence. When I say general in this context, I think I mean systems with abilities like reasoning and problem-solving that can be applied to many different tasks, even tasks they were not explicitly trained to perform. Despite all of this, I think it’s also important to admit that we—and by we here, I mean humanity—are not very good at measuring general intelligence, especially in machines. So I’m excited to dig further into this topic with you today, especially given your background and insights into both human and machine intelligence. And so I just want to start here: for you, Ida, what is general intelligence?

IDA MOMENNEJAD: Thank you for asking that. We could look at general intelligence from the perspective of history of cognitive science and neuroscience. And in doing so, I’d like to mention its discontents, as well. There was a time where general intelligence was introduced as the idea of a kind of intelligence that was separate from what you knew or the knowledge that you had on a particular topic. It was this general capacity to acquire different types of knowledge and reason over different things. And this was at some point known as g, and it’s still known as g. There have been many different kinds of critiques of this concept because some people said that it’s very much focused on the idea of logic and a particular kind of reasoning. Some people made cultural critiques of it. They said it’s very Western oriented. Others said it’s very individualistic. It doesn’t consider collective or interpersonal intelligence or physical intelligence. There are many critiques of it. But at the core of it, there might be something useful and helpful. And I think the useful part is that there could be some general ability in humans, at least the way that g was intended initially, where they can learn many different things and reason over many different domains, and they can transfer ability to reason over a particular domain to another. And then in the AGI, or artificial general intelligence, notion of it, people took this idea of many different abilities or skills for cognitive and reasoning and logic problem-solving at once. There have been different iterations of what this means in different times. In principle, the concept in itself does not provide the criteria on its own. Different people at different times provide different criteria for what would be the artificial general intelligence notion. Some people say that they have achieved it. Some people say we are on the brink of achieving it. Some people say we will never achieve it. However, there is this idea, if you look at it from an evolutionary and neuroscience and cognitive neuroscience lens, that in evolution, intelligence has evolved multiple times in a way that is adaptive to the environment. So there were organisms that needed to be adaptive to the environment where they were, that intelligence has evolved in multiple different species, so there’s not one solution to it, and it depends on the ecological niche that that particular species needed to adapt to and survive in. And it’s very much related to the idea of being adaptive of certain kinds of, different kinds of problem-solving that are specific to that particular ecology. There is also this other idea that there is no free lunch and the no-free-lunch theorem, that you cannot have one particular machine learning solution that can solve everything. So the idea of general artificial intelligence in terms of an approach that can solve everything and there is one end-to-end training that can be useful to solve every possible problem that it has never seen before seems a little bit untenable to me, at least at this point. What does seem tenable to me in terms of general intelligence is if we understand and study, the same way that we can do it in nature, the foundational components of reasoning, of intelligence, of different particular types of intelligence, of different particular skills—whether it has to do with cultural accumulation of written reasoning and intelligence skills, whether it has to do with logic, whether it has to do with planning—and then working on the particular types of artificial agents that are capable of putting these particular foundational building blocks together in order to solve problems they’ve never seen before. A little bit like putting Lego pieces together. So to wrap it up, to sum up what I just said, the idea of general intelligence had a more limited meaning in cognitive science, referring to human ability to have multiple different types of skills for problem-solving and reasoning. Later on, it was also, of course, criticized in terms of the specificity of it and ignoring different kinds of intelligence. In AI, this notion has been having many different kinds of meanings. If we just mean it’s a kind of a toolbox of general kinds of intelligence for something that can be akin to an assistant to a human, that could make sense. But if we go too far and use it in the kind of absolute notion of general intelligence, as it has to encompass all kinds of intelligence possible, that might be untenable. And also perhaps we shouldn’t think about it in terms of a lump of one end-to-end system that can get all of it down. Perhaps we can think about it in terms of understanding the different components that we have also seen emerge in evolution in different species. Some of them are robust across many different species. Some of them are more specific to some species with a specific ecological niche or specific problems to solve. But I think perhaps it could be more helpful to find those cognitive and other interpersonal, cultural, different notions of intelligence; break them down into their foundational building blocks; and then see how a particular artificial intelligence agent can bring together different skills from this kind of a library of intelligence skills in order to solve problems it’s never seen before.

LLORENS: There are two concepts that jump out at me based on what you said. One is artificial general intelligence and the other is humanlike intelligence or human-level intelligence. And you’ve referenced the fact that, you know, oftentimes, we equate the two or at least it’s not clear sometimes how the two relate to each other. Certainly, human intelligence has been an important inspiration for what we’ve done—a lot of what we’ve done—in AI and, in many cases, a kind of evaluation target in terms of how we measure progress or performance. But I wonder if we could just back up a minute. Artificial general intelligence and humanlike, human-level intelligence—how do these two concepts relate to you?

MOMENNEJAD: Great question. I like that you asked to me because I think it would be different for different people. I’ve written about this, in fact. I think humanlike intelligence or human-level intelligence would require performance that is similar to humans, at least behaviorally, not just in terms of what the agent gets right, but also in terms of the kinds of mistakes and biases that the agent might have. It should look like human intelligence. For instance, humans show primacy bias, recency bias, variety of biases. And this seems like it’s unhelpful in a lot of situations. But in some situations, it helps to come with fast and frugal solutions on the go. It helps to summarize certain things or make inferences really fast that can help in human intelligence, for instance. There is analogical reasoning. That is, there are different types of intelligence that humans do. Now, if you look at what are tasks that are difficult and what are tasks that are easier for humans and compare that to a, for instance, let’s say just a large language model like GPT-4, you will see whether they find similar things simple and similar things difficult or not. When they don’t find similar things easy or difficult, I think that we should not say that this is humanlike per se, unless we mean for a specific task. Perhaps on specific sets of tasks, an agent can be, can have human-level or humanlike intelligent behavior; however, if we look overall, as long as there are particular skills that are more or less difficult for one or the other, it might be not reasonable to compare them. That being said, there are many things that some AI agent and even a [programming] language would be better [than] humans at. Does that mean that they are generally more intelligent? No, it doesn’t because there are also many things that humans are far better than AI at. The second component of this is the mechanisms by which humans do the intelligent things that we do. We are very energy efficient. With very little amount of energy consumption, we can solve very complicated problems. If you put some of us next to each other or at least give a pen and paper to one of us, this can be even a lot more effective; however, the amount of energy consumption that it takes in order for any machine to solve similar problems is a lot higher. So another difference between humanlike intelligence or biologically inspired intelligence and the kind of intelligence that is in silico is efficiency, energy efficiency in general. And finally, the amount of data that goes into current state-of-[the-art] AI versus perhaps the amount of data that a human might need to learn new tasks or acquire new skills seem to be also different. So it seems like there are a number of different approaches to comparing human and machine intelligence and deriving what are the criteria for a machine intelligence to be more humanlike. But other than the conceptual aspect of it, it’s not clear that we necessarily want something that’s entirely humanlike. Perhaps we want in some tasks and in some particular use cases for the agent to be humanlike but not in everything.

LLORENS: You mentioned some of the ways in which human intelligence is inferior or has weaknesses. You mentioned some of the weaknesses of human intelligence, like recency bias. What are some of the weaknesses of artificial intelligence, especially frontier systems today? You’ve recently published some works that have gotten into new paradigms for evaluation, and you’ve explored some of these weaknesses. And so can you tell us more about that work and about your view on this?

MOMENNEJAD: Certainly. So inspired by a very long-standing tradition of evaluating cognitive capacities—those Lego pieces that bring together intelligence that I was mentioning in humans and animals—I have conducted a number of experiments, first in humans, and built reinforcement learning models over the past more than a decade on the idea of multistep reasoning and planning. It is in the general domain of reasoning, planning, and decision making. And I particularly focused on what kind of memory representations allow brains and reinforcement learning models inspired by human brain and behavior to be able to predict the future and plan the future and reason over the past and the future seamlessly using the same representations. Inspired by the same research that goes back in tradition to Edward Tolman’s idea of cognitive maps and latent learning in the early 20th century, culminating in his very influential 1948 paper, “Cognitive maps in rats and men,” I sat down with a couple of colleagues last year—exactly this time, probably—and we worked on figuring out if we can devise similar experiments to that in order to test cognitive maps and planning and multistep reasoning abilities in large language models. So I first turned some of the experiments that I had conducted in humans and some of the experiments that were done by Edward Tolman on the topic in rodents and turned them into prompts for ChatGPT. That’s where I started, with GPT-4. The reason I did that was that I wanted to make sure that I will create some prompts that have not been in the training set. My experiments, although the papers have been published, the stimuli of the experiments were not linguistic. They were visual sequences that the human would see, and they would have to have some reinforcement learning and learn from the sequences to make inferences about relationships between different states and find what is the path that would give them optimal rewards. Very simple human reinforcement learning paradigms, however, with different kind of structures. The inspirations that I had drawn from the cognitive maps works by Edward Tolman and others was in this idea that in order for a creature, whether it’s a rodent, a human, or a machine, to be able to reason in [multiple] steps, plan, and have cognitive maps, which is simply a representation of the relational structure of the environment, in order for a creature to have these abilities or these capacities, it means that the creature needs to be sensitive and adaptive to local changes in the environment. So I designed the, sort of, the initial prompts and recruited a number of very smart and generous-with-their-time colleagues, who … we sat together and created these prompts in different domains. For instance, we also created social prompts. We also created the same kind of graph structures but for reasoning over social structures. For instance, I say, Ashley’s friends with Matt. Matt is friends with Michael. If I want to pass a message to Michael, what is the path that I can choose? Which would be, I have to tell Ashley. Ashley will tell Matt. Matt will tell Michael. This is very similar to another paradigm which was more like a maze, which would be similar to saying, there is a castle; it has 16 rooms. You enter Room 1. You open the door. It opens to Room 2. In Room 2, you open the door, and so on and so forth. So you describe, using language, the structure of a social environment or the structure of a spatial environment, and then you ask certain questions that have to do with getting from A to B in this social or spatial environment from the LLM, or you say, oh, you know, Matt and Michael don’t talk to each other anymore. So now in order to pass a message, what should I do? So I need to find a detour. Or, for instance, I say, you know, Ashley has become close to Michael now. So now I have a shortcut, so I can directly give the message to Ashley, and Ashley can directly give the message to Michael. My path to Michael is shorter now. So finding things like detours, shortcuts, or if the reward location changes, these are the kinds of changes that, inspired by my own past work and inspired by the work of Tolman and others, we implemented in all of our experiments. This led to 15 different tasks for every single graph, and we have six graphs total of different complexity levels with different graph theoretic features, and [for] each of them, we had three domains. We had a spatial domain that was with rooms that had orders like Room 1, Room 2, Room 3; a spatial domain that there was no number, there was no ordinal order to the rooms; and a social environment where it was the names of different people and so the reasoning was over social, sort of, spaces. So you can see this is a very large number of tasks. It’s 6 times 15 times 3, and each of the prompts we ran 30 times for different temperatures. Three temperatures: 0, 0.5, and 1. And for those who are not familiar with this, a temperature of a large language model determines how random it will be or how much it will stick to the first or the best option that comes to it at the last layer. And so when there are some problems that may be the first obvious answer that it finds are not good, perhaps increasing the temperature could help, or perhaps a problem that needs precision, increasing the temperature would make it worse. So based on these ideas, we also tried it for different temperatures. And we tested eight different language models like this in order to systematically evaluate their ability for this multistep reasoning and planning, and the framework that we use—we call it CogEval—and CogEval is a framework that’s not just for reasoning and multistep planning. Other tasks can be used in this framework in order to be tested, as well. And the first step of it is always to operationalize the cognitive capacity in terms of many different tasks like I just mentioned. And then the second task is designing the specific experiments with different domains like spatial and social; with different structures, like the graphs that I told you; and with different kind of repetitions and with different tasks, like the detour, shortcut, and the reward revaluation, transition revaluation, and just traversal, all the different tasks that I mentioned. And then the third step is to generate many prompts and then test them with many repetitions using different temperatures. Why is that? I think something that Sam Altman had said is relevant here, which is sometimes with some problems, you ask GPT-4 a hundred times, and one out of those hundred, it would give the correct answer. Sometimes 30 out of a hundred, it will give the correct answer. You obviously want it to give hundred out of hundred the correct answer. But we didn’t want to rely on just one try and miss the opportunity to see whether it could give the answer if you probed it again[1]. And in all of the eight large language models, we saw that none of the large language models was robust to the graph structure. Meaning, its performance got really worse as soon as the graph structure, [which] didn’t even have many nodes but just had a tree structure that was six or seven nodes, or a six- or seven-node tree was much more difficult for it to solve than a graph that had 15 nodes but had a simpler structure that was just two lines. We noted that sometimes, counterintuitively, some graph structures that you think should be easy to solve were more difficult for them. On the other hand, they were not robust to the task set. So the specific task that we tried, whether it was detour, shortcut, or it was reward revaluation or traversal, it mattered. For instance, shortcut and detour were very difficult for all of them. Another thing that we noticed was that all of them, including GPT-4, hallucinated paths that didn’t exist. For instance, there was no door between Room 12 and Room 16. They would hallucinate that there is a door, and they would give a response that includes that door. Another kind of failure mode that we observed was that they would fail to even find a one-step path. Let’s say between Room 7 and 8, there is a direct door. We would say, what is the path from 7 and 8? And they would take a longer path to go from it. And a final mode that we observed was that they would sometimes fall in loops. Even though we would directly ask them to find the shortest path, they would sometimes fall into a loop on the way to getting to their destination, which obviously you shouldn’t do if you are trying to find the shortest path. That said, there is two differing notions of accuracy here. You can have satisficing, which means you get there; you just take a longer path. And there is this notion that you cannot get there because you used some imaginary path or you did something that didn’t make sense and you, sort of, gave a nonsensical response. We had both of those kinds of issues, so we had a lot of issues with giving nonsensical answers, repeating the question that we were asking, producing gibberish. So there were numerous kinds of challenges. What we did observe was that GPT-4 was far better than the other LLMs in this regard, at least at the time that we tested it; however, this is obviously on the basis of the particular kinds of tasks that we tried. In another study, we tried Tower of Hanoi, which is also a classic cognitive science approach to [testing] planning abilities and hierarchical planning abilities. And we found that GPT-4 does between zero and 10 percent in the three-disk problem and zero percent for the four-disk problem. And that is when we started to think about having more brain-inspired solutions to improve that approach. But I’m going to leave that for next.

LLORENS: So it sounds like a very extensive set of experiments across many different tasks and with many different leading AI models, and you’ve uncovered a lack of robustness across some of these different tasks. One curiosity that I have here is how would you assess the relative difficulty of these particular tasks for human beings? Would all of these be relatively easy for a person to do or not so much?

MOMENNEJAD: Great question. So I have conducted some of these experiments already and have published them before. Humans do not perform symmetrically on all these tasks, for sure; however, for instance, Tower of Hanoi is a problem that we know humans can solve. People might have seen this. It’s three little rods that are … usually, it’s a wooden structure, so you have a physical version of it, or you can have a virtual version of it, and there are different disks with different colors and sizes. There are some rules. You cannot put certain disks on top of others. So there is a particular order in which you can stack the disks. Usually what happens is that all the disks are on one side—and when I say a three-disk problem, it means you have three total disks. And there is usually a target solution that you are shown, and you’re told to get there in a particular number of moves or in a minimum number of moves without violating the rules. So in this case, the rules would be that you wouldn’t put certain disks on top of others. And based on that, you’re expected to solve the problem. And the performance of GPT-4 on Tower of Hanoi three disk is between 0 to 10 percent and on Tower of Hanoi four disks is zero percent—zero shot. With the help, it can get better. With some support, it gets better. So in this regard, it seems like Tower of Hanoi is extremely difficult for GPT-4. It doesn’t seem as difficult as it is for GPT-4 for humans. It seems for some reason, that it couldn’t even improve itself when we explained the problem even further to it and explain to it what it did wrong. Sometimes—if people want to try it out, they should—sometimes, it would argue back and say, “No, you’re wrong. I did this right.” Which was a very interesting moment for us with ChatGPT. That was the experience that we had for trying it out first without giving it, sort of, more support than that, but I can tell you what we did next, but I want to make sure that we cover your other questions. But just to wrap this part up, inspired by tasks that have been used for evaluation of cognitive capacities such as multistep reasoning and planning in humans, it is possible to evaluate cognitive capacities and skills such as multistep reasoning and planning also in large language models. And I think that’s the takeaway from this particular study and from this general cognitive science–inspired approach. And I would like to say also it is not just human tasks that are useful. Tolman’s tasks were done in rodents. A lot of people have done experiments in fruit flies, in C. elegans, in worms, in various kinds of other species that are very relevant to testing, as well. So I think there is a general possibility of testing particular intelligence skills, evaluating it, inspired by experiments and evaluation methods for humans and other biological species.

LLORENS: Let’s explore the way forward for AI from your perspective. You know, as you’ve described your recent works, it’s clear that you have, that your work is deeply informed by insights from cognitive science, insights from neuroscience, and recent works—your recent works—have called for the development, for example, of a prefrontal cortex for AI, and I understand this to be the part of the brain that facilitates executive function. How does, how does this relate to the, you know, extending the capabilities of AI, a prefrontal cortex for AI?

MOMENNEJAD: Thank you for that question. So let me start by reiterating something I said earlier, which is the brain didn’t evolve in a lump. There were different components of brains and nervous systems and neurons that evolved at different evolutionary scales. There are some parts of the brain that appear in many different species, so they’re robust across many species. And there are some parts of the brain that appear in some species that had some particular needs, some particular problems they were facing, or some ecological niche. What is, however, in common in many of them is that there seems to be some kind of a modular or multicomponent aspect to what we call higher cognitive function or what we call executive function. And so the kinds of animals that we ascribe some form of executive function of sorts to seem to have brains that have parts or modules that do different things. It doesn’t mean that they only do that. It’s not a very extreme Fodorian view of modularity. But it is the view that, broadly speaking, when, for instance, we observe patients that have damage to a particular part of their prefrontal cortex, it could be that they perform the same on an IQ test, but they have problems holding their relationship or their jobs. So there are different parts of the brain that selective damage to those areas, because of accidents or coma or such, it seems to impair specific cognitive capacities. So this is what very much inspired me. I have been investigating the prefrontal cortex for, I guess, 17 years now, [LAUGHS] which is a scary number to say. But been … basically since I started my PhD and even during my master’s thesis, I have been focused on the role of the prefrontal cortex in our ability for long-term reasoning and planning in not just this moment—long-term, open-ended reasoning and planning. Inspired by this work, I thought, OK, if I want to improve GPT-4’s performance on, let’s say, Tower of Hanoi, can we get inspired by this kind of multiple roles that different parts of the brain play in executive function, specifically different parts of the neocortex and specifically different parts of the prefrontal cortex, part of the neocortex, in humans? Can we get inspired by some of these main roles that I have studied before and ask GPT-4 to play the role of those different parts and solve different parts of the planning and reasoning problem—the multistep planning and reasoning problem—using these roles and particular rules of how to iterate over them. For instance, there is a part of the brain called anterior cingulate cortex. Among other things, it seems to be involved in monitoring for errors and signaling when there is a need to exercise more control or move from what people like to call a faster way of thinking to a slower way of thinking to solve a particular problem. And there is … so let’s call this the cognitive function of this part. Let’s call it the monitor. This is a part of the brain that monitors for when there is a need for exercising more control or changing something because there is an error maybe. There is another part of the brain and the frontal lobe that is the, for instance, dorsolateral prefrontal cortex; that one is involved in working memory and coming up with, like, simpler plans to execute. Then there is the ventromedial prefrontal cortex that is involved in the value of states and predicting what is the next state and integrating it with information from other parts of the brain to figure out what is the value. So you put all of these things together, you can basically write different algorithms that have these different components talking to each other. And we have in that paper also, written in a pseudocode style, the different algorithms that are basically akin to a tree search, in fact. So there is a part of the role … they’re part of the multicomponent or multi-agent realization of a prefrontal cortex-like GPT-4 solution. One part of it would propose a plan. The monitor would say, thanks for that; let me pass it on to the part that is evaluating what is the outcome of this and what’s the value of that, and get back to you. It evaluates there and comes back and says, you know, this is not a good plan; give me another one. And in this iteration, sometimes it takes 10 iterations; sometimes it takes 20 iterations. This kind of council of different types of roles, they come up with a solution that is solving the Tower of Hanoi problem. And we managed to bring the performance from 0 to 10 [percent] in GPT-4 to, I think, about 70—70 percent—in Tower of Hanoi three disks, and OOD, or out-of-distribution generalization, without giving any examples of a four disk, it could generalize to above 20 percent in four-disk problems. Another impressive thing that happened here—and we tested it on the CogEval and the planning tasks from the other experiment, too—was that it brought all of the, sort of, hallucinations from about 20 to 30 percent—in some cases, much higher percentages—to zero percent. So we had slow thinking; we had 30 iterations, so it took a lot longer. And if this is, you know, fast and slow thinking, this is very slow thinking. However, we had no hallucinations anymore. And hallucination in Tower of Hanoi would be making a move that is impossible. For instance, putting in a, kind of, a disk on top of another that you cannot do because you violate a rule or taking out a middle disk that you cannot pull out actually. So those would be the kinds of hallucinations in Tower of Hanoi. All of those also went to zero. And so that is one thing that we have done already, which I have been very excited about.

LLORENS: So you painted a pretty interesting—fascinating, really—picture of a multi-agent framework where different instances of an advanced model like GPT-4 would be prompted to play the roles of different parts of the brain and, kind of, work together. And so my question is a pragmatic one. How do you prompt GPT-4 to play the role of a specific part of the human brain? What does that prompt look like?

MOMENNEJAD: Great question. I can actually, well, we have all of that at the end of our paper, so I can even read some of them if that was of interest. But just a quick response to that is you can basically describe the function that you want the LLM—in this case GPT-4—to play. You can write that in simple language. You don’t have to tell it that this is inspired by the brain. It is completely sufficient to just basically provide certain sets of rules in order for it, in order to be able to do that.[2] For instance, after you provide the problem, sort of, description … let me see if I can actually read some part of this for you. For instance, you give it a problem, and you say, consider this problem. Rule 1: you can only move a number if it’s at this and that. You clarify the rules. Here are examples. Here are proposed moves. And then you say, for instance, your role is to find whether this particular number generated as a solution is accurate. In order to do that, you can call on this other function, which is the predictor and evaluator that says, OK, if I do this, what state do I end up in, and what is the value of that state? And you get that information, and then based on that information, you decide whether the proposed move for this problem is a good move or not. If it is, then you pass a message that says, all right, give me the next step of the plan. If it’s not, then you say, OK, this is not a good plan; propose another plan. And then the part of, the part that plays the role of, hey, here is the problem. Here are the rules. Propose the first towards the subgoal or find the subgoal towards this and propose the next step. And that one receives this feedback from the monitor. And monitor has asked the predictor and evaluator, hey, what happens if I do these things and what would be the value of that in order to say, hey, this is not a great idea. So in a way this becomes a very simple prefrontal cortex–inspired multi-agent system. All of them are within the same … sort of, different calls to GPT-4 but the same instance. Just, like, because we were calling it in a code, it’s just, you just call, it’s called multiple times and each time with this kind of a very simple in-context learning text that, in text, it describes, hey, here’s the kind of problem you’re going to see. Here’s the role I want you to play. And here is what other kind of rules you need to call in order to play your role here. And then it’s up to the LLM to decide how many times it’s going to call which components in order to solve the problem. We don’t decide. We can only decide, hey, cap it at 10 times, for instance, or cap it at 30 iterations and then see how it performs.

LLORENS: So, Ida, what’s next for you and your research?

MOMENNEJAD: Thank you for that. I have always been interested in understanding minds and making minds, and this has been something that I’ve wanted to do since I was a teenager. And I think that my approaches in cognitive neuroscience have really helped me to understand minds to the extent that is possible. And my understanding of how to make minds comes from basically the work that I’ve done in AI and computer science since my undergrad. What I would be interested in is—and I have learned over the years that you cannot think about the mind in general when you are trying to isolate some components and building them—is that my interest is very much in reasoning and multistep planning, especially in complex problems and very long-term problems and how they relate to memory, how the past and the future relate to one another. And so something that I would be very interested in is making more efficient types of multi-agent brain-inspired AI but also to train smaller large language models, perhaps using the process of reasoning in order to improve their reasoning abilities. Because it’s one thing to train on outcome and outcome can be inputs and outputs, and that’s the most of the training data that LLMs receive. But it’s an entirely different approach to teach the process and probe them on different parts of the process as opposed to just the input and output. So I wonder whether with that kind of an approach, which would require generating a lot of synthetic data that relates to different types of reasoning skills, whether it’s possible to teach LLMs reasoning skills, and by reasoning skills, I mean very clearly operationalized—similar to the CogEval approach—operationalized, very well-researched, specific cognitive constructs that have construct validity and then operationalizing them in terms of many tasks. And something that’s important to me is a very important idea and a part of intelligence that maybe I didn’t highlight enough in the first part is being able to transfer to tasks that they have never seen before, and they can piece together different intelligence skills or reasoning skills in order to solve them. Another thing that I have done and I will continue to do is collective intelligence. So we talked about multi-agent systems, that they are playing the roles of different parts inside one brain. But I’ve also done experiments with multiple humans and how different structures of human communication leads to better memory or problem-solving. Humans, also, we invent things; we innovate things in cultural accumulation, which requires [building] on a lot of … some people do something, I take that outcome, take another outcome, put them together, make something. Someone takes my approach and adds something to it; makes something else. So this kind of cultural accumulation, we have done some work on that with deep reinforcement learning models that share their replay buffer as a way of sharing skill with each other; however, as humans become a lot more accustomed to using LLMs and other generative AI, basically generative AI would start participating in this kind of cultural accumulation. So the notion of collective cognition, collective intelligence, and collective memory will now have to incorporate the idea of generative AI being a part of it. And so I’m also interested in different approaches to modeling that, understanding that, optimizing that, identifying in what ways it’s better.[3] We have found both in humans and in deep reinforcement learning agents, for instance, that particular structures of communication that are actually not the most energy-consuming one; it’s not all-to-all communication, but particular partially connected structures are better for innovation than others. And some other structures might be better for memory or collective memory converging with each other.[4] So I think it would be very interesting—the same way that we are looking at what kind of components talk to each other in one brain to solve certain problems—to think about what kind of structures or roles can interact with each other, in what shape and in what frequency of communication, in order to solve larger, sort of, cultural accumulation problems.

[MUSIC PLAYS]

LLORENS: Well, that’s a compelling vision. I really look forward to seeing how far you and the team can take it. And thanks for a fascinating discussion.

MOMENNEJAD: Thank you so much.

[MUSIC FADES]


[1] Momennejad notes that repetitive probing allowed she and her colleagues to report the mean and standard deviation of the accuracy over all the responses with corresponding statistics rather than merely reporting the first or the best response.

[2] Momennejad notes that a “convenient and interesting fact about these modules or components or roles is that they’re very similar to some components in reinforcement learning, like actor and critique and tree search. And people have made prefrontal cortex–inspired models in deep learning in the past. This affinity to RL makes it easier to extend this framework to realize various RL algorithms and the sorts of problems one could solve with them using LLMs. Another feature is that they don’t all solve the big problem. There’s an orchestrator that assigns subgoals and passes it on, then the actor’s input and output or the monitor or evaluator’s input and output are parts of the problem, not all of it. This makes the many calls to GPT-4 efficient and is comparable to the local view or access of heterogenous agents, echoing the classic features of a multi-agent framework.“

[3] Momennejad notes that one task she and her colleagues have used is similar to the game Little Alchemy: the players need to find elements, combine them, and create new components. There are multiple levels of hierarchy of innovation that are possible in the game; some of them combine components from different trajectories.

[4] Momennejad notes that this relates to some work she and her colleagues have done building and evaluating AI agents in multi-agent Xbox games like Bleeding Edge, as well.

The post AI Frontiers: Rethinking intelligence with Ashley Llorens and Ida Momennejad appeared first on Microsoft Research.

Read More