January 2024 – Page 9

Build financial search applications using the Amazon Bedrock Cohere multilingual embedding model

Enterprises have access to massive amounts of data, much of which is difficult to discover because the data is unstructured. Conventional approaches to analyzing unstructured data use keyword or synonym matching. They don’t capture the full context of a document, making them less effective in dealing with unstructured data.

In contrast, text embeddings use machine learning (ML) capabilities to capture the meaning of unstructured data. Embeddings are generated by representational language models that translate text into numerical vectors and encode contextual information in a document. This enables applications such as semantic search, Retrieval Augmented Generation (RAG), topic modeling, and text classification.

For example, in the financial services industry, applications include extracting insights from earnings reports, searching for information from financial statements, and analyzing sentiment about stocks and markets found in financial news. Text embeddings enable industry professionals to extract insights from documents, minimize errors, and increase their performance.

In this post, we showcase an application that can search and query across financial news in different languages using Cohere’s Embed and Rerank models with Amazon Bedrock.

Cohere’s multilingual embedding model

Cohere is a leading enterprise AI platform that builds world-class large language models (LLMs) and LLM-powered solutions that allow computers to search, capture meaning, and converse in text. They provide ease of use and strong security and privacy controls.

Cohere’s multilingual embedding model generates vector representations of documents for over 100 languages and is available on Amazon Bedrock. This allows AWS customers to access it as an API, which eliminates the need to manage the underlying infrastructure and ensures that sensitive information remains securely managed and protected.

The multilingual model groups text with similar meanings by assigning them positions that are close to each other in a semantic vector space. With a multilingual embedding model, developers can process text in multiple languages without the need to switch between different models, as illustrated in the following figure. This makes processing more efficient and improves performance for multilingual applications.

The following are some of the highlights of Cohere’s embedding model:

Focus on document quality – Typical embedding models are trained to measure similarity between documents, but Cohere’s model also measures document quality
Better retrieval for RAG applications – RAG applications require a good retrieval system, which Cohere’s embedding model excels at
Cost-efficient data compression – Cohere uses a special, compression-aware training method, resulting in substantial cost savings for your vector database

Use cases for text embedding

Text embeddings turn unstructured data into a structured form. This allows you to objectively compare, dissect, and derive insights from all of these documents. The following are example use cases that Cohere’s embedding model enables:

Semantic search – Enables powerful search applications when coupled with a vector database, with excellent relevance based on search phrase meaning
Search engine for a larger system – Finds and retrieves the most relevant information from connected enterprise data sources for RAG systems
Text classification – Supports intent recognition, sentiment analysis, and advanced document analysis
Topic modeling – Turns a collection of documents into distinct clusters to uncover emerging topics and themes

Enhanced search systems with Rerank

In enterprises where conventional keyword search systems are already present, how do you introduce modern semantic search capabilities? For such systems that have been part of a company’s information architecture for a long time, a complete migration to an embeddings-based approach is, in many cases, just not feasible.

Cohere’s Rerank endpoint is designed to bridge this gap. It acts as the second stage of a search flow to provide a ranking of relevant documents per a user’s query. Enterprises can retain an existing keyword (or even semantic) system for the first-stage retrieval and boost the quality of search results with the Rerank endpoint in the second-stage reranking.

Rerank provides a fast and straightforward option for improving search results by introducing semantic search technology into a user’s stack with a single line of code. The endpoint also comes with multilingual support. The following figure illustrates the retrieval and reranking workflow.

Solution overview

Financial analysts need to digest a lot of content, such as financial publications and news media, in order to stay informed. According to the Association for Financial Professionals (AFP), financial analysts spend 75% of their time gathering data or administering the process instead of added-value analysis. Finding the answer to a question across a variety of sources and documents is time-intensive and tedious work. The Cohere embedding model helps analysts quickly search across numerous article titles in multiple languages to find and rank the articles that are most relevant to a particular query, saving an enormous amount of time and effort.

In the following use case example, we showcase how Cohere’s Embed model searches and queries across financial news in different languages in one unique pipeline. Then we demonstrate how adding Rerank to your embeddings retrieval (or adding it to a legacy lexical search) can further improve results.

The supporting notebook is available on GitHub.

The following diagram illustrates the workflow of the application.

Enable model access through Amazon Bedrock

Amazon Bedrock users need to request access to models to make them available for use. To request access to additional models, choose Model access the navigation pane on the Amazon Bedrock console. For more information, see Model access. For this walkthrough, you need to request access to the Cohere Embed Multilingual model.

Install packages and import modules

First, we install the necessary packages and import the modules we’ll use in this example:

!pip install --upgrade cohere-aws hnswlib translate

import pandas as pd
import cohere_aws
import hnswlib
import os
import re
import boto3

Import documents

We use a dataset (MultiFIN) containing a list of real-world article headlines covering 15 languages (English, Turkish, Danish, Spanish, Polish, Greek, Finnish, Hebrew, Japanese, Hungarian, Norwegian, Russian, Italian, Icelandic, and Swedish). This is an open source dataset curated for financial natural language processing (NLP) and is available on a GitHub repository.

In our case, we’ve created a CSV file with MultiFIN’s data as well as a column with translations. We don’t use this column to feed the model; we use it to help us follow along when we print the results for those who don’t speak Danish or Spanish. We point to that CSV to create our dataframe:

url = "https://raw.githubusercontent.com/cohere-ai/cohere-aws/main/notebooks/bedrock/multiFIN_train.csv"
df = pd.read_csv(url)

# Inspect dataset
df.head(5)

Select a list of documents to query

MultiFIN has over 6,000 records in 15 different languages. For our example use case, we focus on three languages: English, Spanish, and Danish. We also sort the headers by length and pick the longest ones.

Because we’re picking the longest articles, we ensure the length is not due to repeated sequences. The following code shows an example where that is the case. We will clean that up.

df['text'].iloc[2215]

'El 86% de las empresas españolas comprometidas con los Objetivos de Desarrollo 
Sostenible comprometidas con los Objetivos de Desarrollo Sostenible comprometidas 
con los Objetivos de Desarrollo Sostenible comprometidas con los Objetivos de 
Desarrollo Sostenible'

# Ensure there is no duplicated text in the headers
def remove_duplicates(text):
    return re.sub(r'((bw+b.{1,2}w+b)+).+1', r'1', text, flags=re.I)

df ['text'] = df['text'].apply(remove_duplicates)

# Keep only selected languages
languages = ['English', 'Spanish', 'Danish']
df = df.loc[df['lang'].isin(languages)]

# Pick the top 80 longest articles
df['text_length'] = df['text'].str.len()
df.sort_values(by=['text_length'], ascending=False, inplace=True)
top_80_df = df[:80]

# Language distribution
top_80_df['lang'].value_counts()

Our list of documents is nicely distributed across the three languages:

lang
Spanish    33
English    29
Danish     18
Name: count, dtype: int64

The following is the longest article header in our dataset:

top_80_df['text'].iloc[0]

"CFOdirect: Resultater fra PwC's Employee Engagement Landscape Survey, herunder hvordan 
man skaber mere engagement blandt medarbejdere. Læs desuden om de regnskabsmæssige 
konsekvenser for indkomstskat ifbm. Brexit"

Embed and index documents

Now, we want to embed our documents and store the embeddings. The embeddings are very large vectors that encapsulate the semantic meaning of our document. In particular, we use Cohere’s embed-multilingual-v3.0 model, which creates embeddings with 1,024 dimensions.

When a query is passed, we also embed the query and use the hnswlib library to find the closest neighbors.

It only takes a few lines of code to establish a Cohere client, embed the documents, and create the search index. We also keep track of the language and translation of the document to enrich the display of the results.

# Establish Cohere client
co = cohere_aws.Client(mode=cohere_aws.Mode.BEDROCK)
model_id = "cohere.embed-multilingual-v3"

# Embed documents
docs = top_80_df['text'].to_list()
docs_lang = top_80_df['lang'].to_list()
translated_docs = top_80_df['translated_text'].to_list() #for reference when returning non-English results
doc_embs = co.embed(texts=docs, model_id=model_id, input_type='search_document').embeddings

# Create a search index
index = hnswlib.Index(space='ip', dim=1024)
index.init_index(max_elements=len(doc_embs), ef_construction=512, M=64)
index.add_items(doc_embs, list(range(len(doc_embs))))

Build a retrieval system

Next, we build a function that takes a query as input, embeds it, and finds the four headers more closely related to it:

# Retrieval of 4 closest docs to query
def retrieval(query):
    # Embed query and retrieve results
    query_emb = co.embed(texts=[query], model_id=model_id, input_type="search_query").embeddings
    doc_ids = index.knn_query(query_emb, k=3)[0][0] # we will retrieve 4 closest neighbors
    
    # Print and append results
    print(f"QUERY: {query.upper()} n")
    retrieved_docs, translated_retrieved_docs = [], []
    
    for doc_id in doc_ids:
        # Append results
        retrieved_docs.append(docs[doc_id])
        translated_retrieved_docs.append(translated_docs[doc_id])
    
        # Print results
        print(f"ORIGINAL ({docs_lang[doc_id]}): {docs[doc_id]}")
        if docs_lang[doc_id] != "English":
            print(f"TRANSLATION: {translated_docs[doc_id]} n----")
        else:
            print("----")
    print("END OF RESULTS nn")
    return retrieved_docs, translated_retrieved_docs

Query the retrieval system

Let’s explore what our system does with a couple of different queries. We start with English:

queries = [
    "Are businessess meeting sustainability goals?",
    "Can data science help meet sustainability goals?"
]

for query in queries:
    retrieval(query)

The results are as follows:

QUERY: ARE BUSINESSES MEETING SUSTAINABILITY GOALS? 

ORIGINAL (English): Quality of business reporting on the Sustainable Development Goals 
improves, but has a long way to go to meet and drive targets.
----
ORIGINAL (English): Only 10 years to achieve Sustainable Development Goals but 
businesses remain on starting blocks for integration and progress
----
ORIGINAL (Spanish): Integrar los criterios ESG y el propósito en la estrategia 
principal reto de los Consejos de las empresas españolas en el mundo post-COVID 

TRANSLATION: Integrate ESG criteria and purpose into the main challenge strategy 
of the Boards of Spanish companies in the post-COVID world 
----
END OF RESULTS 

QUERY: CAN DATA SCIENCE HELP MEET SUSTAINABILITY GOALS? 

ORIGINAL (English): Using AI to better manage the environment could reduce greenhouse 
gas emissions, boost global GDP by up to 38m jobs by 2030
----
ORIGINAL (English): Quality of business reporting on the Sustainable Development Goals 
improves, but has a long way to go to meet and drive targets.
----
ORIGINAL (English): Only 10 years to achieve Sustainable Development Goals but 
businesses remain on starting blocks for integration and progress
----
END OF RESULTS

Notice the following:

We’re asking related, but slightly different questions, and the model is nuanced enough to present the most relevant results at the top.
Our model does not perform keyword-based search, but semantic search. Even if we’re using a term like “data science” instead of “AI,” our model is able to understand what’s being asked and return the most relevant result at the top.

How about a query in Danish? Let’s look at the following query:

query = "Hvor kan jeg finde den seneste danske boligplan?" # "Where can I find the latest Danish property plan?"
retrieved_docs, translated_retrieved_docs = retrieval(query)

QUERY: HVOR KAN JEG FINDE DEN SENESTE DANSKE BOLIGPLAN? 

ORIGINAL (Danish): Nyt fra CFOdirect: Ny PP&E-guide, FAQs om den nye leasingstandard, 
podcast om udfordringerne ved implementering af leasingstandarden og meget mere

TRANSLATION: New from CFOdirect: New PP&E guide, FAQs on the new leasing standard, 
podcast on the challenges of implementing the leasing standard and much more 
----
ORIGINAL (Danish): Lovforslag fremlagt om rentefri lån, udskudt frist for 
lønsumsafgift, førtidig udbetaling af skattekredit og loft på indestående på 
skattekontoen

TRANSLATION: Legislative proposal presented on interest-free loans, deferred payroll 
tax deadline, early payment of tax credit and ceiling on deposits in the tax account 
----
ORIGINAL (Danish): Nyt fra CFOdirect: Shareholder-spørgsmål til ledelsen, SEC 
cybersikkerhedsguide, den amerikanske skattereform og meget mere

TRANSLATION: New from CFOdirect: Shareholder questions for management, the SEC 
cybersecurity guide, US tax reform and more 
----
END OF RESULTS

In the preceding example, the English acronym “PP&E” stands for “property, plant, and equipment,” and our model was able to connect it to our query.

In this case, all returned results are in Danish, but the model can return a document in a language other than the query if its semantic meaning is closer. We have complete flexibility, and with a few lines of code, we can specify whether the model should only look at documents in the language of the query, or whether it should look at all documents.

Improve results with Cohere Rerank

Embeddings are very powerful. However, we’re now going to look at how to refine our results even further with Cohere’s Rerank endpoint, which has been trained to score the relevancy of documents against a query.

Another advantage of Rerank is that it can work on top of a legacy keyword search engine. You don’t have to change to a vector database or make drastic changes to your infrastructure, and it only takes a few lines of code. Rerank is available in Amazon SageMaker.

Let’s try a new query. We use SageMaker this time:

query = "Are companies ready for the next down market?"
retrieved_docs, translated_retrieved_docs = retrieval(query)

QUERY: ARE COMPANIES READY FOR THE NEXT DOWN MARKET? 

ORIGINAL (Spanish): El valor en bolsa de las 100 mayores empresas cotizadas cae un 15% 
entre enero y marzo pero aguanta el embate del COVID-19 

TRANSLATION: The stock market value of the 100 largest listed companies falls 15% 
between January and March but withstands the onslaught of COVID-19 
----
ORIGINAL (English): 69% of business leaders have experienced a corporate crisis in the 
last five years yet 29% of companies have no staff dedicated to crisis preparedness
----
ORIGINAL (English): As work sites slowly start to reopen, CFOs are concerned about the 
global economy and a potential new COVID-19 wave - PwC survey
----
END OF RESULTS

In this case, a semantic search was able to retrieve our answer and display it in the results, but it’s not at the top. However, when we pass the query again to our Rerank endpoint with the list of docs retrieved, Rerank is able to surface the most relevant document at the top.

First, we create the client and the Rerank endpoint:

# map model package arn
import boto3
cohere_package = "cohere-rerank-multilingual-v2--8b26a507962f3adb98ea9ac44cb70be1" # replace this with your info

model_package_map = {
    "us-east-1": f"arn:aws:sagemaker:us-east-1:865070037744:model-package/{cohere_package}",
    "us-east-2": f"arn:aws:sagemaker:us-east-2:057799348421:model-package/{cohere_package}",
    "us-west-1": f"arn:aws:sagemaker:us-west-1:382657785993:model-package/{cohere_package}",
    "us-west-2": f"arn:aws:sagemaker:us-west-2:594846645681:model-package/{cohere_package}",
    "ca-central-1": f"arn:aws:sagemaker:ca-central-1:470592106596:model-package/{cohere_package}",
    "eu-central-1": f"arn:aws:sagemaker:eu-central-1:446921602837:model-package/{cohere_package}",
    "eu-west-1": f"arn:aws:sagemaker:eu-west-1:985815980388:model-package/{cohere_package}",
    "eu-west-2": f"arn:aws:sagemaker:eu-west-2:856760150666:model-package/{cohere_package}",
    "eu-west-3": f"arn:aws:sagemaker:eu-west-3:843114510376:model-package/{cohere_package}",
    "eu-north-1": f"arn:aws:sagemaker:eu-north-1:136758871317:model-package/{cohere_package}",
    "ap-southeast-1": f"arn:aws:sagemaker:ap-southeast-1:192199979996:model-package/{cohere_package}",
    "ap-southeast-2": f"arn:aws:sagemaker:ap-southeast-2:666831318237:model-package/{cohere_package}",
    "ap-northeast-2": f"arn:aws:sagemaker:ap-northeast-2:745090734665:model-package/{cohere_package}",
    "ap-northeast-1": f"arn:aws:sagemaker:ap-northeast-1:977537786026:model-package/{cohere_package}",
    "ap-south-1": f"arn:aws:sagemaker:ap-south-1:077584701553:model-package/{cohere_package}",
    "sa-east-1": f"arn:aws:sagemaker:sa-east-1:270155090741:model-package/{cohere_package}",
}

region = boto3.Session().region_name
if region not in model_package_map.keys():
    raise Exception(f"Current boto3 session region {region} is not supported.")

model_package_arn = model_package_map[region]

co = cohere_aws.Client(region_name=region)
co.create_endpoint(arn=model_package_arn, endpoint_name="cohere-rerank-multilingual", instance_type="ml.g4dn.xlarge", n_instances=1)

When we pass the documents to Rerank, the model is able to pick the most relevant one accurately:

results = co.rerank(query=query, documents=retrieved_docs, top_n=1)

for hit in results:
    print(hit.document['text'])

69% of business leaders have experienced a corporate crisis in the last five years yet 
29% of companies have no staff dedicated to crisis preparedness

Conclusion

This post presented a walkthrough of using Cohere’s multilingual embedding model in Amazon Bedrock in the financial services domain. In particular, we demonstrated an example of a multilingual financial articles search application. We saw how the embedding model enables efficient and accurate discovery of information, thereby boosting the productivity and output quality of an analyst.

Cohere’s multilingual embedding model supports over 100 languages. It removes the complexity of building applications that require working with a corpus of documents in different languages. The Cohere Embed model is trained to deliver results in real-world applications. It handles noisy data as inputs, adapts to complex RAG systems, and delivers cost-efficiency from its compression-aware training method.

Start building with Cohere’s multilingual embedding model in Amazon Bedrock today.

About the Authors

James Yi is a Senior AI/ML Partner Solutions Architect in the Technology Partners COE Tech team at Amazon Web Services. He is passionate about working with enterprise customers and partners to design, deploy, and scale AI/ML applications to derive business value. Outside of work, he enjoys playing soccer, traveling, and spending time with his family.

Gonzalo Betegon is a Solutions Architect at Cohere, a provider of cutting-edge natural language processing technology. He helps organizations address their business needs through the deployment of large language models.

Meor Amer is a Developer Advocate at Cohere, a provider of cutting-edge natural language processing (NLP) technology. He helps developers build cutting-edge applications with Cohere’s Large Language Models (LLMs).

Can large language models identify and correct their mistakes?

Posted by Gladys Tyen, Intern, Google Research

LLMs are increasingly popular for reasoning tasks, such as multi-turn QA, task completion, code generation, or mathematics. Yet much like people, they do not always solve problems correctly on the first try, especially on tasks for which they were not trained. Therefore, for such systems to be most useful, they should be able to 1) identify where their reasoning went wrong and 2) backtrack to find another solution.

This has led to a surge in methods related to self-correction, where an LLM is used to identify problems in its own output, and then produce improved results based on the feedback. Self-correction is generally thought of as a single process, but we decided to break it down into two components, mistake finding and output correction.

In “LLMs cannot find reasoning errors, but can correct them!”, we test state-of-the-art LLMs on mistake finding and output correction separately. We present BIG-Bench Mistake, an evaluation benchmark dataset for mistake identification, which we use to address the following questions:

Can LLMs find logical mistakes in Chain-of-Thought (CoT) style reasoning?
Can mistake-finding be used as a proxy for correctness?
Knowing where the mistake is, can LLMs then be prompted to backtrack and arrive at the correct answer?
Can mistake finding as a skill generalize to tasks the LLMs have never seen?

About our dataset

Mistake finding is an underexplored problem in natural language processing, with a particular lack of evaluation tasks in this domain. To best assess the ability of LLMs to find mistakes, evaluation tasks should exhibit mistakes that are non-ambiguous. To our knowledge, most current mistake-finding datasets do not go beyond the realm of mathematics for this reason.

To assess the ability of LLMs to reason about mistakes outside of the math domain, we produce a new dataset for use by the research community, called BIG-Bench Mistake. This dataset consists of Chain-of-Thought traces generated using PaLM 2 on five tasks in BIG-Bench. Each trace is annotated with the location of the first logical mistake.

To maximize the number of mistakes in our dataset, we sample 255 traces where the answer is incorrect (so we know there is definitely a mistake), and 45 traces where the answer is correct (so there may or may not be a mistake). We then ask human labelers to go through each trace and identify the first mistake step. Each trace has been annotated by at least three labelers, whose answers had inter-rater reliability levels of >0.98 (using Krippendorff’s α). The labeling was done for all tasks except the Dyck Languages task, which involves predicting the sequence of closing parentheses for a given input sequence. This task we labeled algorithmically.

The logical errors made in this dataset are simple and unambiguous, providing a good benchmark for testing an LLM’s ability to find its own mistakes before using them on harder, more ambiguous tasks.

Core questions about mistake identification

1. Can LLMs find logical mistakes in Chain-of-Thought style reasoning?

First, we want to find out if LLMs can identify mistakes independently of their ability to correct them. We attempt multiple prompting methods to test GPT series models for their ability to locate mistakes (prompts here) under the assumption that they are generally representative of modern LLM performance.

Generally, we found these state-of-the-art models perform poorly, with the best model achieving 52.9% accuracy overall. Hence, there is a need to improve LLMs’ ability in this area of reasoning.

In our experiments, we try three different prompting methods: direct (trace), direct (step) and CoT (step). In direct (trace), we provide the LLM with the trace and ask for the location step of the mistake or no mistake. In direct (step), we prompt the LLM to ask itself this question for each step it takes. In CoT (step), we prompt the LLM to give its reasoning for whether each step is a mistake or not a mistake.

A diagram showing the three prompting methods direct (trace), direct (step) and CoT (step).

Our finding is in line and builds upon prior results, but goes further in showing that LLMs struggle with even simple and unambiguous mistakes (for comparison, our human raters without prior expertise solve the problem with a high degree of agreement). We hypothesize that this is a big reason why LLMs are unable to self-correct reasoning errors. See the paper for the full results.

2. Can mistake-finding be used as a proxy for correctness of the answer?

When people are confronted with a problem where we are unsure of the answer, we can work through our solutions step-by-step. If no error is found, we can make the assumption that we did the right thing.

While we hypothesized that this would work similarly for LLMs, we discovered that this is a poor strategy. On our dataset of 85% incorrect traces and 15% correct traces, using this method is not much better than the naïve strategy of always labeling traces as incorrect, which gives a weighted average F1 of 78.

A diagram showing how well mistake-finding with LLMs can be used as a proxy for correctness of the answer on each dataset.

3. Can LLMs backtrack knowing where the error is?

Since we’ve shown that LLMs exhibit poor performance in finding reasoning errors in CoT traces, we want to know whether LLMs can even correct errors at all, even if they know where the error is.

Note that knowing the mistake location is different from knowing the right answer: CoT traces can contain logical mistakes even if the final answer is correct, or vice versa. In most real-world situations, we won’t know what the right answer is, but we might be able to identify logical errors in intermediate steps.

We propose the following backtracking method:

Generate CoT traces as usual, at temperature = 0. (Temperature is a parameter that controls the randomness of generated responses, with higher values producing more diverse and creative outputs, usually at the expense of quality.)
Identify the location of the first logical mistake (for example with a classifier, or here we just use labels from our dataset).
Re-generate the mistake step at temperature = 1 and produce a set of eight outputs. Since the original output is known to lead to incorrect results, the goal is to find an alternative generation at this step that is significantly different from the original.
From these eight outputs, select one that is different from the original mistake step. (We just use exact matching here, but in the future this can be something more sophisticated.)
Using the new step, generate the rest of the trace as normal at temperature = 0.

It’s a very simple method that does not require any additional prompt crafting and avoids having to re-generate the entire trace. We test it using the mistake location data from BIG-Bench Mistake, and we find that it can correct CoT errors.

Recent work showed that self-correction methods, like Reflexion and RCI, cause deterioration in accuracy scores because there are more correct answers becoming incorrect than vice versa. Our method, on the other hand, produces more gains (by correcting wrong answers) than losses (by changing right answers to wrong answers).

We also compare our method with a random baseline, where we randomly assume a step to be a mistake. Our results show that this random baseline does produce some gains, but not as much as backtracking with the correct mistake location, and with more losses.

A diagram showing the gains and losses in accuracy for our method as well as a random baseline on each dataset.

4. Can mistake finding generalize to tasks the LLMs have never seen?

To answer this question, we fine-tuned a small model on four of the BIG-Bench tasks and tested it on the fifth, held-out task. We do this for every task, producing five fine-tuned models in total. Then we compare the results with just zero-shot prompting PaLM 2-L-Unicorn, a much larger model.

Bar chart showing the accuracy improvement of the fine-tuned small model compared to zero-shot prompting with PaLM 2-L-Unicorn.

Our results show that the much smaller fine-tuned reward model generally performs better than zero-shot prompting a large model, even though the reward model has never seen data from the task in the test set. The only exception is logical deduction, where it performs on par with zero-shot prompting.

This is a very promising result as we can potentially just use a small fine-tuned reward model to perform backtracking and improve accuracy on any task, even if we don’t have the data for it. This smaller reward model is completely independent of the generator LLM, and can be updated and further fine-tuned for individual use cases.

An illustration showing how our backtracking method works.

Conclusion

In this work, we created an evaluation benchmark dataset that the wider academic community can use to evaluate future LLMs. We further showed that LLMs currently struggle to find logical errors. However, if they could, we show the effectiveness of backtracking as a strategy that can provide gains on tasks. Finally, a smaller reward model can be trained on general mistake-finding tasks and be used to improve out-of-domain mistake finding, showing that mistake-finding can generalize.

Acknowledgements

Thank you to Peter Chen, Tony Mak, Hassan Mansoor and Victor Cărbune for contributing ideas and helping with the experiments and data collection. We would also like to thank Sian Gooding and Vicky Zayats for their comments and suggestions on the paper.

Ball position tracking in the cloud with the PGA TOUR

The PGA TOUR continues to enhance the golf experience with real-time data that brings fans closer to the game. To deliver even richer experiences, they are pursuing the development of a next-generation ball position tracking system that automatically tracks the position of the ball on the green.

The TOUR currently uses ShotLink powered by CDW, a premier scoring system that uses a complex camera system with on-site compute, to closely track the start and end position of every shot. The TOUR wanted to explore computer vision and machine learning (ML) techniques to develop a next-generation cloud-based pipeline to locate golf balls on the putting green.

The Amazon Generative AI Innovation Center (GAIIC) demonstrated the effectiveness of these techniques in an example dataset from a recent PGA TOUR event. The GAIIC designed a modular pipeline cascading a series of deep convolutional neural networks that successfully localizes players within a camera’s field of view, determines which player is putting, and tracks the ball as it moves toward the cup.

In this post, we describe the development of this pipeline, the raw data, the design of the convolutional neural networks comprising the pipeline, and an evaluation of its performance.

Data

The TOUR provided 3 days of continuous video from a recent tournament from three 4K cameras positioned around the green on one hole. The following figure shows a frame from one camera cropped and zoomed so that the player putting is easily visible. Note that despite the high resolution of the cameras, because of the distance from the green, the ball appears small (usually 3×3, 4×4 or 5×5 pixels), and targets of this size can be difficult to localize accurately.

In addition to the camera feeds, the TOUR provided the GAIIC with annotated scoring data on each shot, including world location of its resting position and the timestamp. This allowed for visualizations of every putt on the green, as well as the ability to pull all of the video clips of players putting, which could be manually labeled and used to train detection models that make up the pipeline. The following figure show the three camera views with approximate putt path overlays, counterclockwise from top left. The pin is moved each day, where day 1 corresponds to blue, day 2 to red, and day 3 to orange.

Pipeline overview

The overall system consists of both a training pipeline an inference pipeline. The following diagram illustrates the architecture of the training pipeline. The starting point is ingestion of video data, either from a streaming module like Amazon Kinesis for live video or placement directly into Amazon Simple Storage Service (Amazon S3) for historical video. The training pipeline requires video preprocessing and hand labeling of images with Amazon SageMaker Ground Truth. Models can be trained with Amazon SageMaker and their artifacts stored with Amazon S3.

The inference pipeline, shown in the following diagram, consists of a number of modules that successively extract information from the raw video and ultimately predict the world coordinates of the ball at rest. Initially, the green is cropped from the larger field of view from each camera, in order to cut down on the pixel area in which the models must search for players and balls. Next, a deep convolutional neural network (CNN) is used to find the locations of people in the field of view. Another CNN is used to predict which type of person has been found in order to determine whether anyone is about to putt. After a likely putter has been localized in the field of view, the same network is used to predict the location of the ball near the putter. A third CNN tracks the ball during its motion, and lastly, a transformation function from camera pixel position to GPS coordinates is applied.

Player detection

Although it would be possible to run a CNN for ball detection over an entire 4K frame at a set interval, given the angular size of the ball at these camera distances, any small white object triggers a detection, resulting in many false alarms. To avoid searching the entire image frame for the ball, it’s possible to take advantage of correlations between player pose and ball location. A ball that is about to be putted must be next to a player, so finding the players in the field of view will greatly restrict the pixel area in which the detector must search for the ball.

We were able to use a CNN that was pre-trained to predict bounding boxes around all the people in a scene, as shown in the following figure. Unfortunately, there is frequently more than one ball on the green, so further logic is required beyond simply finding all people and searching for a ball. This requires another CNN to find the player that was currently putting.

Player classification and ball detection

To further narrow down where the ball could be, we fine-tuned a pre-trained object-detection CNN (YOLO v7) to classify all the people on the green. An important component of this process was manually labeling a set of images using SageMaker Ground Truth. The labels allowed the CNN to classify the player putting with high accuracy. In the labeling process, the ball was also outlined along with the player putting, so this CNN was able to perform ball detection as well, drawing an initial bounding box around the ball before a putt and feeding the position information into the downstream ball tracking CNN.

We use four different labels to annotate the objects in the images:

player-putting – The player holding a club and in the putting position
player-not-putting – The player not in the putting position (may also be holding a club)
other-person – Any other person who is not a player
golf-ball – The golf ball

The following figure shows a CNN was fine-tuned using labels from SageMaker Ground Truth to classify each person in the field of view. This is difficult because of the wide range of visual appearances of players, caddies, and fans. After a player was classified as putting, a CNN fine-tuned for ball detection was applied to the small area immediately around that player.

Ball path tracking

A third CNN, a ResNet architecture pre-trained for motion tracking, was used for tracking the ball after it was putted. Motion tracking is a thoroughly researched problem, so this network performed well when integrated into the pipeline without further fine-tuning.

Pipeline output

The cascade of CNNs places bounding boxes around people, classifies people on the green, detects the initial ball position, and tracks the ball once it begins moving. The following figure shows the labeled video output of the pipeline. The pixel positions of the ball as it moves are tracked and recorded. Note that people on the green are being tracked and outlined by bounding boxes; the putter at the bottom is labeled correctly as “player putting,” and the moving ball is being tracked and outlined by a small blue bounding box.

Performance

To assess performance of components of the pipeline, it’s necessary to have labeled data. Although we were provided with the ground truth world position of the ball, we didn’t have intermediate points for ground truth, like the final pixel position of the ball or the pixel location of the player putting. With the labeling job that we carried out, we developed ground truth data for these intermediate outputs of the pipeline that allow us to measure performance.

Player classification and ball detection accuracy

For detection of the player putting and the initial ball location, we labeled a dataset and fine-tuned a YOLO v7 CNN model as described earlier. The model classified the output from the previous person detection module into four classes: a player putting, a player not putting, other people, and the golf ball, as shown in the following figure.

The performance of this module is assessed with a confusion matrix, shown in the following figure. The values in the diagonal boxes show how often the predicted class matched the actual class from the ground truth labels. The model has 89% recall or better for each person class, and 79% recall for golf balls (which is to be expected because the model is pre-trained on examples with people but not on examples with golf balls; this could be improved with more labeled golf balls in the training set).

The next step is to trigger the ball tracker. Because the ball detection output is a confidence probability, it’s also possible to set the threshold for “detected ball” and observe how that changes the results, summarized in the following figure. There is a trade-off in this method because a higher threshold will necessarily have fewer false alarms but also miss some of the less certain examples of balls. We tested thresholds of 20% and 50% confidence, and found ball detection at 78% and 61%, respectively. By this measure, the 20% threshold is better. The trade-off is apparent in that for the 20% confidence threshold, 80% of total detections were actually balls (20% false positive), whereas for the 50% confidence threshold, 90% were balls (10% false positive). For fewer false positives, the 50% confidence threshold is better. Both of these measures could be improved with more labeled data for a larger training set.

The detection pipeline throughput is on the order of 10 frames per second, so in its current form, a single instance is not fast enough to be run continuously on the input at 50 frames per second. Achieving the 7-second mark for output after the ball steps would require further optimization for latency, perhaps by running multiple versions of the pipeline in parallel and compressing the CNN models via quantization (for example).

Ball path tracking accuracy

The pre-trained CNN model from MMTracking works well, but there are interesting failure cases. The following figure shows a case where the tracker starts on the ball, expands its bounding box to include both the putter head and ball, and then unfortunately tracks the putter head and forgets the ball. In this case, the putter head appears white (possibly due to specular reflection), so the confusion is understandable; labeled data for tracking and fine-tuning of the tracking CNN could help improve this in the future.

Conclusion

In this post, we discussed the development of a modular pipeline that localizes players within a camera’s field of view, determines which player is putting, and tracks the ball as it moves toward the cup.

For more information about AWS collaboration with the PGA TOUR, refer to PGA TOUR tees up with AWS to reimagine the fan experience.

About the Authors

James Golden is an applied scientist at Amazon Bedrock with a background in machine learning and neuroscience.

Henry Wang is an applied scientist at Amazon Generative AI Innovation Center, where he researches and builds generative AI solutions for AWS customers. He focuses on sports and media & entertainment industries, and has worked with various sports leagues, teams and broadcasters in the past. During his spare time, he likes to play tennis and golf.

Tryambak Gangopadhyay is an Applied Scientist at the AWS Generative AI Innovation Center, where he collaborates with organizations across a diverse spectrum of industries. His role involves conducting research and developing Generative AI solutions to address crucial business challenges and accelerate AI adoption.

TaskWeaver: A code-first agent framework for efficient data analytics and domain adaptation

The advent of large language models (LLMs) has revolutionized human-machine interactions, particularly in natural language understanding and generation applications. These AI- or LLM-backed virtual assistants hold the promise of serving as intelligent agents capable of autonomously reasoning, observing, and performing tasks articulated in natural language. However, it is still challenging for most agent frameworks to efficiently handle complex data structures (e.g., DataFrame), which are prevalent in data analytics tasks and domain-specific scenarios.

To address these challenges, we introduce TaskWeaver – a code-first agent framework which can convert natural language user requests into executable code, with additional support for rich data structures, dynamic plugin selection, and domain-adapted planning process. Now publicly available as an open-source framework, TaskWeaver leverages LLMs’ coding capability to implement complex logic and incorporates domain-specific knowledge through customizable examples and plugins. TaskWeaver empowers users to easily build their own virtual assistants that understand diverse domain questions, follow examples, and execute customizable algorithms on complex data structures efficiently.

Motivating example – Anomaly detection on time-series data

Scenario

Amy is a business analyst who wants to identify anomalies on a time series of sales data stored in a SQL database. She would like to get help from an AI assistant for the task with natural language interactions. Moreover, Amy would like to apply her own definition and interpretation of anomalies in the context of sales data, including a customized anomaly detection algorithm. Figure 1 shows the desired conversation between the user and the AI assistant – the AI assistant should be able to first pull the data from target database, then apply desired algorithms, and return the visualized results.

Figure 1. A sample conversation between user and the AI assistant powered by taskweaver. The user asks to pull data from a database and apply anomaly detection. The ai assistant accomplishes the task by multi-rounds of communication with the user and finally plots a figure of anomalies. — Figure 1. Sample chat between the user and the AI assistant

Requirements for an agent framework

To accomplish the task above, we identify several key requirements that current agent frameworks may lack:

Plugin: The agent needs to first query and collect data from a database, and then detect anomalies using a specialized anomaly detection algorithm. Both require the capability to define and invoke custom plugins, e.g., the query_database plugin and the anomaly_detection plugin.
Rich data structure: the agent should be capable of handling data in complex but common structures, such as array, matrix, tabular data (e.g., pandas DataFrame (opens in new tab)), to perform advanced data processing actions. Many existing works tend to transform the intermediate outputs as strings in the prompt or save them as local files before reading them again. However, this practice is error-prone and could easily exceed the prompt token limit. Additionally, data in rich structure should be able to transfer easily from one plugin to another.
Stateful execution: The agent engages in iterative interactions with the user, processing user inputs and executing tasks accordingly. The execution states should be preserved throughout the entire conversation session across multiple chat rounds.
Reasoning and acting (ReAct): The agent should have the capability to first observe and then act. The database might contain data of various schemas in the real world, leading to different arguments for anomaly detection. Therefore, the agent must first inspect the data schema, understand which columns are appropriate (and ask users to confirm), then feed the corresponding column names into the anomaly detection algorithm.
Arbitrary code generation: The agent should be able to generate code to accommodate ad-hoc user demands, which are not covered by the pre-defined plugins. In the example provided, the agent generates code to visualize the detected anomalies without using any plugins.
Incorporating domain knowledge: The agent should provide a systematic way to incorporate domain-specific knowledge. It would help LLMs deliver better planning and accurate tool calling, which in turn produces reliable results, particularly in domain-specific scenarios.

TaskWeaver architecture

Figure 2 shows the three core components in the TaskWeaver architecture. The Planner serves as the system’s entry point and interacts with the user. Its responsibilities include: (1) planning – breaking down the user’s request into subtasks and managing the execution process with self-reflection; and (2) responding – transforming the execution result into a human-readable response for the user. The Code Interpreter consists of two components: the Code Generator generates code for a given subtask from the Planner, considering existing plugins and domain-specific task examples; the Code Executor is responsible for executing the generated code and maintaining the execution state throughout the entire session.

Figure 2. The architecture of taskweaver, which consists of three parts, planner, code generator and stateful code executor. They communicate with each other to accomplish user’s request. — Figure 2. Overview of TaskWeaver

Running workflow for motivating example

TaskWeaver has a two-layer planning process for dealing with user requests. In the first layer, the Planner generates a high-level plan outlining the steps required to fulfill the request. In each subsequent round, the code generator will devise a plan, in terms of chain-of-thought and generated code, to execute the specified step. Figure 3 presents the internal workflow of TaskWeaver when accomplishing the motivating example mentioned above. Note that the prompts shown in Figure 3 are simplified and do not represent the full complex instructions.

The initial step involves the Planner taking the user query, Code Interpreter description, and planning examples (if provided) to generate a plan. For the given example, the plan first pulls data from the database and describes the data schema. The Code Generator prompt delineates its profile and competencies, providing definitions of all relevant plugins (e.g., function name, description, arguments and return values.) The output from the Code Generator is a code snippet that executes the sql_pull_data plugin, retrieves the data into a DataFrame, and provides a description of the data schema.

Next, the code generated is sent to the Code Executor for execution, after which the result is sent back to the Planner to determine the next planning step. In the example, the execution result reveals two columns, namely date and value, in the DataFrame. For the next step, the Planner can either confirm with the user if these columns correspond to the two input parameters of the anomaly_detection plugin, or directly proceed to the next step.

The workflow of taskweaver when dealing with the motivating example. The planner first generates a plan by incorporating planner description, code interpreter description and plugin definition. The first step of the plan is passed to the code generator to generate Python code and then forwarded to the code executor. After collecting execution results, the planner responds to the user. — Figure 3. Workflow of TaskWeaver

Key design considerations of TaskWeaver

Code-first analytics experience: TaskWeaver converts user requests into Python programs that run on dedicated processes, where the Python program can be plugin invocations, arbitrary code to handle ad-hoc user queries, or both. Unlike other frameworks that rely on text or file-based expressions, TaskWeaver can fully utilize native data structures such as pandas DataFrame and numpy ndarray that exist in the memory. This makes it easy to perform tasks such as pulling data from a database, running machine learning algorithms (e.g., anomaly detection, classification, or clustering), summarizing results, and visualizing analysis outcomes.
Domain adaptation: Incorporating domain-specific knowledge into the model via prompts can help boost LLMs’ performance when the user query is complex. TaskWeaver provides two options to make customizations with the user’s domain knowledge:
- Customization with plugins: Users can define custom plugins (including Python implementation and schema) to incorporate domain knowledge, such as pulling data from a specific database, and running a dedicated algorithm.
- Customization with examples: TaskWeaver also provides an easy-to-implement interface (in YAML format) for users to configure examples to teach the LLMs how to respond to certain requests. The examples can be of two categories: one is used for planning and the other is for code generation.
Stateful code execution: When users make ad-hoc requests for data analytics, it often involves multiple iterations. As a result, TaskWeaver maintains the state of code execution throughout the entire session. This is like programming in Python using the Jupyter Notebook, where users type code snippets in a sequence of cells and the program’s internal state progresses sequentially. The difference in TaskWeaver is that users use natural language instead of programming language. TaskWeaver converts each user request into one or more code snippets in each round, depending on the specific plan.
Others: TaskWeaver also supports other features such as intelligent plan decomposition and self-reflection to respond to a user’s request in a more reliable and organized manner. Moreover, features like restricted code generation can help limit the capabilities of the generated code to reduce security risks.

Getting started

TaskWeaver (opens in new tab) is now publicly available on GitHub. You may run the following commands to quickly get started.

git clone https://github.com/microsoft/TaskWeaver.git 
cd TaskWeaver 
pip install -r requirements.txt   # install the requirements

Once the installation is finished, users can configure key parameters, such as the LLM endpoint, key and model, and start the TaskWeaver service easily by following the running examples (opens in new tab).

Other resources

Video demos: The Demo Examples (opens in new tab) show two demo videos, the first one covers the motivating example presented above on anomaly detection; the second shows the case to forecast the price of Invesco QQQ Trust, an exchange-traded fund, over the next week.
For more information, please refer to the TaskWeaver documentation. (opens in new tab)
Technical report with mode details on design considerations: technical report.

The post TaskWeaver: A code-first agent framework for efficient data analytics and domain adaptation appeared first on Microsoft Research.

AI Takes Center Stage: Survey Reveals Financial Industry’s Top Trends for 2024

The financial services industry is undergoing a significant transformation with the adoption of AI technologies. NVIDIA’s fourth annual State of AI in Financial Services Report provides insights into the current landscape and emerging trends for 2024.

The report reveals that an overwhelming 91% of financial services companies are either assessing AI or already using it in production. These firms are using AI to drive innovation, improve operational efficiency and enhance customer experiences.

Portfolio optimization, fraud detection and risk management remain top AI use cases, while generative AI is quickly gaining popularity with organizations keen to uncover new efficiencies.

Below are the report’s key findings, which show how the financial services industry is evolving as advanced AI becomes more accessible.

Generative AI and Large Language Models Are on the Rise

Reflecting a macro-trend seen across industries, large language models (LLMs) and generative AI have emerged as significant areas of interest for financial services companies. Fifty-five percent of survey respondents reported that they were actively seeking generative AI workflows for their companies.

Organizations are exploring generative AI and LLMs for an array of applications ranging from marketing and sales — ad copy, email copy and content production — to synthetic data generation. Of these use cases, 37% of respondents showed interest in report generation, synthesis and investment research to cut down on repetitive manual work.

Customer experience and engagement was another sought-out use case, with a 34% response rate. This suggests that financial services institutions are exploring chatbots, virtual assistants and recommendation systems to enhance the customer experience.

AI Is Having an Impact Across Departments and Disciplines

With 75% of survey respondents considering their organization’s AI capabilities to be industry leading or middle of the pack, financial services organizations are becoming more confident in their ability to build, deploy and extract value from AI implementations.

The most popular uses for AI were in operations, risk and compliance, and marketing. To improve operational efficiency, financial organizations are using AI to automate manual processes, enhance data analysis and inform investment decisions.

To enhance risk and compliance, they’re deploying AI to analyze vast amounts of data to identify suspicious activities and anomalous transaction patterns. They’re also using AI to analyze customer data to predict preferences and deliver personalized marketing campaigns, educational content and targeted promotions.

Companies are already seeing results. Forty-three percent of financial services professionals indicated that AI had improved their operational efficiency, while 42% felt it had helped their business build a competitive advantage.

A Shift in the Headwinds

In previous years, the number one challenge respondents reported was recruiting AI experts and data scientists. A 30% increase this year in survey participants resoundingly responded that data-related challenges were the primary concern. This includes data privacy challenges, data sovereignty and data scattered around the globe governed by different oversight regulations.

The growing attention to these issues reflects the advancing power and complexity of AI models, which require huge, diverse datasets to train, as well as increasing regulatory scrutiny and emphasis on responsible AI.

Recruiting and retaining AI experts remains a challenge, as do budget concerns. But more than 60% of respondents are still planning to increase investment in computing infrastructure or optimizing AI workflows, underscoring the importance of these tools in quickly building and deploying trustworthy AI to overcome these barriers.

Paving the Way for Future Investments

By and large, the survey results paint a positive picture of AI bringing greater efficiency to operations, personalization to customer engagements, and precision to investment decisions.

Finance professionals agree. Eighty-six percent of respondents reported a positive impact on revenue, while 82% noted a reduction in costs. Fifty-one percent strongly agreed that AI would be important to their company’s future success, a 76% increase from last year.

With this positive outlook, 97% of companies plan to invest more in AI technologies in the near future. Focus areas for future investments include identifying additional AI use cases, optimizing AI workflows and increasing infrastructure spending.

To build and scale impactful AI across the enterprise, financial services organizations need a comprehensive AI platform that empowers data scientists, quants and developers to seamlessly collaborate while minimizing obstacles. To that end, executives are investing more in AI infrastructure and prioritizing high-yield AI use cases to improve employee productivity while delivering superior customer experiences and investment results.

Download the “State of AI in Financial Services: 2024 Trends” report for in-depth results and insights.

Explore NVIDIA’s AI solutions and enterprise-level AI platforms for delivering smarter, more secure financial services and the AI-powered bank.

To the Cloud and Beyond: New Activision and Blizzard Games, Day Passes and G-SYNC Technology Coming to GeForce NOW

GFN Thursday recaps the latest cloud announcements from CES 2024 — Day Pass memberships, Cloud G-SYNC technology, expanded NVIDIA Reflex support and more.

The new year brings new adventures to the cloud for members, including Diablo IV and Overwatch 2 from Blizzard, Exoprimal from Capcom, Honkai: Star Rail from HoYoverse and Pax Dei from Mainframe Industries.

Plus, no GFN Thursday is complete without new games. Get ready for ten new titles joining the cloud this week.

Cloud’s-Eye View of CES

CES 2024 has come to a close, and GeForce NOW members have a lot to look forward to.

Coming in February, day passes for Ultimate and Priority memberships will offer a new way for members to play at up to GeForce RTX 4080 quality for up to 24 hours. Ultimate Day Pass will be available for $7.99, and Priority Day Pass for $3.99, providing all the benefits of both memberships to gamers before they decide to commit to the better-value one-month or six-month memberships.

G-SYNC comes to GeForce NOW — Nothing but a G-SYNC, baby.

Cloud G-SYNC support will match the display refresh rate of variable refresh rate monitors and G-SYNC-compatible monitors to the streaming rate. Paired with new 60 and 120 frames per second streaming options for GeForce NOW Reflex mode, this makes cloud gaming experiences nearly indistinguishable from using a local PC.

Ultimate members will be able to turn their phones into portable gaming rigs with support for 1440p resolutions on compatible Android phones, as well as updated keyboard and mouse support connected through a USB hub. Thanks to the cloud, these smartphones are now capable of PC gaming at Ultimate quality.

Worldwide Expansion for GeForce NOW — The cloud’s drifting into Japan.

GeForce NOW will also soon expand to Japan, operating alongside GeForce NOW Alliance partner KDDI. This will enable gamers across the country to play their favorite PC games in the cloud with Ultimate performance. Learn more and sign up for notifications.

Here Comes the Blizzard

That’s not all: GeForce NOW is bringing even more top titles to the cloud from celebrated publishers.

Following the recent release of Call of Duty, the latest games from top developer Blizzard Entertainment are coming soon to GeForce NOW. Members will be able to play the Steam versions of Diablo IV and Overwatch 2 on nearly any device with the power of a GeForce RTX 4080 rig in the cloud, with support for Battle.net coming soon.

Diablo IV on GeForce NOW — Someone check the weather in hell — seems pretty cold.

Join the fight for sanctuary in Diablo IV. Fight the forces of hell while discovering countless abilities to master, legendary loot to gather and nightmarish dungeons full of evil enemies to vanquish. Explore a shared open world where players can form their own armies to take down World Bosses, or join the fray in player vs. player zones to test skills against others.

Team up and answer the call of heroes in Overwatch 2, a free-to-play shooter featuring 30+ epic heroes, each with game-changing abilities. Lead the charge, ambush enemies or aid allies as one of Overwatch’s distinct heroes. Join the battle across dozens of futuristic maps inspired by real-world locations and master unique game modes in the always-on, ever-evolving live game.

Members can look forward to playing the Steam version of both games from the cloud, with support for the Battle.net launcher coming soon.

Honkai Star Rail coming soon to GeForce NOW — The Astral Express is coming to GeForce NOW.

Expanding the library of hit free-to-play titles for members, Honkai: Star Rail from miHoYo will soon join Genshin Impact in the cloud. The space-fantasy role-playing game is set in a diverse universe filled with wonder, adventure and thrills. Plus, members can experience all the latest updates without worrying about download times.

Mainframe Industries’ Pax Dei is a highly anticipated social sandbox massively multiplayer online game inspired by legends of the medieval era. It’s planned to release on GeForce NOW when it launches for PC.

Exoprimal on GeForce NOW — Dinosaurs? Oh my.

Capcom is working with NVIDIA to bring more of its hit titles to the cloud, including Exoprimal, an online, team-based action game that pits humanity’s cutting-edge exosuit technology against history’s most ferocious beasts: dinosaurs. Look forward to streaming it from the cloud starting Thursday, Jan. 18.

Get ready to play these titles and more at high performance coming soon. Ultimate members will be able to stream at up to 4K resolution and 120 fps with support for NVIDIA DLSS and Reflex technology, and experience the action even on low-powered devices. Keep an eye out on GFN Thursdays for the latest on game release dates in the cloud.

New to Play Today

War Hospital on GeForce NOW — Patch the wounded, mend the broken, survive the storm.

What’s a GFN Thursday without more games? Here’s what’s coming to the GeForce NOW library this week:

War Hospital (New release Jan. 11, available on Steam)
Assassin’s Creed: Valhalla (Xbox, available for PC Game Pass)
Jected – Rivals (Steam)
RAILGRADE (Steam)
Survivalist: Invisible Strain (Steam)
The Talos Principle 2 (Epic Games Store)
Turbo Golf Racing (Xbox, available for PC Game Pass)
TUNIC (Xbox, available for PC Game Pass)
Witch It (Steam)
Zombie Army 4: Dead War (Xbox, available for PC Game Pass)

Learn more about activating and playing Ubisoft games from PC Game Pass on GeForce NOW.

What are you playing this weekend? Let us know on X or in the comments below.

Five ways Victoria’s Secret & Co. is using AI to make shopping more personalized and inclusive

A new partnership between Victoria’s Secret & Company and Google Cloud uses AI and generative AI to make online shopping more personalized and inclusive.Read More

Build an Amazon SageMaker Model Registry approval and promotion workflow with human intervention

This post is co-written with Jayadeep Pabbisetty, Sr. Specialist Data Engineering at Merck, and Prabakaran Mathaiyan, Sr. ML Engineer at Tiger Analytics.

The large machine learning (ML) model development lifecycle requires a scalable model release process similar to that of software development. Model developers often work together in developing ML models and require a robust MLOps platform to work in. A scalable MLOps platform needs to include a process for handling the workflow of ML model registry, approval, and promotion to the next environment level (development, test, UAT, or production).

A model developer typically starts to work in an individual ML development environment within Amazon SageMaker. When a model is trained and ready to be used, it needs to be approved after being registered in the Amazon SageMaker Model Registry. In this post, we discuss how the AWS AI/ML team collaborated with the Merck Human Health IT MLOps team to build a solution that uses an automated workflow for ML model approval and promotion with human intervention in the middle.

Overview of solution

This post focuses on a workflow solution that the ML model development lifecycle can use between the training pipeline and inferencing pipeline. The solution provides a scalable workflow for MLOps in supporting the ML model approval and promotion process with human intervention. An ML model registered by a data scientist needs an approver to review and approve before it is used for an inference pipeline and in the next environment level (test, UAT, or production). The solution uses AWS Lambda, Amazon API Gateway, Amazon EventBridge, and SageMaker to automate the workflow with human approval intervention in the middle. The following architecture diagram shows the overall system design, the AWS services used, and the workflow for approving and promoting ML models with human intervention from development to production.

The workflow includes the following steps:

The training pipeline develops and registers a model in the SageMaker model registry. At this point, the model status is PendingManualApproval.
EventBridge monitors status change events to automatically take actions with simple rules.
The EventBridge model registration event rule invokes a Lambda function that constructs an email with a link to approve or reject the registered model.
The approver gets an email with the link to review and approve or reject the model.
The approver approves the model by following the link in the email to an API Gateway endpoint.
API Gateway invokes a Lambda function to initiate model updates.
The model registry is updated for the model status (Approved for the dev environment, but PendingManualApproval for test, UAT, and production).
The model detail is stored in AWS Parameter Store, a capability of AWS Systems Manager, including the model version, approved target environment, model package.
The inference pipeline fetches the model approved for the target environment from Parameter Store.
The post-inference notification Lambda function collects batch inference metrics and sends an email to the approver to promote the model to the next environment.

Prerequisites

The workflow in this post assumes the environment for the training pipeline is set up in SageMaker, along with other resources. The input to the training pipeline is the features dataset. The feature generation details are not included in this post, but it focuses on the registry, approval, and promotion of ML models after they are trained. The model is registered in the model registry and is governed by a monitoring framework in Amazon SageMaker Model Monitor to detect for any drift and proceed to retraining in case of model drift.

Workflow details

The approval workflow starts with a model developed from a training pipeline. When data scientists develop a model, they register it to the SageMaker Model Registry with the model status of PendingManualApproval. EventBridge monitors SageMaker for the model registration event and triggers an event rule that invokes a Lambda function. The Lambda function dynamically constructs an email for an approval of the model with a link to an API Gateway endpoint to another Lambda function. When the approver follows the link to approve the model, API Gateway forwards the approval action to the Lambda function, which updates the SageMaker Model Registry and the model attributes in Parameter Store. The approver must be authenticated and part of the approver group managed by Active Directory. The initial approval marks the model as Approved for dev but PendingManualApproval for test, UAT, and production. The model attributes saved in Parameter Store include the model version, model package, and approved target environment.

When an inference pipeline needs to fetch a model, it checks Parameter Store for the latest model version approved for the target environment and gets the inference details. When the inference pipeline is complete, a post-inference notification email is sent to a stakeholder requesting an approval to promote the model to the next environment level. The email has the details about the model and metrics as well as an approval link to an API Gateway endpoint for a Lambda function that updates the model attributes.

The following is the sequence of events and implementation steps for the ML model approval/promotion workflow from model creation to production. The model is promoted from development to test, UAT, and production environments with an explicit human approval in each step.

We start with the training pipeline, which is ready for model development. The model version starts as 0 in SageMaker Model Registry.

The SageMaker training pipeline develops and registers a model in SageMaker Model Registry. Model version 1 is registered and starts with Pending Manual Approval status.The Model Registry metadata has four custom fields for the environments: dev, test, uat, and prod.
EventBridge monitors the SageMaker Model Registry for the status change to automatically take action with simple rules.
The model registration event rule invokes a Lambda function that constructs an email with the link to approve or reject the registered model.
The approver gets an email with the link to review and approve (or reject) the model.
The approver approves the model by following the link to the API Gateway endpoint in the email.
API Gateway invokes the Lambda function to initiate model updates.
The SageMaker Model Registry is updated with the model status.
The model detail information is stored in Parameter Store, including the model version, approved target environment, and model package.
The inference pipeline fetches the model approved for the target environment from Parameter Store.
The post-inference notification Lambda function collects batch inference metrics and sends an email to the approver to promote the model to the next environment.
The approver approves the model promotion to the next level by following the link to the API Gateway endpoint, which triggers the Lambda function to update the SageMaker Model Registry and Parameter Store.

The complete history of the model versioning and approval is saved for review in Parameter Store.

Conclusion

The large ML model development lifecycle requires a scalable ML model approval process. In this post, we shared an implementation of an ML model registry, approval, and promotion workflow with human intervention using SageMaker Model Registry, EventBridge, API Gateway, and Lambda. If you are considering a scalable ML model development process for your MLOps platform, you can follow the steps in this post to implement a similar workflow.

About the authors

Tom Kim is a Senior Solution Architect at AWS, where he helps his customers achieve their business objectives by developing solutions on AWS. He has extensive experience in enterprise systems architecture and operations across several industries – particularly in Health Care and Life Science. Tom is always learning new technologies that lead to desired business outcome for customers – e.g. AI/ML, GenAI and Data Analytics. He also enjoys traveling to new places and playing new golf courses whenever he can find time.

Shamika Ariyawansa, serving as a Senior AI/ML Solutions Architect in the Healthcare and Life Sciences division at Amazon Web Services (AWS),specializes in Generative AI, with a focus on Large Language Model (LLM) training, inference optimizations, and MLOps (Machine Learning Operations). He guides customers in embedding advanced Generative AI into their projects, ensuring robust training processes, efficient inference mechanisms, and streamlined MLOps practices for effective and scalable AI solutions. Beyond his professional commitments, Shamika passionately pursues skiing and off-roading adventures.

Jayadeep Pabbisetty is a Senior ML/Data Engineer at Merck, where he designs and develops ETL and MLOps solutions to unlock data science and analytics for the business. He is always enthusiastic about learning new technologies, exploring new avenues, and acquiring the skills necessary to evolve with the ever-changing IT industry. In his spare time, he follows his passion for sports and likes to travel and explore new places.

Prabakaran Mathaiyan is a Senior Machine Learning Engineer at Tiger Analytics LLC, where he helps his customers to achieve their business objectives by providing solutions for the model building, training, validation, monitoring, CICD and improvement of machine learning solutions on AWS. Prabakaran is always learning new technologies that lead to desired business outcome for customers – e.g. AI/ML, GenAI, GPT and LLM. He also enjoys playing cricket whenever he can find time.

Advancing transparency: Updates on responsible AI research

Editor’s note: All papers referenced here represent collaborations throughout Microsoft and across academia and industry that include authors who contribute to Aether, the Microsoft internal advisory body for AI ethics and effects in engineering and research.

Blue green gradient with a lightbulb icon centered. Six icons surround it: (left to right) group of people, eyeball, shaking hands, set of scales, lock, and shield.

A surge of generative AI models in the past year has fueled much discussion about the impact of artificial intelligence on human history. Advances in AI have indeed challenged thinking across industries, from considering how people will function in creative roles to effects in education, medicine, manufacturing, and more. Whether exploring impressive new capabilities of large language models (LLMs) such as GPT-4 or examining the spectrum of machine learning techniques already embedded in our daily lives, researchers agree on the importance of transparency. For society to appropriately benefit from this powerful technology, people must be given the means for understanding model behavior.

Transparency is a foundational principle of responsible, human-centered AI and is the bedrock of accountability. AI systems have a wide range of stakeholders: AI practitioners need transparency for evaluating data and model architecture so they can identify, measure, and mitigate potential failures; people using AI, expert and novice, must be able to understand the capabilities and limitations of AI systems; people affected by AI-assisted decision-making should have insights for redress when necessary; and indirect stakeholders, such as residents of cities using smart technologies, need clarity about how AI deployment may affect them.

Providing transparency when working with staggeringly complex and often proprietary models must take different forms to meet the needs of people who work with either the model or the user interface. This article profiles a selection of recent efforts for advancing transparency and responsible AI (RAI) by researchers and engineers affiliated with Aether, the Microsoft advisory body for AI ethics and effects in engineering and research. This work includes investigating LLM capabilities and exploring strategies for unlocking specialized-domain competencies of these powerful models while urging transparency approaches for both AI system developers and the people using these systems. Researchers are also working toward improving identification, measurement, and mitigation of AI harms while sharing practical guidance such as for red teaming LLM applications and for privacy-preserving computation. The goal of these efforts is to move from empirical findings to advancing the practice of responsible AI.

Toward user-centered algorithmic recourse

In this demo of GAM Coach, an example of an AI transparency approach, an interactive interface lets stakeholders in a loan allocation scenario understand how a model based its prediction and what factors they can change to meet their goals.

Watch the demo

Identifying harms in LLMs and their applications

The sociotechnical nature of AI is readily apparent as product teams sprint to integrate the power and appeal of LLMs into conversational agents and productivity tools across domains. At the same time, recent accounts, such as a lawyer unwittingly submitting generative AI’s fictitious legal citations in a brief to the court (opens in new tab) or unsettling demonstrations of deepfakes, reveal the ample opportunity for misunderstanding these models’ capabilities and, worse yet, for deliberately misusing them.

Envisioning what could go wrong with an AI system that has not yet been deployed is the first step toward responsible AI. Addressing this challenge, researchers introduce AHA! (anticipating harms of AI), a human-AI collaboration for systematic impact assessment. This framework enables people to make judgments about the impact of potential deployment on stakeholders. It uses an LLM to generate vignettes, or fictional scenarios, that account for an ethical matrix of problematic AI behaviors or harms. Evaluation of this framework in a variety of decision-making contexts found it surfaced a broader range of potential harmful outcomes than either people or LLMs could singly envision.

An illustration of a guide for red teaming LLMs comprising a network icon alongside an icon of a bulleted list.

AI practitioners can follow this planning guide to help them set up and manage red teaming for large language models (LLMs) and their applications. Based on firsthand experience of testing LLMs to identify potentially harmful outputs and plan for mitigation strategies, this guide provides tips for who should test, what to test, and how to test, plus pointers for recording the red-teaming data.

Review the planning guide

Responsible AI red teaming, or probing models and their applications to identify undesirable behaviors, is another method of harm identification. Microsoft has shared a practical guide for the RAI red teaming of LLMs and their applications, and automated tools for RAI red teaming are beginning to emerge. Although the vital task of impact assessment and testing for failures can be facilitated by LLMs helping with creative brainstorming, researchers emphasize that for AI to be human centered, such efforts should never be fully automated. To improve human-AI complementarity in red teaming, AdaTest++ builds on an existing tool that uses an LLM to generate test suggestions as it adapts to user feedback. The redesign offers greater human control for testing hypotheses, enabling editing and exploration of counterfactuals, and conducting in-depth testing across a broad diversity of topics.

Researchers invite AI practitioners working toward responsible AI to use and contribute to AdaTest++, which leverages human-AI complementarity to audit LLMs. Augmenting a preexisting tool, AdaTest++ offers prompt templates and introduces greater human control for testing hypotheses and exploring counterfactuals.

In AI privacy, researchers demonstrate how prompt-tuning can be used to infer private information from an email system using a language model to provide autocompleted replies. In sharing their red-teaming technique, they encourage privacy-enhancing efforts for applications using language models and take the stance that transparency of publicly detailing a model’s vulnerability is an essential step toward adversarial robustness.

Identifying and exposing security vulnerabilities is a top concern, especially when these can seep into AI-generated code. The integration of LLMs for AI-assisted coding has reduced the entry barrier for novice programmers and increased productivity for veteran coders. But it is important to examine the reliability and safety of AI-assisted coding. Although static analysis tools can detect and remove insecure code suggestions caused by the adversarial manipulation of training data, researchers introduce two novel techniques for poisoning code-suggestion models that bypass static analysis mitigation: Covert inserts malicious code in docstrings and comments, while TrojanPuzzle tricks the transformer-based model into substituting tokens, giving the programmer harmless-looking but insecure code. Exposing these vulnerabilities, researchers call for new methods for training code-suggestion models and for processes to ensure code suggestions are secure before programmers ever see them.

Transparency for improving measurement and its validity

We can’t begin to mitigate for the possibility of AI failures without first identifying and then measuring the potential harms of a model’s outputs, transparently examining who may or may not benefit or what could go wrong and to what extent.

A framework for automating the measurement of harms at speed and scale has two LLMs simulate product- or end-user interaction and evaluate outputs for potential harms, using resources created by relevant sociotechnical-domain experts. As researchers stress, the validity and reliability of such evaluation rely strictly on the quality of these resources—the templates and parameters for simulating interactions, the definition of harms, and their annotation guidelines. In other words, sociotechnical-domain expertise is indispensable.

Measurement validity—ensuring metrics align with measurement goals—is central to the practice of responsible AI. Model accuracy in and of itself is not an adequate metric for assessing sociotechnical systems: for example, in the context of productivity applications, capturing what is valuable to an individual using an AI system should also be taken into account. How do we identify metrics appropriate for context-dependent models that are deployed across domains to serve a variety of populations and purposes? Teams need methods to address measurement and mitigation for every deployment scenario.

Language models illustrate the maxim “context is everything.” When it comes to measuring and mitigating fairness-related harms that are context-dependent in AI-generated text, there’s generally not enough granularity in dataset labeling. Lumping harms under generalized labels like “toxic” or “hate speech” doesn’t capture the detail needed for measuring and mitigating harms specific to various populations. FairPrism is a new dataset for detecting gender- and sexuality-related harms that makes a case for greater granularity in human annotation and transparency in dataset documentation, including identifying groups of people that may be targeted. Researchers situate FairPrism as “a recipe” for creating better-detailed datasets for measuring and mitigating AI harms and demonstrate how the new dataset’s 5,000 examples of English text can probe for fairness-related harms to a specific group.

AI practitioners can request access to the FairPrism dataset for detecting gender- and sexuality-related harms in AI-generated text. FairPrism makes a case for greater granularity in human annotation.

Similarly, researchers deepen the conversation around representational harms in automated image-tagging systems, voicing the need for improved transparency and specificity in taxonomies of harms for more precision in measurement and mitigation. Image tagging is generally intended for human consumption, as in alt text or online image search, differentiating it from object recognition. Image tagging can impute fairness-related harms of reifying social groups as well as stereotyping, demeaning, or erasure. Researchers identify these four specific representational harms and map them to computational measurement approaches in image tagging. They call out the benefits of increased granularity but note there is no silver bullet: efforts to mitigate by adding or removing particular tags to avoid harms may in fact introduce or exacerbate these representational harms.

Transparency and UX-based mitigations: What designers need and end users want

Prioritizing what people value and designing for the optimal user experience (UX) is a goal of human-centered, responsible AI. Unfortunately, UX design has often been viewed as a secondary consideration in engineering organizations. But because AI is a sociotechnical discipline, where technical solutions must converge with societal perspectives and social science theory, AI not only brings UX expertise to the foreground but also positions designers as potential innovators, well situated to mitigate some harms and model failures with UX interventions. To realize this, UX designers need transparency—visibility into how models work—so they can form “designerly understanding of AI” to help them ideate effectively. A study of 23 UX designers completing a hands-on design task illustrates their need for better support, including model documentation that’s easier to understand and interactive tools to help them anticipate model failures, envision mitigations, and explore new uses of AI.

People with varying levels of AI experience or subject-matter expertise are suddenly harnessing commercially available generative AI copilots for productivity gains and decision-making support across domains. But generative AI can make mistakes, and the impact of these failures can differ greatly depending on the use case: for example, poor performance in a creative-writing task has a very different impact than an error in a health care recommendation. As the stakes rise, so does the call for mitigating these failures: people need tools and mechanisms that help them audit AI outputs. UX interventions are well suited for mitigating this type of harm. To begin, researchers propose a taxonomy of needs that co-auditing systems should address when helping people double-check generative AI model responses. Basic considerations should include how easy it is for individuals to detect an error, what their skill level is, and how costly or consequential an error may be in a given scenario. A prototype Excel add-in illustrates these considerations, helping the nonprogrammer inspect the accuracy of LLM-generated code.

There are productivity dividends to paying attention to people’s need and desire for transparency. A central problem people encounter with language models is crafting prompts that lead to useful output. Advancing a solution for this in LLM-based code generation, researchers demonstrate an interface that gives people visibility into how the model maps their natural language query to system action. This transparency approach helps people adapt their mental model of the code generator’s capabilities and modify their queries accordingly. Findings of the user study, which included participants with low expertise in coding, showed this transparency approach promoted user confidence and trust while facilitating explanation and debugging. Similarly, human-centered efforts such as modeling the timing of when a programmer finds it most valuable to receive a code suggestion emphasize the primacy of end users’ needs when addressing productivity.

“What It Wants Me To Say”

This transparency approach provides nonexpert programmers with an interface that gives visibility into how a language model maps their natural language query to system action, helping them adapt their mental model and modify their prompts.

Watch the presentation

For experienced coders to be confident and benefit from AI-assisted code completion, they need to be able to easily spot and correct errors and security vulnerabilities. In the first empirical study of the effectiveness of token highlighting for communicating uncertainty of an AI prediction, researchers examine a UX technique that draws programmers’ attention in a way similar to a spell checker. Highlighting tokens that had the highest predicted likelihood of being edited resulted in programmers being able to complete tasks faster with better-targeted edits. Participants also desired more transparency in the form of explanations to help with diagnosing uncertainty and suggested interaction designs that would improve their efficiency and give them control.

Communicating the uncertainty of AI predictions in a way that is meaningful is a design challenge in every deployment context. How to provide transparency via explanations remains a conundrum—studies have shown that the mere presence of explanations can increase overreliance on AI. Designing UX that helps people meet their decision-making goals with confidence requires understanding their perceptions of how a given system works. But little is actually known about the processes decision-makers go through when debating whether to rely on an AI system’s output versus their own intuition. Conducting a think-aloud study for insights into the role of human intuition in reliance on AI predictions in AI-assisted decision making, researchers identified three types of intuition people use in deciding to override the system. While performing brief tasks of income prediction and biography classification with AI support, participants expressed “gut feel” about the decision outcome; how specific data characteristics, or features, may impact explanations; and the limitations of the AI system. Findings suggested what the authors call “intuition-driven pathways” to understanding the effect of different types of explanations on people’s decision to override AI. Results showed that example-based explanations, which were textual narratives, better aligned with people’s intuition and reasoning about a prediction than feature-based explanations, which conveyed information with bar charts and other visual tools. At the same time, participants echoed the familiar desire for help with understanding AI systems’ limitations. Suggestions included interface designs to better support transparency and user understanding—for example, interactive explanations that enable people to change attributes to explore the effect on the model’s prediction.

Accommodating varying levels of user expertise is a growing AI UX design challenge across domains and applications. For example, in business, people with limited knowledge of AI or statistics must increasingly engage AI visual-analytic systems to create reports and inform recommendations. While research seeks to address gaps in knowledge for improving user interaction with AI, some practical and evidence-driven tools are already available. A case study of business experts with varying levels of AI proficiency demonstrates the effectiveness of applying existing guidelines for human-AI interaction for transparency cues. Visual explanations improved participants’ ability to use a visual AI system to make recommendations. At the same time, researchers noted a high level of trust in outputs regardless of participants’ understanding of AI, illustrating the complexity of AI transparency for appropriate trust.

Related papers

ALT TEXT: An illustration of transparency considerations in the age of large language models. A network icon labeled “baseline LLM”; network and gear icons labeled “adapted LLM”; and an icon incorporating a keyboard, network, and pointed finger labeled “LLM-powered applications” represent considerations from the technology perspective. An icon of people labeled “stakeholders” represents considerations from the stakeholder perspective. — Transparency for responsible AI is a sociotechnical endeavor. From the technology perspective, organizations developing LLMs and their applications need to consider how to appropriately characterize and communicate capabilities, limitations, performance, and risks and at what level transparency approaches should take place. To meet the needs of stakeholders—the wide range of people developing, deploying, using, or being impacted by LLMs—transparency considerations should include stakeholder goals; improving people’s mental models of LLMs and supporting appropriate trust; and gaining insight to how transparency can contribute to better control mechanisms for LLMs. (Adapted from AI Transparency in the Age of LLMs: A Human-Centered Research Roadmap)

Transparency: A means for accountability in a new era of AI

This research compilation highlights that transparency is fundamental to multiple components of responsible AI. It requires, among other things, the understanding and communication of datasets and their composition and of model behaviors, capabilities, and limitations. Transparency also touches every aspect of the responsible AI harm mitigation framework: identify, measure, mitigate. Furthermore, this research establishes a primary role for UX in mitigating harms as AI integrates into the apps people rely on every day in their personal and professional lives.

As authors of a research roadmap for transparency in the age of LLMs outline, these complex models’ massive datasets, nondeterministic outputs, adaptability, and rapid evolutionary pace present new challenges for deploying AI responsibly. There’s much work to be done to improve transparency for stakeholders of highly context-dependent AI systems—from improving how we publish the goals and results of evaluations when it comes to model reporting to providing appropriate explanations, communicating model uncertainty, and designing UX-based mitigations.

Prioritizing transparency in the design of our AI systems is to acknowledge the primacy of people, whom the technology is meant to serve. Transparency plays a critical role in respecting human agency and expertise in this new frontier of human-AI collaboration and, ultimately, can hold us accountable for the world we are shaping.

Pathways to deeper human-AI synergy

In his KDD 2023 keynote, Microsoft Chief Scientific Officer Eric Horvitz presents an overview of the power of LLM capabilities and the potential for enriching human-AI complementarity.

Watch the keynote

Research Focus: Week of January 8, 2024

Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

NEW RESEARCH

Mixture-of-Linear-Experts for Long-term Time Series Forecasting

Long-term time series forecasting (LTSF), which aims to predict future values of a series given past values, is an important problem in the machine learning community. It’s useful in areas like weather modeling, traffic flow prediction, and financial forecasting.

The current state of the art on LTSF is attained in some cases by linear-centric models. However, real-world time series are usually nonstationary. For example, traffic patterns change on different days of the week. The inherent simplicity of linear-centric models makes them unable to capture these patterns. In a recent paper: Mixture-of-Linear-Experts for Long-term Time Series Forecasting, researchers from Microsoft and external colleagues propose Mixture-of-Linear-Experts (MoLE) to address this problem. Instead of training a single model, MoLE trains multiple linear-centric models (i.e., experts) and a router model that weighs and mixes their outputs. While the entire framework is trained end-to-end, each expert learns to specialize in a specific temporal pattern, and the router model learns to compose the experts adaptively. Experiments show that MoLE significantly reduces forecasting error of linear-centric models, and MoLE outperforms state-of-the-art transformer-based approaches in 68% of settings.

Read the paper

NEW RESEARCH

A Weakly-Supervised Streaming Multilingual Speech Model with Truly Zero-Shot Capability

End-to-end (E2E) models are the dominant model structure in automatic speech recognition (ASR) and speech translation (ST). This has led to efforts to develop a unified E2E model for multilingual ASR and multilingual ST tasks. Streaming ASR and ST tasks have extensively utilized neural transducers in the past.

In a recent paper: A Weakly-Supervised Streaming Multilingual Speech Model with Truly Zero-Shot Capability, researchers from Microsoft present a streaming multilingual speech model – $SM^2$ – which employs a single neural transducer model for transcribing or translating multiple languages into target languages. $SM^2$ is trained using weakly supervised data created by converting speech recognition transcriptions with a machine translation model. Leveraging 351,000 hours of speech training data from 25 languages, $SM^2$ achieves impressive ST performance. Notably, no human-labeled ST data was employed during training. It was purely weakly supervised ST data generated by converting 351,000 hours of anonymized ASR data from 25 languages using text based machine translation service.

The researchers also demonstrate the truly zero-shot capability of $SM^2$ when expanding to new target languages, generating high-quality zero-shot ST translation for {source-speech, target-text} pairs that were not seen during training.

Read the paper

NEW RESEARCH

KBFormer: A Diffusion Model for Structured Entity Completion

Deep generative models include large language models (LLMs) for text, plus models for other modalities, such as for vision and audio. In a recent paper: KBFormer: A Diffusion Model for Structured Entity Completion, researchers from Microsoft and external colleagues explore generative modeling of structured entities with heterogeneous properties, such as numerical, categorical, string, and composite. This includes entries in rich knowledge bases (KBs), items in product catalogs or scientific catalogs, and ontologies like the periodic table of elements and the various properties of isotopes.

Their approach handles such heterogeneous data through a mixed continuous-discrete diffusion process over the properties, using a flexible framework that can model entities with arbitrary hierarchical properties. Using this approach, the researchers obtain state-of-the-art performance on a majority of cases across 15 datasets. In addition, experiments with a device KB and a nuclear physics dataset demonstrate the model’s ability to learn representations useful for entity completion in diverse settings. This has many downstream use cases, including modeling numerical properties with high accuracy – critical for science applications, which also benefit from the model’s inherent probabilistic nature.

Read the paper

NEW RESEARCH

A Framework for Exploring the Consequences of AI-Mediated Enterprise Knowledge Access and Identifying Risks to Workers

People are increasingly interacting with, and being affected by, the deployment of AI systems in the workplace. This is a pressing matter for system designers, policy-makers, and workers themselves, which researchers from Microsoft address in a recent paper: A Framework for Exploring the Consequences of AI-Mediated Enterprise Knowledge Access and Identifying Risks to Workers.

Organizations generate huge amounts of information that raise challenges associated with the maintenance, dissemination, and discovery of organizational knowledge. Recent developments in AI, notably large language models (LLMs), present a shift in what is possible in this domain. Recent advances could enable more extensive mining, knowledge synthesis, and natural language interaction in relation to knowledge.

The researchers propose the Consequence-Mechanism-Risk Framework to identify risks to workers associated with deploying AI-mediated enterprise knowledge access systems. The goal is to support those involved in the design and/or deployment of such systems to identify the risks they introduce, the specific system mechanisms that introduce those risks, and the actionable levers to reduce those risks.

Read the paper

NEW RESEARCH

Large Search Model: Redefining Search Stack in the Era of LLMs

Modern search engines are built on a stack of different components, including query understanding, retrieval, multi-stage ranking, and question answering, among others. These components are often optimized and deployed independently. In a recent paper: Large Search Model: Redefining Search Stack in the Era of LLMs, researchers from Microsoft introduce a novel conceptual framework called large search model, which redefines the conventional search stack by unifying search tasks with one large language model (LLM). All tasks are formulated as autoregressive text generation problems, allowing for the customization of tasks through the use of natural language prompts. This proposed framework capitalizes on the strong language understanding and reasoning capabilities of LLMs, offering the potential to enhance search result quality while simplifying the cumbersome search stack. To substantiate the feasibility of this framework, the researchers present a series of proof-of-concept experiments and discuss the potential challenges associated with implementing this approach within real-world search systems.

Read the paper

The post Research Focus: Week of January 8, 2024 appeared first on Microsoft Research.

Cohere’s multilingual embedding model

Use cases for text embedding

Enhanced search systems with Rerank

Solution overview

Enable model access through Amazon Bedrock

Install packages and import modules

Import documents

Select a list of documents to query

Embed and index documents

Build a retrieval system

Query the retrieval system

Improve results with Cohere Rerank

Conclusion

About the Authors

About our dataset

Core questions about mistake identification

1. Can LLMs find logical mistakes in Chain-of-Thought style reasoning?

2. Can mistake-finding be used as a proxy for correctness of the answer?

3. Can LLMs backtrack knowing where the error is?

4. Can mistake finding generalize to tasks the LLMs have never seen?

Conclusion

Acknowledgements

Data

Pipeline overview

Player detection

Player classification and ball detection

Ball path tracking

Pipeline output

Performance

Player classification and ball detection accuracy

Ball path tracking accuracy

Conclusion

About the Authors

Motivating example – Anomaly detection on time-series data

Scenario

Requirements for an agent framework

AI and Microsoft Research

TaskWeaver architecture

Running workflow for motivating example

Key design considerations of TaskWeaver

Getting started

Other resources

Generative AI and Large Language Models Are on the Rise

AI Is Having an Impact Across Departments and Disciplines

A Shift in the Headwinds

Paving the Way for Future Investments

Cloud’s-Eye View of CES

Here Comes the Blizzard

New to Play Today

Overview of solution

Prerequisites

Workflow details

Conclusion

About the authors

Toward user-centered algorithmic recourse

Related papers

Identifying harms in LLMs and their applications

Related papers

Transparency for improving measurement and its validity

Related papers

Transparency and UX-based mitigations: What designers need and end users want

“What It Wants Me To Say”

Related papers

Transparency: A means for accountability in a new era of AI

Pathways to deeper human-AI synergy

Related papers

Intern Insights: Dr. Madeleine Daepp with Jennifer Scurrell and Alejandro Cuevas

NEW RESEARCH

Mixture-of-Linear-Experts for Long-term Time Series Forecasting

NEW RESEARCH

A Weakly-Supervised Streaming Multilingual Speech Model with Truly Zero-Shot Capability

AI Frontiers: AI for health and the future of research with Peter Lee

NEW RESEARCH

KBFormer: A Diffusion Model for Structured Entity Completion

NEW RESEARCH

A Framework for Exploring the Consequences of AI-Mediated Enterprise Knowledge Access and Identifying Risks to Workers

NEW RESEARCH

Large Search Model: Redefining Search Stack in the Era of LLMs

Navigation

GenAI Vision Endless Possibilities