Catalog, query, and search audio programs with Amazon Transcribe and Knowledge Bases for Amazon Bedrock

Catalog, query, and search audio programs with Amazon Transcribe and Knowledge Bases for Amazon Bedrock

Information retrieval systems have powered the information age through their ability to crawl and sift through massive amounts of data and quickly return accurate and relevant results. These systems, such as search engines and databases, typically work by indexing on keywords and fields contained in data files.

However, much of our data in the digital age also comes in non-text format, such as audio and video files. Finding relevant content usually requires searching through text-based metadata such as timestamps, which need to be manually added to these files. This can be hard to scale as the volume of unstructured audio and video files continues to grow.

Fortunately, the rise of artificial intelligence (AI) solutions that can transcribe audio and provide semantic search capabilities now offer more efficient solutions for querying content from audio files at scale. Amazon Transcribe is an AWS AI service that makes it straightforward to convert speech to text. Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI.

In this post, we show how Amazon Transcribe and Amazon Bedrock can streamline the process to catalog, query, and search through audio programs, using an example from the AWS re:Think podcast series.

Solution overview

The following diagram illustrates how you can use AWS services to deploy a solution for cataloging, querying, and searching through content stored in audio files.

Architecture Diagram of Amazon Bedrock and related AWS Services

In this solution, audio files stored in mp3 format are first uploaded to Amazon Simple Storage Service (Amazon S3) storage. Video files (such as mp4) that contain audio in supported languages can also be uploaded to Amazon S3 as part of this solution. Amazon Transcribe will then transcribe these files and store the entire transcript in JSON format as an object in Amazon S3.

To catalog these files, each JSON file in Amazon S3 should be tagged with the corresponding episode title. This allows us to later retrieve the episode title for each query result.

Next, we use Amazon Bedrock to create numerical representations of the content inside each file. These numerical representations are also called embeddings, and they’re stored as vectors inside a vector database that we can later query.

Amazon Bedrock is a fully managed service that makes FMs from leading AI startups and Amazon available through an API. Included with Amazon Bedrock is Knowledge Bases for Amazon Bedrock. As a fully managed service, Knowledge Bases for Amazon Bedrock makes it straightforward to set up a Retrieval Augmented Generation (RAG) workflow.

With Knowledge Bases for Amazon Bedrock, we first set up a vector database on AWS. Knowledge Bases for Amazon Bedrock can then automatically split the data files stored in Amazon S3 into chunks and then create embeddings of each chunk using Amazon Titan on Amazon Bedrock. Amazon Titan is a family of high-performing FMs from Amazon. Included with Amazon Titan is Amazon Titan Text Embeddings, which we use to create the numerical representation of the text inside each chunk and store them in a vector database.

When a user queries the contents of the audio files through a generative AI application or AWS Lambda function, it makes an API call to Knowledge Bases for Amazon Bedrock. Knowledge Bases for Amazon Bedrock will then orchestrate a call to the vector database to perform a semantic search, which returns the most relevant results. Next, Knowledge Bases for Amazon Bedrock augments the user’s original query with these results to a prompt, which is sent to the large language model (LLM). The LLM will return results that are more accurate and relevant to the user query.

Let’s walk through an example of how you can catalog, query, and search through a library of audio files using these AWS AI services. For this post, we use episodes of the re:Think podcast series, which has over 20 episodes. Each episode is an audio program recorded in mp3 format. As we continue to add new episodes, we will want to use AI services to make the task of querying and searching for specific content more scalable without the need to manually add metadata for each episode.

Prerequisites

In addition to having access to AWS services through the AWS Management Console, you need a few other resources to deploy this solution.

First, you need a library of audio files to catalog, query, and search. For this post, we use episodes of the AWS re:Think podcast series.

To make API calls to Amazon Bedrock from our generative AI application, we use Python version 3.11.4 and the AWS SDK for Python (Boto3).

Transcribe audio files

The first task is to transcribe each mp3 file using Amazon Transcribe. For instructions on transcribing with the AWS Management Console or AWS CLI, refer to the Amazon Transcribe Developer guide. Amazon Transcribe can create a transcript for each episode and store it as an S3 object in JSON format.

Catalog audio files using tagging

To catalog each episode, we tag the S3 object for each episode with the corresponding episode title. For instructions on tagging objects in S3, refer to the Amazon Simple Storage Service User Guide. For example, for the S3 object AI-Accelerators.json, we tag it with key = “title” and value = “Episode 20: AI Accelerators in the Cloud.”

Edit Tags in S3

The title is the only metadata we need to manually add for each audio file. There is no need to manually add timestamps for each chapter or section in order to later search for specific content.

Set up a vector database using Knowledge Bases for Amazon Bedrock

Next, we set up our fully managed RAG workflow using Knowledge Bases for Amazon Bedrock. For instructions on creating a knowledge base, refer to the Amazon Bedrock User Guide. We begin by specifying a data source. In our case, we choose the S3 bucket location where our transcripts in JSON format are stored.

Configure data source for Knowledge Base

Next, we select an embedding model. The embedding model will convert each chunk of our transcript into embeddings. Embeddings are numbers, and the meaning of each embedding depends on the model. In our example, we select Titan Text Embeddings v2 with a dimension size of 1024.

Select embeddings model and configure vector store for Knowledge Base

The embeddings are stored as vectors in a vector database. You can either specify an existing vector database you have already created or have Knowledge Bases for Amazon Bedrock create one for you. For our example, we have Knowledge Bases for Amazon Bedrock create a vector database using Amazon OpenSearch Serverless.

Create a new vectore store

Before you can query the vector database, you must first sync it with the data source. During each sync operation, Knowledge Bases for Amazon Bedrock will split the data source into chunks and then use the selected embedding model to embed each chunk as a vector. Knowledge Bases for Amazon Bedrock will then store these vectors in the vector database.

The sync operation as well as other Amazon Bedrock operations described so far can be performed either using the console or API calls.

Query the audio files

Now we’re ready to query and search for specific content from our library of podcast episodes. In episode 20, titled “AI Accelerators in the Cloud,” our guest Matthew McClean, a senior manager from AWS’s Annapurna team, shared why AWS decided to buy Annapurna Labs in 2015. For our first query, we ask, “Why did AWS acquire Annapurna Labs?”

We entered this query into Knowledge Bases for Amazon Bedrock using Anthropic Claude and got the following response:

“AWS acquired Annapurna Labs in 2015 because Annapurna was providing AWS with nitro cards that offloaded virtualization, security, networking and storage from EC2 instances to free up CPU resources.”

This is an exact quote from Matthew McClean in the podcast episode. You wouldn’t get this quote if you had entered the same prompt into other publicly available generative AI chatbots because they don’t have the vector database with embeddings of the podcast transcript to provide more relevant context.

Retrieve an episode title

Now let’s suppose that in addition to getting more relevant responses, we also want to retrieve the correct podcast episode title that was relevant to this query from our catalog of podcast episodes.

To retrieve the episode title, we first use the most relevant data chunk from the query. Whenever Knowledge Bases for Amazon Bedrock responds to a query, it also provides one or more chunks of data that it retrieved from the vector database that were most relevant to the query in order of relevance. We can take the first chunk that was returned. These chunks are returned as JSON documents. Nested inside the JSON is the S3 location of the transcript object. In our example, the S3 location is s3://rethinkpodcast/text/transcripts/AI-Accelerators.json.

The first words in the chunk text are: “Yeah, sure. So maybe I can start with the history of Annapurna…”

Because we have already tagged this transcript object in Amazon S3 with the episode title, we can retrieve the title by retrieving the value of the tag where key = “title”. In this case, the title is “Episode 20: AI Accelerators in the Cloud.”

Search the start time

What if we also want to search and find the start time inside the episode where the relevant content begins? We want to do so without having to manually read through the transcript or listen to the episode from the beginning, and without manually adding timestamps for every chapter.

We can find the start time much faster by having our generative AI application make a few more API calls. We start by treating the chunk text as a substring of the entire transcript. We then search for the start time of the first word in the chunk text.

In our example, the first words returned were “Yeah, sure. So maybe I can start with the history of Annapurna…” We now need to search the entire transcript for the start time of the word “Yeah.”

Amazon Transcribe outputs the start time of every word in the transcript. However, any word can appear more than once. The word “Yeah” occurs 28 times in the transcript, and each occurrence has its own start time. So how do we determine the correct start time for “Yeah” in our example?

There are multiple approaches an application developer can use to find the correct start time. For our example, we use the Python string find() method to find the position of the chunk text within the entire transcript.

For the chunk text that begins with “Yeah, sure. So maybe I can start with the history of Annapurna…” the find() method returned the position as 2047. If we treat the transcript as one long text string, the chunk “Yeah, sure. So maybe…” starts at character position 2047.

Finding the start time now becomes a matter of counting the character position of each word in the transcript and using it to look up the correct start time from the transcript file generated by Amazon Transcribe. This may be tedious for a person to do manually, but trivial for a computer.

In our example Python code, we loop through an array that contains the start time for each token while counting the number of the character position that each token starts at. Because we’re looping through the tokens, we can build a new array that stores the start time for each character position.

In this example query, the start time for the word “Yeah” at position 2047 is 160 seconds, or 2 minutes and 40 seconds into the podcast. You can check the recording starting at 2 minutes 40 seconds.

Clean up

This solution incurs charges based on the services you use:

  • Amazon Transcribe operates under a pay-as-you-go pricing model. For more details, see Amazon Transcribe Pricing.
  • Amazon Bedrock uses an on-demand quota, so you only pay for what you use. For more information, refer to Amazon Bedrock pricing.
  • With OpenSearch Serverless, you only pay for the resources consumed by your workload.
  • If you’re using Knowledge Bases for Amazon Bedrock with other vector databases besides OpenSearch Serverless, you may continue to incur charges even when not running any queries. It is recommended you delete your knowledge base and its associated vector store along with audio files stored in Amazon S3 to avoid unnecessary costs when you’re done testing this solution.

Conclusion

Cataloging, querying, and searching through large volumes of audio files can be difficult to scale. In this post, we showed how Amazon Transcribe and Knowledge Bases for Amazon Bedrock can help automate and make the process of retrieving relevant information from audio files more scalable.

You can begin transcribing your own library of audio files with Amazon Transcribe. To learn more on how Knowledge Bases for Amazon Bedrock can then orchestrate a RAG workflow for your transcripts with vector stores, refer to Knowledge Bases now delivers fully managed RAG experience in Amazon Bedrock.

With the help of these AI services, we can now expand the frontiers of our knowledge bases.


About the Author

Nolan Chen is a Partner Solutions Architect at AWS, where he helps startup companies build innovative solutions using the cloud. Prior to AWS, Nolan specialized in data security and helping customers deploy high-performing wide area networks. Nolan holds a bachelor’s degree in Mechanical Engineering from Princeton University.

Read More

GENEVA uses large language models for interactive game narrative design

GENEVA uses large language models for interactive game narrative design

This paper was presented at the IEEE 2024 Conference on Games (opens in new tab) (IEEE CoG 2024), the leading forum on innovation in and through games.

IEEE 2024 Conference on Games recap blog

Mastering the art of storytelling, a highly valued skill across films, novels, games, and more, requires creating rich narratives with compelling plots and characters. In recent years, the rise of AI has prompted inquiries into whether large language models (LLMs) can effectively generate and sustain detailed, coherent storylines that engage audiences. Consequentially, researchers have been actively exploring AI’s potential to support creative processes in video game development, where the growing demands of narrative design often surpass the capabilities of traditional tools. This investigation focuses on AI’s capacity for innovation in storytelling and the necessary human interactions to drive such advances.

In this context, we introduce “GENEVA: GENErating and Visualizing branching narratives using LLMs (opens in new tab),” presented at IEEE CoG 2024. This graph-based narrative generation and visualization tool requires a high-level narrative description and constraints, such as the number of different starts, endings, and storylines, as well as context for grounding the narrative. GENEVA uses the generative capabilities of GPT-4 to create narratives with branching storylines and renders them in a graph format, allowing users to interactively explore different narrative paths through its web interface (opens in new tab).

Visualizing narratives using graphs

The narrative graph itself is a directed acyclic graph (DAG), where each node represents a narrative beat—an event that moves the plot forward—with directed edges (arrows) marking the progression through the story’s events. These beats are the fundamental units of the narrative structure, representing the exchange of action and reaction. A single path from a start node to an end node outlines a unique storyline, and the graph illustrates the various potential storylines based on the same overarching narrative. 

The generation and visualization of these narrative graphs are accomplished using GPT-4 in a two-step process. First, the model generates the branching storylines from the given description and constraints. Second, it produces code to render these narratives in a visually comprehensible graph format.

We detail this methodology in our paper, through a case study where we used GENEVA to construct narrative graphs for four well-known stories—Dracula, Frankenstein, Jack and the Beanstalk, and Little Red Riding Hood. Each was set in one of four distinct worlds: the game of Minecraft, the 21st century, ancient Rome, and the quantum realm. Figure 1 shows a narrative graph of Frankenstein set in the 21st century, and Figure 2 shows the storylines generated for this story.

Figure 1. A picture of a screenshot of the online interface of GENEVA. The screenshot has the title “Visualizing Generated Narratives”. Below the title are four dropdown menus, each for stories, number of starts, number of ends, number of plots and contexts. The values selected for the respective options are Frankenstein story with 1 start, 2 endings, 4 plots and set in the 21st century context. Besides that, there are two buttons, one that says, “show graph” and another that says, “show details”. Below these menu options, is a large graph with nodes and edges. The one orange node on the left is annotated as the start node and the two orange nodes on the right are annotated as the end nodes. The rest of the nodes are blue in color and each of them is annotated with a short phrase of about 3 to 4 words.
Figure 1: A narrative graph for the novel, Frankenstein, grounded in the 21st century. Additional constraints on the graph include one start, two endings, and four storylines.
Figure 2. A picture of a screenshot of the online interface of GENEVA. The screenshot has the title “Visualizing Generated Narratives”. Below the title are four dropdown menus, each for stories, number of starts, number of ends, number of plots and contexts. The values selected for the respective options are Frankenstein story with 1 start, 2 endings, 4 plots and set in the 21st century context. Besides that, there are two buttons, one that says, “show graph” and another that says, “hide details”. Below these menu options is a large text area with three storylines. Each storyline consists of a sequence of beats. Each beat has a unique number and a sentence describing the beat.
Figure 2: A detailed view of the four different storylines in the narrative graph in Figure 1.

microsoft research podcast

What’s Your Story: Weishung Liu

Principal PM Manager Weishung Liu shares how a career delivering products and customer experiences aligns with her love of people and storytelling and how—despite efforts to defy the expectations that come with growing up in Silicon Valley—she landed in tech.


Assessing GENEVA’s narrative adaptations

In our assessment, we found that GENEVA performed better in specific narrative contexts. For example, in Frankenstein’s adaptation to the 21st century, the storylines included themes like creating life from DNA fragments and genetic engineering, maintaining relevance while preserving the original story’s essence. However, upon closer examination, we noted areas for improvement, such as the need for more variety and better grounding of the narrative. Generally, stories that are better known and more thoroughly documented tend to yield richer and more varied adaptations.

Implications and looking forward

GENEVA remains a prototype, serving as a tool for exploring the narrative capabilities of LLMs. As these models evolve, we anticipate corresponding advances in their narrative generation abilities. The ultimate goal in game design is to engage players with compelling interactive experiences. With the skilled input of experienced game designers, tools like GENEVA could increasingly contribute to creating engaging gameplay experiences through iterative refinement of narrative paths.

Our collaboration with Xbox and Inworld AI (opens in new tab) continues to advance the use of AI in game development, incorporating these developments into practical tools for creators. Discover more about this transformative technology by watching this video (opens in new tab).

The post GENEVA uses large language models for interactive game narrative design appeared first on Microsoft Research.

Read More

Players, creators, and AI collaborate to build and expand rich game narratives

Players, creators, and AI collaborate to build and expand rich game narratives

This paper was presented at the IEEE 2024 Conference on Games (opens in new tab) (IEEE CoG 2024), the leading forum on innovation in and through games.

Player-Driven Emergence in LLM-Driven Game Narrative,” presented at IEEE CoG 2024

In the fast-evolving landscape of video game development, crafting dialogues and narratives is a labor-intensive endeavor. Traditionally, creating these elements involved meticulous hand-coding, resulting in static interactions that limit player agency. However, the rise of large language models (LLMs) is introducing possibilities for richer, more dynamic narrative experiences and automating some of the more challenging aspects of game creation. Despite this advance, a key challenge with using LLMs for narrative design in games is that, without human intervention, they tend to repeat patterns.

We address this in our paper, “Player-Driven Emergence in LLM-Driven Game Narrative,” presented at IEEE CoG 2024, where we explore how LLMs can foster unique forms of creativity when players participate in the design process. Rather than replacing designers, LLMs can empower players with considerable freedom in their interactions with nonplayer characters (NPC)—characters not controlled by the players but crucial for gameplay. These interactions provide implicit feedback for designers, offering insights unattainable with traditional dialogue trees—a branching structure of player dialogue choices affecting the narrative.

Creating and designing “Dejaboom!”

To test this hypothesis, we developed a text-adventure game called “Dejaboom!” The game’s premise involves a player waking up at home with déjà vu, recalling an explosion in their village from the day before. The objective is to relive the day and prevent the disaster. Players interact with five NPCs in the village. After a set number of steps, the bomb explodes, causing the player to lose all the items they gathered but retain memories of the NPC interactions. Figure 1 illustrates the game design.

Figure 1 (game design): The figure shows the map of the village where the game takes place. It shows the various locations that the player can explore, including home, park, restaurant, library, blacksmith’s shop, and town hall. It also shows the streets connecting the various locations. In addition to these, there are also two hidden rooms, namely a lab connected to the library and a storage room connected to the blacksmith’s shop. There are several objects placed at various locations that the player can pick up and use. There is a water bucket at home, a redstone torch in the park, shears in the blacksmith’s shop, a journal in the library, a map in the townhall, and a bomb in the storage room. There are five NPCs in the game that the player can interact with. There is Chef Maria in the restaurant, Mrs. Thompson on the residential street, Mad Hatter in the park, Merlin in the lab and Moriarty in the town hall.
Figure. 1. A map of the village shows the locations, objects, and NPCs.

We built the game using TextWorld, an open-source, extensible engine for text adventure games, modifying it to include dialogue with NPCs through OpenAI’s GPT-4 model. TextWorld provided the core game logic, while GPT-4 allowed for dynamic input and output—including both game feedback and NPC responses. Figure 2 illustrates our implementation of the game. In a conventional text game, this setup would allow only a fixed set of player commands and offer a predefined set of game responses. However, the use of GPT-4 allows the game’s input and output to be dynamic.

Figure 2 (game implementation): The figure depicts the implementation of the Dejaboom game. When a player issues a text command, it is first processed by an LLM which classifies it as either an action or words. If it is an action (for example “chase the birds”), then it goes to the fixed game agent which generates a fixed game response (example “this verb is not recognizable”). This response is taken in by another instance of the LLM which generates a more palatable natural language response (example “You tried to chase the birds, but nothing happened”) which is then shown to the player as the game feedback. If the player's text command is classified as words by the LLM classifier (example “can I see your menu”), then it goes to the second instance of the LLM which generates an appropriate NPC response that gets shown to the player (example “Chef Maria: Of course! Our menu today features a delicious selection of Italian-American fusion dishes”).
Figure 2: In our implementation of the game, the user’s commands are classified by GPT-4 as actions or words. Actions are processed by the game agent, while words trigger GPT-4 to generate contextually appropriate NPC responses.

About Microsoft Research

Advancing science and technology to benefit humanity


Narrative analysis and user study

Our goal was to identify narrative paths that players create and how they diverge from the designer’s original narrative. We used GPT-4 to transform player game logs into a narrative graph, where a node represents a player’s strategy at specific points and directed edges (arrows) show game progression. We compared these to a graph of the designer’s intended narrative. We defined emergent nodes as those that appear in the narrative graph of players but are not present in the original narrative graph. 

When we applied this approach to a user study with 28 gamers playing Dejaboom!, we found that players often introduced new strategies and elements, indicating a high level of creative engagement. Those generating the most emergent nodes tended to enjoy games that emphasize discovery, exploration, and experimentation, suggesting that such players are ideally suited for a collaborative approach to game development.

Figure 3 (narrative graph showing emergence): The figure shows a graph with nodes and edges. There are two types of nodes (blue nodes and green nodes). The blue nodes make up the initial narrative graph intended by the game designers whereas the green nodes indicate a few examples of the emergent nodes created by players implicitly through their gameplay. There is also a single start node and a single end node. A single path from the start node to the end node indicates one possible way to stop the explosion.
Figure 3: The single circles indicate the initial narrative graph intended by the designers. The double circles denote the emergent nodes created by players, representing creative new paths.

Implications and looking ahead

Our goal is to build methods that help empower game creators to create novel NPC experiences, design new narratives, and ultimately build entire new worlds through implicit player feedback and progressive application of advanced AI technologies. This work represents a foundational step, marking the start of a new paradigm of game development in which designers, players and generative AI models can collaboratively design and evolve games. Utilizing AI models introduces a new mechanism for capturing implicit player feedback through their emergent behaviors.

The post Players, creators, and AI collaborate to build and expand rich game narratives appeared first on Microsoft Research.

Read More

LLM in a Flash: Efficient Large Language Model Inference with Limited Memory

This paper was accepted at the ACL 2024
Large language models (LLMs) are central to modern natural language processing, delivering exceptional performance in various tasks. However, their substantial computational and memory requirements present challenges, especially for devices with limited DRAM capacity. This paper tackles the challenge of efficiently running LLMs that exceed the available DRAM capacity by storing the model parameters in flash memory, but bringing them on demand to DRAM. Our method involves constructing an inference cost model that takes into account the characteristics of…Apple Machine Learning Research

Direct Large Language Model Alignment Through Self-Rewarding Contrastive Prompt Distillation

Aligning large language models (LLMs) with human expectations without human-annotated preference data is an important problem. In this paper, we propose a method to evaluate the response preference by using the output probabilities of response pairs under contrastive prompt pairs, which could achieve better performance on LLaMA2-7B and LLaMA2-13B compared to RLAIF. Based on this, we propose an automatic alignment method, Direct Large Model Alignment (DLMA). First, we use contrastive prompt pairs to automatically generate preference data. Then, we continue to evaluate the generated preference…Apple Machine Learning Research

BISCUIT: Scaffolding LLM-Generated Code with Ephemeral UIs in Computational Notebooks

This paper was accepted at IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC) 2024
Programmers frequently engage with machine learning tutorials in computational notebooks and have been adopting code generation technologies based on large language models (LLMs). However, they encounter difficulties in understanding and working with code produced by LLMs. To mitigate these challenges, we introduce a novel workflow into computational notebooks that augments LLM-based code generation with an additional ephemeral UI step, offering users UI scaffolds as an intermediate stage…Apple Machine Learning Research

Cepsa Química improves the efficiency and accuracy of product stewardship using Amazon Bedrock

Cepsa Química improves the efficiency and accuracy of product stewardship using Amazon Bedrock

This is a guest post co-written with Vicente Cruz Mínguez, Head of Data and Advanced Analytics at Cepsa Química, and Marcos Fernández Díaz, Senior Data Scientist at Keepler.

Generative artificial intelligence (AI) is rapidly emerging as a transformative force, poised to disrupt and reshape businesses of all sizes and across industries. Generative AI empowers organizations to combine their data with the power of machine learning (ML) algorithms to generate human-like content, streamline processes, and unlock innovation. As with all other industries, the energy sector is impacted by the generative AI paradigm shift, unlocking opportunities for innovation and efficiency. One of the areas where generative AI is rapidly showing its value is the streamlining of operational processes, reducing costs, and enhancing overall productivity.

In this post, we explain how Cepsa Química and partner Keepler have implemented a generative AI assistant to increase the efficiency of the product stewardship team when answering compliance queries related to the chemical products they market. To accelerate development, they used Amazon Bedrock, a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy and safety.

Cepsa Química, a world leader in the manufacturing of linear alkylbenzene (LAB) and ranking second in the production of phenol, is a company aligned with Cepsa’s Positive Motion strategy for 2030, contributing to the decarbonization and sustainability of its processes through the use of renewable raw materials, development of products with less carbon, and use of waste as raw materials.

At Cepsa’s Digital, IT, Transformation & Operational Excellence (DITEX) department, we work on democratizing the use of AI within our business areas so that it becomes another lever for generating value. Within this context, we identified product stewardship as one of the areas with more potential for value creation through generative AI. We partnered with Keepler, a cloud-centered data services consulting company specialized in the design, construction, deployment, and operation of advanced public cloud analytics custom-made solutions for large organizations, in the creation of the first generative AI solution for one of our corporate teams.

The Safety, Sustainability & Energy Transition team

The Safety, Sustainability & Energy Transition area of Cepsa Química is responsible for all human health, safety, and environmental aspects related to the products manufactured by the company and the associated raw materials, among others. In this field, its areas of action are product safety, regulatory compliance, sustainability, and customer service around safety and compliance.

One of the responsibilities of the Safety, Sustainability & Energy Transition team is product stewardship, which takes care of regulatory compliance of the marketed products. The Product Stewardship department is responsible for managing a large collection of regulatory compliance documents. Their duty involves determining which regulations apply to each specific product in the company’s portfolio, compiling a list of all the applicable regulations for a given product, and supporting other internal teams that might have questions related to these products and regulations. Example questions might be “What are the restrictions for CMR substances?”, “How long do I need to keep the documents related to a toluene sale?”, or “What is the reach characterization ratio and how do I calculate it?” The regulatory content required to answer these questions varies over time, introducing new clauses and repealing others. This work used to consume a significant percentage of the team’s time, so they identified an opportunity to generate value by reducing the search time for regulatory consultations.

The DITEX department engaged with the Safety, Sustainability & Energy Transition team for a preliminary analysis of their pain points and deemed it feasible to use generative AI techniques to speed up the resolution of compliance queries faster. The analysis was conducted for queries based on both unstructured (regulatory documents and product specs sheets) and structured (product catalog) data.

An approach to product stewardship with generative AI

Large language models (LLMs) are trained with vast amounts of information crawled from the internet, capturing considerable knowledge from multiple domains. However, their knowledge is static and tied to the data used during the pre-training phase.

To overcome this limitation and provide dynamism and adaptability to knowledge base changes, we decided to follow a Retrieval Augmented Generation (RAG) approach, in which the LLMs are presented with relevant information extracted from external data sources to provide up-to-date data without the need to retrain the models. This approach is a great fit for a scenario where regulatory information is updated at a fast pace, with frequent derogations, amendments, and new regulations being published.

Additionally, the RAG-based approach enables rapid prototyping of document search use cases, allowing us to craft a solution based on regulatory information about chemical substances in a few weeks.

The solution we built is based on four main functional blocks:

  • Input processing – Input regulatory PDF documents are preprocessed to extract the relevant information. Each document is divided into chunks to ease the indexing and retrieval processes based on semantic meaning.
  • Embeddings generation – An embeddings model is used to encode the semantic information of each chunk into an embeddings vector, which is stored in a vector database, enabling similarity search of user queries.
  • LLM chain service – This service orchestrates the solution by invoking the LLM models with a fitting prompt and creating the response that is returned to the user.
  • User interface – A conversational chatbot enables interaction with users.

We divided the solution into two independent modules: one to batch process input documents and another one to answer user queries by running inference.

Batch ingestion module

The batch ingestion module performs the initial processing of the raw compliance documents and product catalog and generates the embeddings that will be later used to answer user queries. The following diagram illustrates this architecture.

Architecture diagram for the batch ingestion module

The batch ingestion module performs the following tasks:

  1. AWS Glue, a serverless data integration service, is used to run periodical extract, transform, and load (ETL) jobs that read input raw documents and the product catalog from Amazon Simple Storage Service (Amazon S3), an object storage service that offers industry-leading scalability, data availability, security, and performance.
  2. The AWS Glue job calls Amazon Textract, an ML service that automatically extracts text, handwriting, layout elements, and data from scanned documents, to process the input PDF documents. After data is extracted, the job performs document chunking, data cleanup, and postprocessing.
  3. The AWS Glue job uses Amazon Bedrock to generate vector embeddings for each document chunk using the Amazon Titan Text Embeddings
  4. Amazon Aurora PostgreSQL-Compatible Edition, a fully managed, PostgreSQL-compatible, and ACID-compliant relational database engine to store the extracted embeddings, is used with the pgvector extension enabled for efficient similarity searches.

Inference module

The inference module transforms user queries into embeddings, retrieves relevant document chunks from the knowledge base using similarity search, and prompts an LLM with the query and retrieved chunks to generate a contextual response. The following diagram illustrates this architecture.

Architecture diagram for the inference module

The inference module implements the following steps:

  1. Users interact through a web portal, which consists of a static website stored in Amazon S3, served through Amazon CloudFront, a content delivery network (CDN), and secured with AWS Cognito, a customer identity and access management platform.
  2. Queries are sent to the backend using a REST API defined in Amazon API Gateway, a fully managed service that makes it straightforward for developers to create, publish, maintain, monitor, and secure APIs at any scale, and implemented through an API Gateway private integration. The backend is implemented by an LLM chain service running on AWS Fargate, a serverless, pay-as-you-go compute engine that lets you focus on building applications without managing servers. This service orchestrates the interaction with the different LLMs using the LangChain
  3. The LLM chain service invokes Amazon Titan Text Embeddings on Amazon Bedrock to generate the embeddings for the user query.
  4. Based on the query embeddings, the relevant documents are retrieved from the embeddings database using similarity search.
  5. The service composes a prompt that includes the user query and the documents extracted from the knowledge base. The prompt is sent to Anthropic Claude 2.0 on Amazon Bedrock, and the model answer is sent back to the user.

Note on the RAG implementation

The product stewardship chatbot was built before Knowledge Bases for Amazon Bedrock was generally available. Knowledge Bases for Amazon Bedrock is a fully managed capability that helps you implement the entire RAG workflow from ingestion to retrieval and prompt augmentation without having to build custom integrations to data sources and manage data flows. Knowledge Bases manages the initial vector store set up, handles the embedding and querying, and provides source attribution and short-term memory needed for production RAG applications.

With Knowledge Bases for Amazon Bedrock, the implementation of steps 3–4 of the Batch Ingestion and Inference modules can be significantly simplified.

Challenges and solutions

In this section, we discuss the challenges we encountered during the development of the system and the decisions we made to overcome those challenges.

Data preprocessing and chunking strategy

We discovered that the input documents contained a variety of structural complexities, which posed a challenge in the processing stage. For instance, some tables contain large amounts of information with minimal context except for the header, which is displayed at the top of the table. This can make it complex to obtain the right answers to user queries, because the retrieval process might lack context.

Additionally, some document annexes are linked to other sections of the document or even other documents, leading to incomplete data retrieval and generation of inaccurate answers.

To address these challenges, we implemented three mitigation strategies:

  • Data chunking – We decided to use larger chunk sizes with significant overlaps to provide maximum context for each chunk during ingestion. However, we set an upper limit to avoid losing the semantic meaning of the chunk.
  • Model selection – We selected a model with a large context window to generate responses that take a larger context into account. Anthropic Claude 2.0 on Amazon Bedrock, with a 100 K context window, provided the most accurate results. (The system was built before Anthropic Claude 2.1 or the Anthropic Claude 3 model family were available on Amazon Bedrock).
  • Query variants – Prior to retrieving documents from the database, multiple variants of the user query are generated using an LLM. Documents for all variants are retrieved and deduplicated before being provided as context for the LLM query.

These three strategies significantly enhanced the retrieval and response accuracy of the RAG system.

Evaluation of results and process refinement

Evaluating the responses from the LLM models is another challenge that is not found in traditional AI use cases. Because of the free text nature of the output, it’s difficult to assess and compare different responses in terms of a metric or KPI, leading to a manual review in most cases. However, a manual process is time-consuming and not scalable.

To minimize the drawbacks, we created a benchmarking dataset with the help of seasoned users, containing the following information:

  • Representative questions that require data combined from different documents
  • Ground truth answers for each question
  • References to the source documents, pages, and line numbers where the right answers are found

Then we implemented an automatic evaluation system with Anthropic Claude 2.0 on Amazon Bedrock, with different prompting strategies to evaluate document retrieval and response formation. This approach allowed for adjustment of different parameters in a fast and automated manner:

  • Preprocessing – Tried different values for chunk size and overlap size
  • Retrieval – Tested several retrieval techniques of incremental complexity
  • Querying – Ran the tests with different LLMs hosted on Amazon Bedrock:
    • Amazon Titan Text Premier
    • Cohere Command v1.4
    • Anthropic Claude Instant
    • Anthropic Claude 2.0

The final solution consists of three chains: one for translating the user query into English, one for generating variations of the input question, and one for composing the final response.

Achieved improvements and next steps

We built a conversational interface for the Safety, Sustainability & Energy Transition team that helps the product stewardship team be more efficient and obtain answers to compliance queries faster. Furthermore, the answers contain references to the input documents used by the LLM to generate the reply, so the team can double-check the response and find additional context if it’s needed. The following screenshot shows an example of the conversational interface.

Example screenshot of a user query and an answer from the chatbot

Some of the qualitative and quantitative improvements identified by the product stewardship team through the use of the solution are:

  • Query times – The following table summarizes the search time saved by query complexity and user seniority (considering all search times have been reduced to less than 1 minute).

 

Complexity

Time saved (minutes)
Junior user Senior user
Low 3.3 2
Medium 9.25 4
High 28 10
  • Answer quality – The implemented system offers additional context and document references that are used by the users to improve the quality of the answer.
  • Operational efficiency – The implemented system has accelerated the regulatory query process, directly enhancing the department operational efficiency.

From the DITEX department, we’re currently working with other business areas at Cepsa Química to identify similar use cases to help create a corporate-wide tool that reuses components from this first initiative and generalizes the use of generative AI across business functions.

Conclusion

In this post, we shared how Cepsa Química and partner Keepler have implemented a generative AI assistant that uses Amazon Bedrock and RAG techniques to process, store, and query the corpus of knowledge related to product stewardship. As a result, users save up to 25 percent of their time when they use the assistant to solve compliance queries.

If you want your business to get started with generative AI, visit Generative AI on AWS and connect with a specialist, or quickly build a generative AI application in PartyRock.


About the authors

Vicente Cruz Mínguez is the Head of Data & Advanced Analytics at Cepsa Química. He has more than 8 years of experience with big data and machine learning projects in financial, retail, energy, and chemical industries. He is currently leading the Data, Advanced Analytics & Cloud Development team in the Digital, IT, Transformation & Operational Excellence department at Cepsa Química, with a focus in feeding the corporate data lake and democratizing data for analysis, machine learning projects, and business analytics. Since 2023, he has also been working on scaling the use of generative AI in all departments.

Marcos Fernández Díaz is a Senior Data Scientist at Keepler, with 10 years of experience developing end-to-end machine learning solutions for different clients and domains, including predictive maintenance, time series forecasting, image classification, object detection, industrial process optimization, and federated machine learning. His main interests include natural language processing and generative AI. Outside of work, he is a travel enthusiast.

Guillermo Menéndez Corral is a Sr. Manager, Solutions Architecture at AWS for Energy and Utilities. He has over 18 years of experience designing and building software products and currently helps AWS customers in the energy industry harness the power of the cloud through innovation and modernization.

Read More

GraphStorm 0.3: Scalable, multi-task learning on graphs with user-friendly APIs

GraphStorm 0.3: Scalable, multi-task learning on graphs with user-friendly APIs

GraphStorm is a low-code enterprise graph machine learning (GML) framework to build, train, and deploy graph ML solutions on complex enterprise-scale graphs in days instead of months. With GraphStorm, you can build solutions that directly take into account the structure of relationships or interactions between billions of entities, which are inherently embedded in most real-world data, including fraud detection scenarios, recommendations, community detection, and search/retrieval problems.

Today, we are launching GraphStorm 0.3, adding native support for multi-task learning on graphs. Specifically, GraphStorm 0.3 allows you to define multiple training targets on different nodes and edges within a single training loop. In addition, GraphStorm 0.3 adds new APIs to customize GraphStorm pipelines: you now only need 12 lines of code to implement a custom node classification training loop. To help you get started with the new API, we have published two Jupyter notebook examples: one for node classification, and one for a link prediction task. We also released a comprehensive study of co-training language models (LM) and graph neural networks (GNN) for large graphs with rich text features using the Microsoft Academic Graph (MAG) dataset from our KDD 2024 paper. The study showcases the performance and scalability of GraphStorm on text rich graphs and the best practices of configuring GML training loops for better performance and efficiency.

Native support for multi-task learning on graphs

Many enterprise applications have graph data associated with multiple tasks on different nodes and edges. For example, retail organizations want to conduct fraud detection on both sellers and buyers. Scientific publishers want to find more related works to cite in their papers and need to select the right subject for their publication to be discoverable. To better model such applications, customers have asked us to support multi-task learning on graphs.

GraphStorm 0.3 supports multi-task learning on graphs with six most common tasks: node classification, node regression, edge classification, edge regression, link prediction, and node feature reconstruction. You can specify the training targets through a YAML configuration file. For example, a scientific publisher can use the following YAML configuration to simultaneously define a paper subject classification task on paper nodes and a link prediction task on paper-citing-paper edges for the scientific publisher use case:

version: 1.0
    gsf:
        basic: # basic settings of the backbone GNN model
            ...
        ...
        multi_task_learning:
            - node_classification:         # define a node classification task for paper subject prediction.
                target_ntype: "paper"      # the paper nodes are the training targets.
                label_field: "label_class" # the node feature "label_class" contains the training labels.
				mask_fields:
                    - "train_mask_class"   # train mask is named as train_mask_class.
                    - "val_mask_class"     # validation mask is named as val_mask_class.
                    - "test_mask_class"    # test mask is named as test_mask_class.
                num_classes: 10            # There are total 10 different classes (subject) to predict.
                task_weight: 1.0           # The task weight is 1.0.
                
            - link_prediction:                # define a link prediction paper citation recommendation.
                num_negative_edges: 4         # Sample 4 negative edges for each positive edge during training
                num_negative_edges_eval: 100  # Sample 100 negative edges for each positive edge during evaluation
                train_negative_sampler: joint # Share the negative edges between positive edges (to speedup training)
                train_etype:
                    - "paper,citing,paper"    # The target edge type for link prediction training is "paper, citing, paper"
                mask_fields:
                    - "train_mask_lp"         # train mask is named as train_mask_lp.
                    - "val_mask_lp"           # validation mask is named as val_mask_lp.
                    - "test_mask_lp"          # test mask is named as test_mask_lp.
                task_weight: 0.5              # The task weight is 0.5.

For more details about how to run graph multi-task learning with GraphStorm, refer to Multi-task Learning in GraphStorm in our documentation.

New APIs to customize GraphStorm pipelines and components

Since GraphStorm’s release in early 2023, customers have mainly used its command line interface (CLI), which abstracts away the complexity of the graph ML pipeline for you to quickly build, train, and deploy models using common recipes. However, customers are telling us that they want an interface that allows them to customize the training and inference pipeline of GraphStorm to their specific requirements more easily. Based on customer feedback for the experimental APIs we released in GraphStorm 0.2, GraphStorm 0.3 introduces refactored graph ML pipeline APIs. With the new APIs, you only need 12 lines of code to define a custom node classification training pipeline, as illustrated by the following example:

import graphstorm as gs
gs.initialize()

acm_data = gs.dataloading.GSgnnData(part_config='./acm_gs_1p/acm.json')

train_dataloader = gs.dataloading.GSgnnNodeDataLoader(dataset=acm_data, target_idx=acm_data.get_node_train_set(ntypes=['paper']), fanout=[20, 20], batch_size=64)
val_dataloader = gs.dataloading.GSgnnNodeDataLoader(dataset=acm_data, target_idx=acm_data.get_node_val_set(ntypes=['paper']), fanout=[100, 100], batch_size=256, train_task=False)
test_dataloader = gs.dataloading.GSgnnNodeDataLoader(dataset=acm_data, target_idx=acm_data.get_node_test_set(ntypes=['paper']), fanout=[100, 100], batch_size=256, train_task=False)

model = RgcnNCModel(g=acm_data.g, num_hid_layers=2, hid_size=128, num_classes=14)
evaluator = gs.eval.GSgnnClassificationEvaluator(eval_frequency=100)

trainer = gs.trainer.GSgnnNodePredictionTrainer(model)
trainer.setup_evaluator(evaluator)

trainer.fit(train_dataloader, val_dataloader, test_dataloader, num_epochs=5)

To help you get started with the new APIs, we also have released new Jupyter notebook examples in our Documentation and Tutorials page.

Comprehensive study of LM+GNN for large graphs with rich text features

Many enterprise applications have graphs with text features. In retail search applications, for example, shopping log data provides insights on how text-rich product descriptions, search queries, and customer behavior are related. Foundational large language models (LLMs) alone are not suitable to model such data because the underlying data distributions and relationships don’t correspond to what LLMs learn from their pre-training data corpuses. GML, on the other hand, is great for modeling related data (graphs) but until now, GML practitioners had to manually combine their GML models with LLMs to model text features and get the best performance for their use cases. Especially when the underlying graph dataset was large, this manual work was challenging and time-consuming.

In GraphStorm 0.2, GraphStorm introduced built-in techniques to train language models (LMs) and GNN models together efficiently at scale on massive text-rich graphs. Since then, customers have been asking us for guidance on how GraphStorm’s LM+GNN techniques should be employed to optimize performance. To address this, with GraphStorm 0.3, we released a LM+GNN benchmark using the large graph dataset, Microsoft Academic Graph (MAG), on two standard graph ML tasks: node classification and link prediction. The graph dataset is a heterogeneous graph, contains hundreds of millions of nodes and billions of edges, and the majority of nodes are attributed with rich text features. The detailed statistics of the datasets are shown in the following table.

Dataset Num. of nodes Num. of edges Num. of node/edge types Num. of nodes in NC training set Num. of edges in LP training set Num. of nodes with text-features
MAG 484,511,504 7,520,311,838 4/4 28,679,392 1,313,781,772 240,955,156

We benchmark two main LM-GNN methods in GraphStorm: pre-trained BERT+GNN, a baseline method that is widely adopted, and fine-tuned BERT+GNN, introduced by GraphStorm developers in 2022. With the pre-trained BERT+GNN method, we first use a pre-trained BERT model to compute embeddings for node text features and then train a GNN model for prediction. With the fine-tuned BERT+GNN method, we initially fine-tune the BERT models on the graph data and use the resulting fine-tuned BERT model to compute embeddings that are then used to train a GNN models for prediction. GraphStorm provides different ways to fine-tune the BERT models, depending on the task types. For node classification, we fine-tune the BERT model on the training set with the node classification tasks; for link prediction, we fine-tune the BERT model with the link prediction tasks. In the experiment, we use 8 r5.24xlarge instances for data processing and use 4 g5.48xlarge instances for model training and inference. The fine-tuned BERT+GNN approach has up to 40% better performance (link prediction on MAG) compared to pre-trained BERT+GNN.

The following table shows the model performance of the two methods and the overall computation time of the whole pipeline starting from data processing and graph construction. NC means node classification and LP means link prediction. LM Time Cost means the time spent on computing BERT embeddings and the time spent on fine-tuning the BERT models for pre-trained BERT+GNN and fine-tuned BERT+GNN, respectively.

Dataset Task Data processing time Target Pre-trained BERT + GNN Fine-tuned BERT + GNN
LM Time Cost One epoch time Metric LM Time Cost One epoch time Metric
MAG NC 553 min paper subject 206 min 135 min Acc:0.572 1423 min 137 min Acc:0.633
LP cite 198 min 2195 min Mrr: 0.487 4508 min 2172 min Mrr: 0.684

We also benchmark GraphStorm on large synthetic graphs to showcase its scalability. We generate three synthetic graphs with 1 billion, 10 billion, and 100 billion edges. The corresponding training set sizes are 8 million, 80 million, and 800 million, respectively. The following table shows the computation time of graph preprocessing, graph partition, and model training. Overall, GraphStorm enables graph construction and model training on 100 billion scale graphs within hours!

Graph Size Data pre-process Graph Partition Model Training
# instances Time # instances Time # instances Time
1B 4 19 min 4 8 min 4 1.5 min
10B 8 31 min 8 41 min 8 8 min
100B 16 61 min 16 416 min 16 50 min

More benchmark details and results are available in our KDD 2024 paper.

Conclusion

GraphStorm 0.3 is published under the Apache-2.0 license to help you tackle your large-scale graph ML challenges, and now offers native support for multi-task learning and new APIs to customize pipelines and other components of GraphStorm. Refer to the GraphStorm GitHub repository and documentation to get started.


About the Author

Xiang Song is a senior applied scientist at AWS AI Research and Education (AIRE), where he develops deep learning frameworks including GraphStorm, DGL and DGL-KE. He led the development of Amazon Neptune ML, a new capability of Neptune that uses graph neural networks for graphs stored in graph database. He is now leading the development of GraphStorm, an open-source graph machine learning framework for enterprise use cases. He received his Ph.D. in computer systems and architecture at the Fudan University, Shanghai, in 2014.

Jian Zhang is a senior applied scientist who has been using machine learning techniques to help customers solve various problems, such as fraud detection, decoration image generation, and more. He has successfully developed graph-based machine learning, particularly graph neural network, solutions for customers in China, USA, and Singapore. As an enlightener of AWS’s graph capabilities, Zhang has given many public presentations about the GNN, the Deep Graph Library (DGL), Amazon Neptune, and other AWS services.

Florian Saupe is a Principal Technical Product Manager at AWS AI/ML research supporting science teams like the graph machine learning group, and ML Systems teams working on large scale distributed training, inference, and fault resilience. Before joining AWS, Florian lead technical product management for automated driving at Bosch, was a strategy consultant at McKinsey & Company, and worked as a control systems/robotics scientist – a field in which he holds a phd.

Read More

Few-shot prompt engineering and fine-tuning for LLMs in Amazon Bedrock

Few-shot prompt engineering and fine-tuning for LLMs in Amazon Bedrock

This blog is part of the series, Generative AI and AI/ML in Capital Markets and Financial Services.

Company earnings calls are crucial events that provide transparency into a company’s financial health and prospects. Earnings reports detail a firm’s financials over a specific period, including revenue, net income, earnings per share, balance sheet, and cash flow statement. Earnings calls are live conferences where executives present an overview of results, discuss achievements and challenges, and provide guidance for upcoming periods.

These disclosures are vitally important for capital markets, significantly impacting stock prices. Investors and analysts closely watch key metrics like revenue growth, earnings per share, margins, cash flow, and projections to assess performance against peers and industry trends. The rate of growth and profit margins influence the premium and multiplier that investors are willing to pay for a company’s stock, ultimately affecting stock returns and price movements.

Earnings calls also allow investors to look for new clues about a company’s future. Companies often release information about new products, cutting-edge technology, mergers and acquisitions, and investments in new market themes and trends during these events. Such details can signal potential growth opportunities for investors, analysts, and portfolio managers.

Traditionally, earnings call scripts have followed similar templates, making it a repeatable task to generate them from scratch each time. On the other hand, generative artificial intelligence (AI) models can learn these templates and produce coherent scripts when fed with quarterly financial data. With generative AI, companies can streamline the process of creating first drafts of earnings call scripts for a new quarter using repeatable templates and information about specific performance and business highlights. The initial draft of a large language model (LLM) generated earnings call script can be then refined and customized using feedback from the company’s executives.

Amazon Bedrock offers a straightforward way to build and scale generative AI applications with foundation models (FMs) and LLMs. Amazon Bedrock is a fully managed service that offers a choice of high-performing FMs from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API. Model customization helps you deliver differentiated and personalized user experiences. To customize models for specific tasks, you can privately fine-tune FMs using your own labeled datasets in just a few quick steps.

In this post, we showcase how to generate the first draft of an earnings call script for the new quarter using LLMs. We demonstrate two methods to generate an earnings call script with LLMs: few-shot learning and fine-tuning. We assess the generated earnings call scripts and the applied methods from different dimensions—comprehensiveness, hallucinations, writing style, ease of use, and cost—and present our findings.

Solution overview

We apply two methods to generate the first draft of an earnings call script for the new quarter using LLMs:

  • Prompt engineering with few-shot learning – We use examples of the past earnings scripts with Anthropic Claude 3 Sonnet on Amazon Bedrock to generate an earnings call script for a new quarter.
  • Fine-tuning – We fine-tune Meta Llama 2 70B on Amazon Bedrock using input/output labeled data from the past earnings scripts and use the customized model to generate an earnings call script for a new quarter.

Both methods involve utilizing a consistent dataset of earnings call transcripts across multiple quarters. We use several past years of quarterly earnings calls, with one quarter set aside, which was used as ground truth for testing and comparison.

The process starts by retrieving the earnings call transcripts from the past quarters to the recent quarter. The next step involves selecting multiple scripts from the previous quarters to serve as few-shot learning examples as well as input/output dataset for fine-tuning. The script for the most recent quarter is held out for validation and evaluation of generated scripts. The generated script is evaluated by comparing it with the actual script for the quarter, which was initially kept aside.

The following diagram illustrates the solution architecture and workflow for both methods.

In the following sections, we discuss the workflows of each method in more detail.

Few-shot learning with Anthropic Claude 3 Sonnet on Amazon Bedrock

The prompt engineering for few-shot learning using Anthropic Claude 3 Sonnet is divided into four sections, as shown in the following figure. Three sections have constant instructions to the LLM based on assigning the LLM a role, instructions on style and tone of narrative, and examples for earnings calls from past quarters for few-shot learning. The fourth section has information on financial performance, results, and business highlights for the current quarter for which earnings calls are to be generated by the LLM.

We used Anthropic Claude 3 Sonnet to generate an earnings call for a new quarter using earnings calls from past quarters. The following is an example of our few-shot learning along with prompt instructions:

Section A: Overall prompt instructions (context)
 
You are the CEO and CFO of Any Company preparing to present the quarterly earnings report to investors. Draft a comprehensive earnings call script that covers the key financial metrics, business highlights, and future outlook for the given quarter. Provide details on revenue, operating income, segment performance, and important strategic initiatives or product launches during the quarter.

Section B: Specific guidance for the earnings script (context)
 
The earnings script should be written in a formal, investor-friendly tone suitable for a public earnings call. Use clear and concise language to explain financial performance and business developments. Aim to strike a balance between providing sufficient details and keeping the script reasonably concise. Incorporate specific data points and figures but avoid overwhelming with excessive numerical minutiae. The overall structure should flow logically, covering key topics like revenue, operating income, segment highlights, strategic priorities, and forward-looking guidance. Use the following 5 instructions when generating results for the earnings call script.

1. Provide a clear structure by organizing the content into logical sections, such as financial highlights, segment performance, operational metrics, strategic initiatives, and a forward-looking view. 
2. Include granular details and insights into the factors impacting performance, such as customer behavior trends, supply chain improvements, cost optimization efforts, and any other relevant context etc.
3. Substantiate your commentary with specific data points and percentages to lend credibility to your statements. 4. Offer a comprehensive forward-looking view by discussing capital investments, preparedness for upcoming events or seasons, and the long-term strategic focus or priorities. 
5. Maintain a measured, objective, and analytical tone throughout the content, avoiding overly conversational or casual language.

Section C: Example Scripts from past quarters (for Few Shot/ Chain-of-thought) 

The example scripts from past quarters provide a reference for the structure, tone, and level of detail expected in an earnings call script. Use these examples to understand how to present financial data, highlight key business initiatives, and address investor concerns or questions. However, ensure that the script for current specific Quarter is tailored to the specific financial performance and business events of that quarter.
<example>
Amazon Earnings call transcript for Q1 2021 ...

Amazon Earnings call transcript for Q2 2021 ...
<example>

Section D: Financial data for quarter for which script is required (context)

<financial_data>

Provide the actual financial results for the specific quarter, including:
Total revenue and year-over-year growth rate
Revenue breakdown by key segments (e.g. AWS, Online Stores, etc.)
Operating income (total and by segment if available)
Any key operating metrics (e.g. Prime membership, third-party seller metrics, etc.)
Notes on significant factors impacting results (e.g. foreign exchange, product launches, one-time events)
Forward-looking guidance on revenue, operating income for next quarter
Highlight key business developments, product launches or strategic priorities for the quarter :

<financial_data>

Fine-tune Meta Llama 2 70B on Amazon Bedrock

In this section, we present our approach to improving the quality of generated earnings call scripts by fine-tuning an LLM. We chose to adapt the Meta Llama 2 70B model, which is powerful and known for its strong performance across various natural languages tasks, to the specific domain of earnings call scripts.

The following diagram illustrates the workflow for our fine-tuning method.

To prepare the training data, we collected a comprehensive dataset of real earnings call transcripts from Q1 2021 to Q4 2022 for Amazon.com. This focused dataset allows the model to better learn the company’s domain-specific knowledge and terminology. The time span also makes sure the model can learn from recent trends and patterns in earnings communications.

Amazon Bedrock offers a model customization feature that enables you to directly use your own data to customize a wide variety of models. This feature not only helps improve model performance on specific tasks but also allows the model to better understand company-specific domain knowledge and terms, ultimately creating a better user experience.

To fine-tune a text-to-text model, you need to prepare training and optional validation datasets by creating a JSONL file with multiple JSON lines. Each JSON line is a sample containing both a prompt and completion field. In our use case, the prompt contains the prompt template, which includes key financial data for that quarter, and the completion field contains the actual earnings call transcript for that quarter.

We use the following prompt template:

{"prompt": ”Section A: Overall prompt instructions (context)… Section B: Specific guidance for the earnings script (context)… Section D: Financial data for Q1 2021 for which script is required (context) The financial data for {time_period} is:
<financial_data>{Section D}<financial_data> Please generate the earning report for {time_period} to the investors, based on the information provided above. Don't make up any information. ", "completion": ”Real earning call script for that Q1 2021"}

The training data is prepared in JSONL format, with each line representing an earnings call for a quarter:

{"prompt": "<prompt1>", "completion": "<expected generated text>"}
{"prompt": "<prompt2>", "completion": "<expected generated text>"}
{"prompt": "<prompt3>", "completion": "<expected generated text>"}

When the dataset is ready, we upload it to Amazon Simple Storage Service (Amazon S3) and set up a customization job in Amazon Bedrock. The training time varies from minutes to hours, depending on the size of the training data and the selected model. After the training job is complete, you must purchase Provisioned Throughput to use the model and generate future earnings call scripts. You can select the No Commitment option for Provisioned Throughput, which is billed on an hourly basis.

For inference, because some language models require a clear separation between the input prompt and expected output during fine-tuning, we need to add a special delimiting key before providing the input to the model. Specifically, for the Meta Llama 2 70B model, we add the key nn Response:n after the input prompt. This delimiter helps the model distinguish where the prompt ends and the expected response should begin, allowing it to generate more accurate outputs. The prompt would look as follows:

Prompt:
{User_Input_Prompt}

Response:

By providing this formatted prompt during inference, the fine-tuned Meta Llama 2 70B model can better understand the input context and generate a more relevant earnings call script as the response.

For better performance, you can use the same prompt template with the current quarter’s financial data (without the few-shot learning examples), format it with the delimiter, and send it to the customized model to generate the final earnings call script for that quarter.

Evaluation of few-shot prompt engineering and fine-tuning

We evaluated the generated earnings call transcripts from both methods (few-shot prompt engineering and fine-tuning) using two different approaches:

  • Evaluated by a human reviewer
  • Evaluated by comparing three variations using an LLM (Anthropic Claude 3 Sonnet)

Evaluated by human reviewer

The following table summarizes a human reviewer’s evaluation.

It is imperative to note that two factors contributed to the differences: varying approaches (few-shot learning and fine-tuning) and disparate models (Anthropic Claude 3 and Meta Llama 70B). Consequently, the results cannot be interpreted as a mere comparison of models. It is advisable to explore the approaches with your specific use case and data, and subsequently evaluate the outcomes by discussing with subject matter experts from the relevant business department.

Factor Fine-Tuned Model Few-shot Prompt Engineering
Comprehensiveness The script covers most of the key points provided in the prompts, although it ignored a few details. For example, it misses the point that the growth in advertising was primarily driven by using machine learning models to improve relevancy of ads. The script covers key points provided in the prompts.
Hallucination Two instances. (1) “This growth was driven by strong demand for our Prime Day event, which saw record-breaking sales and attracted millions of new Prime members.” (2) “This growth was driven by strong demand in our key markets, including India and Japan. Once. (1) “In North America, revenue grew 11% year-over-year to $87.9 billion, fueled by continued robust demand and greater purchase frequency by Prime Members.
Writing style (1) This script uses mostly objective and precise language, which is consistent with the real earnings call. Still, it has subjective expressions such as “a huge success,” and imprecise expressions such as “double digit growth.” (2) The language offers less variations. For example, it uses the format of “This ___ was driven by ___” 10 times without variations. (3) The model generated some additional sentences. For example, “Now, let’s turn to our forward guidance. At this time, we’re not providing specific revenue or operating income guidance for the fourth quarter. The real earnings call uses precise and objective language, while this script uses more metaphoric expressions such as “laser-focused” and “made further strides,” as well as subjective expressions such as “invest prudently” and “disciplined execution.
Ease of Use (1) Fine-tuning a model in Amazon Bedrock gives the option of following steps on the Amazon Bedrock console or apply coding to interact with LLMs on Amazon Bedrock through the API. (2) The fine-tuning process generally takes longer compared to few-shot prompt engineering based on the same documents. (3) Fine-tuning requires preparing data in input/output format (JSON files) for training the selected model. (4) If a new document is added, the whole fine-tuned model needs to be updated by going through the same fine-tuning process. (1) Amazon Bedrock allows users to give instructions and example data to an LLM as is using both the UI or creating reproducible codes. (2) If a new document is added, the user only needs to add to the prompt an example for few-shot learning or prompt instructions. Overall, few-shot prompt engineering is easier to implement, compared to fine-tuning a model.
Cost Monthly cost incurred for fine-tuning = Fine-tuning training cost for the model (priced by number of tokens for training data) + custom model storage per month + hourly cost (or Provisioned Throughput cost for time commitment) of custom model inference. Priced by number of input (few-shot prompts and examples) and output tokens for the model.

The cost comparison can be further evaluated by the frequency of usage, as shown in the following table.

Method One-Time Cost Recurring Cost Inference Cost
Fine-Tuning Priced by the number of tokens for training data Custom model storage cost per month Custom model inference cost (hourly or Provisioned Throughput commitment)
Few-Shot Prompt Engineering N/A N/A Priced by number of input (prompts and examples) and output tokens

Evaluated by comparing three variations using an LLM

We tested the following variations:

  • Variation A – Earnings call transcript from few-shot learning with Anthropic Claude v3 Sonnet
  • Variation B – Earnings call transcript with fine-tuned Meta Llama 70B
  • Variation C – Actual earnings call transcript for the quarter

The following table summarizes the key similarities and differences between the three variations of the Amazon Q3 2023 earnings call transcript. Variation A and Variation B have two main differences – different approaches (few-shot learning vs fine-tuning) and different models (Anthropic Claude 3 vs Meta Llama 70B).

. Identified Factor Result Summaries
Similarities Financial Metrics All variations report strong financial results, with revenue growth around 11% year-over-year and significant increases in operating income.
Business Highlights They highlight the success of Prime Day as a major driver of sales and Prime member growth. The transcripts mention continued growth in third-party seller services, advertising, and AWS.
Management Focus There is a focus on improving operational efficiency, cost optimization, and supply chain/delivery improvements.
Innovation and Partnerships Generative AI initiatives and partnerships (such as Anthropic, Amazon Bedrock, and Amazon CodeWhisperer) are discussed in relation to AWS.
Dissimilarities Level of Financial Detail Variation A provides more detailed financials (exact revenue, operating income figures) than B and C.
Narrative/ Commentary Style – Variation B has more personal commentary from “Jeff Bezos” and “Brian Olsavsky” compared to A and C’s more generic and impersonal style.
Level of Business Detail – Variation C goes into more specifics on initiatives like regionalization, inventory optimization, and cost reduction efforts. Variation A discusses priorities and forward-looking initiatives in more depth compared to B and C.
Forward Guidance Only Variation C mentions actual forward guidance on capital investments for 2023.

Moreover, we can compare the difference between A vs. C and B vs. C to better compare the generated results to the actual earning scripts.

Identified Factor Difference between A & C Difference between B & C
Financial Details A lacks some of the specific financial details and figures present in the actual script. B is more similar to the actual script in terms of providing segment-wise financial figures and percentages.
Depth of Content A mentions broad themes and priorities, whereas C dives deeper into operational metrics, cost savings initiatives, and strategic updates. C provides additional details on topics like free cash flow, capital investments, and strategic initiatives like generative AI.

Overall, although the core financial highlights are similar, there are nuances in the depth of details provided and the narrative and commentary style across the three variations.

Conclusion

Generating high-quality earnings call script drafts using LLMs is a promising approach that can streamline the process for companies. Both the few-shot prompt engineering and fine-tuning methods demonstrated the ability to produce scripts covering key financial metrics, business updates, and forward-looking guidance. Each method has its own nuances. However, there are trade-offs in terms of comprehensiveness, hallucinations, writing style, ease of implementation, and cost that companies must evaluate based on their specific needs and priorities. As language models continue advancing, further research in customizing and refining these models for the financial services and capital markets domain could unlock even more value for financial communications processes.

This blog presents a framework for two different approaches: few-shot prompt engineering and fine-tuning with Large Language Models (LLMs), followed by an evaluation of the results. The findings should not be interpreted as prescriptive recommendations for favoring one approach over the other, as the choice depends on the specific content and prompts. Additionally, the results should not be construed as a direct comparison of LLMs, as the methodologies employed with each LLM differ, making it an apples-to-oranges comparison. As LLMs continue to advance, we anticipate further improvements in their output quality.

As next steps, you can use Amazon Bedrock to explore your own data and use cases. You can engage in few-shot prompt engineering and fine-tuning methods with different LLMs on Amazon Bedrock, using your specific data securely and privately. Furthermore, you can evaluate the results of these methods by collaborating with subject matter experts or using evaluation frameworks, enabling you to assess the performance and suitability of the methods and LLMs on Amazon Bedrock for your particular use case. You can try out and compare the results, and either use prompt engineering or deploy your own fine-tuned model to generate the earnings calls tied to your company. You can also evaluate both approaches for any related use case.

Refer to Prompt engineering guidelines and Custom models for more information about these two methods. To learn more about applying generative AI for investment research, please refer to AI-powered assistants for investment research with multi-modal data: An application of Agents for Amazon Bedrock.

Refer to this blog to find out more about, empowering analysts to perform financial statement analysis, hypothesis testing, and cause-effect analysis with Amazon Bedrock, Anthropic Claude 3 Sonnet, and prompt engineering


About the Authors

Sovik Kumar Nath is an AI/ML and Generative AI senior solution architect with AWS. He has extensive experience designing end-to-end machine learning and business analytics solutions in finance, operations, marketing, healthcare, supply chain management, and IoT. He has double masters degrees from the University of South Florida, University of Fribourg, Switzerland, and a bachelors degree from the Indian Institute of Technology, Kharagpur. Outside of work, Sovik enjoys traveling, taking ferry rides, and watching movies.

Yanyan Zhang is a Senior Generative AI Data Scientist at Amazon Web Services, where she has been working on cutting-edge AI/ML technologies as a Generative AI Specialist, helping customers leverage GenAI to achieve their desired outcomes. Yanyan graduated from Texas A&M University with a Ph.D. degree in Electrical Engineering. Outside of work, she loves traveling, working out and exploring new things.

Jia (Vivian) Li is a Senior Solutions Architect in AWS, with specialization in AI/ML. She currently supports customers in financial industry. Prior to joining AWS in 2022, she had 7 years of experience supporting enterprise customers use AI/ML in the cloud to drive business results. Vivian has a BS from Peking University and a PhD from University of Southern California. In her spare time, she enjoys all the water activities, and hiking in the beautiful mountains in her home state, Colorado.

Read More