Intelligent document processing with Amazon Textract, Amazon Bedrock, and LangChain

Intelligent document processing with Amazon Textract, Amazon Bedrock, and LangChain

In today’s information age, the vast volumes of data housed in countless documents present both a challenge and an opportunity for businesses. Traditional document processing methods often fall short in efficiency and accuracy, leaving room for innovation, cost-efficiency, and optimizations. Document processing has witnessed significant advancements with the advent of Intelligent Document Processing (IDP). With IDP, businesses can transform unstructured data from various document types into structured, actionable insights, dramatically enhancing efficiency and reducing manual efforts. However, the potential doesn’t end there. By integrating generative artificial intelligence (AI) into the process, we can further enhance IDP capabilities. Generative AI not only introduces enhanced capabilities in document processing, it also introduces a dynamic adaptability to changing data patterns. This post takes you through the synergy of IDP and generative AI, unveiling how they represent the next frontier in document processing.

We discuss IDP in detail in our series Intelligent document processing with AWS AI services (Part 1 and Part 2). In this post, we discuss how to extend a new or existing IDP architecture with large language models (LLMs). More specifically, we discuss how we can integrate Amazon Textract with LangChain as a document loader and Amazon Bedrock to extract data from documents and use generative AI capabilities within the various IDP phases.

Amazon Textract is a machine learning (ML) service that automatically extracts text, handwriting, and data from scanned documents. Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) through easy-to-use APIs.

Solution overview

The following diagram is a high-level reference architecture that explains how you can further enhance an IDP workflow with foundation models. You can use LLMs in one or all phases of IDP depending on the use case and desired outcome.

In this architecture, LLMs are used to perform specific tasks within the IDP workflow.

  • Document classification – In addition to using Amazon Comprehend, you can use an LLM to classify documents using few-shot prompting. Few-shot prompt involves prompting the language model with a few examples of different classes and a list of all possible classes, and then asking the model to classify a given piece of text from a document using one of the classes.
  • Summarization – You can also use LLMs to summarize larger documents to provide precise summaries within the extraction phase of IDP. For example, a financial analysis system may involve analyzing hundreds of pages of earnings documents of a company. You can use a language model to summarize the key aspects of the earnings, enabling analysts to make business decisions.
  • Standardization and in-context Q&A – In addition to extracting exact information out of documents using the Amazon Textract Analyze Document functionality, you can use LLMs to extract information that may otherwise not be explicitly inferred from a document. For example, a patient discharge summary may have the patient’s hospital admit date and discharge date but may not explicitly specify the total number of days the patient was in the hospital. You can use an LLM to deduce the total number of days the patient was admitted in the hospital, given the two dates extracted by Amazon Textract. This value can then be assigned with a well-known alias in a key-value format, also known as a normalized key, which makes consumption and post-processing even more straightforward.
  •  Templating and normalizations – An IDP pipeline often generates output that must conform to a specific deterministic schema. This is so that the output generated using the IDP workflow can be consumed into a downstream system, for example a relational database. The benefit of defining a deterministic schema is also to achieve key normalization so that we have a known set of keys to process in our postprocessing logic. For example, we may want to define “DOB” as a normalized key for “date of birth,” “birth date,” “birthday date,” “date born,” and so on, because documents may come with any variation of these. We use LLMs to perform such templating and normalized key-value extractions on any document.
  • Spellcheck and corrections – Although Amazon Textract can extract the exact values from scanned documents (printed or handwritten), you can use a language model to identify if word misspellings and grammatical errors exist in the extracted data from. This is important in situations where the data may be extracted from poor quality or handwritten documents and used for generating marketing materials, flash reports, and so on. In addition to having a human manually review low-score extractions from Amazon Textract, you can use an LLM to augment the review process by providing correction recommendations to the human reviewer, thereby speeding up the review process.

In the following sections, we dive deep into how Amazon Textract is integrated into generative AI workflows using LangChain to process documents for each of these specific tasks. The code blocks provided here have been trimmed down for brevity. Refer to our GitHub repository for detailed Python notebooks and a step-by-step walkthrough.

Amazon Textract LangChain document loader

Text extraction from documents is a crucial aspect when it comes to processing documents with LLMs. You can use Amazon Textract to extract unstructured raw text from documents and preserve the original semi-structured or structured objects like key-value pairs and tables present in the document. Document packages like healthcare and insurance claims or mortgages consist of complex forms that contain a lot of information across structured, semi-structured, and unstructured formats. Document extraction is an important step here because LLMs benefit from the rich content to generate more accurate and relevant responses, which otherwise could impact the quality of the LLMs’ output.

LangChain is a powerful open-source framework for integrating with LLMs. LLMs in general are versatile but may struggle with domain-specific tasks where deeper context and nuanced responses are needed. LangChain empowers developers in such scenarios to build agents that can break down complex tasks into smaller sub-tasks. The sub-tasks can then introduce context and memory into LLMs by connecting and chaining LLM prompts.

LangChain offers document loaders that can load and transform data from documents. You can use them to structure documents into preferred formats that can be processed by LLMs. The AmazonTextractPDFLoader is a service loader type of document loader that provides quick way to automate document processing by using Amazon Textract in combination with LangChain. For more details on AmazonTextractPDFLoader, refer to the LangChain documentation. To use the Amazon Textract document loader, you start by importing it from the LangChain library:

from langchain.document_loaders import AmazonTextractPDFLoader

You can load a document from a HTTPS URL endpoint as well as documents hosted in Amazon Simple Storage Service (Amazon S3) buckets via Amazon S3 object URLs (also called path style access):

https_loader = AmazonTextractPDFLoader("https://sample-website.com/sample-doc.pdf")
https_document = https_loader.load()

s3_loader = AmazonTextractPDFLoader("s3://sample-bucket/sample-doc.pdf")
s3_document = s3_loader.load()

You can also store documents in Amazon S3 and refer to them using the s3:// URL pattern, as explained in Accessing a bucket using S3://, and pass this S3 path to the Amazon Textract PDF loader:

import boto3
textract_client = boto3.client('textract', region_name='us-east-2')

file_path = "s3://amazon-textract-public-content/langchain/layout-parser-paper.pdf"
loader = AmazonTextractPDFLoader(file_path, client=textract_client)
documents = loader.load()

A multi-page document will contain multiple pages of text, which can then be accessed via the documents object, which is a list of pages. The following code loops through the pages in the documents object and prints the document text, which is available via the page_content attribute:

print(len(documents))

for document in documents:
    print(document.page_content)

Document classification

Amazon Comprehend and LLMs can be effectively utilized for document classification. Amazon Comprehend is a natural language processing (NLP) service that uses ML to extract insights from text. Amazon Comprehend also supports custom classification model training with layout awareness on documents like PDFs, Word, and image formats. For more information about using the Amazon Comprehend document classifier, refer to Amazon Comprehend document classifier adds layout support for higher accuracy.

When paired with LLMs, document classification becomes a powerful approach for managing large volumes of documents. LLMs are helpful in document classification because they can analyze the text, patterns, and contextual elements in the document using natural language understanding. You can also fine-tune them for specific document classes. When a new document type introduced in the IDP pipeline needs classification, the LLM can process text and categorize the document given a set of classes. The following is a sample code that uses the LangChain document loader powered by Amazon Textract to extract the text from the document and use it for classifying the document. We use the Anthropic Claude v2 model via Amazon Bedrock to perform the classification.

In the following example, we first extract text from a patient discharge report and use an LLM to classify it given a list of three different document types—DISCHARGE_SUMMARY, RECEIPT, and PRESCRIPTION. The following screenshot shows our report.

We use the following code:

from langchain.document_loaders import AmazonTextractPDFLoader
from langchain.llms import Bedrock
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain

loader = AmazonTextractPDFLoader("./samples/document.png")
document = loader.load()

template = """

Given a list of classes, classify the document into one of these classes. Skip any preamble text and just give the class name.

<classes>DISCHARGE_SUMMARY, RECEIPT, PRESCRIPTION</classes>
<document>{doc_text}<document>
<classification>"""

prompt = PromptTemplate(template=template, input_variables=["doc_text"])
bedrock_llm = Bedrock(client=bedrock, model_id="anthropic.claude-v2")

llm_chain = LLMChain(prompt=prompt, llm=bedrock_llm)
class_name = llm_chain.run(document[0].page_content)

print(f"The provided document is = {class_name}")

The code produces the following output:

The provided document is a DISCHARGE_SUMMARY

Document summarization

Summarization involves condensing a given text or document into a shorter version while retaining its key information. This technique is beneficial for efficient information retrieval, which enables users to quickly grasp the key points of a document without reading the entire content. Although Amazon Textract doesn’t directly perform text summarization, it provides the foundational capabilities of extracting the entire text from documents. This extracted text serves as an input to our LLM model for performing text summarization tasks.

Using the same sample discharge report, we use AmazonTextractPDFLoader to extract text from this document. As before, we use the Claude v2 model via Amazon Bedrock and initialize it with a prompt that contains the instructions on what to do with the text (in this case, summarization). Finally, we run the LLM chain by passing in the extracted text from the document loader. This runs an inference action on the LLM with the prompt that consists of the instructions to summarize, and the document’s text marked by Document. See the following code:

from langchain.document_loaders import AmazonTextractPDFLoader
from langchain.llms import Bedrock
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain

loader = AmazonTextractPDFLoader("./samples/discharge-summary.png")
document = loader.load()

template = """

Given a full document, give me a concise summary. Skip any preamble text and just give the summary.

<document>{doc_text}</document>
<summary>"""

prompt = PromptTemplate(template=template, input_variables=["doc_text"])
bedrock_llm = Bedrock(client=bedrock, model_id="anthropic.claude-v2")

num_tokens = bedrock_llm.get_num_tokens(document[0].page_content)
print (f"Our prompt has {num_tokens} tokens nn=========================n")

llm_chain = LLMChain(prompt=prompt, llm=bedrock_llm)
summary = llm_chain.run(document[0].page_content)

print(summary.replace("</summary>","").strip())

The code generates the summary of a patient discharge summary report:

Our prompt has 797 tokens 
=========================
35 yo M admitted for epigastric abdominal pain, nausea, fatigue. Found to likely have ulcer. Discharged with activity restrictions, antibiotics, diet changes, and follow up.

The preceding example used a single-page document to perform summarization. However, you will likely deal with documents containing multiple pages that need summarization. A common way to perform summarization on multiple pages is to first generate summaries on smaller chunks of text and then combine the smaller summaries to get a final summary of the document. Note that this method requires multiple calls to the LLM. The logic for this can be crafted easily; however, LangChain provides a built-in summarize chain that can summarize large texts (from multi-page documents). The summarization can happen either via map_reduce or with stuff options, which are available as options to manage the multiple calls to the LLM. In the following example, we use map_reduce to summarize a multi-page document. The following figure illustrates our workflow.

Let’s first start by extracting the document and see the total token count per page and the total number of pages:

from langchain.document_loaders import AmazonTextractPDFLoader
from langchain.llms import Bedrock

bedrock_llm = Bedrock(client=bedrock, model_id="anthropic.claude-v2")

loader = AmazonTextractPDFLoader(f"s3://{data_bucket}/bedrock-sample/health_plan.pdf")
document = loader.load()
num_docs = len(document)
print (f"There are {num_docs} pages in the document")
for index, doc in enumerate(document):
    num_tokens_first_doc = bedrock_llm.get_num_tokens(doc.page_content)
    print (f"Page {index+1} has approx. {num_tokens_first_doc} tokens")

There are 5 pages in the document
Page 1 has approx. 533 tokens
Page 2 has approx. 1323 tokens
Page 3 has approx. 997 tokens
Page 4 has approx. 1643 tokens
Page 5 has approx. 867 tokens

Next, we use LangChain’s built-in load_summarize_chain to summarize the entire document:

from langchain.chains.summarize import load_summarize_chain

summary_chain = load_summarize_chain(llm=bedrock_llm, 
			         chain_type='map_reduce')
output = summary_chain.run(document)
print(output.strip())

Standardization and Q&A

In this section, we discuss standardization and Q&A tasks.

Standardization

Output standardization is a text generation task where LLMs are used to provide a consistent formatting of the output text. This task is particularly useful for automation of key entity extraction that requires the output to be aligned with desired formats. For example, we can follow prompt engineering best practices to fine-tune an LLM to format dates into MM/DD/YYYY format, which may be compatible with a database DATE column. The following code block shows an example of how this is done using an LLM and prompt engineering. Not only do we standardize the output format for the date values, we also prompt the model to generate the final output in a JSON format so that it is easily consumable in our downstream applications. We use LangChain Expression Language (LCEL) to chain together two actions. The first action prompts the LLM to generate a JSON format output of just the dates from the document. The second action takes the JSON output and standardizes the date format. Note that this two-step action may also be performed in a single step with proper prompt engineering, as we’ll see in normalization and templating.

from langchain.document_loaders import AmazonTextractPDFLoader
from langchain.llms import Bedrock
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain

loader = AmazonTextractPDFLoader("./samples/discharge-summary.png")
document = loader.load()

bedrock_llm = Bedrock(client=bedrock, model_id="anthropic.claude-v2")

template1 = """

Given a full document, answer the question and format the output in the format specified. Skip any preamble text and just generate the JSON.

<format>
{{
  "key_name":"key_value"
}}
</format>
<document>{doc_text}</document>
<question>{question}</question>"""

template2 = """

Given a JSON document, format the dates in the value fields precisely in the provided format. Skip any preamble text and just generate the JSON.

<format>DD/MM/YYYY</format>
<json_document>{json_doc}</json_document>
"""


prompt1 = PromptTemplate(template=template1, input_variables=["doc_text", "question"])
llm_chain = LLMChain(prompt=prompt1, llm=bedrock_llm, verbose=True)

prompt2 = PromptTemplate(template=template2, input_variables=["json_doc"])
llm_chain2 = LLMChain(prompt=prompt2, llm=bedrock_llm, verbose=True)

chain = ( 
    llm_chain 
    | {'json_doc': lambda x: x['text'] }  
    | llm_chain2
)

std_op = chain.invoke({ "doc_text": document[0].page_content, 
                        "question": "Can you give me the patient admitted and discharge dates?"})

print(std_op['text'])

{
  "admit_date":"07/09/2020",
  "discharge_date":"08/09/2020"
}

The output of the preceding code sample is a JSON structure with dates 07/09/2020 and 08/09/2020, which are in the format DD/MM/YYYY and are the patient’s admit and discharge date from the hospital, respectively, according to the discharge summary report.

Q&A with Retrieval Augmented Generation

LLMs are known to retain factual information, often referred to as their world knowledge or world view. When fine-tuned, they can produce state-of-the-art results. However, there are constraints to how effectively an LLM can access and manipulate this knowledge. As a result, in tasks that heavily rely on specific knowledge, their performance might not be optimal for certain use cases. For instance, in Q&A scenarios, it’s essential for the model to adhere strictly to the context provided in the document without relying solely on its world knowledge. Deviating from this can lead to misrepresentations, inaccuracies, or even incorrect responses. The most commonly used method to address this problem is known as Retrieval Augmented Generation (RAG). This approach synergizes the strengths of both retrieval models and language models, enhancing the precision and quality of the responses generated.

LLMs can also impose token limitations due to their memory constraints and the limitations of the hardware they run on. To handle this problem, techniques like chunking are used to divide large documents into smaller portions that fit within the token limits of LLMs. On the other hand, embeddings are employed in NLP primarily to capture the semantic meaning of words and their relationships with other words in a high-dimensional space. These embeddings transform words into vectors, allowing models to efficiently process and understand textual data. By understanding the semantic nuances between words and phrases, embeddings enable LLMs to generate coherent and contextually relevant outputs. Note the following key terms:

  • Chunking – This process breaks down large amounts of text from documents into smaller, meaningful chunks of text.
  • Embeddings – These are fixed-dimensional vector transformations of each chunk that retain the semantic information from the chunks. These embeddings are subsequently loaded into a vector database.
  • Vector database – This is a database of word embeddings or vectors that represent the context of words. It acts as a knowledge source that aides NLP tasks in document processing pipelines. The benefit of the vector database here is that is allows only the necessary context to be provided to the LLMs during text generation, as we explain in the following section.

RAG uses the power of embeddings to understand and fetch relevant document segments during the retrieval phase. By doing so, RAG can work within the token limitations of LLMs, ensuring the most pertinent information is selected for generation, resulting in more accurate and contextually relevant outputs.

The following diagram illustrates the integration of these techniques to craft the input to LLMs, enhancing their contextual understanding and enabling more relevant in-context responses. One approach involves similarity search, utilizing both a vector database and chunking. The vector database stores embeddings representing semantic information, and chunking divides text into manageable sections. Using this context from similarity search, LLMs can run tasks such as question answering and domain-specific operations like classification and enrichment.

For this post, we use a RAG-based approach to perform in-context Q&A with documents. In the following code sample, we extract text from a document and then split the document into smaller chunks of text. Chunking is required because we may have large multi-page documents and our LLMs may have token limits. These chunks are then loaded into the vector database for performing similarity search in the subsequent steps. In the following example, we use the Amazon Titan Embed Text v1 model, which performs the vector embeddings of the document chunks:

from langchain.text_splitter import RecursiveCharacterTextSplitter 
from langchain.embeddings import BedrockEmbeddings
from langchain.vectorstores import FAISS
from langchain.document_loaders import AmazonTextractPDFLoader
from langchain.chains import RetrievalQA

loader = AmazonTextractPDFLoader("amazon_10k.pdf")
document = loader.load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=400,
                                               separators=["nn", "n", ".", "!", "?", ",", " ", ""],
                                               chunk_overlap=0)
texts = text_splitter.split_documents(document)
embeddings = BedrockEmbeddings(client=bedrock,
                               model_id="amazon.titan-embed-text-v1")
db = FAISS.from_documents(documents=texts,
                           embedding=embeddings) 

retriever = db.as_retriever(search_type='mmr', search_kwargs={"k": 3})

template = """

Answer the question as truthfully as possible strictly using only the provided text, and if the answer is not contained within the text, say "I don't know". Skip any preamble text and reasoning and give just the answer.

<text>{context}</text>
<question>{question}</question>
<answer>"""

# define the prompt template
qa_prompt = PromptTemplate(template=template, input_variables=["context","question"])

chain_type_kwargs = { "prompt": qa_prompt, "verbose": False } # change verbose to True if you need to see what's happening

bedrock_llm = Bedrock(client=bedrock, model_id="anthropic.claude-v2")
qa = RetrievalQA.from_chain_type(
    llm=bedrock_llm, 
    chain_type="stuff", 
    retriever=retriever,
    chain_type_kwargs=chain_type_kwargs,
    verbose=False # change verbose to True if you need to see what's happening
)

question="Who is the administrator for this plan?"
result = qa.run(question)
print(result.strip())

The code creates a relevant context for the LLM using the chunks of text that are returned by the similarity search action from the vector database. For this example, we use an open-source FAISS vector store as a sample vector database to store vector embeddings of each chunk of text. We then define the vector database as a LangChain retriever, which is passed into the RetrievalQA chain. This internally runs a similarity search query on the vector database that returns the top n (where n=3 in our example) chunks of text that are relevant to the question. Finally, the LLM chain is run with the relevant context (a group of relevant chunks of text) and the question for the LLM to answer. For a step-by-step code walkthrough of Q&A with RAG, refer to the Python notebook on GitHub.

As an alternative to FAISS, you can also use Amazon OpenSearch Service vector database capabilities, Amazon Relational Database Service (Amazon RDS) for PostgreSQL with the pgvector extension as vector databases, or open-source Chroma Database.

Q&A with tabular data

Tabular data within documents can be challenging for LLMs to process because of its structural complexity. Amazon Textract can be augmented with LLMs because it enables extracting tables from documents in a nested format of elements such as page, table, and cells. Performing Q&A with tabular data is a multi-step process, and can be achieved via self-querying. The following is an overview of the steps:

  1.  Extract tables from documents using Amazon Textract. With Amazon Textract, the tabular structure (rows, columns, headers) can be extracted from a document.
  2.  Store the tabular data into a vector database along with metadata information, such as the header names and the description of each header.
  3. Use the prompt to construct a structured query, using an LLM, to derive the data from the table.
  4. Use the query to extract the relevant table data from the vector database.

For example, in a bank statement, given the prompt “What are the transactions with more than $1000 in deposits,” the LLM would complete the following steps:

  1. Craft a query, such as “Query: transactions” , “filter: greater than (Deposit$)”.
  2. Convert the query into a structured query.
  3. Apply the structured query to the vector database where our table data is stored.

For a step-by-step sample code walkthrough of Q&A with tabular, refer to the Python notebook in GitHub.

Templating and normalizations

In this section, we look at how to use prompt engineering techniques and LangChain’s built-in mechanism to generate an output with extractions from a document in a specified schema. We also perform some standardization on the extracted data, using the techniques discussed previously. We start by defining a template for our desired output. This will serve as a schema and encapsulate the details about each entity we want to extract from the document’s text.

output_template= {
    "doctor_name":{ "type": "string", "description": "The doctor or provider's full name" },
    "provider_id":{ "type": "string", "description": "The doctor or provider's ID" },
    "patient_name":{ "type": "string", "description": "The patient's full name" },
    …
}

Note that for each of the entities, we use the description to explain what that entity is to help assist the LLM in extracting the value from the document’s text. In the following sample code, we use this template to craft our prompt for the LLM along with the text extracted from the document using AmazonTextractPDFLoader and subsequently perform inference with the model:

from langchain.llms import Bedrock
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain

template = """

You are a helpful assistant. Please extract the following details from the document and format the output as JSON using the keys. Skip any preamble text and generate the final answer.

<details>
{details}
</details>

<keys>
{keys}
</keys>

<document>
{doc_text}
<document>

<final_answer>"""

details = "n".join([f"{key}: {value['description']}" for key, value in output_template.items()])
keys = "n".join([f"{key}" for key, value in output_template.items()])

prompt = PromptTemplate(template=template, input_variables=["details", "keys", "doc_text"])
bedrock_llm = Bedrock(client=bedrock, model_id="anthropic.claude-v2")

llm_chain = LLMChain(prompt=prompt, llm=bedrock_llm)
output = llm_chain.run({"doc_text": full_text, "details": details, "keys": keys})

print(output)

{
  "doctor_name": "Mateo Jackson, Phd",
  "provider_id": "XA/7B/00338763", 
  "patient_name": "John Doe",
  … 
}

As you can see, the {keys} part of the prompt is the keys from our template, and the {details} are the keys along with their description. In this case, we don’t prompt the model explicitly with the format of the output other than specifying in the instruction to generate the output in JSON format. This works for the most part; however, because the output from LLMs is non-deterministic text generation, we want to specify the format explicitly as part of the instruction in the prompt. To solve this, we can use LangChain’s structured output parser module to take advantage of the automated prompt engineering that helps convert our template to a format instruction prompt. We use the template defined earlier to generate the format instruction prompt as follows:

from langchain.output_parsers import ResponseSchema
from langchain.output_parsers import StructuredOutputParser

response_schems = list()

for key, value in output_template.items():
    schema = ResponseSchema(name=key, description=value['description'], type=value['type'])
    response_schems.append(schema)
output_parser = StructuredOutputParser.from_response_schemas(response_schems)
format_instructions= output_parser.get_format_instructions()
print(format_instructions)

The format_instructions variable now holds the format instruction prompt:

The output should be a markdown code snippet formatted in the following schema, including the leading and trailing "```json" and "```":

```json
{
	"doctor_name": string  // The doctor or provider's full name
	"provider_id": string  // The doctor or provider's ID
	"patient_name": string  // The patient's full name
        …
}
```

We then use this variable within our original prompt as an instruction to the LLM so that it extracts and formats the output in the desired schema by making a small modification to our prompt:

template = """

You are a helpful assistant. Please extract the following details from the document and strictly follow the instructions described in the format instructions to format the output. Skip any preamble text and generate the final answer. Do not generate incomplete answer.

<details>
{details}
</details>

<format_instructions>
{format_instructions}
</format_instructions>

<document>
{doc_text}
<document>

<final_answer>"""

So far, we have only extracted the data out of the document in a desired schema. However, we still need to perform some standardization. For example, we want the patient’s admitted date and discharge date to be extracted in DD/MM/YYYY format. In this case, we augment the description of the key with the formatting instruction:

new_output_template= {   
    … 
    "admitted_date":{ "type": "string",  "description": "Date the patient was admitted to the hospital, this should be formatted in DD/MM/YYYY format." },
    "discharge_date":{ "type": "string",  "description": "Date the patient was discharged from the hospital, this should be formatted in DD/MM/YYYY format." 
    …
}

Refer to the Python notebook in GitHub for a full step-by-step walkthrough and explanation.

Spellchecks and corrections

LLMs have demonstrated remarkable abilities in understanding and generating human-like text. One of the lesser-discussed but immensely useful applications of LLMs is their potential in grammatical checks and sentence correction in documents. Unlike traditional grammar checkers that rely on a set of predefined rules, LLMs use patterns that they have identified from vast amounts of text data to determine what constitutes as correct or fluent language. This means they can detect nuances, context, and subtleties that rule-based systems might miss.

Imagine the text extracted from a patient discharge summary that reads “Patient Jon Doe, who was admittd with sever pnemonia, has shown significant improvemnt and can be safely discharged. Followups are scheduled for nex week.” A traditional spellchecker might recognize “admittd,” “pneumonia,” “improvement,” and “nex” as errors. However, the context of these errors could lead to further mistakes or generic suggestions. An LLM, equipped with its extensive training, might suggest: “Patient John Doe, who was admitted with severe pneumonia, has shown significant improvement and can be safely discharged. Follow-ups are scheduled for next week.”

The following is a poorly handwritten sample document with the same text as explained previously.

We extract the document with an Amazon Textract document loader and then instruct the LLM, via prompt engineering, to rectify the extracted text to correct any spelling and or grammatical mistakes:

from langchain.document_loaders import AmazonTextractPDFLoader
from langchain.llms import Bedrock
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain

loader = AmazonTextractPDFLoader("./samples/hand_written_note.pdf")
document = loader.load()

template = """

Given a detailed 'Document', perform spelling and grammatical corrections. Ensure the output is coherent, polished, and free from errors. Skip any preamble text and give the answer.

<document>{doc_text}</<document>
<answer>
"""

prompt = PromptTemplate(template=template, input_variables=["doc_text"])
llm = Bedrock(client=bedrock, model_id="anthropic.claude-v2")
llm_chain = LLMChain(prompt=prompt, llm=llm)

try:
    txt = document[0].page_content
    std_op = llm_chain.run({"doc_text": txt})
    
    print("Extracted text")
    print("==============")
    print(txt)

    print("nCorrected text")
    print("==============")
    print(std_op.strip())
    print("n")
except Exception as e:
    print(str(e))

The output of the preceding code shows the original text extracted by the document loader followed by the corrected text generated by the LLM:

Extracted text
==============
Patient John Doe, who was ad mitta with sever pnequonia, has shown Signif i art improumet & can be safely discharged. Follow w/s are scheduled for nen week. Patient John Doe, who was ad mitta with sever pnequonia, has shown Signif i art improumet & can be safely discharged. Follow w/s are scheduled for nen week. 

Corrected text
==============
Patient John Doe, who was admitted with severe pneumonia, has shown significant improvement and can be safely discharged. Follow-up appointments are scheduled for next week.

Keep in mind that as powerful as LLMs are, it’s essential to view their suggestions as just that—suggestions. Although they capture the intricacies of language impressively well, they aren’t infallible. Some suggestions might change the intended meaning or tone of the original text. Therefore, it’s crucial for human reviewers to use LLM-generated corrections as a guide, not an absolute. The collaboration of human intuition with LLM capabilities promises a future where our written communication is not just error-free, but also richer and more nuanced.

Conclusion

Generative AI is changing how you can process documents with IDP to derive insights. In the post Enhancing AWS intelligent document processing with generative AI, we discussed the various stages of the pipeline and how AWS customer Ricoh is enhancing their IDP pipeline with LLMs. In this post, we discussed various mechanisms of augmenting the IDP workflow with LLMs via Amazon Bedrock, Amazon Textract, and the popular LangChain framework. You can get started with the new Amazon Textract document loader with LangChain today using the sample notebooks available in our GitHub repository. For more information on working with generative AI on AWS, refer to Announcing New Tools for Building with Generative AI on AWS.


About the Authors

Sonali Sahu is leading intelligent document processing with the AI/ML services team in AWS. She is an author, thought leader, and passionate technologist. Her core area of focus is AI and ML, and she frequently speaks at AI and ML conferences and meetups around the world. She has both breadth and depth of experience in technology and the technology industry, with industry expertise in healthcare, the financial sector, and insurance.

Anjan Biswas is a Senior AI Services Solutions Architect with a focus on AI/ML and Data Analytics. Anjan is part of the world-wide AI services team and works with customers to help them understand and develop solutions to business problems with AI and ML. Anjan has over 14 years of experience working with global supply chain, manufacturing, and retail organizations, and is actively helping customers get started and scale on AWS AI services.

Chinmayee Rane is an AI/ML Specialist Solutions Architect at Amazon Web Services. She is passionate about applied mathematics and machine learning. She focuses on designing intelligent document processing and generative AI solutions for AWS customers. Outside of work, she enjoys salsa and bachata dancing.

Read More

T-Mobile US, Inc. uses artificial intelligence through Amazon Transcribe and Amazon Translate to deliver voicemail in the language of their customers’ choice

T-Mobile US, Inc. uses artificial intelligence through Amazon Transcribe and Amazon Translate to deliver voicemail in the language of their customers’ choice

This post is co-authored by Dhurjati Brahma, Senior Systems Architect at T-Mobile US, Inc and Jim Chao, Principal Engineer/Architect at T-Mobile US, Inc and Nicholas Zellerhoff Associate Systems Architect at T-Mobile US, Inc.

T-Mobile US, Inc. provides a Voicemail to Text service to its customers, which allows customers to quickly read through their voicemails and respond to and manage messages in any order without having to dial into their voicemailbox. This service is delivered by the T-Mobile Voicemail system and uses Amazon Transcribe to convert voicemail messages to text. In 2023, T-Mobile launched the Voicemail to Text Translate feature. Powered by Amazon Translate, this feature lets customers request voicemail transcriptions in their language of choice from the native Visual Voicemail application available on major Android manufacturers’ devices, beginning with flagship devices and all future devices available from major device partners.

NATIVE VISUAL VOICEMAIL DIALER FEATURING VOICEMAIL TO TEXT TRANSLATE FEATURE – ALL MODELS
Voicemail settings Voicemail translation options Visual voicemail

The history

Two years ago, in 2021, T-Mobile engineering teams partnered with AWS to launch a new AI-powered feature called Voicemail to Text with automatic language detection and to improve quality and performance for customers. Voicemail to Text provided customers with the additional benefit of receiving voicemail transcriptions at no extra cost. Voicemail to Text is available in 39 different spoken languages and dialects and uses auto-language detection provided by Amazon Transcribe. These languages included English and Spanish, as well as many European, Middle Eastern, Asian, and African languages. The full list of supported languages can be found in Amazon Transcribe documentation. Since its introduction, the usage of Voicemail to Text service has grown and it transcribes 126 million voicemail messages per month as of July 2023. T-Mobile has partnered with AWS to analyze the key application metrics of this service, such as language distribution, the number of messages per language, the daily total and unique active customers, and so on. This data helped in scaling the service and making key business decisions to improve the Voicemail to Text customer experience.

The challenge

Upon analysis of the weekly language distribution of voicemail transcriptions, T-Mobile observed that approximately 10 percent of voicemail transcriptions were received in a language other than US English, with Spanish being the predominant language.

Weekly language distribution for all languages

Weekly language distribution excluding US English and US Spanish

In addition, market research based on U.S. Census Bureau data showed that approximately 22 percent of the US population spoke a language other than English. This showed a need for a Voicemail to Text translation feature that could bridge this language gap and drove T-Mobile and AWS teams to brainstorm a solution. The idea was to give T-Mobile’s Voicemail to Text customers an option to receive voicemail transcriptions in the language of their choice delivered through SMS, email, or through the Visual Voicemail (VVM) application.

The solution

T-Mobile decided to use Amazon Translate, a neural machine translation (MT) service to augment their existing Voicemail to Text service with real-time text translation between supported languages because Amazon Translate provided the high-quality translation services required in the industry. T-Mobile already had their voicemail system connected to AWS through a private AWS Direct Connect link and was using the Amazon Transcribe API to get transcriptions. Following the same design pattern, T-Mobile added an integration with the Amazon Translate API to translate voicemail transcripts from the source language detected by Amazon Transcribe to the customers’ preferred language.

Here is a high-level architecture diagram that illustrates the T-Mobile Voicemail to Text Translate solution.

Scope of solution

From a customer perspective, to enable the Visual Voicemail Translate feature, a customer needs a Voicemail to Text feature service operator code (SOC) enabled on their mobile plan and should have one of the supported major Android manufacturer devices with the Translate feature API enabled. The customer can then visit the Visual Voicemail settings page to select a language from a list of 75 different languages and dialects supported by Amazon Translate. This will enable customers to receive voicemail transcription in the supported language of their choice.

The results

With Amazon Translate, T-Mobile was able to deliver a delightful new customer experience that accommodates the language preference of its customers and makes voicemail more accessible to people who speak various languages. This new capability helps to break language barriers by making it easier for speakers of different languages to communicate.

Conclusion

By using Amazon Transcribe and Amazon Translate language AI services, T-Mobile was able to enhance its voicemail service by delivering message transcriptions in a language that customers can understand. By choosing to use AWS managed AI services, T-Mobile was able to expedite the delivery of this new customer experience and avoid operational burdens of maintaining additional servers and software in their data centers. With Amazon Transcribe and Amazon Translate, Voicemail to Text and Voicemail to Text Translate services are delivered with low latency and high accuracy.

For more information, check out Getting started with Amazon Translate and Getting started with Amazon Transcribe to explore how you can use these services with your applications. Follow the Artificial Intelligence category on AWS Machine Learning Blog to stay up to date with new capabilities and use cases for various AWS AI services.


About the Authors

Dhurjati Brahma is a Senior Systems Architect at T-Mobile US, Inc. He has over 15+ years of experience in designing, building, and managing robust and scalable virtualized messaging solutions within T-Mobile’s network. He is passionate to collaborate with various cross-functional teams and vendors to securely integrate T-Mobile’s messaging systems with public cloud to launch exciting new products and services for T-Mobile’s customers. He holds a Master’s degree in Electrical Engineering from University of Alabama at Birmingham. Outside of work, he enjoys going on hikes, listening to classical music, practicing meditation, and spending time with his family and friends.

Jim Chao is the Principal Engineer/Architect at Core Messaging Service Design at T-Mobile US, Inc. He has more than two decades of experience in design and architecture of mobile messaging service systems and platforms. Lately, he has been dedicating his time to the next generation of messaging service using machine learning and generative AI. He holds a Master’s degree of Computer Information Systems. Outside the work he spends time with family and does a lot of religious study and practice as well as traveling to the religious places above 5000 meters in the mountains of Tibet.

Nicholas Zellerhoff is an Associate Systems Architect for T-Mobile as part of the Service Technology Development team functioning as lead development engineer for Native Visual Voicemail services. When not in office, Nick enjoys everything outdoors, from backyard BBQs with friends to backcountry hiking in the North Cascades.

Alex Bulatkin is a solutions architect at AWS. He enjoys helping communication service providers build innovative solutions in AWS that are redefining the telecom industry. He is passionate about working with customers on bringing the power of AWS AI services into their applications. Alex is based in the Denver metropolitan area and likes to hike, ski, and snowboard.

Prabhakaran Balasubramaniam is a Principal Customer Solutions Manager at AWS. He loves to help telecom customers leverage new technologies to solve their problems. Prabhakaran is based in the Dallas-Fort Worth area and likes sports.

Read More

On Razer’s Edge: VFX Star Surfaced Studio Creates Stunning Sci-Fi World This Week ‘In The NVIDIA Studio’

On Razer’s Edge: VFX Star Surfaced Studio Creates Stunning Sci-Fi World This Week ‘In The NVIDIA Studio’

Visual effects artist Surfaced Studio returns to In the NVIDIA Studio to share his real-world VFX project, created on a brand new Razer Blade 16 Mercury Edition laptop powered by GeForce RTX 4080 graphics.

Surfaced Studio creates photorealistic, digitally generated imagery that seamlessly integrates visual effects into short films, television and console gaming.

He found inspiration for a recent sci-fi project by experimenting with 3D transitions: using a laptop screen as a gateway between worlds, like the portals from Dr. Strange or the transitions from The Matrix.

Break the Rules and Become a Hero

Surfaced Studio aimed to create an immersive experience with his latest project.

“I wanted to get my audience to feel surprised getting ‘sucked into’ the 3D world,” he explained.

Surfaced Studio began with a simple script, alongside sketches of brainstormed ideas and played out shots. “This usually helps me think through how I’d pull each effect off and whether they’re actually possible,” he said.

From there, he shot video and imported the footage into Adobe Premiere Pro for a rough test edit. Then, Surfaced Studio selected the most suitable clips for use.

He cleaned up the footage in Adobe After Effects, stabilizing shots with the Warp Stabilizer tool and removing distracting background elements with the Mocha Pro tool. Both effects were accelerated by his GeForce RTX 4080 Laptop GPU.

After, he created a high-contrast version of the shot for 3D motion tracking in Blender.

3D motion tracking in Blender.

Motion tracking is used to apply tracking data to 3D objects. “This was pretty tricky, as it’s a 16-second gimbal shot with fast moving sections and a decent camera blur,” said Surfaced Studio. “It took me a good few days to get a decent track and fix issues with manual keyframes and ‘patches’ between different sections.”

A gimbal shot uses sensors and motors to stabilize and support the camera.

Surfaced Studio exported footage of the animated camera into a 3D FBX file to use in Unreal Engine and set it up in the Cyberpunk High City pack, which contains a modular constructor for creating highly detailed sci-fi city streets, alleys and blocks.

“I’m not much of a 3D artist so using [the Cyberpunk High City pack] was the best option to complete the project on this side of the century,” the artist said. He then made modifications to the cityscape, reducing flickering lights and adding buildings, custom fog and Razer and NVIDIA Studio banners. He even added a billboard with an ad encouraging kindness to cats. “It’s so off to the side of most shots I doubt anyone actually noticed,” noted a satisfied Surfaced Studio.

A PSA from Surfaced Studio: be nice to cats.

Learning 3D effects can seem overwhelming due to the vast knowledge needed across multiple apps and district workflows. But Surfaced Studio stresses the simple importance of first understanding workflow hierarchies — and how one feeds into another — as an approachable entry point to choosing a specialty suited to a creator’s unique passion and natural talent.

Surfaced Studio was able to seamlessly run his scene in Unreal Engine full 4K resolution — with all textures and materials loading at maximum graphical fidelity — thanks to the GeForce RTX 4080 Laptop GPU in his Razer Blade 16. The graphics card also contains NVIDIA DLSS capabilities to increase viewport interactivity by using AI to upscale frames rendered at lower resolution while retaining high-fidelity detail.

Moving virtual objects in Unreal Engine.

Surfaced Studio then took the FBX file with the exported camera tracking data into Unreal Engine, matching his ‘3D camera’ with the real-world one used to film the laptop with. “This was the crucial step in creating the ‘look-through’ effect I wanted,” he said.

Once satisfied with the look, Surfaced Studio exported all sequences from Unreal Engine as multilayer EXR files — including a Z-depth pass, a grayscale value range to create a depth-of-field effect — to separate visual elements from the 3D footage.

Composite work in Adobe After Effects.

Surfaced Studio went back to After Effects for the final composites. He added distortion effects and some glow for the transition from the physical screen to the 3D world.

Cleaning up screen tracking in Adobe After Effects.

Then, Surfaced Studio again used the Z-depth pass to extract the 3D cars and overlay them onto the real footage.

Composite work in Adobe After Effects.

He exported the final project into Premiere Pro and added sound effects, music and a few color correction edits.

Final edits in Adobe Premiere Pro.

With GeForce RTX 4080 dual encoders, Surface Studio nearly halved Adobe Premiere Pro video decoding and encoding export times. Surfaced Studio has been using NVIDIA GPUs for over a decade, citing their widespread integration with commonly used tools.

“NVIDIA has simply done a better job than its competitors to reach out to and integrate with other companies that create creative apps,” said Surfaced Studio. “CUDA and RTX are widespread technologies that you find in most popular creative apps to accelerate workflows.”

When he’s not working on VFX projects, Surfaced Studio also uses his laptop to game. The Razer Blade 16 has the first dual-mode mini-LED display with two native resolutions: UHD+ at 120Hz — suited for VFX workflows — and FHD at 240Hz — ideal for gamers (or creators who like gaming).

Powerful, elegant, beautiful: the Razer Blade 16 Mercury Edition.

For a limited time, gamers and creators can get the critically acclaimed game Alan Wake 2 with the purchase of the Razer Blade 16 powered by GeForce RTX 40 Series graphics cards.

Surfaced Studio’s VFX tutorials are available on YouTube, where he covers filmmaking, VFX and 3D techniques using Adobe After Effects, Blender, Photoshop, Premiere Pro and other apps.

VFX artist Surfaced Studio.

Join the #SeasonalArtChallenge

Don’t forget to join the #SeasonalArtChallenge by submitting spooky Halloween-inspired art in October and harvest- and fall-themed pieces in November.

Follow NVIDIA Studio on Instagram, Twitter and Facebook. Access tutorials on the Studio YouTube channel and get updates directly in your inbox by subscribing to the Studio newsletter.

Read More

From text to dream job: Building an NLP-based job recommender at Talent.com with Amazon SageMaker

From text to dream job: Building an NLP-based job recommender at Talent.com with Amazon SageMaker

This post is co-authored by Anatoly Khomenko, Machine Learning Engineer, and Abdenour Bezzouh, Chief Technology Officer at Talent.com.

Founded in 2011, Talent.com is one of the world’s largest sources of employment. The company combines paid job listings from their clients with public job listings into a single searchable platform. With over 30 million jobs listed in more than 75 countries, Talent.com serves jobs across many languages, industries, and distribution channels. The result is a platform that matches millions of job seekers with available jobs.

Talent.com’s mission is to centralize all jobs available on the web to help job seekers find their best match while providing them with the best search experience. Its focus is on relevancy, because the order of the recommended jobs is vitally important to show the jobs most pertinent to users’ interests. The performance of Talent.com’s matching algorithm is paramount to the success of the business and a key contributor to their users’ experience. It’s challenging to predict which jobs are pertinent to a job seeker based on the limited amount of information provided, usually contained to a few keywords and a location.

Given this mission, Talent.com and AWS joined forces to create a job recommendation engine using state-of-the-art natural language processing (NLP) and deep learning model training techniques with Amazon SageMaker to provide an unrivaled experience for job seekers. This post shows our joint approach to designing a job recommendation system, including feature engineering, deep learning model architecture design, hyperparameter optimization, and model evaluation that ensures the reliability and effectiveness of our solution for both job seekers and employers. The system is developed by a team of dedicated applied machine learning (ML) scientists, ML engineers, and subject matter experts in collaboration between AWS and Talent.com.

The recommendation system has driven an 8.6% increase in clickthrough rate (CTR) in online A/B testing against a previous XGBoost-based solution, helping connect millions of Talent.com’s users to better jobs.

Overview of solution

An overview of the system is illustrated in the following figure. The system takes a user’s search query as input and outputs a ranked list of jobs in order of pertinence. Job pertinence is measured by the click probability (the probability of a job seeker clicking on a job for more information).

The system includes four main components:

  • Model architecture – The core of this job recommendation engine is a deep learning-based Triple Tower Pointwise model, which includes a query encoder that encodes user search queries, a document encoder that encodes the job descriptions, and an interaction encoder that processes the past user-job interaction features. The outputs of the three towers are concatenated and passed through a classification head to predict the job’s click probabilities. By training this model on search queries, job specifics, and historical user interaction data from Talent.com, this system provides personalized and highly relevant job recommendations to job seekers.
  • Feature engineering – We perform two sets of feature engineering to extract valuable information from input data and feed it into the corresponding towers in the model. The two sets are standard feature engineering and fine-tuned Sentence-BERT (SBERT) embeddings. We use the standard engineered features as input into the interaction encoder and feed the SBERT derived embedding into the query encoder and document encoder.
  • Model optimization and tuning – We utilize advanced training methodologies to train, test, and deploy the system with SageMaker. This includes SageMaker Distributed Data Parallel (DDP) training, SageMaker Automatic Model Tuning (AMT), learning rate scheduling, and early stopping to improve model performance and training speed. Using the DDP training framework helped speed up our model training to approximately eight times faster.
  • Model evaluation – We conduct both offline and online evaluation. We evaluate the model performance with Area Under the Curve (AUC) and Mean Average Precision at K (mAP@K) in offline evaluation. During online A/B testing, we evaluate the CTR improvements.

In the following sections, we present the details of these four components.

Deep learning model architecture design

We design a Triple Tower Deep Pointwise (TTDP) model using a triple-tower deep learning architecture and the pointwise pair modeling approach. The triple-tower architecture provides three parallel deep neural networks, with each tower processing a set of features independently. This design pattern allows the model to learn distinct representations from different sources of information. After the representations from all three towers are obtained, they are concatenated and passed through a classification head to make the final prediction (0–1) on the click probability (a pointwise modeling setup).

The three towers are named based on the information they process: the query encoder processes the user search query, the document encoder processes the candidate job’s documentational contents including the job title and company name, and the interaction encoder uses relevant features extracted from past user interactions and history (discussed more in the next section).

Each of these towers plays a crucial role in learning how to recommend jobs:

  • Query encoder – The query encoder takes in the SBERT embeddings derived from the user’s job search query. We enhance the embeddings through an SBERT model we fine-tuned. This encoder processes and understands the user’s job search intent, including details and nuances captured by our domain-specific embeddings.
  • Document encoder – The document encoder processes the information of each job listing. Specifically, it takes the SBERT embeddings of the concatenated text from the job title and company. The intuition is that users will be more interested in candidate jobs that are more relevant to the search query. By mapping the jobs and the search queries to the same vector space (defined by SBERT), the model can learn to predict the probability of the potential jobs a job seeker will click.
  • Interaction encoder – The interaction encoder deals with the user’s past interactions with job listings. The features are produced via a standard feature engineering step, which includes calculating popularity metrics for job roles and companies, establishing context similarity scores, and extracting interaction parameters from previous user engagements. It also processes the named entities identified in the job title and search queries with a pre-trained named entity recognition (NER) model.

Each tower generates an independent output in parallel, all of which are then concatenated together. This combined feature vector is then passed to predict the click probability of a job listing for a user query. The triple-tower architecture provides flexibility in capturing complex relationships between different inputs or features, allowing the model to take advantage of the strengths of each tower while learning more expressive representations for the given task.

Candidate jobs’ predicted click probabilities are ranked from high to low, generating personalized job recommendations. Through this process, we ensure that each piece of information—whether it’s the user’s search intent, job listing details, or past interactions—is fully captured by a specific tower dedicated to it. The complex relationships between them are also captured through the combination of the tower outputs.

Feature engineering

We perform two sets of feature engineering processes to extract valuable information from the raw data and feed it into the corresponding towers in the model: standard feature engineering and fine-tuned SBERT embeddings.

Standard feature engineering

Our data preparation process begins with standard feature engineering. Overall, we define four types of features:

  • Popularity – We calculate popularity scores at the individual job level, occupation level, and company level. This provides a metric of how attractive a particular job or company might be.
  • Textual similarity – To understand the contextual relationship between different textual elements, we compute similarity scores, including string similarity between the search query and the job title. This helps us gauge the relevance of a job opening to a job seeker’s search or application history.
  • Interaction – In addition, we extract interaction features from past user engagements with job listings. A prime example of this is the embedding similarity between past clicked job titles and candidate job titles. This measure helps us understand the similarity between previous jobs a user has shown interest in vs. upcoming job opportunities. This enhances the precision of our job recommendation engine.
  • Profile – Lastly, we extract user-defined job interest information from the user profile and compare it with new job candidates. This helps us understand if a job candidate matches a user’s interest.

A crucial step in our data preparation is the application of a pre-trained NER model. By implementing an NER model, we can identify and label named entities within job titles and search queries. Consequently, this allows us to compute similarity scores between these identified entities, providing a more focused and context-aware measure of relatedness. This methodology reduces the noise in our data and gives us a more nuanced, context-sensitive method of comparing jobs.

Fine-tuned SBERT embeddings

To enhance the relevance and accuracy of our job recommendation system, we use the power of SBERT, a powerful transformer-based model, known for its proficiency in capturing semantic meanings and contexts from text. However, generic embeddings like SBERT, although effective, may not fully capture the unique nuances and terminologies inherent in a specific domain such as ours, which centers around employment and job searches. To overcome this, we fine-tune the SBERT embeddings using our domain-specific data. This fine-tuning process optimizes the model to better understand and process the industry-specific language, jargon, and context, making the embeddings more reflective of our specific domain. As a result, the refined embeddings offer improved performance in capturing both semantic and contextual information within our sphere, leading to more accurate and meaningful job recommendations for our users.

The following figure illustrates the SBERT fine-tuning step.

We fine-tune SBERT embeddings using TripletLoss with a cosine distance metric that learns text embedding where anchor and positive texts have a higher cosine similarity than anchor and negative texts. We use users’ search queries as anchor texts. We combine job titles and employer names as inputs to the positive and negative texts. The positive texts are sampled from job postings that the corresponding user clicked on, whereas the negative texts are sampled from job postings that the user did not click on. The following is sample implementation of the fine-tuning procedure:

import math
from datetime import datetime

from torch.utils.data import DataLoader
from sentence_transformers import (SentenceTransformer, SentencesDataset,
                                   LoggingHandler, losses)
from sentence_transformers.readers import InputExample

model_name = 'all-mpnet-base-v2'
train_batch_size = 16
num_epochs = 1
model_save_path = (f'output/{model_name}_'+
                   datetime.now().strftime("%Y-%m-%d_%H-%M-%S"))

### load pre-trained SBERT model
model = SentenceTransformer(model_name, device="cuda")

### construct training dataset of triplet texts,
### stored in three lists (achors, positives, negatives)
train_examples =[]
for anchor, positive, negative in zip(achors, positives, negatives):
    train_examples.append(InputExample(texts=(anchor, positive, negative)))

train_dataset = SentencesDataset(train_examples, model)
train_dataloader = DataLoader(train_dataset, shuffle=True,
                              batch_size=train_batch_size)

### use TripletLoss with cosine distance metric and margin=0.5
distance_metric=losses.TripletDistanceMetric.COSINE
train_loss = losses.TripletLoss(model=model,
                                distance_metric=distance_metric,
                                triplet_margin=0.5)

### 10% of train data for warm-up
warmup_steps = math.ceil(len(train_dataloader) * num_epochs * 0.1)

# Train the model
model.fit(train_objectives=[(train_dataloader, train_loss)],
          epochs=num_epochs,
          warmup_steps=warmup_steps,
          output_path=model_save_path)

Model training with SageMaker Distributed Data Parallel

We use SageMaker Distributed Data Parallel (SMDDP), a feature of the SageMaker ML platform that is built on top of PyTorch DDP. It provides an optimized environment for running PyTorch DDP training jobs on the SageMaker platform. It’s designed to significantly speed up deep learning model training. It accomplishes this by splitting a large dataset into smaller chunks and distributing them across multiple GPUs. The model is replicated on every GPU. Each GPU processes its assigned data independently, and the results are collated and synchronized across all GPUs. DDP takes care of gradient communication to keep model replicas synchronized and overlaps them with gradient computations to speed up training. SMDDP utilizes an optimized AllReduce algorithm to minimize communication between GPUs, reducing synchronization time and improving overall training speed. The algorithm adapts to different network conditions, making it highly efficient for both on-premises and cloud-based environments. In the SMDDP architecture (as shown in the following figure), distributed training is also scaled using a cluster of many nodes. This means not just multiple GPUs in a computing instance, but many instances with multiple GPUs, which further speeds up training.

For more information about this architecture, refer to Introduction to SageMaker’s Distributed Data Parallel Library.

With SMDDP, we have been able to substantially reduce the training time for our TTDP model, making it eight times faster. Faster training times mean we can iterate and improve our models more quickly, leading to better job recommendations for our users in a shorter amount of time. This efficiency gain is instrumental in maintaining the competitiveness of our job recommendation engine in a fast-evolving job market.

You can adapt your training script with the SMDDP with only three lines of code, as shown in the following code block. Using PyTorch as an example, the only thing you need to do is import the SMDDP library’s PyTorch client (smdistributed.dataparallel.torch.torch_smddp). The client registers smddp as a backend for PyTorch.

import smdistributed.dataparallel.torch.torch_smddp
import torch.distributed as dist

dist.init_process_group(backend='smddp')

After you have a working PyTorch script that is adapted to use the distributed data parallel library, you can launch a distributed training job using the SageMaker Python SDK.

Evaluating model performance

When evaluating the performance of a recommendation system, it’s crucial to choose metrics that align closely with business goals and provide a clear understanding of the model’s effectiveness. In our case, we use the AUC to evaluate our TTDP model’s job click prediction performance and the mAP@K to assess the quality of the final ranked jobs list.

The AUC refers to the area under the receiver operating characteristic (ROC) curve. It represents the probability that a randomly chosen positive example will be ranked higher than a randomly chosen negative example. It ranges from 0–1, where 1 indicates an ideal classifier and 0.5 represents a random guess. mAP@K is a metric commonly used to assess the quality of information retrieval systems, such as our job recommender engine. It measures the average precision of retrieving the top K relevant items for a given query or user. It ranges from 0–1, with 1 indicating optimal ranking and 0 indicating the lowest possible precision at the given K value. We evaluate the AUC, mAP@1, and mAP@3. Collectively, these metrics allow us to gauge the model’s ability to distinguish between positive and negative classes (AUC) and its success at ranking the most relevant items at the top (mAP@K).

Based on our offline evaluation, the TTDP model outperformed the baseline model—the existing XGBoost-based production model—by 16.65% for AUC, 20% for mAP@1, and 11.82% for mAP@3.

Furthermore, we designed an online A/B test to evaluate the proposed system and ran the test on a percentage of the US email population for 6 weeks. In total, approximately 22 million emails were sent using the job recommended by the new system. The resulting uplift in clicks compared to the previous production model was 8.6%. Talent.com is gradually increasing the percentage to roll out the new system to its complete population and channels.

Conclusion

Creating a job recommendation system is a complex endeavor. Each job seeker has unique needs, preferences, and professional experiences that can’t be inferred from a short search query. In this post, Talent.com collaborated with AWS to develop an end-to-end deep learning-based job recommender solution that ranks lists of jobs to recommend to users. The Talent.com team truly enjoyed collaborating with the AWS team throughout the process of solving this problem. This marks a significant milestone in Talent.com’s transformative journey, as the team takes advantage of the power of deep learning to empower its business.

This project was fine-tuned using SBERT to generate text embeddings. At the time of writing, AWS introduced Amazon Titan Embeddings as part of their foundational models (FMs) offered through Amazon Bedrock, which is a fully managed service providing a selection of high-performing foundational models from leading AI companies. We encourage readers to explore the machine learning techniques presented in this blog post and leverage the capabilities provided by AWS, such as SMDDP, while making use of AWS Bedrock’s foundational models to create their own search functionalities.

References


About the authors

Yi Xiang is a Applied Scientist II at the Amazon Machine Learning Solutions Lab, where she helps AWS customers across different industries accelerate their AI and cloud adoption.

Tong Wang is a Senior Applied Scientist at the Amazon Machine Learning Solutions Lab, where he helps AWS customers across different industries accelerate their AI and cloud adoption.

Dmitriy BespalovDmitriy Bespalov is a Senior Applied Scientist at the Amazon Machine Learning Solutions Lab, where he helps AWS customers across different industries accelerate their AI and cloud adoption.

Anatoly Khomenko is a Senior Machine Learning Engineer at Talent.com with a passion for natural language processing matching good people to good jobs.

Abdenour Bezzouh is an executive with more than 25 years experience building and delivering technology solutions that scale to millions of customers. Abdenour held the position of Chief Technology Officer (CTO) at Talent.com when the AWS team designed and executed this particular solution for Talent.com.

Dale Jacques is a Senior AI Strategist within the Generative AI Innovation Center where he helps AWS customers translate business problems into AI solutions.

Yanjun QiYanjun Qi is a Senior Applied Science Manager at the Amazon Machine Learning Solution Lab. She innovates and applies machine learning to help AWS customers speed up their AI and cloud adoption.

Read More

Street View to the Rescue: Deep Learning Paves the Way to Safer Buildings

Street View to the Rescue: Deep Learning Paves the Way to Safer Buildings

Images such as those in Google Street View are taking on a new purpose in the hands of University of Florida Assistant Professor of Artificial Intelligence Chaofeng Wang.

He’s using them, along with deep learning, in a research project to automate the evaluation of urban buildings. The project aims to help governments mitigate natural disaster damage by providing the information needed for decision-makers to bolster building structures or perform post-disaster recovery.

After a natural disaster such as an earthquake, local governments send teams to check and evaluate building conditions. Manually done, it can take up to months to go through the full stock of a city.

Wang’s project uses AI to accelerate the evaluation process — cutting the time needed to a few hours. The AI model is trained using images sourced from Google Street View and local governments to assign scores to buildings based on Federal Emergency Management Agency (FEMA) P-154 standards, which provide assessment guidelines based on factors like wall material, structure type, building age and more. Wang also collaborated with the World Bank Global Program for Resilient Housing to collect images and perform annotations, which were used to improve the model.

The collected images are placed in a data repository. The AI model reads the repository and performs inference on the images — a process accelerated by NVIDIA DGX A100 systems.

“Without NVIDIA GPUs, we wouldn’t have been able to do this,” Wang said. “They significantly accelerate the process, ensuring timely results.”

Wang used the DGX A100 nodes in the University of Florida’s supercomputer, HiPerGator. HiPerGator is one of the world’s fastest AI supercomputers in academia, delivering 700 petaflops of AI performance, and was built with the support of NVIDIA founder and UF alumnus Chris Malachowsky and hardware, software, training and services from NVIDIA.

The AI model’s output is compiled into a database that feeds into a web portal, which shows information — including the safety assessment score, building type and even roof or wall material — in a map-based format.

Wang’s work was funded by the NVIDIA Applied Research Accelerator Program, which supports research projects that have the potential to make a real-world impact through the deployment of NVIDIA-accelerated applications adopted by commercial and government organizations.

A Helping Eye

Wang says that the portal can serve different needs depending on the use case. To prepare for a natural disaster, a government can use predictions solely from street view images.

“Those are static images — one example is Google Street View images, which get updated every several years,” he said. “But that’s good enough for collecting information and getting a general understanding about certain statistics.”

But for rural areas or developing regions, where such images aren’t available or not frequently updated, governments can collect the images themselves. Powered by NVIDIA GPUs, the timely delivery of building assessments can help accelerate analyses.

Wang also suggests that with enough refinement, his research could also create ripples for the urban planning and insurance industries.

The project is currently being tested by a few local governments in Mexico and is garnering interest in some African, Asian and South American countries. At its current state, it can achieve over 85% accuracy in its assessment scores, per ‌FEMA P-154 standards.

Survey of the Land

One challenge Wang cites is the variation in urban landscapes in different countries. Different regions have their own cultural and architectural styles. Not trained on a large or diverse enough pool of images, the AI model could be thrown off by factors like paint color when performing wall material analysis. Another challenge is urban density variation.

“It is a very general limitation of current AI technology,” Wang said. “In order to be useful, it requires enough training data to represent the distribution of the real world, so we’re putting efforts into the data collection process to solve the generalization issue.”

To overcome this challenge, Wang aims to train and test the model for more cities. So far, he’s tested about eight cities in different countries.

“We need to generate more detailed and high-quality annotations to train the model with,” he said. “That is the way we can improve the model in the future so that it can be used more widely.”

Wang’s goal is to get the project to a point where it can be deployed as a service for more general industry use.

“We are creating application programming interfaces that can estimate and analyze buildings and households to allow seamless integration with other products,” he said. “We are also building a user-friendly application that all government agencies and organizations can use.”

Read More

Abstracts: October 23, 2023

Abstracts: October 23, 2023

Microsoft Research Podcast - Abstracts

Members of the research community at Microsoft work continuously to advance their respective fields. Abstracts brings its audience to the cutting edge with them through short, compelling conversations about new and noteworthy achievements.

In this episode, Andy Gordon, a Partner Research Manager, and Carina Negreanu, a Senior Researcher, both at Microsoft Research, join host Dr. Gretchen Huizinga to discuss “Co-audit: Tools to help humans double-check AI-generated content.” This paper brings together current understanding of generative AI performance to explore the need and context for tools to help people using the technology find and fix mistakes in AI output.

Transcript

GRETCHEN HUIZINGA: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. I’m Dr. Gretchen Huizinga. In this series, members of the research community at Microsoft give us a quick snapshot, or a podcast abstract, of their new and noteworthy papers. Today, I’m talking to Dr. Andy Gordon, a Partner Research Manager, and Dr. Carina Negreanu, a Senior Researcher, both at Microsoft Research. Doctors Gordon and Negreanu are co-editors of a paper called “Co-audit: Tools to help humans double-check AI-generated content,” and you can read a preprint of this paper now on arXiv. Andy Gordon, Carina Negreanu, thanks for joining us on Abstracts!


ANDY GORDON: Great to be here.

CARINA NEGREANU: Likewise.

HUIZINGA: Let’s start with you, Andy. In a few sentences, describe the issue or problem your paper addresses and why people should care about it.

GORDON: Well, generative AI is amazing. Things like Bing Chat or ChatGPT, all these things powered by large language models. Totally amazing. But it’s really important for everyone to remember that these AIs can make mistakes. For example, you ask when your favorite actor got married, and the model says the year but gets it wrong. Or you ask for some Python code, and it works on positive numbers, but occasionally you give it negative numbers and it goes wrong. Another example, you get a summary of some text. It’s great but unfortunately misses one of the important points. Or thinking about images, you ask for a portrait of a character from the AI and there’s some glitch, and it produces a hand with six fingers. So as users, we need to get into the habit of carefully checking AI outputs for mistakes. And we refer to that as “audit” in a sense of a systematic review. Coming to the paper, it’s about what we call co-audit. And that’s our term for any tool support that helps the human audit the AI output. And some examples of co-audit are tools that can help check for hallucinations, like when the actor’s date of birth is wrong, or to check Python code to find some errors or show how a summary has been constructed to help people find errors.

HUIZINGA: Carina, let’s talk to you. What related research does this paper build on, and how does your work add to it?

NEGREANU: So there was no direct work on the co-audit brand before us. We’re just introducing it. But there has been a lot of research that either motivates the need for co-audit or provides relevant framing for it or even like early examples of what we start thinking of co-audit. So as you’re probably aware, there has been a really great effort in the last years to assess the quality of generations by large language models across a multitude, really, of tasks. And currently we use this body of work as motivation for our research. It basically shows there really is a need for this kind of work. And we hope that in the future, we can also use it to benchmark co-audit tools that we are going to produce in our wider community. But the idea of dealing with errors has been a key part of research on human-AI interaction for ages. And there have been some really cool guidelines that came out recently, especially from Amershi in 2019, on human-AI interactions that are concerned with this part of the world. And more recently, Glassman had a really cool paper about conversational frameworks for human-AI and communication and basically links these concepts to psychology. And in our work, as you can read in our paper, we are trying to basically frame co-audit within her framework, and we find that it’s a natural fit. But before we started defining formally co-audit and building this paper, our group has built co-audit tools in the co-generation space. One such tool is GAM, which is grounded abstraction matching, where we basically help users learn how to effectively communicate with large language models so that they both understand what the large language model understands they’re asking and also get good feedback back. We also built ColDeco, which is a spreadsheet tool for inspecting and verifying calculated columns without the user requiring to view the underlying code produced by the large language models. But really, any tool that focuses on debugging or basically getting information back from human-generated content is useful here. So even tools that are like early debugging tools like FxD are very important here as we learn how people use these kinds of tools and we try to basically apply the same concepts in the context of LLM-generated content. So basically, we are building on top of work that helps understand the needs and challenges that end-user programmers have when working in this space and trying to extrapolate them to co-auditing tools for LLM-generated content.

HUIZINGA: Well, Andy, how would you describe the research approach you used or your methodology for this paper, and how did it come about?

GORDON: Great question, Gretchen, and it was actually quite an unusual methodology for us. So as Carina says, we’ve been looking at co-audit in a very specific setting of spreadsheet computations, and we began to realize that co-audit was really important for any kind of AI-generated output, and we started to see other people doing research that was doing the same sort of thing we were doing but in different settings. So, for example, there was a paper, they were generating bits of Python and they were deliberately showing multiple pieces of code after they’d been generated to kind of nudge the human user to make a decision about which one was better. I mean that’s, it’s really important to get people to think about the outputs, and this was a nice trick. So we thought, look, this is actually quite an important problem, and MSR (Microsoft Research) should step up and sort of gather people. So we organized a workshop inside Microsoft in the spring and got folks together to share their perspectives on co-audit. And then since then, we’ve reflected on those discussions and tried to kind of pull them together in a more coherent sense than the sort of whiteboards and sticky notes that we produced back then. And so that’s produced this paper. I think one of the key things that we learned in that process that we hadn’t been thinking about before was that co-audit really complements prompt engineering. So you hear a lot about prompt engineering, and it’s the first part of what we call the prompt-response-audit loop. And this is related to what Carina was saying about Elena Glassman’s work about AI-human interaction. So the first step is you formulate a prompt. For example, you ask for Python code. That’s the first step. The second step is we wait for the response from the AI. And then the third step is that we need to inspect the response—that’s the audit part—decide if it meets our needs or if there is a mistake, and if that’s the case, we need to repeat again. So that’s this loop, the prompt-response-audit loop. And prompt engineering, they’re the tools and techniques that you use in that first step to create the prompt. So, for example, some tools will automatically include a data context in a prompt if you’re trying to create some Python to apply to a table in a spreadsheet or, or something like that. And then duly, co-audit, those are the tools and techniques we have to help the human audit the response in the third step of this loop. And that’s like these tools I’ve been mentioning that show maybe two or three candidates of code that’s to be used.

HUIZINGA: Carina, let’s move over to what kinds of things you came away with. Your takeaways or your findings from this workshop. Talk about that and how you chose to articulate them in the paper.

NEGREANU: So as part of our research, we found that basically one co-audit tool does not fit all needs, which in a way was great because we have a bigger field to explore, but in other ways a bit daunting, as it means you have to think of many things. And one thing that really came to light was that even though we can’t, you know, build something that fits everything, we can build a set of principles that we think are important. So really, we wrote our paper around those 10 principles that we have identified from the workshop and then are trying to promote them as things people should think about when they start going on the journey of building co-auditing tools. So one of the examples is that we really think that we should think about grounding outputs, so, for example, by citing reliable sources similar to what Bing Chat does today. We think that’s a really valuable, important principle that people should follow, and they should think about what that means in the concept of their co-auditing tool. In the case of Bing, it’s quite simple, as it’s like factual references, but maybe if it becomes referencing code, that becomes more tricky but still super interesting going forward. We also propose that co-auditing tools should have the capability to prioritize the user’s attention to the most likely errors, as we need to be mindful of the user’s cognitive efforts and have a positive cost benefit. Basically, if we overflood the users with different errors and flags, it might be too problematic, and the adoption might be quite difficult going forward. And finally, this is something that really comes to core to our research area in spreadsheets. It’s about thinking beyond text. So we know visuals are so important in how we explain things, in how we teach in schools, how we teach universities. So how do we include them in the co-auditing process going forward? I think that’s going to be a really interesting challenge, and we hope we’re going to see some interesting work in that space.

HUIZINGA: Yeah. Well, principles are one thing, Andy, but how does this paper contribute to real-world impact? We talked about that a bit at the beginning. Who benefits most from this tool?

GORDON: That is a great question, Gretchen, and actually that was a question that we talked about at the workshop. We think that some application areas are going to benefit more than others. So co-audit really matters when correctness really matters and when mistakes are bad consequences, so in terms of application area, that’s areas like maybe finance or technology development or medicine. But you asked particularly about who, and we think some people will benefit more from co-audit than others. And we found this really striking example, I guess it’s an anecdotal example that someone was posting on social media. A professor was teaching a class using generative AI tools for the first time to generate code, and he found some evidence that people who have low self-confidence with computers can be intimidated by generative AI. So he would find that some of the class were really confident users and they would ask it, you know, generate some Python to do such and such, and it would come back with code with, you know, a bunch of mistakes in it. And the confident users were happy just to swat that away; they were even quite a little arrogant about it, like this is a stupid computer, they were saying. But, Gretchen, he found that a lot of his students who were less confident with computers were quite intimidated by this because it was very confidently just saying, oh look, all this code is going to work. And they kind of got a bit stuck, and some of them were scrolling around through this code, trying to understand how it worked, when in fact it was just really broken. So he thought this was pretty bad that these able students who were just less confident were being intimidated and were making less good use of the, the generative AI. Now that is an example that’s an anecdote from social media from a reputable professor, but we looked into it and there’s peer-reviewed studies that show similar effect in the literature. So I’d say we need co-audit tools that will encourage these less confident users to question when the AI is mistaken rather than getting stuck, and I think otherwise they’re not going to see the benefits of the generative AI.

HUIZINGA: Well, Carina, sometimes I like to boil things down to a nugget or a beautiful takeaway. So if there’s one thing you want our listeners to take away from this work, this paper, what would it be?

NEGREANU: I think that what this study has taught us is that really we need significantly more research. So basically, a good co-auditing experience can really be the element that makes it or breaks it in how we incorporate LLMs safely into our day-to-day lives. But to make this happen, we need people from the field working towards the same goal. It’s really an interdisciplinary work, and I don’t think we can do it by isolating into groups as we’re currently researching now. So I would urge our listeners to think about how they could contribute in this space and reach out with feedback and questions to us. We are more than open to collaboration. Really, we are just starting this journey, and we’d love to see this area to become a research priority going forward in 2024.

HUIZINGA: Well, Andy, as an opportunity to give some specificity to Carina’s call for help, what potential pitfalls have you already identified that represent ongoing research challenges in this field? And what’s next on yours—and potentially others’—research agenda in this field?

GORDON: Well, one point, and I think Carina made this, that co-audit techniques will themselves never be perfect. I mean, we’re saying that language models are never going to be perfect. Mistakes will come through. But the co-audit techniques themselves won’t be perfect either. So sometimes a user who is using the tools will still miss some mistakes. So, for example, you know, at the workshop, we thought about security questions and co-audit tools themselves. And we were thinking, for instance, about maybe deliberate attacks on a generative AI. There’s various techniques that people are talking about at the moment where you might sort of poison the inputs that generative AI models pick up on. And in principle, co-audit tools could help users realize that there are deliberate mistakes that have been engineered by the attacker. So that’s good. But on the other hand, you know, security always becomes an arms race. And so once, you know, if we did have a good tool that could detect those kinds of mistakes, the attackers then will start to engineer around the co-audit tools, trying to make them less effective. So that will be an ongoing problem, I think. And on the other hand, you know, we’ll find that if co-audit tools are giving too many warnings, users will start to ignore them, and there’ll be a sort of under-reliance on co-audit tools. And of course, if we give too few, users will miss the mistakes. So an interesting balance needs to be struck. And also, we don’t expect there’s going to be one overarching co-audit experience, but we think there’ll be many different realizations. And so, as Carina says, we hope that common lessons can be learned, and that’s why we want to keep documenting this space in general and building a research community. So I echo what Carina was saying. If you’re listening and you think that what you’re working on is co-audit, do reach out.

HUIZINGA: Well, Andy Gordon, Carina Negreanu, thanks for joining us today. And to our listeners, thanks for tuning in. If you’re interested in learning more about this paper and this research, you can find a link at aka.ms/abstracts, or you can read the preprint on arXiv. See you next time on Abstracts!

The post Abstracts: October 23, 2023 appeared first on Microsoft Research.

Read More

Governing the ML lifecycle at scale, Part 1: A framework for architecting ML workloads using Amazon SageMaker

Governing the ML lifecycle at scale, Part 1: A framework for architecting ML workloads using Amazon SageMaker

Customers of every size and industry are innovating on AWS by infusing machine learning (ML) into their products and services. Recent developments in generative AI models have further sped up the need of ML adoption across industries. However, implementing security, data privacy, and governance controls are still key challenges faced by customers when implementing ML workloads at scale. Addressing those challenges builds the framework and foundations for mitigating risk and responsible use of ML-driven products. Although generative AI may need additional controls in place, such as removing toxicity and preventing jailbreaking and hallucinations, it shares the same foundational components for security and governance as traditional ML.

We hear from customers that they require specialized knowledge and investment of up to 12 months for building out their customized Amazon SageMaker ML platform implementation to ensure scalable, reliable, secure, and governed ML environments for their lines of business (LOBs) or ML teams. If you lack a framework for governing the ML lifecycle at scale, you may run into challenges such as team-level resource isolation, scaling experimentation resources, operationalizing ML workflows, scaling model governance, and managing security and compliance of ML workloads.

Governing ML lifecycle at scale is a framework to help you build an ML platform with embedded security and governance controls based on industry best practices and enterprise standards. This framework addresses challenges by providing prescriptive guidance through a modular framework approach extending an AWS Control Tower multi-account AWS environment and the approach discussed in the post Setting up secure, well-governed machine learning environments on AWS.

It provides prescriptive guidance for the following ML platform functions:

  • Multi-account, security, and networking foundations – This function uses AWS Control Tower and well-architected principles for setting up and operating multi-account environment, security, and networking services.
  • Data and governance foundations – This function uses a data mesh architecture for setting up and operating the data lake, central feature store, and data governance foundations to enable fine-grained data access.
  • ML platform shared and governance services – This function enables setting up and operating common services such as CI/CD, AWS Service Catalog for provisioning environments, and a central model registry for model promotion and lineage.
  • ML team environments – This function enables setting up and operating environments for ML teams for model development, testing, and deploying their use cases for embedding security and governance controls.
  • ML platform observability – This function helps with troubleshooting and identifying the root cause for problems in ML models through centralization of logs and providing tools for log analysis visualization. It also provides guidance for generating cost and usage reports for ML use cases.

Although this framework can provide benefits to all customers, it’s most beneficial for large, mature, regulated, or global enterprises customers that want to scale their ML strategies in a controlled, compliant, and coordinated approach across the organization. It helps enable ML adoption while mitigating risks. This framework is useful for the following customers:

  • Large enterprise customers that have many LOBs or departments interested in using ML. This framework allows different teams to build and deploy ML models independently while providing central governance.
  • Enterprise customers with a moderate to high maturity in ML. They have already deployed some initial ML models and are looking to scale their ML efforts. This framework can help accelerate ML adoption across the organization. These companies also recognize the need for governance to manage things like access control, data usage, model performance, and unfair bias.
  • Companies in regulated industries such as financial services, healthcare, chemistry, and the private sector. These companies need strong governance and audibility for any ML models used in their business processes. Adopting this framework can help facilitate compliance while still allowing for local model development.
  • Global organizations that need to balance centralized and local control. This framework’s federated approach allows the central platform engineering team to set some high-level policies and standards, but also gives LOB teams flexibility to adapt based on local needs.

In the first part of this series, we walk through the reference architecture for setting up the ML platform. In a later post, we will provide prescriptive guidance for how to implement the various modules in the reference architecture in your organization.

The capabilities of the ML platform are grouped into four categories, as shown in the following figure. These capabilities form the foundation of the reference architecture discussed later in this post:

  • Build ML foundations
  • Scale ML operations
  • Observable ML
  • Secure ML

Solution overview

The framework for governing ML lifecycle at scale framework enables organizations to embed security and governance controls throughout the ML lifecycle that in turn help organizations reduce risk and accelerate infusing ML into their products and services. The framework helps optimize the setup and governance of secure, scalable, and reliable ML environments that can scale to support an increasing number of models and projects. The framework enables the following features:

  • Account and infrastructure provisioning with organization policy compliant infrastructure resources
  • Self-service deployment of data science environments and end-to-end ML operations (MLOps) templates for ML use cases
  • LOB-level or team-level isolation of resources for security and privacy compliance
  • Governed access to production-grade data for experimentation and production-ready workflows
  • Management and governance for code repositories, code pipelines, deployed models, and data features
  • A model registry and feature store (local and central components) for improving governance
  • Security and governance controls for the end-to-end model development and deployment process

In this section, we provide an overview of prescriptive guidance to help you build this ML platform on AWS with embedded security and governance controls.

The functional architecture associated with the ML platform is shown in the following diagram. The architecture maps the different capabilities of the ML platform to AWS accounts.

The functional architecture with different capabilities is implemented using a number of AWS services, including AWS Organizations, SageMaker, AWS DevOps services, and a data lake. The reference architecture for the ML platform with various AWS services is shown in the following diagram.

This framework considers multiple personas and services to govern the ML lifecycle at scale. We recommend the following steps to organize your teams and services:

  1. Using AWS Control Tower and automation tooling, your cloud administrator sets up the multi-account foundations such as Organizations and AWS IAM Identity Center (successor to AWS Single Sign-On) and security and governance services such as AWS Key Management Service (AWS KMS) and Service Catalog. In addition, the administrator sets up a variety of organization units (OUs) and initial accounts to support your ML and analytics workflows.
  2. Data lake administrators set up your data lake and data catalog, and set up the central feature store working with the ML platform admin.
  3. The ML platform admin provisions ML shared services such as AWS CodeCommit, AWS CodePipeline, Amazon Elastic Container Registry (Amazon ECR), a central model registry, SageMaker Model Cards, SageMaker Model Dashboard, and Service Catalog products for ML teams.
  4. The ML team lead federates via IAM Identity Center, uses Service Catalog products, and provisions resources in the ML team’s development environment.
  5. Data scientists from ML teams across different business units federate into their team’s development environment to build the model pipeline.
  6. Data scientists search and pull features from the central feature store catalog, build models through experiments, and select the best model for promotion.
  7. Data scientists create and share new features into the central feature store catalog for reuse.
  8. An ML engineer deploys the model pipeline into the ML team test environment using a shared services CI/CD process.
  9. After stakeholder validation, the ML model is deployed to the team’s production environment.
  10. Security and governance controls are embedded into every layer of this architecture using services such as AWS Security Hub, Amazon GuardDuty, Amazon Macie, and more.
  11. Security controls are centrally managed from the security tooling account using Security Hub.
  12. ML platform governance capabilities such as SageMaker Model Cards and SageMaker Model Dashboard are centrally managed from the governance services account.
  13. Amazon CloudWatch and AWS CloudTrail logs from each member account are made accessible centrally from an observability account using AWS native services.

Next, we dive deep into the modules of the reference architecture for this framework.

Reference architecture modules

The reference architecture comprises eight modules, each designed to solve a specific set of problems. Collectively, these modules address governance across various dimensions, such as infrastructure, data, model, and cost. Each module offers a distinct set of functions and interoperates with other modules to provide an integrated end-to-end ML platform with embedded security and governance controls. In this section, we present a short summary of each module’s capabilities.

Multi-account foundations

This module helps cloud administrators build an AWS Control Tower landing zone as a foundational framework. This includes building a multi-account structure, authentication and authorization via IAM Identity Center, a network hub-and-spoke design, centralized logging services, and new AWS member accounts with standardized security and governance baselines.

In addition, this module gives best practice guidance on OU and account structures that are appropriate for supporting your ML and analytics workflows. Cloud administrators will understand the purpose of the required accounts and OUs, how to deploy them, and key security and compliance services they should use to centrally govern their ML and analytics workloads.

A framework for vending new accounts is also covered, which uses automation for baselining new accounts when they are provisioned. By having an automated account provisioning process set up, cloud administrators can provide ML and analytics teams the accounts they need to perform their work more quickly, without sacrificing on a strong foundation for governance.

Data lake foundations

This module helps data lake admins set up a data lake to ingest data, curate datasets, and use the AWS Lake Formation governance model for managing fine-grained data access across accounts and users using a centralized data catalog, data access policies, and tag-based access controls. You can start small with one account for your data platform foundations for a proof of concept or a few small workloads. For medium-to-large-scale production workload implementation, we recommend adopting a multi-account strategy. In such a setting, LOBs can assume the role of data producers and data consumers using different AWS accounts, and the data lake governance is operated from a central shared AWS account. The data producer collects, processes, and stores data from their data domain, in addition to monitoring and ensuring the quality of their data assets. Data consumers consume the data from the data producer after the centralized catalog shares it using Lake Formation. The centralized catalog stores and manages the shared data catalog for the data producer accounts.

ML platform services

This module helps the ML platform engineering team set up shared services that are used by the data science teams on their team accounts. The services include a Service Catalog portfolio with products for SageMaker domain deployment, SageMaker domain user profile deployment, data science model templates for model building and deploying. This module has functionalities for a centralized model registry, model cards, model dashboard, and the CI/CD pipelines used to orchestrate and automate model development and deployment workflows.

In addition, this module details how to implement the controls and governance required to enable persona-based self-service capabilities, allowing data science teams to independently deploy their required cloud infrastructure and ML templates.

ML use case development

This module helps LOBs and data scientists access their team’s SageMaker domain in a development environment and instantiate a model building template to develop their models. In this module, data scientists work on a dev account instance of the template to interact with the data available on the centralized data lake, reuse and share features from a central feature store, create and run ML experiments, build and test their ML workflows, and register their models to a dev account model registry in their development environments.

Capabilities such as experiment tracking, model explainability reports, data and model bias monitoring, and model registry are also implemented in the templates, allowing for rapid adaptation of the solutions to the data scientists’ developed models.

ML operations

This module helps LOBs and ML engineers work on their dev instances of the model deployment template. After the candidate model is registered and approved, they set up CI/CD pipelines and run ML workflows in the team’s test environment, which registers the model into the central model registry running in a platform shared services account. When a model is approved in the central model registry, this triggers a CI/CD pipeline to deploy the model into the team’s production environment.

Centralized feature store

After the first models are deployed to production and multiple use cases start to share features created from the same data, a feature store becomes essential to ensure collaboration across use cases and reduce duplicate work. This module helps the ML platform engineering team set up a centralized feature store to provide storage and governance for ML features created by the ML use cases, enabling feature reuse across projects.

Logging and observability

This module helps LOBs and ML practitioners gain visibility into the state of ML workloads across ML environments through centralization of log activity such as CloudTrail, CloudWatch, VPC flow logs, and ML workload logs. Teams can filter, query, and visualize logs for analysis, which can help enhance security posture as well.

Cost and reporting

This module helps various stakeholders (cloud admin, platform admin, cloud business office) to generate reports and dashboards to break down costs at ML user, ML team, and ML product levels, and track usage such as number of users, instance types, and endpoints.

Customers have asked us to provide guidance on how many accounts to create and how to structure those accounts. In the next section, we provide guidance on that account structure as reference that you can modify to suit your needs according to your enterprise governance requirements.

Reference account structure

In this section, we discuss our recommendation for organizing your account structure. We share a baseline reference account structure; however, we recommend ML and data admins work closely with their cloud admin to customize this account structure based on their organization controls.

We recommend organizing accounts by OU for security, infrastructure, workloads, and deployments. Furthermore, within each OU, organize by non-production and production OU because the accounts and workloads deployed under them have different controls. Next, we briefly discuss those OUs.

Security OU

The accounts in this OU are managed by the organization’s cloud admin or security team for monitoring, identifying, protecting, detecting, and responding to security events.

Infrastructure OU

The accounts in this OU are managed by the organization’s cloud admin or network team for managing enterprise-level infrastructure shared resources and networks.

We recommend having the following accounts under the infrastructure OU:

  • Network – Set up a centralized networking infrastructure such as AWS Transit Gateway
  • Shared services – Set up centralized AD services and VPC endpoints

Workloads OU

The accounts in this OU are managed by the organization’s platform team admins. If you need different controls implemented for each platform team, you can nest other levels of OU for that purpose, such as an ML workloads OU, data workloads OU, and so on.

We recommend the following accounts under the workloads OU:

  • Team-level ML dev, test, and prod accounts – Set this up based on your workload isolation requirements
  • Data lake accounts – Partition accounts by your data domain
  • Central data governance account – Centralize your data access policies
  • Central feature store account – Centralize features for sharing across teams

Deployments OU

The accounts in this OU are managed by the organization’s platform team admins for deploying workloads and observability.

We recommend the following accounts under the deployments OU because the ML platform team can set up different sets of controls at this OU level to manage and govern deployments:

  • ML shared services accounts for test and prod – Hosts platform shared services CI/CD and model registry
  • ML observability accounts for test and prod – Hosts CloudWatch logs, CloudTrail logs, and other logs as needed

Next, we briefly discuss organization controls that need to be considered for embedding into member accounts for monitoring the infrastructure resources.

AWS environment controls

A control is a high-level rule that provides ongoing governance for your overall AWS environment. It’s expressed in plain language. In this framework, we use AWS Control Tower to implement the following controls that help you govern your resources and monitor compliance across groups of AWS accounts:

  • Preventive controls – A preventive control ensures that your accounts maintain compliance because it disallows actions that lead to policy violations and are implemented using a Service Control Policy (SCP). For example, you can set a preventive control that ensures that CloudTrail is not deleted or stopped in AWS accounts or Regions.
  • Detective controls – A detective control detects noncompliance of resources within your accounts, such as policy violations, provides alerts through the dashboard, and is implemented using AWS Config rules. For example, you can create a detective control to detects whether public read access is enabled to the Amazon Simple Storage Service (Amazon S3) buckets in the log archive shared account.
  • Proactive controls – A proactive control scans your resources before they are provisioned and makes sure that the resources are compliant with that control and are implemented using AWS CloudFormation hooks. Resources that aren’t compliant will not be provisioned. For example, you can set a proactive control that checks that direct internet access is not allowed for a SageMaker notebook instance.

Interactions between ML platform services, ML use cases, and ML operations

Different personas, such as the head of data science (lead data scientist), data scientist, and ML engineer, operate modules 2–6 as shown in the following diagram for different stages of ML platform services, ML use case development, and ML operations along with data lake foundations and the central feature store.

The following table summarizes the ops flow activity and setup flow steps for different personas. Once a persona initiates a ML activity as part of ops flow, the services run as mentioned in setup flow steps.

Persona Ops Flow Activity – Number Ops Flow Activity – Description Setup Flow Step – Number Setup Flow Step – Description
Lead Data Science or ML Team Lead

1

Uses Service Catalog in the ML platform services account and deploys the following:

    • ML infrastructure
    • SageMaker projects
    • SageMaker model registry

1-A

  • Sets up the dev, test, and prod environments for LOBs
  • Sets up SageMaker Studio in the ML platform services account

1-B

  • Sets up SageMaker Studio with the required configuration
Data Scientist

2

Conducts and tracks ML experiments in SageMaker notebooks

2-A

  • Uses data from Lake Formation
  • Saves features in the central feature store

3

Automates successful ML experiments with SageMaker projects and pipelines

3-A

    • Initiates SageMaker pipelines (preprocess, train, evaluate) in the dev account
  • Initiates the build CI/CD process with CodePipeline in the dev account

3-B

After the SageMaker pipelines run, saves the model in the local (dev) model registry
Lead Data Scientist or ML Team Lead

4

Approves the model in the local (dev) model registry

4-A

Model metadata and model package writes from the local (dev) model registry to the central model registry

5

Approves the model in the central model registry

5-A

Initiates the deployment CI/CD process to create SageMaker endpoints in the test environment

5-B

Writes the model information and metadata to the ML governance module (model card, model dashboard) in the ML platform services account from the local (dev) account
ML Engineer

6

Tests and monitors the SageMaker endpoint in the test environment after CI/CD .

7

Approves deployment for SageMaker endpoints in the prod environment

7-A

Initiates the deployment CI/CD process to create SageMaker endpoints in the prod environment

8

Tests and monitors the SageMaker endpoint in the test environment after CI/CD .

Personas and interactions with different modules of the ML platform

Each module caters to particular target personas within specific divisions that utilize the module most often, granting them primary access. Secondary access is then permitted to other divisions that require occasional use of the modules. The modules are tailored towards the needs of particular job roles or personas to optimize functionality.

We discuss the following teams:

  • Central cloud engineering – This team operates at the enterprise cloud level across all workloads for setting up common cloud infrastructure services, such as setting up enterprise-level networking, identity, permissions, and account management
  • Data platform engineering – This team manages enterprise data lakes, data collection, data curation, and data governance
  • ML platform engineering – This team operates at the ML platform level across LOBs to provide shared ML infrastructure services such as ML infrastructure provisioning, experiment tracking, model governance, deployment, and observability

The following table details which divisions have primary and secondary access for each module according to the module’s target personas.

Module Number Modules Primary Access Secondary Access Target Personas Number of accounts

1

Multi-account foundations Central cloud engineering Individual LOBs
  • Cloud admin
  • Cloud engineers
Few

2

Data lake foundations Central cloud or data platform engineering Individual LOBs
  • Data lake admin
  • Data engineers
Multiple

3

ML platform services Central cloud or ML platform engineering Individual LOBs
  • ML platform Admin
  • ML team Lead
  • ML engineers
  • ML governance lead
One

4

ML use case development Individual LOBs Central cloud or ML platform engineering
  • Data scientists
  • Data engineers
  • ML team lead
  • ML engineers
Multiple

5

ML operations Central cloud or ML engineering Individual LOBs
  • ML Engineers
  • ML team leads
  • Data scientists
Multiple

6

Centralized feature store Central cloud or data engineering Individual LOBs
  • Data engineer
  • Data scientists
One

7

Logging and observability Central cloud engineering Individual LOBs
  • Cloud admin
  • IT auditors
One

8

Cost and reporting Individual LOBs Central platform engineering
  • LOB executives
  • ML managers
One

Conclusion

In this post, we introduced a framework for governing the ML lifecycle at scale that helps you implement well-architected ML workloads embedding security and governance controls. We discussed how this framework takes a holistic approach for building an ML platform considering data governance, model governance, and enterprise-level controls. We encourage you to experiment with the framework and concepts introduced in this post and share your feedback.


About the authors

Ram Vittal is a Principal ML Solutions Architect at AWS. He has over 3 decades of experience architecting and building distributed, hybrid, and cloud applications. He is passionate about building secure, scalable, reliable AI/ML and big data solutions to help enterprise customers with their cloud adoption and optimization journey to improve their business outcomes. In his spare time, he rides motorcycle and walks with his three-year old sheep-a-doodle!

Sovik Kumar Nath is an AI/ML solution architect with AWS. He has extensive experience designing end-to-end machine learning and business analytics solutions in finance, operations, marketing, healthcare, supply chain management, and IoT. Sovik has published articles and holds a patent in ML model monitoring. He has double masters degrees from the University of South Florida, University of Fribourg, Switzerland, and a bachelors degree from the Indian Institute of Technology, Kharagpur. Outside of work, Sovik enjoys traveling, taking ferry rides, and watching movies.

Maira Ladeira Tanke is a Senior Data Specialist at AWS. As a technical lead, she helps customers accelerate their achievement of business value through emerging technology and innovative solutions. Maira has been with AWS since January 2020. Prior to that, she worked as a data scientist in multiple industries focusing on achieving business value from data. In her free time, Maira enjoys traveling and spending time with her family someplace warm.

Ryan Lempka is a Senior Solutions Architect at Amazon Web Services, where he helps his customers work backwards from business objectives to develop solutions on AWS. He has deep experience in business strategy, IT systems management, and data science. Ryan is dedicated to being a lifelong learner, and enjoys challenging himself every day to learn something new.

Sriharsh Adari is a Senior Solutions Architect at Amazon Web Services (AWS), where he helps customers work backwards from business outcomes to develop innovative solutions on AWS. Over the years, he has helped multiple customers on data platform transformations across industry verticals. His core area of expertise include Technology Strategy, Data Analytics, and Data Science. In his spare time, he enjoys playing sports, binge-watching TV shows, and playing Tabla.

Read More

How Meesho built a generalized feed ranker using Amazon SageMaker inference

How Meesho built a generalized feed ranker using Amazon SageMaker inference

This is a guest post co-written by Rama Badrinath, Divay Jindal and Utkarsh Agrawal at Meesho.


Meesho is India’s fastest growing ecommerce company with a mission to democratize internet commerce for everyone and make it accessible to the next billion users of India. Meesho was founded in 2015 and today focuses on buyers and sellers across India. The Meesho marketplace provides micro, small, and medium businesses and individual entrepreneurs access to millions of customers, a selection from over 30 categories and more than 900 sub-categories, pan-India logistics, payment services, and customer support capabilities to efficiently run their businesses on the Meesho ecosystem.

As an ecommerce platform, Meesho aims to improve the user experience by offering personalized and relevant product recommendations. We wanted to create a generalized feed ranker that considers individual preferences and historical behavior to effectively display products in each user’s feed. Through this, we wanted to boost user engagement, conversion rates, and overall business growth by tailoring the shopping experience to each customer’s unique requirements and providing the best value for their money.

We used AWS machine learning (ML) services like Amazon SageMaker to develop a powerful generalized feed ranker (GFR). In this post, we discuss the key components of the GFR and how this ML-driven solution streamlined the ML lifecycle, ensuring efficient infra management, scalability, and reliability within the ecosystem.

Solution overview

To personalize users’ feeds, we analyzed extensive historical data, extracting insights into features that include browsing patterns and interests. These valuable features are used to construct ranking models. The GFR personalizes each user’s feed in real time, considering various factors like geography, prior shopping pattern, acquisition channels, and more. Several interaction-based features are also used to capture the affinity of the user towards an item, item category, or item properties like price, rating, or discount.

Several user-agnostic features and scores at item level are used as well. These include an item popularity score and item propensity to buy score. All these features go as input to the Learning to Rank (LTR) model that tries to emit the Probability of Click (PCTR) and Probability of Purchase (PCVR).

For diverse and relevant recommendations, the GFR sources candidate products from multiple channels, including exploit (known user preferences), explore (novel and potentially interesting products), popularity (trending items), and recent (latest additions).

The following diagram illustrates the GFR architecture.

The architecture can be divided into two different components: model training and model deployment. In the following sections, we discuss each component and the AWS services used in more detail.

Model training

Meesho used Amazon EMR with Apache Spark to process hundreds of millions of data points, depending on the model’s complexity. One of the major challenges was to run distributed training at scale. We used Dask—a distributed data science computing framework that natively integrates with Python libraries—on Amazon EMR to scale out the training jobs across the cluster. The distributed training of the model helped cut down training time from days to hours and allowed us to schedule Spark jobs efficiently and cost-effectively. We used an offline feature store to maintain a historical record of all feature values that will be used for model training. Model artifacts from training are stored in Amazon Simple Storage Service (Amazon S3), providing convenient access and version management.

We used a time sampling strategy to create training, validation, and test datasets for model training. We kept track of various metrics to evaluate the performance of the model—the most important ones being area under the ROC curve and area under the precision recall curve. We also tracked calibration of the model to prevent overconfidence and underconfidence issues while predicting the probability scores.

Model deployment

Meesho used SageMaker inference endpoints with auto scaling enabled for deploying the trained model. SageMaker offered ease of deployment with support for various ML frameworks, allowing models to be served with low latency. Although AWS offers standard inference images suitable for most use cases, we built a custom inference image that caters specifically to our needs and pushed it to Amazon Elastic Container Registry (Amazon ECR).

We built an in-house A/B testing platform that facilitated live monitoring of A/B metrics, enabling us to make data-driven decisions promptly. We also used the A/B testing feature of SageMaker to deploy multiple production variants on an endpoint. Through A/B experiments, we observed an approximate 3.5% enhancement in the platform’s conversion rate and an increase in app open frequency of the users, highlighting the effectiveness of this approach.

We kept track of various drifts such as feature drift and prior drift multiple times a day after model deployment to prevent the model performance from deteriorating.

We used AWS Lambda to set up various automations and triggers that are required during model retraining, endpoint updates, and monitoring processes.

The recommendation workflow after model deployment works as follows (as noted in the solution architecture diagram):

  1. The input requests with user context and interaction features are received at the application layer from Meesho’s mobile and web app.
  2. The application layer fetches additional features like historical data of the user from the online feature store and appends these to the input requests.
  3. The appended features are sent to the real-time endpoints for generating recommendations.
  4. The model predictions are sent back to the application layer.
  5. The application layer uses these predictions to personalize the user feeds on the mobile or web application.

Conclusion

Meesho successfully implemented a generalized feed ranker using SageMaker, which resulted in highly personalized product recommendations for each customer based on their preferences and historical behavior. This approach significantly improved user engagement and led to higher conversion rates, contributing to the company’s overall business growth. As a result of utilizing AWS services, our ML lifecycle runtime reduced significantly, from taking months to just weeks, leading to increased efficiency and productivity for our team.

With this advanced feed ranker, Meesho continues to deliver tailored shopping experiences, adding more value to its customers and fulfilling its mission to democratize ecommerce for everyone.

The team is grateful for the continuous support and guidance from Ravindra Yadav, Director of Data Science at Meesho, and Debdoot Mukherjee, Head of AI at Meesho, who played a key role in enabling this success.

To learn more about SageMaker, refer to the Amazon SageMaker Developer Guide.


About the Authors

Utkarsh Agrawal is currently working as a Senior Data Scientist at Meesho. He previously worked with Fractal Analytics and Trell on various domains, including recommender systems, time series, NLP, and more. He holds a master’s degree in Mathematics and Computing from Indian Institute of Technology Kharagpur (IIT), India.

Rama Badrinath is currently working as a Principal Data Scientist at Meesho. He previously worked with Microsoft and ShareChat on various domains, including recommender systems, image AI, NLP, and more. He holds a master’s degree in Machine Learning from Indian Institute of Science (IISc), India. He has also published papers in renowned conferences such as KDD and ECIR.

Divay Jindal is currently working as a Lead Data Scientist at Meesho. He previously worked with Bookmyshow on various domains, including recommender systems and dynamic pricing.

Venugopal Pai is a Solutions Architect at AWS. He lives in Bengaluru, India, and helps digital-native customers scale and optimize their applications on AWS.

Read More

Answering billions of reporting queries each day with low latency

Answering billions of reporting queries each day with low latency

Google Ads infrastructure runs on an internal data warehouse called Napa. Billions of reporting queries, which power critical dashboards used by advertising clients to measure campaign performance, run on tables stored in Napa. These tables contain records of ads performance that are keyed using particular customers and the campaign identifiers with which they are associated. Keys are tokens that are used both to associate an ads record with a particular client and campaign (e.g., customer_id, campaign_id) and for efficient retrieval. A record contains dozens of keys, so clients use reporting queries to specify keys needed to filter the data to understand ads performance (e.g., by region, device and metrics such as clicks, etc.). What makes this problem challenging is that the data is skewed since queries require varying levels of effort to be answered and have stringent latency expectations. Specifically, some queries require the use of millions of records while others are answered with just a few.

To this end, in “Progressive Partitioning for Parallelized Query Execution in Napa”, presented at VLDB 2023, we describe how the Napa data warehouse determines the amount of machine resources needed to answer reporting queries while meeting strict latency targets. We introduce a new progressive query partitioning algorithm that can parallelize query execution in the presence of complex data skews to perform consistently well in a matter of a few milliseconds. Finally, we demonstrate how Napa allows Google Ads infrastructure to serve billions of queries every day.

Query processing challenges

When a client inputs a reporting query, the main challenge is to determine how to parallelize the query effectively. Napa’s parallelization technique breaks up the query into even sections that are equally distributed across available machines, which then process these in parallel to significantly reduce query latency. This is done by estimating the number of records associated with a specified key, and assigning more or less equal amounts of work to machines. However, this estimation is not perfect since reviewing all records would require the same effort as answering the query. A machine that processes significantly more than others would result in run-time skews and poor performance. Each machine also needs to have sufficient work since needless parallelism leads to underutilized infrastructure. Finally, parallelization has to be a per query decision that must be executed near-perfectly billions of times, or the query may miss the stringent latency requirements.

The reporting query example below extracts the records denoted by keys (i.e., customer_id and campaign_id) and then computes an aggregate (i.e., SUM(cost)) from an advertiser table. In this example the number of records is too large to process on a single machine, so Napa needs to use a subsequent key (e.g., adgroup_id) to further break up the collection of records so that equal distribution of work is achieved. It is important to note that at petabyte scale, the size of the data statistics needed for parallelization may be several terabytes. This means that the problem is not just about collecting enormous amounts of metadata, but also how it is managed.

        SELECT customer_id, campaign_id, SUM(cost)
             FROM advertiser_table
             WHERE customer_id in (1, 7, ..., x )
             AND campaign_id in (10, 20, ..., y)
             GROUP BY customer_id, campaign_id;


This reporting query example extracts records denoted by keys (i.e., customer_id and campaign_id) and then computes an aggregate (i.e., SUM(cost)) from an advertiser table. The query effort is determined by the keys’ included in the query. Keys belonging to clients with larger campaigns may touch millions of records since the data volume directly correlates with the size of the ads campaign. This disparity of matching records based on keys reflects the skewness in data, which makes query processing a challenging problem.

An effective solution minimizes the amount of metadata needed, focuses effort primarily on the skewed part of the key space to partition data efficiently, and works well within the allotted time. For example, if the query latency is a few hundred milliseconds, partitioning should take no longer than tens of milliseconds. Finally, a parallelization process should determine when it’s reached the best possible partitioning that considers query latency expectations. To this end, we have developed a progressive partitioning algorithm that we describe later in this article.

Managing the data deluge

Tables in Napa are constantly updated, so we use log-structured merge forests (LSM tree) to organize the deluge of table updates. LSM is a forest of sorted data that is temporally organized with a B-tree index to support efficient key lookup queries. B-trees store summary information of the sub-trees in a hierarchical manner. Each B-tree node records the number of entries present in each subtree, which aids in the parallelization of queries. LSM allows us to decouple the process of updating the tables from the mechanics of query serving in the sense that live queries go against a different version of the data, which is atomically updated once the next batch of ingest (called delta) has been fully prepared for querying.

The partitioning problem

The data partitioning problem in our context is that we have a massively large table that is represented as an LSM tree. In the figure below, Delta 1 and 2 each have their own B-tree, and together represent 70 records. Napa breaks the records into two pieces, and assigns each piece to a different machine. The problem becomes a partitioning problem of a forest of trees and requires a tree-traversal algorithm that can quickly split the trees into two equal parts.

To avoid visiting all the nodes of the tree, we introduce the concept of “good enough” partitioning. As we begin cutting and partitioning the tree into two parts, we maintain an estimate of how bad our current answer would be if we terminated the partitioning process at that instant. This is the yardstick of how close we are to the answer and is represented below by a total error margin of 40 (at this point of execution, the two pieces are expected to be between 15 and 35 records in size, the uncertainty adds up to 40). Each subsequent traversal step reduces the error estimate, and if the two pieces are approximately equal, it stops the partitioning process. This process continues until the desired error margin is reached, at which time we are guaranteed that the two pieces are more or less equal.

Progressive partitioning algorithm

Progressive partitioning encapsulates the notion of “good enough” in that it makes a series of moves to reduce the error estimate. The input is a set of B-trees and the goal is to cut the trees into pieces of more or less equal size. The algorithm traverses one of the trees (“drill down” in the figure) which results in a reduction of the error estimate. The algorithm is guided by statistics that are stored with each node of the tree so that it makes an informed set of moves at each step. The challenge here is to decide how to direct effort in the best possible way so that the error bound reduces quickly in the fewest possible steps. Progressive partitioning is conducive for our use-case since the longer the algorithm runs, the more equal the pieces become. It also means that if the algorithm is stopped at any point, one still gets good partitioning, where the quality corresponds to the time spent.

Prior work in this space uses a sampled table to drive the partitioning process, while the Napa approach uses a B-tree. As mentioned earlier, even just a sample from a petabyte table can be massive. A tree-based partitioning method can achieve partitioning much more efficiently than a sample-based approach, which does not use a tree organization of the sampled records. We compare progressive partitioning with an alternative approach, where sampling of the table at various resolutions (e.g., 1 record sample every 250 MB and so on) aids the partitioning of the query. Experimental results show the relative speedup from progressive partitioning for queries requiring varying numbers of machines. These results demonstrate that progressive partitioning is much faster than existing approaches and the speedup increases as the size of the query increases.

Conclusion

Napa’s progressive partitioning algorithm efficiently optimizes database queries, enabling Google Ads to serve client reporting queries billions of times each day. We note that tree traversal is a common technique that students in introductory computer science courses use, yet it also serves a critical use-case at Google. We hope that this article will inspire our readers, as it demonstrates how simple techniques and carefully designed data structures can be remarkably potent if used well. Check out the paper and a recent talk describing Napa to learn more.

Acknowledgements

This blog post describes a collaborative effort between Junichi Tatemura, Tao Zou, Jagan Sankaranarayanan, Yanlai Huang, Jim Chen, Yupu Zhang, Kevin Lai, Hao Zhang, Gokul Nath Babu Manoharan, Goetz Graefe, Divyakant Agrawal, Brad Adelberg, Shilpa Kolhar and Indrajit Roy.

Read More