Intelligent document processing with Amazon Textract, Amazon Bedrock, and LangChain

Intelligent document processing with Amazon Textract, Amazon Bedrock, and LangChain

In today’s information age, the vast volumes of data housed in countless documents present both a challenge and an opportunity for businesses. Traditional document processing methods often fall short in efficiency and accuracy, leaving room for innovation, cost-efficiency, and optimizations. Document processing has witnessed significant advancements with the advent of Intelligent Document Processing (IDP). With IDP, businesses can transform unstructured data from various document types into structured, actionable insights, dramatically enhancing efficiency and reducing manual efforts. However, the potential doesn’t end there. By integrating generative artificial intelligence (AI) into the process, we can further enhance IDP capabilities. Generative AI not only introduces enhanced capabilities in document processing, it also introduces a dynamic adaptability to changing data patterns. This post takes you through the synergy of IDP and generative AI, unveiling how they represent the next frontier in document processing.

We discuss IDP in detail in our series Intelligent document processing with AWS AI services (Part 1 and Part 2). In this post, we discuss how to extend a new or existing IDP architecture with large language models (LLMs). More specifically, we discuss how we can integrate Amazon Textract with LangChain as a document loader and Amazon Bedrock to extract data from documents and use generative AI capabilities within the various IDP phases.

Amazon Textract is a machine learning (ML) service that automatically extracts text, handwriting, and data from scanned documents. Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) through easy-to-use APIs.

Solution overview

The following diagram is a high-level reference architecture that explains how you can further enhance an IDP workflow with foundation models. You can use LLMs in one or all phases of IDP depending on the use case and desired outcome.

In this architecture, LLMs are used to perform specific tasks within the IDP workflow.

  • Document classification – In addition to using Amazon Comprehend, you can use an LLM to classify documents using few-shot prompting. Few-shot prompt involves prompting the language model with a few examples of different classes and a list of all possible classes, and then asking the model to classify a given piece of text from a document using one of the classes.
  • Summarization – You can also use LLMs to summarize larger documents to provide precise summaries within the extraction phase of IDP. For example, a financial analysis system may involve analyzing hundreds of pages of earnings documents of a company. You can use a language model to summarize the key aspects of the earnings, enabling analysts to make business decisions.
  • Standardization and in-context Q&A – In addition to extracting exact information out of documents using the Amazon Textract Analyze Document functionality, you can use LLMs to extract information that may otherwise not be explicitly inferred from a document. For example, a patient discharge summary may have the patient’s hospital admit date and discharge date but may not explicitly specify the total number of days the patient was in the hospital. You can use an LLM to deduce the total number of days the patient was admitted in the hospital, given the two dates extracted by Amazon Textract. This value can then be assigned with a well-known alias in a key-value format, also known as a normalized key, which makes consumption and post-processing even more straightforward.
  •  Templating and normalizations – An IDP pipeline often generates output that must conform to a specific deterministic schema. This is so that the output generated using the IDP workflow can be consumed into a downstream system, for example a relational database. The benefit of defining a deterministic schema is also to achieve key normalization so that we have a known set of keys to process in our postprocessing logic. For example, we may want to define “DOB” as a normalized key for “date of birth,” “birth date,” “birthday date,” “date born,” and so on, because documents may come with any variation of these. We use LLMs to perform such templating and normalized key-value extractions on any document.
  • Spellcheck and corrections – Although Amazon Textract can extract the exact values from scanned documents (printed or handwritten), you can use a language model to identify if word misspellings and grammatical errors exist in the extracted data from. This is important in situations where the data may be extracted from poor quality or handwritten documents and used for generating marketing materials, flash reports, and so on. In addition to having a human manually review low-score extractions from Amazon Textract, you can use an LLM to augment the review process by providing correction recommendations to the human reviewer, thereby speeding up the review process.

In the following sections, we dive deep into how Amazon Textract is integrated into generative AI workflows using LangChain to process documents for each of these specific tasks. The code blocks provided here have been trimmed down for brevity. Refer to our GitHub repository for detailed Python notebooks and a step-by-step walkthrough.

Amazon Textract LangChain document loader

Text extraction from documents is a crucial aspect when it comes to processing documents with LLMs. You can use Amazon Textract to extract unstructured raw text from documents and preserve the original semi-structured or structured objects like key-value pairs and tables present in the document. Document packages like healthcare and insurance claims or mortgages consist of complex forms that contain a lot of information across structured, semi-structured, and unstructured formats. Document extraction is an important step here because LLMs benefit from the rich content to generate more accurate and relevant responses, which otherwise could impact the quality of the LLMs’ output.

LangChain is a powerful open-source framework for integrating with LLMs. LLMs in general are versatile but may struggle with domain-specific tasks where deeper context and nuanced responses are needed. LangChain empowers developers in such scenarios to build agents that can break down complex tasks into smaller sub-tasks. The sub-tasks can then introduce context and memory into LLMs by connecting and chaining LLM prompts.

LangChain offers document loaders that can load and transform data from documents. You can use them to structure documents into preferred formats that can be processed by LLMs. The AmazonTextractPDFLoader is a service loader type of document loader that provides quick way to automate document processing by using Amazon Textract in combination with LangChain. For more details on AmazonTextractPDFLoader, refer to the LangChain documentation. To use the Amazon Textract document loader, you start by importing it from the LangChain library:

from langchain.document_loaders import AmazonTextractPDFLoader

You can load a document from a HTTPS URL endpoint as well as documents hosted in Amazon Simple Storage Service (Amazon S3) buckets via Amazon S3 object URLs (also called path style access):

https_loader = AmazonTextractPDFLoader("https://sample-website.com/sample-doc.pdf")
https_document = https_loader.load()

s3_loader = AmazonTextractPDFLoader("s3://sample-bucket/sample-doc.pdf")
s3_document = s3_loader.load()

You can also store documents in Amazon S3 and refer to them using the s3:// URL pattern, as explained in Accessing a bucket using S3://, and pass this S3 path to the Amazon Textract PDF loader:

import boto3
textract_client = boto3.client('textract', region_name='us-east-2')

file_path = "s3://amazon-textract-public-content/langchain/layout-parser-paper.pdf"
loader = AmazonTextractPDFLoader(file_path, client=textract_client)
documents = loader.load()

A multi-page document will contain multiple pages of text, which can then be accessed via the documents object, which is a list of pages. The following code loops through the pages in the documents object and prints the document text, which is available via the page_content attribute:

print(len(documents))

for document in documents:
    print(document.page_content)

Document classification

Amazon Comprehend and LLMs can be effectively utilized for document classification. Amazon Comprehend is a natural language processing (NLP) service that uses ML to extract insights from text. Amazon Comprehend also supports custom classification model training with layout awareness on documents like PDFs, Word, and image formats. For more information about using the Amazon Comprehend document classifier, refer to Amazon Comprehend document classifier adds layout support for higher accuracy.

When paired with LLMs, document classification becomes a powerful approach for managing large volumes of documents. LLMs are helpful in document classification because they can analyze the text, patterns, and contextual elements in the document using natural language understanding. You can also fine-tune them for specific document classes. When a new document type introduced in the IDP pipeline needs classification, the LLM can process text and categorize the document given a set of classes. The following is a sample code that uses the LangChain document loader powered by Amazon Textract to extract the text from the document and use it for classifying the document. We use the Anthropic Claude v2 model via Amazon Bedrock to perform the classification.

In the following example, we first extract text from a patient discharge report and use an LLM to classify it given a list of three different document types—DISCHARGE_SUMMARY, RECEIPT, and PRESCRIPTION. The following screenshot shows our report.

We use the following code:

from langchain.document_loaders import AmazonTextractPDFLoader
from langchain.llms import Bedrock
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain

loader = AmazonTextractPDFLoader("./samples/document.png")
document = loader.load()

template = """

Given a list of classes, classify the document into one of these classes. Skip any preamble text and just give the class name.

<classes>DISCHARGE_SUMMARY, RECEIPT, PRESCRIPTION</classes>
<document>{doc_text}<document>
<classification>"""

prompt = PromptTemplate(template=template, input_variables=["doc_text"])
bedrock_llm = Bedrock(client=bedrock, model_id="anthropic.claude-v2")

llm_chain = LLMChain(prompt=prompt, llm=bedrock_llm)
class_name = llm_chain.run(document[0].page_content)

print(f"The provided document is = {class_name}")

The code produces the following output:

The provided document is a DISCHARGE_SUMMARY

Document summarization

Summarization involves condensing a given text or document into a shorter version while retaining its key information. This technique is beneficial for efficient information retrieval, which enables users to quickly grasp the key points of a document without reading the entire content. Although Amazon Textract doesn’t directly perform text summarization, it provides the foundational capabilities of extracting the entire text from documents. This extracted text serves as an input to our LLM model for performing text summarization tasks.

Using the same sample discharge report, we use AmazonTextractPDFLoader to extract text from this document. As before, we use the Claude v2 model via Amazon Bedrock and initialize it with a prompt that contains the instructions on what to do with the text (in this case, summarization). Finally, we run the LLM chain by passing in the extracted text from the document loader. This runs an inference action on the LLM with the prompt that consists of the instructions to summarize, and the document’s text marked by Document. See the following code:

from langchain.document_loaders import AmazonTextractPDFLoader
from langchain.llms import Bedrock
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain

loader = AmazonTextractPDFLoader("./samples/discharge-summary.png")
document = loader.load()

template = """

Given a full document, give me a concise summary. Skip any preamble text and just give the summary.

<document>{doc_text}</document>
<summary>"""

prompt = PromptTemplate(template=template, input_variables=["doc_text"])
bedrock_llm = Bedrock(client=bedrock, model_id="anthropic.claude-v2")

num_tokens = bedrock_llm.get_num_tokens(document[0].page_content)
print (f"Our prompt has {num_tokens} tokens nn=========================n")

llm_chain = LLMChain(prompt=prompt, llm=bedrock_llm)
summary = llm_chain.run(document[0].page_content)

print(summary.replace("</summary>","").strip())

The code generates the summary of a patient discharge summary report:

Our prompt has 797 tokens 
=========================
35 yo M admitted for epigastric abdominal pain, nausea, fatigue. Found to likely have ulcer. Discharged with activity restrictions, antibiotics, diet changes, and follow up.

The preceding example used a single-page document to perform summarization. However, you will likely deal with documents containing multiple pages that need summarization. A common way to perform summarization on multiple pages is to first generate summaries on smaller chunks of text and then combine the smaller summaries to get a final summary of the document. Note that this method requires multiple calls to the LLM. The logic for this can be crafted easily; however, LangChain provides a built-in summarize chain that can summarize large texts (from multi-page documents). The summarization can happen either via map_reduce or with stuff options, which are available as options to manage the multiple calls to the LLM. In the following example, we use map_reduce to summarize a multi-page document. The following figure illustrates our workflow.

Let’s first start by extracting the document and see the total token count per page and the total number of pages:

from langchain.document_loaders import AmazonTextractPDFLoader
from langchain.llms import Bedrock

bedrock_llm = Bedrock(client=bedrock, model_id="anthropic.claude-v2")

loader = AmazonTextractPDFLoader(f"s3://{data_bucket}/bedrock-sample/health_plan.pdf")
document = loader.load()
num_docs = len(document)
print (f"There are {num_docs} pages in the document")
for index, doc in enumerate(document):
    num_tokens_first_doc = bedrock_llm.get_num_tokens(doc.page_content)
    print (f"Page {index+1} has approx. {num_tokens_first_doc} tokens")

There are 5 pages in the document
Page 1 has approx. 533 tokens
Page 2 has approx. 1323 tokens
Page 3 has approx. 997 tokens
Page 4 has approx. 1643 tokens
Page 5 has approx. 867 tokens

Next, we use LangChain’s built-in load_summarize_chain to summarize the entire document:

from langchain.chains.summarize import load_summarize_chain

summary_chain = load_summarize_chain(llm=bedrock_llm, 
			         chain_type='map_reduce')
output = summary_chain.run(document)
print(output.strip())

Standardization and Q&A

In this section, we discuss standardization and Q&A tasks.

Standardization

Output standardization is a text generation task where LLMs are used to provide a consistent formatting of the output text. This task is particularly useful for automation of key entity extraction that requires the output to be aligned with desired formats. For example, we can follow prompt engineering best practices to fine-tune an LLM to format dates into MM/DD/YYYY format, which may be compatible with a database DATE column. The following code block shows an example of how this is done using an LLM and prompt engineering. Not only do we standardize the output format for the date values, we also prompt the model to generate the final output in a JSON format so that it is easily consumable in our downstream applications. We use LangChain Expression Language (LCEL) to chain together two actions. The first action prompts the LLM to generate a JSON format output of just the dates from the document. The second action takes the JSON output and standardizes the date format. Note that this two-step action may also be performed in a single step with proper prompt engineering, as we’ll see in normalization and templating.

from langchain.document_loaders import AmazonTextractPDFLoader
from langchain.llms import Bedrock
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain

loader = AmazonTextractPDFLoader("./samples/discharge-summary.png")
document = loader.load()

bedrock_llm = Bedrock(client=bedrock, model_id="anthropic.claude-v2")

template1 = """

Given a full document, answer the question and format the output in the format specified. Skip any preamble text and just generate the JSON.

<format>
{{
  "key_name":"key_value"
}}
</format>
<document>{doc_text}</document>
<question>{question}</question>"""

template2 = """

Given a JSON document, format the dates in the value fields precisely in the provided format. Skip any preamble text and just generate the JSON.

<format>DD/MM/YYYY</format>
<json_document>{json_doc}</json_document>
"""


prompt1 = PromptTemplate(template=template1, input_variables=["doc_text", "question"])
llm_chain = LLMChain(prompt=prompt1, llm=bedrock_llm, verbose=True)

prompt2 = PromptTemplate(template=template2, input_variables=["json_doc"])
llm_chain2 = LLMChain(prompt=prompt2, llm=bedrock_llm, verbose=True)

chain = ( 
    llm_chain 
    | {'json_doc': lambda x: x['text'] }  
    | llm_chain2
)

std_op = chain.invoke({ "doc_text": document[0].page_content, 
                        "question": "Can you give me the patient admitted and discharge dates?"})

print(std_op['text'])

{
  "admit_date":"07/09/2020",
  "discharge_date":"08/09/2020"
}

The output of the preceding code sample is a JSON structure with dates 07/09/2020 and 08/09/2020, which are in the format DD/MM/YYYY and are the patient’s admit and discharge date from the hospital, respectively, according to the discharge summary report.

Q&A with Retrieval Augmented Generation

LLMs are known to retain factual information, often referred to as their world knowledge or world view. When fine-tuned, they can produce state-of-the-art results. However, there are constraints to how effectively an LLM can access and manipulate this knowledge. As a result, in tasks that heavily rely on specific knowledge, their performance might not be optimal for certain use cases. For instance, in Q&A scenarios, it’s essential for the model to adhere strictly to the context provided in the document without relying solely on its world knowledge. Deviating from this can lead to misrepresentations, inaccuracies, or even incorrect responses. The most commonly used method to address this problem is known as Retrieval Augmented Generation (RAG). This approach synergizes the strengths of both retrieval models and language models, enhancing the precision and quality of the responses generated.

LLMs can also impose token limitations due to their memory constraints and the limitations of the hardware they run on. To handle this problem, techniques like chunking are used to divide large documents into smaller portions that fit within the token limits of LLMs. On the other hand, embeddings are employed in NLP primarily to capture the semantic meaning of words and their relationships with other words in a high-dimensional space. These embeddings transform words into vectors, allowing models to efficiently process and understand textual data. By understanding the semantic nuances between words and phrases, embeddings enable LLMs to generate coherent and contextually relevant outputs. Note the following key terms:

  • Chunking – This process breaks down large amounts of text from documents into smaller, meaningful chunks of text.
  • Embeddings – These are fixed-dimensional vector transformations of each chunk that retain the semantic information from the chunks. These embeddings are subsequently loaded into a vector database.
  • Vector database – This is a database of word embeddings or vectors that represent the context of words. It acts as a knowledge source that aides NLP tasks in document processing pipelines. The benefit of the vector database here is that is allows only the necessary context to be provided to the LLMs during text generation, as we explain in the following section.

RAG uses the power of embeddings to understand and fetch relevant document segments during the retrieval phase. By doing so, RAG can work within the token limitations of LLMs, ensuring the most pertinent information is selected for generation, resulting in more accurate and contextually relevant outputs.

The following diagram illustrates the integration of these techniques to craft the input to LLMs, enhancing their contextual understanding and enabling more relevant in-context responses. One approach involves similarity search, utilizing both a vector database and chunking. The vector database stores embeddings representing semantic information, and chunking divides text into manageable sections. Using this context from similarity search, LLMs can run tasks such as question answering and domain-specific operations like classification and enrichment.

For this post, we use a RAG-based approach to perform in-context Q&A with documents. In the following code sample, we extract text from a document and then split the document into smaller chunks of text. Chunking is required because we may have large multi-page documents and our LLMs may have token limits. These chunks are then loaded into the vector database for performing similarity search in the subsequent steps. In the following example, we use the Amazon Titan Embed Text v1 model, which performs the vector embeddings of the document chunks:

from langchain.text_splitter import RecursiveCharacterTextSplitter 
from langchain.embeddings import BedrockEmbeddings
from langchain.vectorstores import FAISS
from langchain.document_loaders import AmazonTextractPDFLoader
from langchain.chains import RetrievalQA

loader = AmazonTextractPDFLoader("amazon_10k.pdf")
document = loader.load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=400,
                                               separators=["nn", "n", ".", "!", "?", ",", " ", ""],
                                               chunk_overlap=0)
texts = text_splitter.split_documents(document)
embeddings = BedrockEmbeddings(client=bedrock,
                               model_id="amazon.titan-embed-text-v1")
db = FAISS.from_documents(documents=texts,
                           embedding=embeddings) 

retriever = db.as_retriever(search_type='mmr', search_kwargs={"k": 3})

template = """

Answer the question as truthfully as possible strictly using only the provided text, and if the answer is not contained within the text, say "I don't know". Skip any preamble text and reasoning and give just the answer.

<text>{context}</text>
<question>{question}</question>
<answer>"""

# define the prompt template
qa_prompt = PromptTemplate(template=template, input_variables=["context","question"])

chain_type_kwargs = { "prompt": qa_prompt, "verbose": False } # change verbose to True if you need to see what's happening

bedrock_llm = Bedrock(client=bedrock, model_id="anthropic.claude-v2")
qa = RetrievalQA.from_chain_type(
    llm=bedrock_llm, 
    chain_type="stuff", 
    retriever=retriever,
    chain_type_kwargs=chain_type_kwargs,
    verbose=False # change verbose to True if you need to see what's happening
)

question="Who is the administrator for this plan?"
result = qa.run(question)
print(result.strip())

The code creates a relevant context for the LLM using the chunks of text that are returned by the similarity search action from the vector database. For this example, we use an open-source FAISS vector store as a sample vector database to store vector embeddings of each chunk of text. We then define the vector database as a LangChain retriever, which is passed into the RetrievalQA chain. This internally runs a similarity search query on the vector database that returns the top n (where n=3 in our example) chunks of text that are relevant to the question. Finally, the LLM chain is run with the relevant context (a group of relevant chunks of text) and the question for the LLM to answer. For a step-by-step code walkthrough of Q&A with RAG, refer to the Python notebook on GitHub.

As an alternative to FAISS, you can also use Amazon OpenSearch Service vector database capabilities, Amazon Relational Database Service (Amazon RDS) for PostgreSQL with the pgvector extension as vector databases, or open-source Chroma Database.

Q&A with tabular data

Tabular data within documents can be challenging for LLMs to process because of its structural complexity. Amazon Textract can be augmented with LLMs because it enables extracting tables from documents in a nested format of elements such as page, table, and cells. Performing Q&A with tabular data is a multi-step process, and can be achieved via self-querying. The following is an overview of the steps:

  1.  Extract tables from documents using Amazon Textract. With Amazon Textract, the tabular structure (rows, columns, headers) can be extracted from a document.
  2.  Store the tabular data into a vector database along with metadata information, such as the header names and the description of each header.
  3. Use the prompt to construct a structured query, using an LLM, to derive the data from the table.
  4. Use the query to extract the relevant table data from the vector database.

For example, in a bank statement, given the prompt “What are the transactions with more than $1000 in deposits,” the LLM would complete the following steps:

  1. Craft a query, such as “Query: transactions” , “filter: greater than (Deposit$)”.
  2. Convert the query into a structured query.
  3. Apply the structured query to the vector database where our table data is stored.

For a step-by-step sample code walkthrough of Q&A with tabular, refer to the Python notebook in GitHub.

Templating and normalizations

In this section, we look at how to use prompt engineering techniques and LangChain’s built-in mechanism to generate an output with extractions from a document in a specified schema. We also perform some standardization on the extracted data, using the techniques discussed previously. We start by defining a template for our desired output. This will serve as a schema and encapsulate the details about each entity we want to extract from the document’s text.

output_template= {
    "doctor_name":{ "type": "string", "description": "The doctor or provider's full name" },
    "provider_id":{ "type": "string", "description": "The doctor or provider's ID" },
    "patient_name":{ "type": "string", "description": "The patient's full name" },
    …
}

Note that for each of the entities, we use the description to explain what that entity is to help assist the LLM in extracting the value from the document’s text. In the following sample code, we use this template to craft our prompt for the LLM along with the text extracted from the document using AmazonTextractPDFLoader and subsequently perform inference with the model:

from langchain.llms import Bedrock
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain

template = """

You are a helpful assistant. Please extract the following details from the document and format the output as JSON using the keys. Skip any preamble text and generate the final answer.

<details>
{details}
</details>

<keys>
{keys}
</keys>

<document>
{doc_text}
<document>

<final_answer>"""

details = "n".join([f"{key}: {value['description']}" for key, value in output_template.items()])
keys = "n".join([f"{key}" for key, value in output_template.items()])

prompt = PromptTemplate(template=template, input_variables=["details", "keys", "doc_text"])
bedrock_llm = Bedrock(client=bedrock, model_id="anthropic.claude-v2")

llm_chain = LLMChain(prompt=prompt, llm=bedrock_llm)
output = llm_chain.run({"doc_text": full_text, "details": details, "keys": keys})

print(output)

{
  "doctor_name": "Mateo Jackson, Phd",
  "provider_id": "XA/7B/00338763", 
  "patient_name": "John Doe",
  … 
}

As you can see, the {keys} part of the prompt is the keys from our template, and the {details} are the keys along with their description. In this case, we don’t prompt the model explicitly with the format of the output other than specifying in the instruction to generate the output in JSON format. This works for the most part; however, because the output from LLMs is non-deterministic text generation, we want to specify the format explicitly as part of the instruction in the prompt. To solve this, we can use LangChain’s structured output parser module to take advantage of the automated prompt engineering that helps convert our template to a format instruction prompt. We use the template defined earlier to generate the format instruction prompt as follows:

from langchain.output_parsers import ResponseSchema
from langchain.output_parsers import StructuredOutputParser

response_schems = list()

for key, value in output_template.items():
    schema = ResponseSchema(name=key, description=value['description'], type=value['type'])
    response_schems.append(schema)
output_parser = StructuredOutputParser.from_response_schemas(response_schems)
format_instructions= output_parser.get_format_instructions()
print(format_instructions)

The format_instructions variable now holds the format instruction prompt:

The output should be a markdown code snippet formatted in the following schema, including the leading and trailing "```json" and "```":

```json
{
	"doctor_name": string  // The doctor or provider's full name
	"provider_id": string  // The doctor or provider's ID
	"patient_name": string  // The patient's full name
        …
}
```

We then use this variable within our original prompt as an instruction to the LLM so that it extracts and formats the output in the desired schema by making a small modification to our prompt:

template = """

You are a helpful assistant. Please extract the following details from the document and strictly follow the instructions described in the format instructions to format the output. Skip any preamble text and generate the final answer. Do not generate incomplete answer.

<details>
{details}
</details>

<format_instructions>
{format_instructions}
</format_instructions>

<document>
{doc_text}
<document>

<final_answer>"""

So far, we have only extracted the data out of the document in a desired schema. However, we still need to perform some standardization. For example, we want the patient’s admitted date and discharge date to be extracted in DD/MM/YYYY format. In this case, we augment the description of the key with the formatting instruction:

new_output_template= {   
    … 
    "admitted_date":{ "type": "string",  "description": "Date the patient was admitted to the hospital, this should be formatted in DD/MM/YYYY format." },
    "discharge_date":{ "type": "string",  "description": "Date the patient was discharged from the hospital, this should be formatted in DD/MM/YYYY format." 
    …
}

Refer to the Python notebook in GitHub for a full step-by-step walkthrough and explanation.

Spellchecks and corrections

LLMs have demonstrated remarkable abilities in understanding and generating human-like text. One of the lesser-discussed but immensely useful applications of LLMs is their potential in grammatical checks and sentence correction in documents. Unlike traditional grammar checkers that rely on a set of predefined rules, LLMs use patterns that they have identified from vast amounts of text data to determine what constitutes as correct or fluent language. This means they can detect nuances, context, and subtleties that rule-based systems might miss.

Imagine the text extracted from a patient discharge summary that reads “Patient Jon Doe, who was admittd with sever pnemonia, has shown significant improvemnt and can be safely discharged. Followups are scheduled for nex week.” A traditional spellchecker might recognize “admittd,” “pneumonia,” “improvement,” and “nex” as errors. However, the context of these errors could lead to further mistakes or generic suggestions. An LLM, equipped with its extensive training, might suggest: “Patient John Doe, who was admitted with severe pneumonia, has shown significant improvement and can be safely discharged. Follow-ups are scheduled for next week.”

The following is a poorly handwritten sample document with the same text as explained previously.

We extract the document with an Amazon Textract document loader and then instruct the LLM, via prompt engineering, to rectify the extracted text to correct any spelling and or grammatical mistakes:

from langchain.document_loaders import AmazonTextractPDFLoader
from langchain.llms import Bedrock
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain

loader = AmazonTextractPDFLoader("./samples/hand_written_note.pdf")
document = loader.load()

template = """

Given a detailed 'Document', perform spelling and grammatical corrections. Ensure the output is coherent, polished, and free from errors. Skip any preamble text and give the answer.

<document>{doc_text}</<document>
<answer>
"""

prompt = PromptTemplate(template=template, input_variables=["doc_text"])
llm = Bedrock(client=bedrock, model_id="anthropic.claude-v2")
llm_chain = LLMChain(prompt=prompt, llm=llm)

try:
    txt = document[0].page_content
    std_op = llm_chain.run({"doc_text": txt})
    
    print("Extracted text")
    print("==============")
    print(txt)

    print("nCorrected text")
    print("==============")
    print(std_op.strip())
    print("n")
except Exception as e:
    print(str(e))

The output of the preceding code shows the original text extracted by the document loader followed by the corrected text generated by the LLM:

Extracted text
==============
Patient John Doe, who was ad mitta with sever pnequonia, has shown Signif i art improumet & can be safely discharged. Follow w/s are scheduled for nen week. Patient John Doe, who was ad mitta with sever pnequonia, has shown Signif i art improumet & can be safely discharged. Follow w/s are scheduled for nen week. 

Corrected text
==============
Patient John Doe, who was admitted with severe pneumonia, has shown significant improvement and can be safely discharged. Follow-up appointments are scheduled for next week.

Keep in mind that as powerful as LLMs are, it’s essential to view their suggestions as just that—suggestions. Although they capture the intricacies of language impressively well, they aren’t infallible. Some suggestions might change the intended meaning or tone of the original text. Therefore, it’s crucial for human reviewers to use LLM-generated corrections as a guide, not an absolute. The collaboration of human intuition with LLM capabilities promises a future where our written communication is not just error-free, but also richer and more nuanced.

Conclusion

Generative AI is changing how you can process documents with IDP to derive insights. In the post Enhancing AWS intelligent document processing with generative AI, we discussed the various stages of the pipeline and how AWS customer Ricoh is enhancing their IDP pipeline with LLMs. In this post, we discussed various mechanisms of augmenting the IDP workflow with LLMs via Amazon Bedrock, Amazon Textract, and the popular LangChain framework. You can get started with the new Amazon Textract document loader with LangChain today using the sample notebooks available in our GitHub repository. For more information on working with generative AI on AWS, refer to Announcing New Tools for Building with Generative AI on AWS.


About the Authors

Sonali Sahu is leading intelligent document processing with the AI/ML services team in AWS. She is an author, thought leader, and passionate technologist. Her core area of focus is AI and ML, and she frequently speaks at AI and ML conferences and meetups around the world. She has both breadth and depth of experience in technology and the technology industry, with industry expertise in healthcare, the financial sector, and insurance.

Anjan Biswas is a Senior AI Services Solutions Architect with a focus on AI/ML and Data Analytics. Anjan is part of the world-wide AI services team and works with customers to help them understand and develop solutions to business problems with AI and ML. Anjan has over 14 years of experience working with global supply chain, manufacturing, and retail organizations, and is actively helping customers get started and scale on AWS AI services.

Chinmayee Rane is an AI/ML Specialist Solutions Architect at Amazon Web Services. She is passionate about applied mathematics and machine learning. She focuses on designing intelligent document processing and generative AI solutions for AWS customers. Outside of work, she enjoys salsa and bachata dancing.

Read More

T-Mobile US, Inc. uses artificial intelligence through Amazon Transcribe and Amazon Translate to deliver voicemail in the language of their customers’ choice

T-Mobile US, Inc. uses artificial intelligence through Amazon Transcribe and Amazon Translate to deliver voicemail in the language of their customers’ choice

This post is co-authored by Dhurjati Brahma, Senior Systems Architect at T-Mobile US, Inc and Jim Chao, Principal Engineer/Architect at T-Mobile US, Inc and Nicholas Zellerhoff Associate Systems Architect at T-Mobile US, Inc.

T-Mobile US, Inc. provides a Voicemail to Text service to its customers, which allows customers to quickly read through their voicemails and respond to and manage messages in any order without having to dial into their voicemailbox. This service is delivered by the T-Mobile Voicemail system and uses Amazon Transcribe to convert voicemail messages to text. In 2023, T-Mobile launched the Voicemail to Text Translate feature. Powered by Amazon Translate, this feature lets customers request voicemail transcriptions in their language of choice from the native Visual Voicemail application available on major Android manufacturers’ devices, beginning with flagship devices and all future devices available from major device partners.

NATIVE VISUAL VOICEMAIL DIALER FEATURING VOICEMAIL TO TEXT TRANSLATE FEATURE – ALL MODELS
Voicemail settings Voicemail translation options Visual voicemail

The history

Two years ago, in 2021, T-Mobile engineering teams partnered with AWS to launch a new AI-powered feature called Voicemail to Text with automatic language detection and to improve quality and performance for customers. Voicemail to Text provided customers with the additional benefit of receiving voicemail transcriptions at no extra cost. Voicemail to Text is available in 39 different spoken languages and dialects and uses auto-language detection provided by Amazon Transcribe. These languages included English and Spanish, as well as many European, Middle Eastern, Asian, and African languages. The full list of supported languages can be found in Amazon Transcribe documentation. Since its introduction, the usage of Voicemail to Text service has grown and it transcribes 126 million voicemail messages per month as of July 2023. T-Mobile has partnered with AWS to analyze the key application metrics of this service, such as language distribution, the number of messages per language, the daily total and unique active customers, and so on. This data helped in scaling the service and making key business decisions to improve the Voicemail to Text customer experience.

The challenge

Upon analysis of the weekly language distribution of voicemail transcriptions, T-Mobile observed that approximately 10 percent of voicemail transcriptions were received in a language other than US English, with Spanish being the predominant language.

Weekly language distribution for all languages

Weekly language distribution excluding US English and US Spanish

In addition, market research based on U.S. Census Bureau data showed that approximately 22 percent of the US population spoke a language other than English. This showed a need for a Voicemail to Text translation feature that could bridge this language gap and drove T-Mobile and AWS teams to brainstorm a solution. The idea was to give T-Mobile’s Voicemail to Text customers an option to receive voicemail transcriptions in the language of their choice delivered through SMS, email, or through the Visual Voicemail (VVM) application.

The solution

T-Mobile decided to use Amazon Translate, a neural machine translation (MT) service to augment their existing Voicemail to Text service with real-time text translation between supported languages because Amazon Translate provided the high-quality translation services required in the industry. T-Mobile already had their voicemail system connected to AWS through a private AWS Direct Connect link and was using the Amazon Transcribe API to get transcriptions. Following the same design pattern, T-Mobile added an integration with the Amazon Translate API to translate voicemail transcripts from the source language detected by Amazon Transcribe to the customers’ preferred language.

Here is a high-level architecture diagram that illustrates the T-Mobile Voicemail to Text Translate solution.

Scope of solution

From a customer perspective, to enable the Visual Voicemail Translate feature, a customer needs a Voicemail to Text feature service operator code (SOC) enabled on their mobile plan and should have one of the supported major Android manufacturer devices with the Translate feature API enabled. The customer can then visit the Visual Voicemail settings page to select a language from a list of 75 different languages and dialects supported by Amazon Translate. This will enable customers to receive voicemail transcription in the supported language of their choice.

The results

With Amazon Translate, T-Mobile was able to deliver a delightful new customer experience that accommodates the language preference of its customers and makes voicemail more accessible to people who speak various languages. This new capability helps to break language barriers by making it easier for speakers of different languages to communicate.

Conclusion

By using Amazon Transcribe and Amazon Translate language AI services, T-Mobile was able to enhance its voicemail service by delivering message transcriptions in a language that customers can understand. By choosing to use AWS managed AI services, T-Mobile was able to expedite the delivery of this new customer experience and avoid operational burdens of maintaining additional servers and software in their data centers. With Amazon Transcribe and Amazon Translate, Voicemail to Text and Voicemail to Text Translate services are delivered with low latency and high accuracy.

For more information, check out Getting started with Amazon Translate and Getting started with Amazon Transcribe to explore how you can use these services with your applications. Follow the Artificial Intelligence category on AWS Machine Learning Blog to stay up to date with new capabilities and use cases for various AWS AI services.


About the Authors

Dhurjati Brahma is a Senior Systems Architect at T-Mobile US, Inc. He has over 15+ years of experience in designing, building, and managing robust and scalable virtualized messaging solutions within T-Mobile’s network. He is passionate to collaborate with various cross-functional teams and vendors to securely integrate T-Mobile’s messaging systems with public cloud to launch exciting new products and services for T-Mobile’s customers. He holds a Master’s degree in Electrical Engineering from University of Alabama at Birmingham. Outside of work, he enjoys going on hikes, listening to classical music, practicing meditation, and spending time with his family and friends.

Jim Chao is the Principal Engineer/Architect at Core Messaging Service Design at T-Mobile US, Inc. He has more than two decades of experience in design and architecture of mobile messaging service systems and platforms. Lately, he has been dedicating his time to the next generation of messaging service using machine learning and generative AI. He holds a Master’s degree of Computer Information Systems. Outside the work he spends time with family and does a lot of religious study and practice as well as traveling to the religious places above 5000 meters in the mountains of Tibet.

Nicholas Zellerhoff is an Associate Systems Architect for T-Mobile as part of the Service Technology Development team functioning as lead development engineer for Native Visual Voicemail services. When not in office, Nick enjoys everything outdoors, from backyard BBQs with friends to backcountry hiking in the North Cascades.

Alex Bulatkin is a solutions architect at AWS. He enjoys helping communication service providers build innovative solutions in AWS that are redefining the telecom industry. He is passionate about working with customers on bringing the power of AWS AI services into their applications. Alex is based in the Denver metropolitan area and likes to hike, ski, and snowboard.

Prabhakaran Balasubramaniam is a Principal Customer Solutions Manager at AWS. He loves to help telecom customers leverage new technologies to solve their problems. Prabhakaran is based in the Dallas-Fort Worth area and likes sports.

Read More

From text to dream job: Building an NLP-based job recommender at Talent.com with Amazon SageMaker

From text to dream job: Building an NLP-based job recommender at Talent.com with Amazon SageMaker

This post is co-authored by Anatoly Khomenko, Machine Learning Engineer, and Abdenour Bezzouh, Chief Technology Officer at Talent.com.

Founded in 2011, Talent.com is one of the world’s largest sources of employment. The company combines paid job listings from their clients with public job listings into a single searchable platform. With over 30 million jobs listed in more than 75 countries, Talent.com serves jobs across many languages, industries, and distribution channels. The result is a platform that matches millions of job seekers with available jobs.

Talent.com’s mission is to centralize all jobs available on the web to help job seekers find their best match while providing them with the best search experience. Its focus is on relevancy, because the order of the recommended jobs is vitally important to show the jobs most pertinent to users’ interests. The performance of Talent.com’s matching algorithm is paramount to the success of the business and a key contributor to their users’ experience. It’s challenging to predict which jobs are pertinent to a job seeker based on the limited amount of information provided, usually contained to a few keywords and a location.

Given this mission, Talent.com and AWS joined forces to create a job recommendation engine using state-of-the-art natural language processing (NLP) and deep learning model training techniques with Amazon SageMaker to provide an unrivaled experience for job seekers. This post shows our joint approach to designing a job recommendation system, including feature engineering, deep learning model architecture design, hyperparameter optimization, and model evaluation that ensures the reliability and effectiveness of our solution for both job seekers and employers. The system is developed by a team of dedicated applied machine learning (ML) scientists, ML engineers, and subject matter experts in collaboration between AWS and Talent.com.

The recommendation system has driven an 8.6% increase in clickthrough rate (CTR) in online A/B testing against a previous XGBoost-based solution, helping connect millions of Talent.com’s users to better jobs.

Overview of solution

An overview of the system is illustrated in the following figure. The system takes a user’s search query as input and outputs a ranked list of jobs in order of pertinence. Job pertinence is measured by the click probability (the probability of a job seeker clicking on a job for more information).

The system includes four main components:

  • Model architecture – The core of this job recommendation engine is a deep learning-based Triple Tower Pointwise model, which includes a query encoder that encodes user search queries, a document encoder that encodes the job descriptions, and an interaction encoder that processes the past user-job interaction features. The outputs of the three towers are concatenated and passed through a classification head to predict the job’s click probabilities. By training this model on search queries, job specifics, and historical user interaction data from Talent.com, this system provides personalized and highly relevant job recommendations to job seekers.
  • Feature engineering – We perform two sets of feature engineering to extract valuable information from input data and feed it into the corresponding towers in the model. The two sets are standard feature engineering and fine-tuned Sentence-BERT (SBERT) embeddings. We use the standard engineered features as input into the interaction encoder and feed the SBERT derived embedding into the query encoder and document encoder.
  • Model optimization and tuning – We utilize advanced training methodologies to train, test, and deploy the system with SageMaker. This includes SageMaker Distributed Data Parallel (DDP) training, SageMaker Automatic Model Tuning (AMT), learning rate scheduling, and early stopping to improve model performance and training speed. Using the DDP training framework helped speed up our model training to approximately eight times faster.
  • Model evaluation – We conduct both offline and online evaluation. We evaluate the model performance with Area Under the Curve (AUC) and Mean Average Precision at K (mAP@K) in offline evaluation. During online A/B testing, we evaluate the CTR improvements.

In the following sections, we present the details of these four components.

Deep learning model architecture design

We design a Triple Tower Deep Pointwise (TTDP) model using a triple-tower deep learning architecture and the pointwise pair modeling approach. The triple-tower architecture provides three parallel deep neural networks, with each tower processing a set of features independently. This design pattern allows the model to learn distinct representations from different sources of information. After the representations from all three towers are obtained, they are concatenated and passed through a classification head to make the final prediction (0–1) on the click probability (a pointwise modeling setup).

The three towers are named based on the information they process: the query encoder processes the user search query, the document encoder processes the candidate job’s documentational contents including the job title and company name, and the interaction encoder uses relevant features extracted from past user interactions and history (discussed more in the next section).

Each of these towers plays a crucial role in learning how to recommend jobs:

  • Query encoder – The query encoder takes in the SBERT embeddings derived from the user’s job search query. We enhance the embeddings through an SBERT model we fine-tuned. This encoder processes and understands the user’s job search intent, including details and nuances captured by our domain-specific embeddings.
  • Document encoder – The document encoder processes the information of each job listing. Specifically, it takes the SBERT embeddings of the concatenated text from the job title and company. The intuition is that users will be more interested in candidate jobs that are more relevant to the search query. By mapping the jobs and the search queries to the same vector space (defined by SBERT), the model can learn to predict the probability of the potential jobs a job seeker will click.
  • Interaction encoder – The interaction encoder deals with the user’s past interactions with job listings. The features are produced via a standard feature engineering step, which includes calculating popularity metrics for job roles and companies, establishing context similarity scores, and extracting interaction parameters from previous user engagements. It also processes the named entities identified in the job title and search queries with a pre-trained named entity recognition (NER) model.

Each tower generates an independent output in parallel, all of which are then concatenated together. This combined feature vector is then passed to predict the click probability of a job listing for a user query. The triple-tower architecture provides flexibility in capturing complex relationships between different inputs or features, allowing the model to take advantage of the strengths of each tower while learning more expressive representations for the given task.

Candidate jobs’ predicted click probabilities are ranked from high to low, generating personalized job recommendations. Through this process, we ensure that each piece of information—whether it’s the user’s search intent, job listing details, or past interactions—is fully captured by a specific tower dedicated to it. The complex relationships between them are also captured through the combination of the tower outputs.

Feature engineering

We perform two sets of feature engineering processes to extract valuable information from the raw data and feed it into the corresponding towers in the model: standard feature engineering and fine-tuned SBERT embeddings.

Standard feature engineering

Our data preparation process begins with standard feature engineering. Overall, we define four types of features:

  • Popularity – We calculate popularity scores at the individual job level, occupation level, and company level. This provides a metric of how attractive a particular job or company might be.
  • Textual similarity – To understand the contextual relationship between different textual elements, we compute similarity scores, including string similarity between the search query and the job title. This helps us gauge the relevance of a job opening to a job seeker’s search or application history.
  • Interaction – In addition, we extract interaction features from past user engagements with job listings. A prime example of this is the embedding similarity between past clicked job titles and candidate job titles. This measure helps us understand the similarity between previous jobs a user has shown interest in vs. upcoming job opportunities. This enhances the precision of our job recommendation engine.
  • Profile – Lastly, we extract user-defined job interest information from the user profile and compare it with new job candidates. This helps us understand if a job candidate matches a user’s interest.

A crucial step in our data preparation is the application of a pre-trained NER model. By implementing an NER model, we can identify and label named entities within job titles and search queries. Consequently, this allows us to compute similarity scores between these identified entities, providing a more focused and context-aware measure of relatedness. This methodology reduces the noise in our data and gives us a more nuanced, context-sensitive method of comparing jobs.

Fine-tuned SBERT embeddings

To enhance the relevance and accuracy of our job recommendation system, we use the power of SBERT, a powerful transformer-based model, known for its proficiency in capturing semantic meanings and contexts from text. However, generic embeddings like SBERT, although effective, may not fully capture the unique nuances and terminologies inherent in a specific domain such as ours, which centers around employment and job searches. To overcome this, we fine-tune the SBERT embeddings using our domain-specific data. This fine-tuning process optimizes the model to better understand and process the industry-specific language, jargon, and context, making the embeddings more reflective of our specific domain. As a result, the refined embeddings offer improved performance in capturing both semantic and contextual information within our sphere, leading to more accurate and meaningful job recommendations for our users.

The following figure illustrates the SBERT fine-tuning step.

We fine-tune SBERT embeddings using TripletLoss with a cosine distance metric that learns text embedding where anchor and positive texts have a higher cosine similarity than anchor and negative texts. We use users’ search queries as anchor texts. We combine job titles and employer names as inputs to the positive and negative texts. The positive texts are sampled from job postings that the corresponding user clicked on, whereas the negative texts are sampled from job postings that the user did not click on. The following is sample implementation of the fine-tuning procedure:

import math
from datetime import datetime

from torch.utils.data import DataLoader
from sentence_transformers import (SentenceTransformer, SentencesDataset,
                                   LoggingHandler, losses)
from sentence_transformers.readers import InputExample

model_name = 'all-mpnet-base-v2'
train_batch_size = 16
num_epochs = 1
model_save_path = (f'output/{model_name}_'+
                   datetime.now().strftime("%Y-%m-%d_%H-%M-%S"))

### load pre-trained SBERT model
model = SentenceTransformer(model_name, device="cuda")

### construct training dataset of triplet texts,
### stored in three lists (achors, positives, negatives)
train_examples =[]
for anchor, positive, negative in zip(achors, positives, negatives):
    train_examples.append(InputExample(texts=(anchor, positive, negative)))

train_dataset = SentencesDataset(train_examples, model)
train_dataloader = DataLoader(train_dataset, shuffle=True,
                              batch_size=train_batch_size)

### use TripletLoss with cosine distance metric and margin=0.5
distance_metric=losses.TripletDistanceMetric.COSINE
train_loss = losses.TripletLoss(model=model,
                                distance_metric=distance_metric,
                                triplet_margin=0.5)

### 10% of train data for warm-up
warmup_steps = math.ceil(len(train_dataloader) * num_epochs * 0.1)

# Train the model
model.fit(train_objectives=[(train_dataloader, train_loss)],
          epochs=num_epochs,
          warmup_steps=warmup_steps,
          output_path=model_save_path)

Model training with SageMaker Distributed Data Parallel

We use SageMaker Distributed Data Parallel (SMDDP), a feature of the SageMaker ML platform that is built on top of PyTorch DDP. It provides an optimized environment for running PyTorch DDP training jobs on the SageMaker platform. It’s designed to significantly speed up deep learning model training. It accomplishes this by splitting a large dataset into smaller chunks and distributing them across multiple GPUs. The model is replicated on every GPU. Each GPU processes its assigned data independently, and the results are collated and synchronized across all GPUs. DDP takes care of gradient communication to keep model replicas synchronized and overlaps them with gradient computations to speed up training. SMDDP utilizes an optimized AllReduce algorithm to minimize communication between GPUs, reducing synchronization time and improving overall training speed. The algorithm adapts to different network conditions, making it highly efficient for both on-premises and cloud-based environments. In the SMDDP architecture (as shown in the following figure), distributed training is also scaled using a cluster of many nodes. This means not just multiple GPUs in a computing instance, but many instances with multiple GPUs, which further speeds up training.

For more information about this architecture, refer to Introduction to SageMaker’s Distributed Data Parallel Library.

With SMDDP, we have been able to substantially reduce the training time for our TTDP model, making it eight times faster. Faster training times mean we can iterate and improve our models more quickly, leading to better job recommendations for our users in a shorter amount of time. This efficiency gain is instrumental in maintaining the competitiveness of our job recommendation engine in a fast-evolving job market.

You can adapt your training script with the SMDDP with only three lines of code, as shown in the following code block. Using PyTorch as an example, the only thing you need to do is import the SMDDP library’s PyTorch client (smdistributed.dataparallel.torch.torch_smddp). The client registers smddp as a backend for PyTorch.

import smdistributed.dataparallel.torch.torch_smddp
import torch.distributed as dist

dist.init_process_group(backend='smddp')

After you have a working PyTorch script that is adapted to use the distributed data parallel library, you can launch a distributed training job using the SageMaker Python SDK.

Evaluating model performance

When evaluating the performance of a recommendation system, it’s crucial to choose metrics that align closely with business goals and provide a clear understanding of the model’s effectiveness. In our case, we use the AUC to evaluate our TTDP model’s job click prediction performance and the mAP@K to assess the quality of the final ranked jobs list.

The AUC refers to the area under the receiver operating characteristic (ROC) curve. It represents the probability that a randomly chosen positive example will be ranked higher than a randomly chosen negative example. It ranges from 0–1, where 1 indicates an ideal classifier and 0.5 represents a random guess. mAP@K is a metric commonly used to assess the quality of information retrieval systems, such as our job recommender engine. It measures the average precision of retrieving the top K relevant items for a given query or user. It ranges from 0–1, with 1 indicating optimal ranking and 0 indicating the lowest possible precision at the given K value. We evaluate the AUC, mAP@1, and mAP@3. Collectively, these metrics allow us to gauge the model’s ability to distinguish between positive and negative classes (AUC) and its success at ranking the most relevant items at the top (mAP@K).

Based on our offline evaluation, the TTDP model outperformed the baseline model—the existing XGBoost-based production model—by 16.65% for AUC, 20% for mAP@1, and 11.82% for mAP@3.

Furthermore, we designed an online A/B test to evaluate the proposed system and ran the test on a percentage of the US email population for 6 weeks. In total, approximately 22 million emails were sent using the job recommended by the new system. The resulting uplift in clicks compared to the previous production model was 8.6%. Talent.com is gradually increasing the percentage to roll out the new system to its complete population and channels.

Conclusion

Creating a job recommendation system is a complex endeavor. Each job seeker has unique needs, preferences, and professional experiences that can’t be inferred from a short search query. In this post, Talent.com collaborated with AWS to develop an end-to-end deep learning-based job recommender solution that ranks lists of jobs to recommend to users. The Talent.com team truly enjoyed collaborating with the AWS team throughout the process of solving this problem. This marks a significant milestone in Talent.com’s transformative journey, as the team takes advantage of the power of deep learning to empower its business.

This project was fine-tuned using SBERT to generate text embeddings. At the time of writing, AWS introduced Amazon Titan Embeddings as part of their foundational models (FMs) offered through Amazon Bedrock, which is a fully managed service providing a selection of high-performing foundational models from leading AI companies. We encourage readers to explore the machine learning techniques presented in this blog post and leverage the capabilities provided by AWS, such as SMDDP, while making use of AWS Bedrock’s foundational models to create their own search functionalities.

References


About the authors

Yi Xiang is a Applied Scientist II at the Amazon Machine Learning Solutions Lab, where she helps AWS customers across different industries accelerate their AI and cloud adoption.

Tong Wang is a Senior Applied Scientist at the Amazon Machine Learning Solutions Lab, where he helps AWS customers across different industries accelerate their AI and cloud adoption.

Dmitriy BespalovDmitriy Bespalov is a Senior Applied Scientist at the Amazon Machine Learning Solutions Lab, where he helps AWS customers across different industries accelerate their AI and cloud adoption.

Anatoly Khomenko is a Senior Machine Learning Engineer at Talent.com with a passion for natural language processing matching good people to good jobs.

Abdenour Bezzouh is an executive with more than 25 years experience building and delivering technology solutions that scale to millions of customers. Abdenour held the position of Chief Technology Officer (CTO) at Talent.com when the AWS team designed and executed this particular solution for Talent.com.

Dale Jacques is a Senior AI Strategist within the Generative AI Innovation Center where he helps AWS customers translate business problems into AI solutions.

Yanjun QiYanjun Qi is a Senior Applied Science Manager at the Amazon Machine Learning Solution Lab. She innovates and applies machine learning to help AWS customers speed up their AI and cloud adoption.

Read More

Governing the ML lifecycle at scale, Part 1: A framework for architecting ML workloads using Amazon SageMaker

Governing the ML lifecycle at scale, Part 1: A framework for architecting ML workloads using Amazon SageMaker

Customers of every size and industry are innovating on AWS by infusing machine learning (ML) into their products and services. Recent developments in generative AI models have further sped up the need of ML adoption across industries. However, implementing security, data privacy, and governance controls are still key challenges faced by customers when implementing ML workloads at scale. Addressing those challenges builds the framework and foundations for mitigating risk and responsible use of ML-driven products. Although generative AI may need additional controls in place, such as removing toxicity and preventing jailbreaking and hallucinations, it shares the same foundational components for security and governance as traditional ML.

We hear from customers that they require specialized knowledge and investment of up to 12 months for building out their customized Amazon SageMaker ML platform implementation to ensure scalable, reliable, secure, and governed ML environments for their lines of business (LOBs) or ML teams. If you lack a framework for governing the ML lifecycle at scale, you may run into challenges such as team-level resource isolation, scaling experimentation resources, operationalizing ML workflows, scaling model governance, and managing security and compliance of ML workloads.

Governing ML lifecycle at scale is a framework to help you build an ML platform with embedded security and governance controls based on industry best practices and enterprise standards. This framework addresses challenges by providing prescriptive guidance through a modular framework approach extending an AWS Control Tower multi-account AWS environment and the approach discussed in the post Setting up secure, well-governed machine learning environments on AWS.

It provides prescriptive guidance for the following ML platform functions:

  • Multi-account, security, and networking foundations – This function uses AWS Control Tower and well-architected principles for setting up and operating multi-account environment, security, and networking services.
  • Data and governance foundations – This function uses a data mesh architecture for setting up and operating the data lake, central feature store, and data governance foundations to enable fine-grained data access.
  • ML platform shared and governance services – This function enables setting up and operating common services such as CI/CD, AWS Service Catalog for provisioning environments, and a central model registry for model promotion and lineage.
  • ML team environments – This function enables setting up and operating environments for ML teams for model development, testing, and deploying their use cases for embedding security and governance controls.
  • ML platform observability – This function helps with troubleshooting and identifying the root cause for problems in ML models through centralization of logs and providing tools for log analysis visualization. It also provides guidance for generating cost and usage reports for ML use cases.

Although this framework can provide benefits to all customers, it’s most beneficial for large, mature, regulated, or global enterprises customers that want to scale their ML strategies in a controlled, compliant, and coordinated approach across the organization. It helps enable ML adoption while mitigating risks. This framework is useful for the following customers:

  • Large enterprise customers that have many LOBs or departments interested in using ML. This framework allows different teams to build and deploy ML models independently while providing central governance.
  • Enterprise customers with a moderate to high maturity in ML. They have already deployed some initial ML models and are looking to scale their ML efforts. This framework can help accelerate ML adoption across the organization. These companies also recognize the need for governance to manage things like access control, data usage, model performance, and unfair bias.
  • Companies in regulated industries such as financial services, healthcare, chemistry, and the private sector. These companies need strong governance and audibility for any ML models used in their business processes. Adopting this framework can help facilitate compliance while still allowing for local model development.
  • Global organizations that need to balance centralized and local control. This framework’s federated approach allows the central platform engineering team to set some high-level policies and standards, but also gives LOB teams flexibility to adapt based on local needs.

In the first part of this series, we walk through the reference architecture for setting up the ML platform. In a later post, we will provide prescriptive guidance for how to implement the various modules in the reference architecture in your organization.

The capabilities of the ML platform are grouped into four categories, as shown in the following figure. These capabilities form the foundation of the reference architecture discussed later in this post:

  • Build ML foundations
  • Scale ML operations
  • Observable ML
  • Secure ML

Solution overview

The framework for governing ML lifecycle at scale framework enables organizations to embed security and governance controls throughout the ML lifecycle that in turn help organizations reduce risk and accelerate infusing ML into their products and services. The framework helps optimize the setup and governance of secure, scalable, and reliable ML environments that can scale to support an increasing number of models and projects. The framework enables the following features:

  • Account and infrastructure provisioning with organization policy compliant infrastructure resources
  • Self-service deployment of data science environments and end-to-end ML operations (MLOps) templates for ML use cases
  • LOB-level or team-level isolation of resources for security and privacy compliance
  • Governed access to production-grade data for experimentation and production-ready workflows
  • Management and governance for code repositories, code pipelines, deployed models, and data features
  • A model registry and feature store (local and central components) for improving governance
  • Security and governance controls for the end-to-end model development and deployment process

In this section, we provide an overview of prescriptive guidance to help you build this ML platform on AWS with embedded security and governance controls.

The functional architecture associated with the ML platform is shown in the following diagram. The architecture maps the different capabilities of the ML platform to AWS accounts.

The functional architecture with different capabilities is implemented using a number of AWS services, including AWS Organizations, SageMaker, AWS DevOps services, and a data lake. The reference architecture for the ML platform with various AWS services is shown in the following diagram.

This framework considers multiple personas and services to govern the ML lifecycle at scale. We recommend the following steps to organize your teams and services:

  1. Using AWS Control Tower and automation tooling, your cloud administrator sets up the multi-account foundations such as Organizations and AWS IAM Identity Center (successor to AWS Single Sign-On) and security and governance services such as AWS Key Management Service (AWS KMS) and Service Catalog. In addition, the administrator sets up a variety of organization units (OUs) and initial accounts to support your ML and analytics workflows.
  2. Data lake administrators set up your data lake and data catalog, and set up the central feature store working with the ML platform admin.
  3. The ML platform admin provisions ML shared services such as AWS CodeCommit, AWS CodePipeline, Amazon Elastic Container Registry (Amazon ECR), a central model registry, SageMaker Model Cards, SageMaker Model Dashboard, and Service Catalog products for ML teams.
  4. The ML team lead federates via IAM Identity Center, uses Service Catalog products, and provisions resources in the ML team’s development environment.
  5. Data scientists from ML teams across different business units federate into their team’s development environment to build the model pipeline.
  6. Data scientists search and pull features from the central feature store catalog, build models through experiments, and select the best model for promotion.
  7. Data scientists create and share new features into the central feature store catalog for reuse.
  8. An ML engineer deploys the model pipeline into the ML team test environment using a shared services CI/CD process.
  9. After stakeholder validation, the ML model is deployed to the team’s production environment.
  10. Security and governance controls are embedded into every layer of this architecture using services such as AWS Security Hub, Amazon GuardDuty, Amazon Macie, and more.
  11. Security controls are centrally managed from the security tooling account using Security Hub.
  12. ML platform governance capabilities such as SageMaker Model Cards and SageMaker Model Dashboard are centrally managed from the governance services account.
  13. Amazon CloudWatch and AWS CloudTrail logs from each member account are made accessible centrally from an observability account using AWS native services.

Next, we dive deep into the modules of the reference architecture for this framework.

Reference architecture modules

The reference architecture comprises eight modules, each designed to solve a specific set of problems. Collectively, these modules address governance across various dimensions, such as infrastructure, data, model, and cost. Each module offers a distinct set of functions and interoperates with other modules to provide an integrated end-to-end ML platform with embedded security and governance controls. In this section, we present a short summary of each module’s capabilities.

Multi-account foundations

This module helps cloud administrators build an AWS Control Tower landing zone as a foundational framework. This includes building a multi-account structure, authentication and authorization via IAM Identity Center, a network hub-and-spoke design, centralized logging services, and new AWS member accounts with standardized security and governance baselines.

In addition, this module gives best practice guidance on OU and account structures that are appropriate for supporting your ML and analytics workflows. Cloud administrators will understand the purpose of the required accounts and OUs, how to deploy them, and key security and compliance services they should use to centrally govern their ML and analytics workloads.

A framework for vending new accounts is also covered, which uses automation for baselining new accounts when they are provisioned. By having an automated account provisioning process set up, cloud administrators can provide ML and analytics teams the accounts they need to perform their work more quickly, without sacrificing on a strong foundation for governance.

Data lake foundations

This module helps data lake admins set up a data lake to ingest data, curate datasets, and use the AWS Lake Formation governance model for managing fine-grained data access across accounts and users using a centralized data catalog, data access policies, and tag-based access controls. You can start small with one account for your data platform foundations for a proof of concept or a few small workloads. For medium-to-large-scale production workload implementation, we recommend adopting a multi-account strategy. In such a setting, LOBs can assume the role of data producers and data consumers using different AWS accounts, and the data lake governance is operated from a central shared AWS account. The data producer collects, processes, and stores data from their data domain, in addition to monitoring and ensuring the quality of their data assets. Data consumers consume the data from the data producer after the centralized catalog shares it using Lake Formation. The centralized catalog stores and manages the shared data catalog for the data producer accounts.

ML platform services

This module helps the ML platform engineering team set up shared services that are used by the data science teams on their team accounts. The services include a Service Catalog portfolio with products for SageMaker domain deployment, SageMaker domain user profile deployment, data science model templates for model building and deploying. This module has functionalities for a centralized model registry, model cards, model dashboard, and the CI/CD pipelines used to orchestrate and automate model development and deployment workflows.

In addition, this module details how to implement the controls and governance required to enable persona-based self-service capabilities, allowing data science teams to independently deploy their required cloud infrastructure and ML templates.

ML use case development

This module helps LOBs and data scientists access their team’s SageMaker domain in a development environment and instantiate a model building template to develop their models. In this module, data scientists work on a dev account instance of the template to interact with the data available on the centralized data lake, reuse and share features from a central feature store, create and run ML experiments, build and test their ML workflows, and register their models to a dev account model registry in their development environments.

Capabilities such as experiment tracking, model explainability reports, data and model bias monitoring, and model registry are also implemented in the templates, allowing for rapid adaptation of the solutions to the data scientists’ developed models.

ML operations

This module helps LOBs and ML engineers work on their dev instances of the model deployment template. After the candidate model is registered and approved, they set up CI/CD pipelines and run ML workflows in the team’s test environment, which registers the model into the central model registry running in a platform shared services account. When a model is approved in the central model registry, this triggers a CI/CD pipeline to deploy the model into the team’s production environment.

Centralized feature store

After the first models are deployed to production and multiple use cases start to share features created from the same data, a feature store becomes essential to ensure collaboration across use cases and reduce duplicate work. This module helps the ML platform engineering team set up a centralized feature store to provide storage and governance for ML features created by the ML use cases, enabling feature reuse across projects.

Logging and observability

This module helps LOBs and ML practitioners gain visibility into the state of ML workloads across ML environments through centralization of log activity such as CloudTrail, CloudWatch, VPC flow logs, and ML workload logs. Teams can filter, query, and visualize logs for analysis, which can help enhance security posture as well.

Cost and reporting

This module helps various stakeholders (cloud admin, platform admin, cloud business office) to generate reports and dashboards to break down costs at ML user, ML team, and ML product levels, and track usage such as number of users, instance types, and endpoints.

Customers have asked us to provide guidance on how many accounts to create and how to structure those accounts. In the next section, we provide guidance on that account structure as reference that you can modify to suit your needs according to your enterprise governance requirements.

Reference account structure

In this section, we discuss our recommendation for organizing your account structure. We share a baseline reference account structure; however, we recommend ML and data admins work closely with their cloud admin to customize this account structure based on their organization controls.

We recommend organizing accounts by OU for security, infrastructure, workloads, and deployments. Furthermore, within each OU, organize by non-production and production OU because the accounts and workloads deployed under them have different controls. Next, we briefly discuss those OUs.

Security OU

The accounts in this OU are managed by the organization’s cloud admin or security team for monitoring, identifying, protecting, detecting, and responding to security events.

Infrastructure OU

The accounts in this OU are managed by the organization’s cloud admin or network team for managing enterprise-level infrastructure shared resources and networks.

We recommend having the following accounts under the infrastructure OU:

  • Network – Set up a centralized networking infrastructure such as AWS Transit Gateway
  • Shared services – Set up centralized AD services and VPC endpoints

Workloads OU

The accounts in this OU are managed by the organization’s platform team admins. If you need different controls implemented for each platform team, you can nest other levels of OU for that purpose, such as an ML workloads OU, data workloads OU, and so on.

We recommend the following accounts under the workloads OU:

  • Team-level ML dev, test, and prod accounts – Set this up based on your workload isolation requirements
  • Data lake accounts – Partition accounts by your data domain
  • Central data governance account – Centralize your data access policies
  • Central feature store account – Centralize features for sharing across teams

Deployments OU

The accounts in this OU are managed by the organization’s platform team admins for deploying workloads and observability.

We recommend the following accounts under the deployments OU because the ML platform team can set up different sets of controls at this OU level to manage and govern deployments:

  • ML shared services accounts for test and prod – Hosts platform shared services CI/CD and model registry
  • ML observability accounts for test and prod – Hosts CloudWatch logs, CloudTrail logs, and other logs as needed

Next, we briefly discuss organization controls that need to be considered for embedding into member accounts for monitoring the infrastructure resources.

AWS environment controls

A control is a high-level rule that provides ongoing governance for your overall AWS environment. It’s expressed in plain language. In this framework, we use AWS Control Tower to implement the following controls that help you govern your resources and monitor compliance across groups of AWS accounts:

  • Preventive controls – A preventive control ensures that your accounts maintain compliance because it disallows actions that lead to policy violations and are implemented using a Service Control Policy (SCP). For example, you can set a preventive control that ensures that CloudTrail is not deleted or stopped in AWS accounts or Regions.
  • Detective controls – A detective control detects noncompliance of resources within your accounts, such as policy violations, provides alerts through the dashboard, and is implemented using AWS Config rules. For example, you can create a detective control to detects whether public read access is enabled to the Amazon Simple Storage Service (Amazon S3) buckets in the log archive shared account.
  • Proactive controls – A proactive control scans your resources before they are provisioned and makes sure that the resources are compliant with that control and are implemented using AWS CloudFormation hooks. Resources that aren’t compliant will not be provisioned. For example, you can set a proactive control that checks that direct internet access is not allowed for a SageMaker notebook instance.

Interactions between ML platform services, ML use cases, and ML operations

Different personas, such as the head of data science (lead data scientist), data scientist, and ML engineer, operate modules 2–6 as shown in the following diagram for different stages of ML platform services, ML use case development, and ML operations along with data lake foundations and the central feature store.

The following table summarizes the ops flow activity and setup flow steps for different personas. Once a persona initiates a ML activity as part of ops flow, the services run as mentioned in setup flow steps.

Persona Ops Flow Activity – Number Ops Flow Activity – Description Setup Flow Step – Number Setup Flow Step – Description
Lead Data Science or ML Team Lead

1

Uses Service Catalog in the ML platform services account and deploys the following:

    • ML infrastructure
    • SageMaker projects
    • SageMaker model registry

1-A

  • Sets up the dev, test, and prod environments for LOBs
  • Sets up SageMaker Studio in the ML platform services account

1-B

  • Sets up SageMaker Studio with the required configuration
Data Scientist

2

Conducts and tracks ML experiments in SageMaker notebooks

2-A

  • Uses data from Lake Formation
  • Saves features in the central feature store

3

Automates successful ML experiments with SageMaker projects and pipelines

3-A

    • Initiates SageMaker pipelines (preprocess, train, evaluate) in the dev account
  • Initiates the build CI/CD process with CodePipeline in the dev account

3-B

After the SageMaker pipelines run, saves the model in the local (dev) model registry
Lead Data Scientist or ML Team Lead

4

Approves the model in the local (dev) model registry

4-A

Model metadata and model package writes from the local (dev) model registry to the central model registry

5

Approves the model in the central model registry

5-A

Initiates the deployment CI/CD process to create SageMaker endpoints in the test environment

5-B

Writes the model information and metadata to the ML governance module (model card, model dashboard) in the ML platform services account from the local (dev) account
ML Engineer

6

Tests and monitors the SageMaker endpoint in the test environment after CI/CD .

7

Approves deployment for SageMaker endpoints in the prod environment

7-A

Initiates the deployment CI/CD process to create SageMaker endpoints in the prod environment

8

Tests and monitors the SageMaker endpoint in the test environment after CI/CD .

Personas and interactions with different modules of the ML platform

Each module caters to particular target personas within specific divisions that utilize the module most often, granting them primary access. Secondary access is then permitted to other divisions that require occasional use of the modules. The modules are tailored towards the needs of particular job roles or personas to optimize functionality.

We discuss the following teams:

  • Central cloud engineering – This team operates at the enterprise cloud level across all workloads for setting up common cloud infrastructure services, such as setting up enterprise-level networking, identity, permissions, and account management
  • Data platform engineering – This team manages enterprise data lakes, data collection, data curation, and data governance
  • ML platform engineering – This team operates at the ML platform level across LOBs to provide shared ML infrastructure services such as ML infrastructure provisioning, experiment tracking, model governance, deployment, and observability

The following table details which divisions have primary and secondary access for each module according to the module’s target personas.

Module Number Modules Primary Access Secondary Access Target Personas Number of accounts

1

Multi-account foundations Central cloud engineering Individual LOBs
  • Cloud admin
  • Cloud engineers
Few

2

Data lake foundations Central cloud or data platform engineering Individual LOBs
  • Data lake admin
  • Data engineers
Multiple

3

ML platform services Central cloud or ML platform engineering Individual LOBs
  • ML platform Admin
  • ML team Lead
  • ML engineers
  • ML governance lead
One

4

ML use case development Individual LOBs Central cloud or ML platform engineering
  • Data scientists
  • Data engineers
  • ML team lead
  • ML engineers
Multiple

5

ML operations Central cloud or ML engineering Individual LOBs
  • ML Engineers
  • ML team leads
  • Data scientists
Multiple

6

Centralized feature store Central cloud or data engineering Individual LOBs
  • Data engineer
  • Data scientists
One

7

Logging and observability Central cloud engineering Individual LOBs
  • Cloud admin
  • IT auditors
One

8

Cost and reporting Individual LOBs Central platform engineering
  • LOB executives
  • ML managers
One

Conclusion

In this post, we introduced a framework for governing the ML lifecycle at scale that helps you implement well-architected ML workloads embedding security and governance controls. We discussed how this framework takes a holistic approach for building an ML platform considering data governance, model governance, and enterprise-level controls. We encourage you to experiment with the framework and concepts introduced in this post and share your feedback.


About the authors

Ram Vittal is a Principal ML Solutions Architect at AWS. He has over 3 decades of experience architecting and building distributed, hybrid, and cloud applications. He is passionate about building secure, scalable, reliable AI/ML and big data solutions to help enterprise customers with their cloud adoption and optimization journey to improve their business outcomes. In his spare time, he rides motorcycle and walks with his three-year old sheep-a-doodle!

Sovik Kumar Nath is an AI/ML solution architect with AWS. He has extensive experience designing end-to-end machine learning and business analytics solutions in finance, operations, marketing, healthcare, supply chain management, and IoT. Sovik has published articles and holds a patent in ML model monitoring. He has double masters degrees from the University of South Florida, University of Fribourg, Switzerland, and a bachelors degree from the Indian Institute of Technology, Kharagpur. Outside of work, Sovik enjoys traveling, taking ferry rides, and watching movies.

Maira Ladeira Tanke is a Senior Data Specialist at AWS. As a technical lead, she helps customers accelerate their achievement of business value through emerging technology and innovative solutions. Maira has been with AWS since January 2020. Prior to that, she worked as a data scientist in multiple industries focusing on achieving business value from data. In her free time, Maira enjoys traveling and spending time with her family someplace warm.

Ryan Lempka is a Senior Solutions Architect at Amazon Web Services, where he helps his customers work backwards from business objectives to develop solutions on AWS. He has deep experience in business strategy, IT systems management, and data science. Ryan is dedicated to being a lifelong learner, and enjoys challenging himself every day to learn something new.

Sriharsh Adari is a Senior Solutions Architect at Amazon Web Services (AWS), where he helps customers work backwards from business outcomes to develop innovative solutions on AWS. Over the years, he has helped multiple customers on data platform transformations across industry verticals. His core area of expertise include Technology Strategy, Data Analytics, and Data Science. In his spare time, he enjoys playing sports, binge-watching TV shows, and playing Tabla.

Read More

How Meesho built a generalized feed ranker using Amazon SageMaker inference

How Meesho built a generalized feed ranker using Amazon SageMaker inference

This is a guest post co-written by Rama Badrinath, Divay Jindal and Utkarsh Agrawal at Meesho.


Meesho is India’s fastest growing ecommerce company with a mission to democratize internet commerce for everyone and make it accessible to the next billion users of India. Meesho was founded in 2015 and today focuses on buyers and sellers across India. The Meesho marketplace provides micro, small, and medium businesses and individual entrepreneurs access to millions of customers, a selection from over 30 categories and more than 900 sub-categories, pan-India logistics, payment services, and customer support capabilities to efficiently run their businesses on the Meesho ecosystem.

As an ecommerce platform, Meesho aims to improve the user experience by offering personalized and relevant product recommendations. We wanted to create a generalized feed ranker that considers individual preferences and historical behavior to effectively display products in each user’s feed. Through this, we wanted to boost user engagement, conversion rates, and overall business growth by tailoring the shopping experience to each customer’s unique requirements and providing the best value for their money.

We used AWS machine learning (ML) services like Amazon SageMaker to develop a powerful generalized feed ranker (GFR). In this post, we discuss the key components of the GFR and how this ML-driven solution streamlined the ML lifecycle, ensuring efficient infra management, scalability, and reliability within the ecosystem.

Solution overview

To personalize users’ feeds, we analyzed extensive historical data, extracting insights into features that include browsing patterns and interests. These valuable features are used to construct ranking models. The GFR personalizes each user’s feed in real time, considering various factors like geography, prior shopping pattern, acquisition channels, and more. Several interaction-based features are also used to capture the affinity of the user towards an item, item category, or item properties like price, rating, or discount.

Several user-agnostic features and scores at item level are used as well. These include an item popularity score and item propensity to buy score. All these features go as input to the Learning to Rank (LTR) model that tries to emit the Probability of Click (PCTR) and Probability of Purchase (PCVR).

For diverse and relevant recommendations, the GFR sources candidate products from multiple channels, including exploit (known user preferences), explore (novel and potentially interesting products), popularity (trending items), and recent (latest additions).

The following diagram illustrates the GFR architecture.

The architecture can be divided into two different components: model training and model deployment. In the following sections, we discuss each component and the AWS services used in more detail.

Model training

Meesho used Amazon EMR with Apache Spark to process hundreds of millions of data points, depending on the model’s complexity. One of the major challenges was to run distributed training at scale. We used Dask—a distributed data science computing framework that natively integrates with Python libraries—on Amazon EMR to scale out the training jobs across the cluster. The distributed training of the model helped cut down training time from days to hours and allowed us to schedule Spark jobs efficiently and cost-effectively. We used an offline feature store to maintain a historical record of all feature values that will be used for model training. Model artifacts from training are stored in Amazon Simple Storage Service (Amazon S3), providing convenient access and version management.

We used a time sampling strategy to create training, validation, and test datasets for model training. We kept track of various metrics to evaluate the performance of the model—the most important ones being area under the ROC curve and area under the precision recall curve. We also tracked calibration of the model to prevent overconfidence and underconfidence issues while predicting the probability scores.

Model deployment

Meesho used SageMaker inference endpoints with auto scaling enabled for deploying the trained model. SageMaker offered ease of deployment with support for various ML frameworks, allowing models to be served with low latency. Although AWS offers standard inference images suitable for most use cases, we built a custom inference image that caters specifically to our needs and pushed it to Amazon Elastic Container Registry (Amazon ECR).

We built an in-house A/B testing platform that facilitated live monitoring of A/B metrics, enabling us to make data-driven decisions promptly. We also used the A/B testing feature of SageMaker to deploy multiple production variants on an endpoint. Through A/B experiments, we observed an approximate 3.5% enhancement in the platform’s conversion rate and an increase in app open frequency of the users, highlighting the effectiveness of this approach.

We kept track of various drifts such as feature drift and prior drift multiple times a day after model deployment to prevent the model performance from deteriorating.

We used AWS Lambda to set up various automations and triggers that are required during model retraining, endpoint updates, and monitoring processes.

The recommendation workflow after model deployment works as follows (as noted in the solution architecture diagram):

  1. The input requests with user context and interaction features are received at the application layer from Meesho’s mobile and web app.
  2. The application layer fetches additional features like historical data of the user from the online feature store and appends these to the input requests.
  3. The appended features are sent to the real-time endpoints for generating recommendations.
  4. The model predictions are sent back to the application layer.
  5. The application layer uses these predictions to personalize the user feeds on the mobile or web application.

Conclusion

Meesho successfully implemented a generalized feed ranker using SageMaker, which resulted in highly personalized product recommendations for each customer based on their preferences and historical behavior. This approach significantly improved user engagement and led to higher conversion rates, contributing to the company’s overall business growth. As a result of utilizing AWS services, our ML lifecycle runtime reduced significantly, from taking months to just weeks, leading to increased efficiency and productivity for our team.

With this advanced feed ranker, Meesho continues to deliver tailored shopping experiences, adding more value to its customers and fulfilling its mission to democratize ecommerce for everyone.

The team is grateful for the continuous support and guidance from Ravindra Yadav, Director of Data Science at Meesho, and Debdoot Mukherjee, Head of AI at Meesho, who played a key role in enabling this success.

To learn more about SageMaker, refer to the Amazon SageMaker Developer Guide.


About the Authors

Utkarsh Agrawal is currently working as a Senior Data Scientist at Meesho. He previously worked with Fractal Analytics and Trell on various domains, including recommender systems, time series, NLP, and more. He holds a master’s degree in Mathematics and Computing from Indian Institute of Technology Kharagpur (IIT), India.

Rama Badrinath is currently working as a Principal Data Scientist at Meesho. He previously worked with Microsoft and ShareChat on various domains, including recommender systems, image AI, NLP, and more. He holds a master’s degree in Machine Learning from Indian Institute of Science (IISc), India. He has also published papers in renowned conferences such as KDD and ECIR.

Divay Jindal is currently working as a Lead Data Scientist at Meesho. He previously worked with Bookmyshow on various domains, including recommender systems and dynamic pricing.

Venugopal Pai is a Solutions Architect at AWS. He lives in Bengaluru, India, and helps digital-native customers scale and optimize their applications on AWS.

Read More

Announcing Rekogniton Custom Moderation: Enhance accuracy of pre-trained Rekognition moderation models with your data

Announcing Rekogniton Custom Moderation: Enhance accuracy of pre-trained Rekognition moderation models with your data

Companies increasingly rely on user-generated images and videos for engagement. From ecommerce platforms encouraging customers to share product images to social media companies promoting user-generated videos and images, using user content for engagement is a powerful strategy. However, it can be challenging to ensure that this user-generated content is consistent with your policies and fosters a safe online community for your users.

Many companies currently depend on human moderators or respond reactively to user complaints to manage inappropriate user-generated content. These approaches don’t scale to effectively moderate millions of images and videos at sufficient quality or speed, which leads to a poor user experience, high costs to achieve scale, or even potential harm to brand reputation.

In this post, we discuss how to use the Custom Moderation feature in Amazon Rekognition to enhance the accuracy of your pre-trained content moderation API.

Content moderation in Amazon Rekognition

Amazon Rekognition is a managed artificial intelligence (AI) service that offers pre-trained and customizable computer vision capabilities to extract information and insights from images and videos. One such capability is Amazon Rekognition Content Moderation, which detects inappropriate or unwanted content in images and videos. Amazon Rekognition uses a hierarchical taxonomy to label inappropriate or unwanted content with 10 top-level moderation categories (such as violence, explicit, alcohol, or drugs) and 35 second-level categories. Customers across industries such as ecommerce, social media, and gaming can use content moderation in Amazon Rekognition to protect their brand reputation and foster safe user communities.

By using Amazon Rekognition for image and video moderation, human moderators have to review a much smaller set of content, typically 1–5% of the total volume, already flagged by the content moderation model. This enables companies to focus on more valuable activities and still achieve comprehensive moderation coverage at a fraction of their existing cost.

Introducing Amazon Rekognition Custom Moderation

You can now enhance the accuracy of the Rekognition moderation model for your business-specific data with the Custom Moderation feature. You can train a custom adapter with as few as 20 annotated images in less than 1 hour. These adapters extend the capabilities of the moderation model to detect images used for training with higher accuracy. For this post, we use a sample dataset containing both safe images and images with alcoholic beverages (considered unsafe) to enhance the accuracy of the alcohol moderation label.

The unique ID of the trained adapter can be provided to the existing DetectModerationLabels API operation to process images using this adapter. Each adapter can only be used by the AWS account that was used for training the adapter, ensuring that the data used for training remains safe and secure in that AWS account. With the Custom Moderation feature, you can tailor the Rekognition pre-trained moderation model for improved performance on your specific moderation use case, without any machine learning (ML) expertise. You can continue to enjoy the benefits of a fully managed moderation service with a pay-per-use pricing model for Custom Moderation.

Solution overview

Training a custom moderation adapter involves five steps that you can complete using the AWS Management Console or the API interface:

  1. Create a project
  2. Upload the training data
  3. Assign ground truth labels to images
  4. Train the adapter
  5. Use the adapter

workflow diagram

Let’s walk through these steps in more detail using the console.

Create a project

A project is a container to store your adapters. You can train multiple adapters within a project with different training datasets to assess which adapter performs best for your specific use case. To create your project, complete the following steps:

  1. On the Amazon Rekognition console, choose Custom Moderation in the navigation pane.
  2. Choose Create project.

screenshot - list of tasks

  1. For Project name, enter a name for your project.
  2. For Adapter name, enter a name for your adapter.
  3. Optionally, enter a description for your adapter.

screenshot - create task

Upload training data

You can begin with as few as 20 sample images to adapt the moderation model to detect fewer false positives (images that are appropriate for your business but are flagged by the model with a moderation label). To reduce false negatives (images that are inappropriate for your business but don’t get flagged with a moderation label), you are required to start with 50 sample images.

You can select from the following options to provide the image datasets for adapter training:

Complete the following steps:

  1. For this post, select Import images from S3 bucket and enter your S3 URI.

screenshot - provide dataset

Like any ML training process, training a Custom Moderation adapter in Amazon Rekognition requires two separate datasets: one for training the adapter and another for evaluating the adapter. You can either upload a separate test dataset or choose to automatically split your training dataset for training and testing.

  1. For this post, select Autosplit.
  2. Select Enable auto-update to ensure that the system automatically retrains the adapter when a new version of the content moderation model is launched.
  3. Choose Create project.

screenshot - create project

Assign ground truth labels to images

If you uploaded unannotated images, you can use the Amazon Rekognition console to provide image labels as per the moderation taxonomy. In the following example, we train an adapter to detect hidden alcohol with higher accuracy, and label all such images with the label alcohol. Images not considered inappropriate can be labeled as Safe.

screenshot - label images

Train the adapter

After you label all the images, choose Start training to initiate the training process. Amazon Rekognition will use the uploaded image datasets to train an adapter model for enhanced accuracy on the specific type of images provided for training.

After the custom moderation adapter is trained, you can view all the adapter details (adapterID, test and training manifest files) in the Adapter performance section.

The Adapter performance section displays improvements in false positives and false negatives when compared to the pre-trained moderation model. The adapter we trained to enhance the detection of the alcohol label reduces the false negative rate in test images by 73%. In other words, the adapter now accurately predicts the alcohol moderation label for 73% more images compared to the pre-trained moderation model. However, no improvement is observed in false positives, as no false positive samples were used for training.

screenshot - accuracy

Use the adapter

You can perform inference using the newly trained adapter to achieve enhanced accuracy. To do this, call the Amazon Rekognition DetectModerationLabel API with an additional parameter, ProjectVersion, which is the unique AdapterID of the adapter. The following is a sample command using the AWS Command Line Interface (AWS CLI):

aws rekognition detect-moderation-labels 
--image 'S3Object={Bucket="<bucket>",Name="<key>"}' 
--project-version <ARN of the Adapter> 
--region us-east-1

The following is a sample code snippet using the Python Boto3 library:

import boto3
client = boto3.client('rekognition')
response = client.detect_moderation_labels(
    Image={
        "S3Object":{
            "Bucket":"<bucket>",
            "Name":"<key>"
        }
    }, 
    ProjectVersion="<ARN of the Adapter>"
)

Best practices for training

To maximize the performance of your adapter, the following best practices are recommended for training the adapter:

  • The sample image data should capture the representative errors that you want to improve the moderation model accuracy for
  • Instead of only bringing in error images for false positives and false negatives, you can also provide true positives and true negatives for improved performance
  • Supply as many annotated images as possible for training

Conclusion

In this post, we presented an in-depth overview of the new Amazon Rekognition Custom Moderation feature. Furthermore, we detailed the steps for performing training using the console, including best practices for optimal results. For additional information, visit the Amazon Rekognition console and explore the Custom Moderation feature.

Amazon Rekognition Custom Moderation is now generally available in all AWS Regions where Amazon Rekognition is available.

Learn more about content moderation on AWS. Take the first step towards streamlining your content moderation operations with AWS.


About the Authors

Author - Shipra KanoriaShipra Kanoria is a Principal Product Manager at AWS. She is passionate about helping customers solve their most complex problems with the power of machine learning and artificial intelligence. Before joining AWS, Shipra spent over 4 years at Amazon Alexa, where she launched many productivity-related features on the Alexa voice assistant.

Author - Aakash DeepAakash Deep is a Software Development Engineering Manager based in Seattle. He enjoys working on computer vision, AI, and distributed systems. His mission is to enable customers to address complex problems and create value with AWS Rekognition. Outside of work, he enjoys hiking and traveling.

Author - Lana ZhangLana Zhang is a Senior Solutions Architect at AWS WWSO AI Services team, specializing in AI and ML for Content Moderation, Computer Vision, Natural Language Processing and Generative AI. With her expertise, she is dedicated to promoting AWS AI/ML solutions and assisting customers in transforming their business solutions across diverse industries, including social media, gaming, e-commerce, media, advertising & marketing.

Read More

Defect detection in high-resolution imagery using two-stage Amazon Rekognition Custom Labels models

Defect detection in high-resolution imagery using two-stage Amazon Rekognition Custom Labels models

High-resolution imagery is very prevalent in today’s world, from satellite imagery to drones and DLSR cameras. From this imagery, we can capture damage due to natural disasters, anomalies in manufacturing equipment, or very small defects such as defects on printed circuit boards (PCBs) or semiconductors. Building anomaly detection models using high-resolution imagery can be challenging because modern computer vision models typically resize images to a lower resolution to fit into memory for training and running inference. Reducing the image resolution significantly means that visual information relating to the defect is degraded or completely lost.

One approach to overcome these challenges is to build two-stage models. Stage 1 models detect a region of interest, and Stage 2 models detect defects on the cropped region of interest, thereby maintaining sufficient resolution for small detects.

In this post, we go over how to build an effective two-stage defect detection system using Amazon Rekognition Custom Labels and compare results for this specific use case with one-stage models. Note that several one-stage models are effective even at lower or resized image resolutions, and others may accommodate large images in smaller batches.

Solution overview

For our use case, we use a dataset of images of PCBs with synthetically generated missing hole pins, as shown in the following example.

We use this dataset to demonstrate that a one-stage approach using object detection results in subpar detection performance for the missing hole pin defects. A two-step model is preferred, in which we use Rekognition Custom Labels first for object detection to identify the pins and then a second-stage model to classify cropped images of the pins into pins with missing holes or normal pins.

The training process for a Rekognition Custom Labels model consists of several steps, as illustrated in the following diagram.

First, we use Amazon Simple Storage Service (Amazon S3) to store the image data. The data is ingested in Amazon Sagemaker Jupyter notebooks, where typically a data scientist will inspect the images and preprocess them, removing any images that are of poor quality such as blurred images or poor lighting conditions, and resize or crop the images. Then data is split into training and test sets, and Amazon SageMaker Ground Truth labeling jobs are run to label the sets of images and output a train and test manifest file. The manifest files are used by Rekognition Custom Labels for training.

One-stage model approach

The first approach we take to identifying missing holes on the PCB is to label the missing holes and train an object detection model to identify the missing holes. The following is an image example from the dataset.

We train a model with a dataset with 95 images used as training and 20 images used for testing. The following table summarizes our results.

Evaluation Results
F1 Score Average Precision Overall Recall
0.468 0.750 0.340
Training Time Training Dataset Testing Dataset
Trained in 1.791 hours 1 label, 95 images 1 label, 20 images
Per Label Performance
Label Name F1 Score Test Images Precision Recall Assumed Threshold
missing_hole 0.468 20 0.750 0.340 0.053

The resulting model has high precision but low recall, meaning that when we localize a region for a missing hole, we’re usually correct, but we’re missing a lot of missing holes that are present on the PCB. To build an effective defect detection system, we need to improve recall. The low performance of this model may be due to the defects being small on this high-resolution image of the PCB, so the model has no reference of a healthy pin.

Next, we explore splitting the image into four or six crops depending on the PCB size and labeling both healthy and missing holes. The following is an example of the resulting cropped image.

We train a model with 524 images used as training and 106 images used for testing. We maintain the same PCBs used in train and test as the full board model. The results for cropped healthy pins vs. missing holes are shown in the following table.

Evaluation Results
F1 Score Average Precision Overall Recall
0.967 0.989 0.945
Training Time Training Dataset Testing Dataset
Trained in 2.118 hours 2 labels, 524 images 2 labels, 106 images
Per Label Performance
Label Name F1 Score Test Images Precision Recall Assumed Threshold
missing_hole 0.949 42 0.980 0.920 0.536
pin 0.984 106 0.998 0.970 0.696

Both precision and recall have improved significantly. Training the model with zoomed-in cropped images and a reference to the model for healthy pins helped. However, recall is still at 92%, meaning that we would still miss 8% of the missing holes and let defects go by unnoticed.

Next, we explore a two-stage model approach in which we can improve the model performance further.

Two-stage model approach

For the two-stage model, we train two models: one for detecting pins and one for detecting if the pin is missing or not on zoomed-in cropped images of the pin. The following is an image from the pin detection dataset.

The data is similar to our previous experiment, in which we cropped the PCB into four or six cropped images. This time, we label all pins and don’t make any distinctions if the pin has a missing hole or not. We train this model with 522 images and test with 108 images, maintaining the same train/test split as previous experiments. The results are shown in the following table.

Evaluation Results
F1 Score Average Precision Overall Recall
1.000 0.999 1.000
Training Time Training Dataset Testing Dataset
Trained in 1.581 hours 1 label, 522 images 1 label, 108 images
Per Label Performance
Label Name F1 Score Test Images Precision Recall Assumed Threshold
pin 1.000 108 0.999 1.000 0.617

The model detects the pins perfectly on this synthetic dataset.

Next, we build the model to make the distinction for missing holes. We use cropped images of the holes to train the second stage of the model, as shown in the following examples. This model is separate from the previous models because it’s a classification model and will be focused on the narrow task of determining if the pin has a missing hole.

We train this second-stage model on 16,624 images and test on 3,266, maintaining the same train/test splits as the previous experiments. The following table summarizes our results.

Evaluation Results
F1 Score Average Precision Overall Recall
1.000 1.000 1.000
Training Time Training Dataset Testing Dataset
Trained in 6.660 hours 2 labels, 16,624 images 2 labels, 3,266 images
Per Label Performance
Label Name F1 Score Test Images Precision Recall Assumed Threshold
anomaly 1.000 88 1.000 1.000 0.960
normal 1.000 3,178 1.000 1.000 0.996

Again, we receive perfect precision and recall on this synthetic dataset. Combining the previous pin detection model with this second-stage missing hole classification model, we can build a model that outperforms any single-stage model.

The following table summarizes the experiments we conducted.

Experiment Type Description F1 Score Precision Recall
1 One-stage model Object detection model to detect missing holes on full images 0.468 0.75 0.34
2 One-stage model Object detection model to detect healthy pins and missing holes on cropped images 0.967 0.989 0.945
3 Two-stage model Stage 1: Object detection on all pins 1.000 0.999 1.000
Stage 2: Image classification of healthy pin or missing holes 1.000 1.000 1.000
End-to-end average 1.000 0.9995 1.000

Inference pipeline

You can use the following architecture to deploy the one-stage and two-stage models that we described in this post. The following main components are involved:

For one-stage models, you can send an input image to the API Gateway endpoint, followed by Lambda for any basic image preprocessing, and route to the Rekognition Custom Labels trained model endpoint. In our experiments, we explored one-stage models that can detect only missing holes, and missing holes and healthy pins.

For two-stage models, you can similarly send an image to the API Gateway endpoint, followed by Lambda. Lambda acts as an orchestrator that first calls the object detection model (trained using Rekognition Custom Labels), which generates the region of interest. The original image is then cropped in the Lambda function, and sent to another Rekognition Custom Labels classification model for detecting defects in each cropped image.

Conclusion

In this post, we trained one- and two-stage models to detect missing holes in PCBs using Rekognition Custom Labels. We reported results for various models; in our case, two-stage models outperformed other variants. We encourage customers with high-resolution imagery from other domains to test model performance with one- and two-stage models. Additionally, consider the following ways to expand the solution:

  • Sliding window crops for your actual datasets
  • Reusing your object detection models in the same pipeline
  • Pre-labeling workflows using bounding box predictions

About the authors

Andreas Karagounis is a Data Science Manager at Accenture. He holds a masters in Computer Science from Brown University. He has a background in computer vision and works with customers to solve their business challenges using data science and machine learning.

Yogesh Chaturvedi is a Principal Solutions Architect at AWS with a focus in computer vision. He works with customers to address their business challenges using cloud technologies. Outside of work, he enjoys hiking, traveling, and watching sports.

Shreyas Subramanian is a Principal Data Scientist, and helps customers by using machine learning to solve their business challenges using the AWS platform. Shreyas has a background in large-scale optimization and machine learning, and in the use of machine learning and reinforcement learning for accelerating optimization tasks.

Selimcan “Can” Sakar is a cloud-first developer and Solutions Architect at AWS Accenture Business Group with a focus on emerging technologies such as GenAI, ML, and blockchain. When he isn’t watching models converge, he can be seen biking or playing the clarinet.

Read More

Automatically redact PII for machine learning using Amazon SageMaker Data Wrangler

Automatically redact PII for machine learning using Amazon SageMaker Data Wrangler

Customers increasingly want to use deep learning approaches such as large language models (LLMs) to automate the extraction of data and insights. For many industries, data that is useful for machine learning (ML) may contain personally identifiable information (PII). To ensure customer privacy and maintain regulatory compliance while training, fine-tuning, and using deep learning models, it’s often necessary to first redact PII from source data.

This post demonstrates how to use Amazon SageMaker Data Wrangler and Amazon Comprehend to automatically redact PII from tabular data as part of your machine learning operations (ML Ops) workflow.

Problem: ML data that contains PII

PII is defined as any representation of information that permits the identity of an individual to whom the information applies to be reasonably inferred by either direct or indirect means. PII is information that either directly identifies an individual (name, address, social security number or other identifying number or code, telephone number, email address, and so on) or information that an agency intends to use to identify specific individuals in conjunction with other data elements, namely, indirect identification.

Customers in business domains such as financial, retail, legal, and government deal with PII data on a regular basis. Due to various government regulations and rules, customers have to find a mechanism to handle this sensitive data with appropriate security measures to avoid regulatory fines, possible fraud, and defamation. PII redaction is the process of masking or removing sensitive information from a document so it can be used and distributed, while still protecting confidential information.

Businesses need to deliver delightful customer experiences and better business outcomes by using ML. Redaction of PII data is often a key first step to unlock the larger and richer data streams needed to use or fine-tune generative AI models, without worrying about whether their enterprise data (or that of their customers) will be compromised.

Solution overview

This solution uses Amazon Comprehend and SageMaker Data Wrangler to automatically redact PII data from a sample dataset.

Amazon Comprehend is a natural language processing (NLP) service that uses ML to uncover insights and relationships in unstructured data, with no managing infrastructure or ML experience required. It provides functionality to locate various PII entity types within text, for example names or credit card numbers. Although the latest generative AI models have demonstrated some PII redaction capability, they generally don’t provide a confidence score for PII identification or structured data describing what was redacted. The PII functionality of Amazon Comprehend returns both, enabling you to create redaction workflows that are fully auditable at scale. Additionally, using Amazon Comprehend with AWS PrivateLink means that customer data never leaves the AWS network and is continuously secured with the same data access and privacy controls as the rest of your applications.

Similar to Amazon Comprehend, Amazon Macie uses a rules-based engine to identify sensitive data (including PII) stored in Amazon Simple Storage Service (Amazon S3). However, its rules-based approach relies on having specific keywords that indicate sensitive data located close to that data (within 30 characters). In contrast, the NLP-based ML approach of Amazon Comprehend uses sematic understanding of longer chunks of text to identify PII, making it more useful for finding PII within unstructured data.

Additionally, for tabular data such as CSV or plain text files, Macie returns less detailed location information than Amazon Comprehend (either a row/column indicator or a line number, respectively, but not start and end character offsets). This makes Amazon Comprehend particularly helpful for redacting PII from unstructured text that may contain a mix of PII and non-PII words (for example, support tickets or LLM prompts) that is stored in a tabular format.

Amazon SageMaker provides purpose-built tools for ML teams to automate and standardize processes across the ML lifecycle. With SageMaker MLOps tools, teams can easily prepare, train, test, troubleshoot, deploy, and govern ML models at scale, boosting productivity of data scientists and ML engineers while maintaining model performance in production. The following diagram illustrates the SageMaker MLOps workflow.

SageMaker Pipelines

SageMaker Data Wrangler is a feature of Amazon SageMaker Studio that provides an end-to-end solution to import, prepare, transform, featurize, and analyze datasets stored in locations such as Amazon S3 or Amazon Athena, a common first step in the ML lifecycle. You can use SageMaker Data Wrangler to simplify and streamline dataset preprocessing and feature engineering by either using built-in, no-code transformations or customizing with your own Python scripts.

Using Amazon Comprehend to redact PII as part of a SageMaker Data Wrangler data preparation workflow keeps all downstream uses of the data, such as model training or inference, in alignment with your organization’s PII requirements. You can integrate SageMaker Data Wrangler with Amazon SageMaker Pipelines to automate end-to-end ML operations, including data preparation and PII redaction. For more details, refer to Integrating SageMaker Data Wrangler with SageMaker Pipelines. The rest of this post demonstrates a SageMaker Data Wrangler flow that uses Amazon Comprehend to redact PII from text stored in tabular data format.

This solution uses a public synthetic dataset along with a custom SageMaker Data Wrangler flow, available as a file in GitHub. The steps to use the SageMaker Data Wrangler flow to redact PII are as follows:

  1. Open SageMaker Studio.
  2. Download the SageMaker Data Wrangler flow.
  3. Review the SageMaker Data Wrangler flow.
  4. Add a destination node.
  5. Create a SageMaker Data Wrangler export job.

This walkthrough, including running the export job, should take 20–25 minutes to complete.

Prerequisites

For this walkthrough, you should have the following:

Open SageMaker Studio

To open SageMaker Studio, complete the following steps:

  1. On the SageMaker console, choose Studio in the navigation pane.
  2. Choose the domain and user profile
  3. Choose Open Studio.

To get started with the new capabilities of SageMaker Data Wrangler, it’s recommended to upgrade to the latest release.

Download the SageMaker Data Wrangler flow

You first need to retrieve the SageMaker Data Wrangler flow file from GitHub and upload it to SageMaker Studio. Complete the following steps:

  1. Navigate to the SageMaker Data Wrangler redact-pii.flow file on GitHub.
  2. On GitHub, choose the download icon to download the flow file to your local computer.
  3. In SageMaker Studio, choose the file icon in the navigation pane.
  4. Choose the upload icon, then choose redact-pii.flow.

Upload Data Wrangler Flow

Review the SageMaker Data Wrangler flow

In SageMaker Studio, open redact-pii.flow. After a few minutes, the flow will finish loading and show the flow diagram (see the following screenshot). The flow contains six steps: an S3 Source step followed by five transformation steps.

Data Wrangler Flow steps

On the flow diagram, choose the last step, Redact PII. The All Steps pane opens on the right and shows a list of the steps in the flow. You can expand each step to view details, change parameters, and potentially add custom code.

Data Wrangler Flow step details

Let’s walk through each step in the flow.

Steps 1 (S3 Source) and 2 (Data types) are added by SageMaker Data Wrangler whenever data is imported for a new flow. In S3 Source, the S3 URI field points to the sample dataset, which is a CSV file stored in Amazon S3. The file contains roughly 116,000 rows, and the flow sets the value of the Sampling field to 1,000, which means that SageMaker Data Wrangler will sample 1,000 rows to display in the user interface. Data types sets the data type for each column of imported data.

Step 3 (Sampling) sets the number of rows SageMaker Data Wrangler will sample for an export job to 5,000, via the Approximate sample size field. Note that this is different from the number of rows sampled to display in the user interface (Step 1). To export data with more rows, you can increase this number or remove Step 3.

Steps 4, 5, and 6 use SageMaker Data Wrangler custom transforms. Custom transforms allow you to run your own Python or SQL code within a Data Wrangler flow. The custom code can be written in four ways:

  • In SQL, using PySpark SQL to modify the dataset
  • In Python, using a PySpark data frame and libraries to modify the dataset
  • In Python, using a pandas data frame and libraries to modify the dataset
  • In Python, using a user-defined function to modify a column of the dataset

The Python (pandas) approach requires your dataset to fit into memory and can only be run on a single instance, limiting its ability to scale efficiently. When working in Python with larger datasets, we recommend using either the Python (PySpark) or Python (user-defined function) approach. SageMaker Data Wrangler optimizes Python user-defined functions to provide performance similar to an Apache Spark plugin, without needing to know PySpark or Pandas. To make this solution as accessible as possible, this post uses a Python user-defined function written in pure Python.

Expand Step 4 (Make PII column) to see its details. This step combines different types of PII data from multiple columns into a single phrase that is saved in a new column, pii_col. The following table shows an example row containing data.

customer_name customer_job billing_address customer_email
Katie Journalist 19009 Vang Squares Suite 805 hboyd@gmail.com

This is combined into the phrase “Katie is a Journalist who lives at 19009 Vang Squares Suite 805 and can be emailed at hboyd@gmail.com”. The phrase is saved in pii_col, which this post uses as the target column to redact.

Step 5 (Prep for redaction) takes a column to redact (pii_col) and creates a new column (pii_col_prep) that is ready for efficient redaction using Amazon Comprehend. To redact PII from a different column, you can change the Input column field of this step.

There are two factors to consider to efficiently redact data using Amazon Comprehend:

  • The cost to detect PII is defined on a per-unit basis, where 1 unit = 100 characters, with a 3-unit minimum charge for each document. Because tabular data often contains small amounts of text per cell, it’s generally more time- and cost-efficient to combine text from multiple cells into a single document to send to Amazon Comprehend. Doing this avoids the accumulation of overhead from many repeated function calls and ensures that the data sent is always greater than the 3-unit minimum.
  • Because we’re doing redaction as one step of a SageMaker Data Wrangler flow, we will be calling Amazon Comprehend synchronously. Amazon Comprehend sets a 100 KB (100,000 character) limit per synchronous function call, so we need to ensure that any text we send is under that limit.

Given these factors, Step 5 prepares the data to send to Amazon Comprehend by appending a delimiter string to the end of the text in each cell. For the delimiter, you can use any string that doesn’t occur in the column being redacted (ideally, one that is as few characters as possible, because they’re included in the Amazon Comprehend character total). Adding this cell delimiter allows us to optimize the call to Amazon Comprehend, and will be discussed further in Step 6.

Note that if the text in any individual cell is longer than the Amazon Comprehend limit, the code in this step truncates it to 100,000 characters (roughly equivalent to 15,000 words or 30 single-spaced pages). Although this amount of text is unlikely to be stored in in a single cell, you can modify the transformation code to handle this edge case another way if needed.

Step 6 (Redact PII) takes a column name to redact as input (pii_col_prep) and saves the redacted text to a new column (pii_redacted). When you use a Python custom function transform, SageMaker Data Wrangler defines an empty custom_func that takes a pandas series (a column of text) as input and returns a modified pandas series of the same length. The following screenshot shows part of the Redact PII step.

Data Wrangler custom function redaction code

The function custom_func contains two helper (inner) functions:

  • make_text_chunks – This function does the work of concatenating text from individual cells in the series (including their delimiters) into longer strings (chunks) to send to Amazon Comprehend.
  • redact_pii– This function takes text as input, calls Amazon Comprehend to detect PII, redacts any that is found, and returns the redacted text. Redaction is done by replacing any PII text with the type of PII found in square brackets, for example John Smith would be replaced with [NAME]. You can modify this function to replace PII with any string, including the empty string (“”) to remove it. You also could modify the function to check the confidence score of each PII entity and only redact if it’s above a specific threshold.

After the inner functions are defined, custom_func uses them to do the redaction, as shown in the following code excerpt. When the redaction is complete, it converts the chunks back into original cells, which it saves in the pii_redacted column.

# concatenate text from cells into longer chunks
chunks = make_text_chunks(series, COMPREHEND_MAX_CHARS)

redacted_chunks = []
# call Comprehend once for each chunk, and redact
for text in chunks:
  redacted_text = redact_pii(text)
  redacted_chunks.append(redacted_text)
  
# join all redacted chunks into one text string
redacted_text = ''.join(redacted_chunks)

# split back to list of the original rows
redacted_rows = redacted_text.split(CELL_DELIM)

Add a destination node

To see the result of your transformations, SageMaker Data Wrangler supports exporting to Amazon S3, SageMaker Pipelines, Amazon SageMaker Feature Store, and Python code. To export the redacted data to Amazon S3, we first need to create a destination node:

  1. In the SageMaker Data Wrangler flow diagram, choose the plus sign next to the Redact PII step.
  2. Choose Add destination, then choose Amazon S3.
  3. Provide an output name for your transformed dataset.
  4. Browse or enter the S3 location to store the redacted data file.
  5. Choose Add destination.

You should now see the destination node at the end of your data flow.

Create a SageMaker Data Wrangler export job

Now that the destination node has been added, we can create the export job to process the dataset:

  1. In SageMaker Data Wrangler, choose Create job.
  2. The destination node you just added should already be selected. Choose Next.
  3. Accept the defaults for all other options, then choose Run.

This creates a SageMaker Processing job. To view the status of the job, navigate to the SageMaker console. In the navigation pane, expand the Processing section and choose Processing jobs. Redacting all 116,000 cells in the target column using the default export job settings (two ml.m5.4xlarge instances) takes roughly 8 minutes and costs approximately $0.25. When the job is complete, download the output file with the redacted column from Amazon S3.

Clean up

The SageMaker Data Wrangler application runs on an ml.m5.4xlarge instance. To shut it down, in SageMaker Studio, choose Running Terminals and Kernels in the navigation pane. In the RUNNING INSTANCES section, find the instance labeled Data Wrangler and choose the shutdown icon next to it. This shuts down the SageMaker Data Wrangler application running on the instance.

Conclusion

In this post, we discussed how to use custom transformations in SageMaker Data Wrangler and Amazon Comprehend to redact PII data from your ML dataset. You can download the SageMaker Data Wrangler flow and start redacting PII from your tabular data today.

For other ways to enhance your MLOps workflow using SageMaker Data Wrangler custom transformations, check out Authoring custom transformations in Amazon SageMaker Data Wrangler using NLTK and SciPy. For more data preparation options, check out the blog post series that explains how to use Amazon Comprehend to react, translate, and analyze text from either Amazon Athena or Amazon Redshift.


About the Authors

Tricia JamisonTricia Jamison is a Senior Prototyping Architect on the AWS Prototyping and Cloud Acceleration (PACE) Team, where she helps AWS customers implement innovative solutions to challenging problems with machine learning, internet of things (IoT), and serverless technologies. She lives in New York City and enjoys basketball, long distance treks, and staying one step ahead of her children.

Neelam Koshiya is an Enterprise Solutions Architect at AWS. With a background in software engineering, she organically moved into an architecture role. Her current focus is helping enterprise customers with their cloud adoption journey for strategic business outcomes with the area of depth being AI/ML. She is passionate about innovation and inclusion. In her spare time, she enjoys reading and being outdoors.

Adeleke Coker is a Global Solutions Architect with AWS. He works with customers globally to provide guidance and technical assistance in deploying production workloads at scale on AWS. In his spare time, he enjoys learning, reading, gaming and watching sport events.

Read More