Build an intelligent search solution with automated content enrichment

Unstructured data belonging to the enterprise continues to grow, making it a challenge for customers and employees to get the information they need. Amazon Kendra is a highly accurate intelligent search service powered by machine learning (ML). It helps you easily find the content you’re looking for, even when it’s scattered across multiple locations and content repositories.

Amazon Kendra leverages deep learning and reading comprehension to deliver precise answers. It offers natural language search for a user experience that’s like interacting with a human expert. When documents don’t have a clear answer or if the question is ambiguous, Amazon Kendra returns a list of the most relevant documents for the user to choose from.

To help narrow down a list of relevant documents, you can assign metadata at the time of document ingestion to provide filtering and faceting capabilities, for an experience similar to the Amazon.com retail site where you’re presented with filtering options on the left side of the webpage. But what if the original documents have no metadata, or users have a preference for how this information is categorized? You can automatically generate metadata using ML in order to enrich the content and make it easier to search and discover.

This post outlines how you can automate and simplify metadata generation using Amazon Comprehend Medical, a natural language processing (NLP) service that uses ML to find insights related to healthcare and life sciences (HCLS) such as medical entities and relationships in unstructured medical text. The metadata generated is then ingested as custom attributes alongside documents into an Amazon Kendra index. For repositories with documents containing generic information or information related to domains other than HCLS, you can use a similar approach with Amazon Comprehend to automate metadata generation.

To demonstrate an intelligent search solution with enriched data, we use Wikipedia pages of the medicines listed in the World Health Organization (WHO) Model List of Essential Medicines. We combine this content with metadata automatically generated using Amazon Comprehend Medical, into a unified Amazon Kendra index to make it searchable. You can visit the search application and try asking it some questions of your own, such as “What is the recommended paracetamol dose for an adult?” The following screenshot shows the results.

Solution overview

We take a two-step approach to custom content enrichment during the content ingestion process:

Identify the metadata for each document using Amazon Comprehend Medical.
Ingest the document along with the metadata in the search solution based on an Amazon Kendra index.

Amazon Comprehend Medical uses NLP to extract medical insights about the content of documents by extracting medical entities such as medication, medical condition, anatomical location, the relationships between entities such as route and medication, and traits such as negation. In this example, for the Wikipedia page of each medicine from the WHO Model List of Essential Medicines, we use the DetectEntitiesV2 operation of Amazon Comprehend Medical to detect the entities in the categories ANATOMY, MEDICAL_CONDITION, MEDICATION, PROTECTED_HEALTH_INFORMATION, TEST_TREATMENT_PROCEDURE, and TIME_EXPRESSION. We use these entities to generate the document metadata.

We prepare the Amazon Kendra index by defining custom attributes of type STRING_LIST corresponding to the entity categories ANATOMY, MEDICAL_CONDITION, MEDICATION, PROTECTED_HEALTH_INFORMATION, TEST_TREATMENT_PROCEDURE, and TIME_EXPRESSION. For each document, the DetectEntitiesV2 operation of Amazon Comprehend Medical returns a categorized list of entities. Each entity from this list with a sufficiently high confidence score (for this use case, greater than 0.97) is added to the custom attribute corresponding to its category. After all the detected entities are processed in this way, the populated attributes are used to generate the metadata JSON file corresponding to that document. Amazon Kendra has an upper limit of 10 strings for an attribute of STRING_LIST type. In this example, we take the top 10 entities with the highest frequency of occurrence in the processed document.

After the metadata JSON files for all the documents are created, they’re copied to the Amazon Simple Storage Service (Amazon S3) bucket configured as a data source to the Amazon Kendra index, and a data source sync is performed to ingest the documents in the index along with the metadata.

Prerequisites

To deploy and work with the solution in this post, make sure you have the following:

An AWS account with privileges to create AWS Identity and Access Management (IAM) roles and policies. For more information, see Overview of access management: Permissions and policies.
Basic knowledge of AWS and the AWS Command Line Interface (AWS CLI). For more information about the AWS CLI, see AWS CLI Command Reference.
An S3 bucket to store the documents and metadata. For more information, see Creating a bucket and What is Amazon S3?
Access to AWS CloudShell, Amazon Kendra, and Amazon Comprehend Medical.

Architecture

We use the AWS CloudFormation template medkendratemplate.yaml to deploy an Amazon Kendra index with the custom attributes of type STRING_LIST corresponding to the entity categories ANATOMY, MEDICAL_CONDITION, MEDICATION, PROTECTED_HEALTH_INFORMATION, TEST_TREATMENT_PROCEDURE, and TIME_EXPRESSION.

The following diagram illustrates our solution architecture.

Based on this architecture, the steps to build and use the solution are as follows:

On CloudShell, a Bash script called getpages.sh downloads Wikipedia pages of the medicines and store them as text files.
A Python script called meds.py, which contains the core logic of the automation of the metadata generation, makes the detect_entities_v2 API call to Amazon Comprehend Medical to detect entities for each of the Wikipedia pages and generate metadata based on the entities returned. The steps used in this script are as follows:
1. Split the Wikipedia page text into chunks smaller than the maximum text size allowed by the detect_entities_v2 API call.
2. Make the detect_entities_v2 call.
3. Filter the entities detected by the detect_entities_v2 call using a threshold confidence score (0.97 for this example).
4. Keep track of each unique entity corresponding to its category and the frequency of occurrence of that entity.
5. For each entity category, sort the entities in that category from highest to lowest frequency of occurrence and select the top 10 entities.
6. Create a metadata object based on the selected entities and output it in JSON format.
We use the AWS CLI to copy the text data and the metadata to the S3 bucket that is configured as a data source to the Amazon Kendra index using the S3 connector.
We perform a data source sync using the Amazon Kendra console to ingest the contents of the documents along with the metadata in the Amazon Kendra index.
Finally, we use the Amazon Kendra search console to make queries to the index.

Create an Amazon S3 bucket to be used as a data source

Create an Amazon S3 bucket that you will use as a data source for the Amazon Kendra index.

Deploy the infrastructure as a CloudFormation stack

To deploy the infrastructure and resources for this solution, complete the following steps:

In a separate browser tab, open the AWS Management Console, and make sure that you’re logged in to your AWS account. Click the following button to launch the CloudFormation stack to deploy the infrastructure.

After that you should see a page similar to the following image:

For S3DataSourceBucket, enter your data source bucket name without the s3:// prefix, select I acknowledge that AWS CloudFormation might create IAM resources with custom names, and then choose Create stack.

Stack creation can take 30–45 minutes to complete. You can monitor the stack creation status on the Stack info tab. You can also look at the different tabs, such as Events, Resources, and Template. While the stack is being created, you can work on getting the data and generating the metadata in the next few steps.

Get the data and generate the metadata

To fetch your data and start generating metadata, complete the following steps:

On the AWS Management Console, click icon shown by a red circle in the following picture to start AWS CloudShell.

Copy the filecode-data.tgz and extract the contents by using the following commands on AWS CloudShell:

aws s3 cp s3://aws-ml-blog/artifacts/build-an-intelligent-search-solution-with-automated-content-enrichment/code-data.tgz .
tar zxvf code-data.tgz

Change the working directory to code-data:

cd code-data

At this point, you can choose to run the end-to-end workflow of getting the data, creating the metadata using Amazon Comprehend Medical (which takes about 35–40 minutes), and then ingesting the data along with the metadata in the Amazon Kendra index, or just complete the last step to ingest the data with the metadata that has been generated using Amazon Comprehend Medical and supplied in the package for convenience.

To use the metadata supplied in the package, enter the following code and then jump to Step 6:

tar zxvf med-data-meta.tgz

Perform this step to get a hands-on experience of building the end-to-end solution. The following command runs a bash script called main.sh, which calls the following scripts:
1. prereq.sh to install prerequisites and create subdirectories to store data and metadata
2. getpages.sh to get the Wikipedia pages of medicines in the list
3. getmetapar.sh to call the meds.py Python script for each document

./main.sh

The Python script meds.py contains the logic to make the get_entities_v2 call to Amazon Comprehend Medical and then process the output to produce the JSON metadata file. It takes about 30–40 minutes for this to complete.

While performing Step 5, if CloudShell times out, security tokens get refreshed, or the script stops before all the data is processed, start the CloudShell session again and run getmetapar.sh, which starts the data processing from the point it was stopped:

./getmetapar.sh

Upload the data and metadata to the S3 bucket being used as the data source for the Amazon Kendra index using the following AWS CLI commands:

aws cp Data/ s3://<REPLACE-WITH-NAME-OF-YOUR-S3-BUCKET>/Data/ —recursive
aws cp Meta/ s3://<REPLACE-WITH-NAME-OF-YOUR-S3-BUCKET>/Meta/ —recursive

Review Amazon Kendra configuration and start the data source sync

Before starting this step, make sure that the CloudFormation stack creation is complete. In the following steps, we start the data source sync to begin crawling and indexing documents.

On the Amazon Kendra console, choose the index AuthKendraIndex, which was created as part of the CloudFormation stack.

In the navigation pane, choose Data sources.
On the Settings tab, you can see the data source bucket being configured.
Choose the data source and choose Sync now.

The data source sync can take 10–15 minutes to complete.

Observe Amazon Kendra index facet definition

In the navigation pane, choose Facet definition. The following screenshot shows the entries for ANATOMY, MEDICAL_CONDITION, MEDICATION, PROTECTED_HEALTH_INFORMATION, TEST_TREATMENT_PROCEDURE, and TIME_EXPRESSION. These are the categories of the entities detected by Amazon Comprehend Medical. These are defined as custom attributes in the CloudFormation template that we used to create the Amazon Kendra index. The facetable check boxes for PROTECTED_HEALTH_INFORMATION and TIME_EXPRESSION aren’t selected, therefore these aren’t shown in the facets of the search user interface.

Query the repository of WHO Model List of Essential Medicines

We’re now ready to make queries to our search solution.

On the Amazon Kendra console, navigate to your index and choose Search console.
In the search field, enter What is the treatment for diabetes?

The following screenshot shows the results.

Choose Filter search results to see the facets.

The headings of MEDICATION, ANATOMY, MEDICAL_CONDITION, and TEST_TREATMENT_PROCEDURE are the categories defined as Amazon Kendra facets, and the list of items underneath them are the entities of these categories as detected by Amazon Comprehend Medical in the documents being searched. PROTECTED_HEALTH_INFORMATION and TIME_EXPRESSION are not shown.

Under MEDICAL_CONDITION, select pregnancy to refine the search results.

You can go back to the Facet definition page and make PROTECTED_HEALTH_INFORMATION and TIME_EXPRESSION facetable and save the configuration. Go back to the search console, make a new query, and observe the facets again. Experiment with these facets to see what suits your needs best.

Make additional queries and use the facets to refine the search results. You can use the following queries to get started, but you can also experiment with your own:

What is a common painkiller?
Is parcetamol safe for children?
How to manage high blood pressure?
When should BCG vaccine be administered?

You can observe how domain-specific facets improve the search experience.

Infrastructure cleanup

To delete the infrastructure that was deployed as part of the CloudFormation stack, delete the stack from the AWS CloudFormation console. Stack deletion can take 20–30 minutes.

When the stack status shows as Delete Complete, go to the Events tab and confirm that each of the resources has been removed. You can also cross-verify by checking on the Amazon Kendra console that the index is deleted.

You must delete your data source bucket separately because it wasn’t created as part of the CloudFormation stack.

Conclusion

In this post, we demonstrated how to automate the process to enrich the content by generating domain-specific metadata for an Amazon Kendra index using Amazon Comprehend or Amazon Comprehend Medical, thereby improving the user experience for the search solution.

This example used the entities detected by Amazon Comprehend Medical to generate the Amazon Kendra metadata. Depending on the domain of the content repository, you can use a similar approach with the pretrained model or custom trained models of Amazon Comprehend. Try out our solution and let us know what you think! You can further enhance the metadata by using other elements such as protected health information (PHI) for Amazon Comprehend Medical and events, key phrases, personally identifiable information (PII), dominant language, sentiment, and syntax for Amazon Comprehend.

About the Authors

Abhinav Jawadekar is a Senior Partner Solutions Architect at Amazon Web Services. Abhinav works with AWS partners to help them in their cloud journey.

Udi Hershkovich has been a Principal WW AI/ML Service Specialist at AWS since 2018. Prior to AWS, Udi held multiple leadership positions with AI startups and Enterprise initiatives including co-founder and CEO at LeanFM Technologies, offering ML-powered predictive maintenance in facilities management, CEO of Safaba Translation Solutions, a machine translation startup acquired by Amazon in 2015, and Head of Professional Services for Contact Center Intelligence at Amdocs. Udi holds Law and Business degrees from the Interdisciplinary Center in Herzliya, Israel, and lives in Pittsburgh, Pennsylvania, USA.

Vedere AI