Search for answers accurately using Amazon Kendra S3 Connector with VPC support

Amazon Kendra is an easy-to-use intelligent search service that allows you to integrate search capabilities with your applications so users can find information stored across data sources like Amazon Simple Storage Service , OneDrive and Google Drive; applications such as SalesForce, SharePoint and Service Now; and relational databases like Amazon Relational Database Service (Amazon RDS). Using Amazon Kendra connectors enables you to synchronize data from multiple content repositories with your Amazon Kendra index. When end-users ask natural language questions, Amazon Kendra uses machine learning (ML) algorithms to understand the context and return the most relevant answers.

The Amazon Kendra’s S3 connector supports indexing documents and their associated metadata stored in an S3 bucket. It’s often the case that you want to make sure that applications running inside a VPC have access only to specific S3 buckets and in many cases the connection must not traverse the internet to reach public endpoints. Many customers, however, own multiple S3 buckets, some of which are accessible by VPC endpoints for Amazon S3. In this post, we describe how to use the updated Amazon Kendra S3 connector with VPC support for using VPC endpoints.

This post provides the steps to help you create an enterprise search engine on AWS using Amazon Kendra by connecting documents stored in a S3 bucket only accessible from within a VPC. For more information, see enhancing enterprise search with Amazon Kendra. The post also demonstrates how to configure your connector for Amazon S3 and configure how your index syncs with your data source when your data source content changes.

Overview of solution

There are three main improvements to the Amazon Kendra S3 connector :

VPC support – The connector now supports using your Amazon Virtual Private Cloud (Amazon VPC) networks. You can now securely connect to Amazon S3 using VPC endpoints for Amazon S3 by specifying the VPC connection, subnet and security groups.
Two sync modes – When you schedule sync of a data source in Amazon S3 to an Amazon Kendra index, you can now choose to run in Full sync mode or New, modified and deleted document sync mode. In the full sync mode, every time the synchronization runs, it scans objects in every folder under the root path it was configured to crawl and re-ingests all documents . The full refresh enables you to reset the index without the need to delete and create a new data source. In the New, modified and deleted document sync mode, every time the sync job runs, it processes only objects that were added, modified, or deleted since the last crawl. Incremental crawls can reduce runtime and cost when used with datasets that append new objects to existing data sources on a regular basis.
Additional inclusion and exclusion patterns for documents: In addition to prefixes, we’re introducing patterns for inclusion or exclusion of documents from your index. Two supported pattern types are Unix style glob or file types. You can now add a regular expression pattern to include specific folders or exclude folders, file types, or specific files from your data source. This can be useful for shared data repositories that contain content belonging to different categories, classification and file types.

Prerequisites

For this walkthrough, you should have the following prerequisites:

An AWS account.
Basic knowledge of AWS.
An S3 bucket for your documents. For more information, see Create a Bucket and the Amazon S3 User Guide.
Amazon S3 access with VPC endpoints. For details, see Gateway endpoints for Amazon S3
Subnets and security groups.
An Amazon Kendra index. For instructions, see Creating an Index.

Create and configure your document repository

Before you can create an index in Amazon Kendra, you need to load documents into an S3 bucket. This section contains instructions to create an S3 bucket, get the files, and load them into the bucket. After completing all the steps in this section, you have a data source that Amazon Kendra can use.

On the AWS Management Console, in the Region list, choose US East (N. Virginia) or any Region of your choice that Amazon Kendra is available in.
Choose Services.
Under Storage, choose S3.
On the Amazon S3 console, choose Create bucket.
Under General configuration, provide the following information:
- For Bucket name, enter kendrapost-{your account id}.
- For Region, choose the same Region that you use to deploy your Amazon Kendra index (this post uses us-east-1).
- Under Bucket settings, for Block Public Access, leave everything with the default values.
Under Advanced settings, leave everything with the default values.
Choose Create bucket.
Download AWS_Whitepapers.zip and unzip the files.
On the Amazon S3 console, select the bucket that you just created and choose Upload.
Upload the folders Best Practices, Databases, General, and Machine Learning from the unzipped file.

Inside your bucket, you should now see four folders.

Add a data source

A data source is a location that stores the documents for indexing. You can synchronize data sources automatically with an Amazon Kendra index to make sure that searches correctly reflect new, updated, or deleted documents in the source repositories.

After completing all the steps in this section, you’ll have a data source linked to Amazon Kendra. For more information, see Adding documents from a data source.

Before continuing, make sure that the index creation is complete and the index shows as Active. For more information, see Creating an Index.

On the Amazon Kendra console, navigate to your index (for this post, kendra-blog-index).
On the kendra-blog-index page, choose Add data sources.
Under Amazon S3, choose Add connector.

For more information about the different data sources that Amazon Kendra supports, see Adding documents from a data source.

In the Specify data source details section, for Data source name, enter aws_white_paper.
For Description, enter AWS White Paper documentation.
Choose Next.

Now you create an AWS Identity and Access Management (IAM) role for Amazon Kendra.

In the Define access and security page, for IAM role section, choose Create a new role.
For Role name, enter source-role (your role name is prefixed with AmazonKendra-).
In the Configure VPC and security section, choose your VPC, and enter your Subnets and VPC security groups.

For more information on connecting your Amazon Kendra to your Amazon Virtual Private Cloud, see Configuring Amazon Kendra to use a VPC.

Choose Next.
In the Configure sync settings page, for Enter the data source location, enter the S3 bucket you created: kendrapost-{your account id}.
Leave Metadata files prefix folder location blank.

By default, metadata files are stored in the same directory as the documents. If you want to place these files in a different folder, you can add a prefix. For more information, see Amazon S3 document metadata.

For Select decryption key, leave it deselected.
For Additional configuration, you can add a pattern to include or exclude certain folders or files. For this post, keep the default values.
For Sync mode choose New, modified, or deleted documents sync.
For Frequency, choose Run on demand.

This step defines the frequency with which the data source is synchronized with the Amazon Kendra index.

Choose Next.
In the Set field mappings page, keep the default values.
Choose Next.
On the Review and create page, choose Add data source.
Navigate back to your Kendra index.
Choose your Data Source, then choose Sync now to synchronize the documents with the Amazon Kendra index.

The duration of this process depends on the number of documents that you index. For this use case, it may take 15 minutes, after which you should see a message that the sync was successful. In the Sync run history section, you can see that 40 documents were synchronized.

Your Amazon Kendra index is now ready for natural language queries. When you search your index, Amazon Kendra uses all the data and metadata provided to return the most accurate answers to your search query. On the Amazon Kendra console, choose Search indexed content. In the query field, start with a query such as “Which AWS service has 11 nines of durability?”

For more information about querying the index, see Querying an Index

Synchronize data source changes to search the index

Your data source is set up to sync any new, modified or deleted data. Before you can synchronize your data source incrementally with an index in Amazon Kendra, you need to load new documents into an S3 bucket.

On the Amazon S3 console, select the bucket that you just created and choose Upload.
Upload the folders Security and Well_Architected from the unzipped file.

Now you can synchronize the new documents added to the S3 bucket:

On the Amazon Kendra console, choose Data sources and then select your S3 data source.
Choose Sync Now.

The duration of this process depends on the number of documents that you index. For this use case, it may take 15 minutes, after which you should see a message that the sync was successful.

In the Sync run history section, you can see that 20 documents were synchronized.

Re-index the data source

In a scenario where the data source has stale information, you can now re-index the data source without having to delete and create a new data source. To modify the sync mode and re-index the data source, complete the following steps:

On the Amazon Kendra console, choose Data sources and then select your S3 data source.
On the Actions menu, choose Edit.
Choose Next to move to Step 3 – Configure sync settings page.
For Sync mode, select Full Sync.
For Frequency, choose Run on demand.
Choose Next.
In the Set field mappings page, keep the default values.
Choose Next.
On the Review and create page, choose Update.

Now you can synchronize the new documents added to the S3 bucket.

On the Amazon Kendra console, choose Data sources and then select your S3 data source.
Choose Sync Now.

In the Sync run history section, you can see that all documents were synchronized irrespective of the previous sync status under the modified column.

Clean up

To avoid incurring future charges and to clean out unused roles and policies, delete the resources you created:

On the Amazon Kendra index, choose Indexes in the navigation pane.
Select the index you created and on the Actions menu, choose Delete.
To confirm deletion, enter Delete when prompted and choose Delete.

Wait until you get the confirmation message; the process can take up to 15 minutes.

On the Amazon S3 console, delete the S3 bucket.
On the IAM console, delete the corresponding IAM roles.

Conclusion

In this post, you learned how to use Amazon Kendra to deploy an enterprise search service using a secure connection to Amazon S3 that doesn’t require an internet gateway or Network Address Translation (NAT) device. You can enable quicker syncs for your documents using sync mode.

There are many additional features that we didn’t cover. For example:

You can enable user-based access control for your Amazon Kendra index, and restrict access to documents based on the access controls you have already configured.
You can map object attributes to Amazon Kendra index attributes, and enable them for faceting, search, and display in the search results.
You can quickly find information from webpages (HTML tables) using Amazon Kendra tabular search

To learn more about Amazon Kendra, refer Amazon Kendra Developer Guide.

About the Authors

Maran Chandrasekaran is a Senior Solutions Architect at Amazon Web Services, working with our enterprise customers. Outside of work, he loves to travel.

Arjun Agrawal is Software Engineer at AWS, currently working with an Amazon Kendra team on an enterprise search engine. He is passionate about new technology and solving real-world problems. Outside of work, he loves to hike and travel.

Vedere AI