From edge computing and causal reasoning to differential privacy and visual-field mapping, the top blog posts of the year display the range of scientific research at Amazon.Read More
The 10 most viewed publications of 2022
From a look back at Amazon Redshift to personalized complementary product recommendation, these are the most viewed publications authored by Amazon scientists and collaborators in 2022.Read More
Improving automatic discrimination of logos with similar texts
Combining contrastive training and selection of hard negative examples establishes new benchmarks.Read More
Recent honors and awards for Amazon scientists
Researchers honored for their contributions to the scientific community.Read More
How to redact PII data in conversation transcripts
Customer service interactions often contain personally identifiable information (PII) such as names, phone numbers, and dates of birth. As organizations incorporate machine learning (ML) and analytics into their applications, using this data can provide insights on how to create more seamless customer experiences. However, the presence of PII information often restricts the use of this data. In this blog post, we will review a solution to automatically redact PII data from a customer service conversation transcript.
Let’s take an example conversation between a customer and a call center agent.
Agent: Hi, thank you for calling us today. Whom do I have the pleasure of speaking with today?
Caller: Hello, my name is John Stiles.
Agent: Hi John, how may I help you?
Caller: I haven’t received my W2 statement yet and wanted to check on its status.
Agent: Sure, I can help you with that. Can you please confirm the last four digits of your Social Security number?
Caller: Yes, it’s 1111.
Agent: Ok. I’m pulling up the status now. I see that it was sent out yesterday, and the estimated arrival is early next week. Would you like me to turn on automated alerts so you can be notified of any delays?
Caller: Yes, please.
Agent: The number we have on file for you is 555-456-7890. Is that still correct?
Caller: Yes, it is.
Agent: Great. I have turned on automated notifications. Is there anything else I can assist you with John?
Caller: No, that’s all. Thank you.
Agent: Thank you, John. Have a great day.
In this brief interaction, there are several pieces of data that would generally be considered PII, including the caller’s name, the last four digits of their Social Security number, and the phone number. Let’s review how we can redact this PII data in the transcript.
Solution overview
We will create an AWS Step Functions state machine, which orchestrates an Amazon Comprehend PII redaction job. Amazon Comprehend is a natural-language processing (NLP) service that uses machine learning to uncover valuable insights and connections in text, including the ability to detect and redact PII data.
You will provide the transcripts in the input Amazon S3 bucket. The transcripts are in the format used by Contact Lens for Amazon Connect. You will also specify an output S3 bucket, which stores the redaction output as well as intermediate data. The intermediate data are micro-batched versions of the input data. For example, if there are 10,000 conversations to be redacted, the workflow will split them into 10 batches of 1000 conversations each. Each batch is stored using a unique prefix, which is then used as the input source for Comprehend. The Step Functions map state is used to execute these redaction jobs in parallel by calling the StartPIIEntitiesDetectionJob API. This approach allows you to run multiple jobs in parallel rather than individual jobs in sequence. Since the job is implemented as a Step Functions state machine, it can be triggered to run manually or automatically as part of a daily process.
You can learn more about how Comprehend detects and redacts PII data in this blog post.
Deploy the sample solution
First, sign in to the AWS Management Console in your AWS account.
You will need an S3 bucket with some sample transcript data to redact and another bucket for output. If you don’t have existing sample transcript data, follow these steps:
- Navigate to the Amazon S3 console.
- Choose Create bucket.
- Enter a bucket name, such as
text-redaction-data-<your-account-number>
. - Accept the defaults, and choose Create bucket.
- Open the bucket you created, and choose Create folder.
- Enter a folder name, such as “sample-data” and choose Create folder.
- Click on your new folder name to open it.
- Download the SampleData.zip file.
- Open the .zip file on your local computer and then drag the folder to the S3 bucket you created.
- Choose Upload.
Now click the following link to deploy the sample solution to US East (N. Virginia):
This will create a new AWS CloudFormation stack.
Enter the Stack name (e.g., pii-redaction-workflow
), the name of the S3 input bucket containing the input transcript data, and the name of the S3 output bucket. Choose Next and add any tags that you want for your stack (optional). Choose Next again and review the stack details. Select the checkbox to acknowledge that AWS Identity and Access Management (IAM) resources will be created, and then choose Create stack.
The CloudFormation stack will create an IAM role with the ability to list and read the objects from the bucket. You can further customize the role per your requirements. It will also create a Step Functions state machine, several AWS Lambda functions used by the state machine, and an S3 bucket for storing the redacted output versions of the transcripts.
After a few minutes, your stack will be complete, and then you can examine the Step Functions state machine that was created as part of the CloudFormation template.
Run a redaction job
To run a job, navigate to Step Functions in the AWS console, select the state machine, and choose Start execution.
Next provide the input arguments to run the job. For the job input, you want to provide the name of your input S3 bucket as the S3InputDataBucket value, the folder name as the S3InputDataPrefix value, the name of your output S3 bucket as the S3OutputDataBucket
value, and the folder to store the results as S3OutputDataPrefix
value then click Start execution.
As the job executes, you can monitor its status in the Step Functions graph view. It will take a few minutes to run the job. Once the job is complete, you will see the output for each of the jobs in the Execution input and output section of the console. You can use the output URI to retrieve the output of a job. If multiple jobs were executed, you can copy the results of all jobs to a destination bucket for further analysis.
Let’s take a look at the redacted version of the conversation that we started with.
Agent: Hi, thank you for calling us today. Whom do I have the pleasure of speaking with today?
Caller: Hello, my name is [NAME].
Agent: Hi [NAME], how may I help you?
Caller: I haven’t received my W2 statement yet and wanted to check on its status.
Agent: Sure, I can help you with that. Can you please confirm the last four digits of your Social Security number?
Caller: Yes, it’s [SSN].
Agent: Ok. I’m pulling up the status now. I see that it was sent out yesterday, and the estimated arrival is early next week. Would you like me to turn on automated alerts so you can be notified of any delays?
Caller: Yes, please.
Agent: The number we have on file for you is [PHONE]. Is that still correct?
Caller: Yes, it is.
Agent: Great. I have turned on automated notifications. Is there anything else I can assist you with, [NAME]?
Caller: No, that’s all. Thank you.
Agent: Thank you, [NAME]. Have a great day.
Clean up
You may want to clean up the resources created as part of CloudFormation template after you are complete to avoid ongoing charges. To do so, delete the deployed CloudFormation stack and delete the S3 bucket with the sample transcript data if one was created.
Conclusion
With customers demanding seamless experiences across channels and also expecting security to be embedded at every point, the use of Step Functions and Amazon Comprehend to redact PII data in text conversation transcripts is a powerful tool at your disposal. Organizations can speed time to value by using the redacted transcripts to analyze customer service interactions and glean insights to improve the customer experience.
Try using this workflow to redact your data and leave us a comment!
About the author
Alex Emilcar is a Senior Solutions Architect in the Amazon Machine Learning Solutions Lab, where he helps customers build digital experiences with AWS AI technologies. Alex has over 10 years of technology experience working in different capacities from developer, infrastructure engineer, and Solutions Architecture. In his spare time, Alex likes to spend time reading and doing yard work.
AmazonNext program hosts final project presentations at Virginia HQ2
Program focuses on diversifying tech-industry talent.Read More
Auto Machine Translation and Synchronization for “Dive into Deep Learning”
A system built on Amazon Translate reduces the workload of human translators.Read More
Get to production-grade data faster by using new built-in interfaces with Amazon SageMaker Ground Truth Plus
Launched at AWS re:Invent 2021, Amazon SageMaker Ground Truth Plus helps you create high-quality training datasets by removing the undifferentiated heavy lifting associated with building data labeling applications and managing the labeling workforce. All you do is share data along with labeling requirements, and Ground Truth Plus sets up and manages your data labeling workflow based on these requirements. From there, an expert workforce that is trained on a variety of machine learning (ML) tasks labels your data. You don’t even need deep ML expertise or knowledge of workflow design and quality management to use Ground Truth Plus.
Today, we are excited to announce the launch of new built-in interfaces on Ground Truth Plus. With this new capability, multiple Ground Truth Plus users can now create a new project and batch, share data, and receive data using the same AWS account through self-serve interfaces. This enables you to accelerate the development of high-quality training datasets by reducing project set up time. Additionally, you can control fine-grained access to your data by scoping your AWS Identity and Access Management (IAM) role permissions to match your individual level of Amazon Simple Storage Service (Amazon S3) access, and you always have the option to revoke access to certain buckets.
Until now, you had to reach out to your Ground Truth Plus operations program manager (OPM) to create new data labeling projects and batches. This process had some restrictions because it allowed only one user to request a new project and batch—if multiple users within the organization were using the same AWS account, then only one user could request a new data labeling project and batch using the Ground Truth Plus console. Additionally, the process created artificial delays in kicking off the labeling process due to multiple manual touchpoints and troubleshooting required in case of issues. Separately, all the projects used the same IAM role for accessing data. Therefore, to run projects and batches that needed access to different data sources such as different Amazon S3 buckets, you had to rely on your Ground Truth Plus OPM to provide your account specific S3 policies, which you had to manually apply to your S3 buckets. This entire operation was manually intensive resulting in operational overheads.
This post walks you through steps to create a new project and batch, share data, and receive data using the new self-serve interfaces to efficiently kickstart the labeling process. This post assumes that you are familiar with Ground Truth Plus. For more information, see Amazon SageMaker Ground Truth Plus – Create Training Datasets Without Code or In-house Resources.
Solution overview
We demonstrate how to do the following:
- Update existing projects
- Request a new project
- Set up a project team
- Create a batch
Prerequisites
Before you get started, make sure you have the following prerequisites:
- An AWS account
- An IAM user with access to create IAM roles
- The Amazon S3 URI of the bucket where your labeling objects are stored
Update existing projects
If you have a Ground Truth Plus project before the launch (December 9, 2022) of the new features described in this post, then you need to create and share an IAM role so that you can use these features with your existing Ground Truth Plus project. If you’re a new user of Ground Truth Plus, you can skip this section.
To create an IAM role, complete the following steps:
- On the IAM console, choose Create role.
- Select Custom trust policy.
- Specify the following trust relationship for the role:
- Choose Next.
- Choose Create policy.
- On the JSON tab, specify the following policy. Update the Resource property by specifying two entries for each bucket: one with just the bucket ARN, and another with the bucket ARN followed by
/*
. For example, replace <your-input-s3-arn> witharn:aws:s3:::my-bucket/myprefix/
and <your-input-s3-arn>/* witharn:aws:s3:::my-bucket/myprefix/*
. - Choose Next: Tags and Next: Review.
- Enter the name of the policy and an optional description.
- Choose Create policy.
- Close this tab and go back to the previous tab to create your role.
On the Add permissions tab, you should see the new policy you created (refresh the page if you don’t see it).
- Select the newly created policy and choose Next.
- Enter a name (for example,
GTPlusExecutionRole
) and optionally a description of the role. - Choose Create role.
- Provide the role ARN to your Ground Truth Plus OPM, who will then update your existing project with this newly created role.
Request a new project
To request a new project, complete the following steps:
- On the Ground Truth Plus console, navigate to the Projects section.
This is where all your projects are listed.
- Choose Request project.
The Request project page is your opportunity to provide details that will help us schedule an initial consultation call and set up your project.
- In addition to specifying general information like the project name and description, you must specify the project’s task type and whether it contains personally identifiable information (PII).
To label your data, Ground Truth Plus needs temporary access to your raw data in an S3 bucket. When the labeling process is complete, Ground Truth Plus delivers the labeling output back to your S3 bucket. This is done through an IAM role. You can either create a new role, or you can navigate to the IAM console to create a new role (refer to the previous section for instructions).
- If you choose to create a role, choose Enter a custom IAM role ARN and enter your IAM role ARN, which is in the format of
arn:aws:iam::<YourAccountNumber>:role/<RoleName>
. - To use the built-in tool, on the drop-down menu under IAM Role, choose Create a new role.
- Specify the bucket location of your labeling data. If you don’t know the location of your labeling data or if you don’t have any labeling data uploaded, select Any S3 bucket, which will give Ground Truth Plus access to all your account’s buckets.
- Choose Create to create the role.
Your IAM role will allow Ground Truth Plus, identified as sagemaker-ground-truth-plus.amazonaws.com
in the role’s trust policy, to run the following actions on your S3 buckets:
- Choose Request project to complete the request.
A Ground Truth Plus OPM will schedule an initial consultation call with you to discuss your data labeling project requirements and pricing.
Set up a project team
After you request a project, you need to create a project team to log in to your project portal. A project team provides access to the members from your organization or team to track projects, view metrics, and review labels. You can use the option Invite new members by email or Import members from existing Amazon Cognito user groups. In this post, we show how to import members from existing Amazon Cognito user groups to add users to your project team.
- On the Ground Truth Plus console, navigate to the Project team section.
- Choose Create project team.
- Choose Import members from existing Amazon Cognito user groups.
- Choose an Amazon Cognito user pool.
User pools require a domain and an existing user group.
- Choose an app client.
We recommend using a client generated by Amazon SageMaker.
- Choose a user group from your pool to import members.
- Choose Create project team.
You can add more team members after creating the project team by choosing Invite new members on the Members page of the Ground Truth Plus console.
Create a batch
After you have successfully submitted the project request and created a project team, you can access the Ground Truth Plus project portal by clicking Open project portal on the Ground Truth Plus console.
You can use the project portal to create batches for a project, but only after the project’s status has changed to Request approved
.
- View a project’s details and batches by choosing the project name.
A page titled with the project name opens.
- In the Batches section, choose Create batch.
- Enter a batch name and optional description.
- Enter the S3 locations of the input and output datasets.
To ensure the batch is created successfully, you must meet the following requirements:
-
- The S3 bucket and prefix should exist, and the total number of files should be greater than 0
- The total number of objects should be less than 10,000
- The size of each object should be less than 2 GB
- The total size of all objects combined is less than 100 GB
- The IAM role provided to create a project has permission to access the input bucket, output bucket, and S3 files that are used to create the batch
- The files under the provided S3 location for the input datasets should not be encrypted by AWS Key Management Service (AWS KMS)
- Choose Submit.
Your batch status will show as Request submitted
. After Ground Truth Plus has temporary access to your data, AWS experts will set up data labeling workflows and operate them on your behalf, which will change the batch status to In-progress
. When the labeling is complete, the batch status changes from In-progress
to Ready for review
. If you want to review your labels before receiving the labels then choose Review batch. From there, you have an option to choose Accept batch to receive your labeled data.
Conclusion
This post showed you how multiple Ground Truth Plus users can now create a new project and batch, share data, and receive data using the same AWS account through new self-serve interfaces. This new capability allows you to kickstart your labeling projects faster and reduces operational overhead. We also demonstrated how you can control fine-grained access to data by scoping your IAM role permissions to match your individual level of access.
We encourage you to try out this new functionality, and connect with the Machine Learning & AI community if you have any questions or feedback!
About the authors
Manish Goel is the Product Manager for Amazon SageMaker Ground Truth Plus. He is focused on building products that make it easier for customers to adopt machine learning. In his spare time, he enjoys road trips and reading books.
Karthik Ganduri is a Software Development Engineer at Amazon AWS, where he works on building ML tools for customers and internal solutions. Outside of work, he enjoys clicking pictures.
Zhuling Bai is a Software Development Engineer at Amazon AWS. She works on developing large scale distributed systems to solve machine learning problems.
Aatef Baransy is a Frontend engineer at Amazon AWS. He writes fast, reliable, and thoroughly tested software to nurture and grow the industry’s most cutting-edge AI applications.
Mohammad Adnan is a Senior Engineer for AI and ML at AWS. He was part of many AWS service launch, notably Amazon Lookout for Metrics and AWS Panorama. Currently, he is focusing on AWS human-in-the-loop offerings (AWS SageMaker’s Ground truth, Ground truth plus and Augmented AI). He is a clean code advocate and a subject-matter expert on server-less and event-driven architecture. You can follow him on LinkedIn, mohammad-adnan-6a99a829.
Popular deep-learning book from Amazon authors gets update
Google JAX Python library implementation and new topics added; volume 1 of book to be published by Cambridge University Press.Read More
Nine teams selected for Alexa Prize SocialBot Grand Challenge
Fifth challenge adds new elements and features four new competitors for the $1 million research grant.Read More