Method enables machine-checkable proofs of SAT solvers’ decisions on incremental SAT problems, in which problem constraints are gradually imposed over time.Read More
For the World to See: Nonprofit Deploys GPU-Powered Simulators to Train Providers in Sight-Saving Surgery
GPU-powered surgical-simulation devices are helping train more than 2,000 doctors a year in lower-income countries to treat cataract blindness, the world’s leading cause of blindness, thanks to the nonprofit HelpMeSee.
While cataract surgery has a success rate of around 99%, many patients in low- and middle-income countries lack access to the common procedure due to a severe shortage of ophthalmologists. An estimated 90% of the 100 million people affected by cataract-related visual impairment or blindness are in these locations.
By training more healthcare providers — including those without a specialty in ophthalmology — to treat cataracts, HelpMeSee improves the quality of life for patients such as a mother of two young children in Bhiwandi, near Mumbai, India, who was blinded by cataracts in both eyes.
“After the surgery, her vision improved dramatically and she was able to take up a job, changing the course of her entire family,” said Dr. Chetan Ahiwalay, chief instructor and subject-matter expert for HelpMeSee in India. “She and her husband are now happily raising their kids and leading a healthy life. These are the things that keep us going as doctors.”
HelpMeSee’s simulator devices use NVIDIA RTX GPUs to render high-quality visuals, providing a more realistic training environment for doctors to hone their surgical skills. To further improve the trainee experience, NVIDIA experts are working with the HelpMeSee team to improve rendering performance, increase visual realism and augment the simulator with next-generation technologies such as real-time ray tracing and AI.
Tackling Treatable Blindness With Accessible Training
High-income countries have 18x more ophthalmologists per million residents than low-income countries. That coverage gap, which is far wider still in certain countries, makes it harder for those in thinly resourced areas to receive treatment for avoidable blindness.
HelpMeSee’s devices can train doctors on multiple eye procedures using immersive tools inspired by flight simulators used in aviation. The team trains doctors in countries including India, China, Madagascar, Mexico and the U.S., and rolls out multilingual training each year for new procedures.
The eye surgery simulator offers realistic 3D visuals, haptic feedback, performance scores and the opportunity to attempt a step of the procedure multiple times until the trainee achieves proficiency. Qualified instructors like Dr. Ahiwalay travel to rural and urban areas to deliver the training through structured courses — and help surgeons transition from the simulators to live surgeries.

“We’re lowering the barrier for healthcare practitioners to learn these specific skills that can have a profound impact on patients,” said Dr. Bonnie An Henderson, CEO of HelpMeSee, which is based in New York. “Simulation-based training will improve surgical skills while keeping patients safe.”
Looking Ahead to AI, Advanced Rendering
HelpMeSee works with Surgical Science, a supplier of medical virtual-reality simulators, based in Gothenburg, Sweden, to develop the 3D models and real-time rendering for its devices. Other collaborators — Strasbourg, France-based InSimo and Pune, India-based Harman Connected Services — develop the physics-based simulations and user interface, respectively.
“Since there are many crucial visual cues during eye surgery, the simulation requires high fidelity,” said Sebastian Ullrich, senior manager of software development at Surgical Science, who has worked with HelpMeSee for years. “To render a realistic 3D representation of the human eye, we use custom shader materials with high-resolution textures to represent various anatomical components, mimic optical properties such as refraction, use order-independent transparency sorting and employ volume rendering.”
NVIDIA RTX GPUs support 3D volume rendering, stereoscopic rendering and depth sorting algorithms that provide a realistic visual experience for HelpMeSee’s trainees. Working with NVIDIA, the team is investigating AI models that could provide trainees with a real-time analysis of the practice procedure and offer recommendations for improvement.
Watch a demo of HelpMeSee’s cataract surgery training simulation.
Eureka! NVIDIA Research Breakthrough Puts New Spin on Robot Learning
A new AI agent developed by NVIDIA Research that can teach robots complex skills has trained a robotic hand to perform rapid pen-spinning tricks — for the first time as well as a human can.
The stunning prestidigitation, showcased in the video above, is one of nearly 30 tasks that robots have learned to expertly accomplish thanks to Eureka, which autonomously writes reward algorithms to train bots.
Eureka has also taught robots to open drawers and cabinets, toss and catch balls, and manipulate scissors, among other tasks.
The Eureka research, published today, includes a paper and the project’s AI algorithms, which developers can experiment with using NVIDIA Isaac Gym, a physics simulation reference application for reinforcement learning research. Isaac Gym is built on NVIDIA Omniverse, a development platform for building 3D tools and applications based on the OpenUSD framework. Eureka itself is powered by the GPT-4 large language model.
“Reinforcement learning has enabled impressive wins over the last decade, yet many challenges still exist, such as reward design, which remains a trial-and-error process,” said Anima Anandkumar, senior director of AI research at NVIDIA and an author of the Eureka paper. “Eureka is a first step toward developing new algorithms that integrate generative and reinforcement learning methods to solve hard tasks.”
AI Trains Robots
Eureka-generated reward programs — which enable trial-and-error learning for robots — outperform expert human-written ones on more than 80% of tasks, according to the paper. This leads to an average performance improvement of more than 50% for the bots.
Robot arm taught by Eureka to open a drawer.
The AI agent taps the GPT-4 LLM and generative AI to write software code that rewards robots for reinforcement learning. It doesn’t require task-specific prompting or predefined reward templates — and readily incorporates human feedback to modify its rewards for results more accurately aligned with a developer’s vision.
Using GPU-accelerated simulation in Isaac Gym, Eureka can quickly evaluate the quality of large batches of reward candidates for more efficient training.
Eureka then constructs a summary of the key stats from the training results and instructs the LLM to improve its generation of reward functions. In this way, the AI is self-improving. It’s taught all kinds of robots — quadruped, bipedal, quadrotor, dexterous hands, cobot arms and others — to accomplish all kinds of tasks.
The research paper provides in-depth evaluations of 20 Eureka-trained tasks, based on open-source dexterity benchmarks that require robotic hands to demonstrate a wide range of complex manipulation skills.
The results from nine Isaac Gym environments are showcased in visualizations generated using NVIDIA Omniverse.
Humanoid robot learns a running gait via Eureka.
“Eureka is a unique combination of large language models and NVIDIA GPU-accelerated simulation technologies,” said Linxi “Jim” Fan, senior research scientist at NVIDIA, who’s one of the project’s contributors. “We believe that Eureka will enable dexterous robot control and provide a new way to produce physically realistic animations for artists.”
It’s breakthrough work bound to get developers’ minds spinning with possibilities, adding to recent NVIDIA Research advancements like Voyager, an AI agent built with GPT-4 that can autonomously play Minecraft.
NVIDIA Research comprises hundreds of scientists and engineers worldwide, with teams focused on topics including AI, computer graphics, computer vision, self-driving cars and robotics.
Learn more about Eureka and NVIDIA Research.
Announcing Rekogniton Custom Moderation: Enhance accuracy of pre-trained Rekognition moderation models with your data
Companies increasingly rely on user-generated images and videos for engagement. From ecommerce platforms encouraging customers to share product images to social media companies promoting user-generated videos and images, using user content for engagement is a powerful strategy. However, it can be challenging to ensure that this user-generated content is consistent with your policies and fosters a safe online community for your users.
Many companies currently depend on human moderators or respond reactively to user complaints to manage inappropriate user-generated content. These approaches don’t scale to effectively moderate millions of images and videos at sufficient quality or speed, which leads to a poor user experience, high costs to achieve scale, or even potential harm to brand reputation.
In this post, we discuss how to use the Custom Moderation feature in Amazon Rekognition to enhance the accuracy of your pre-trained content moderation API.
Content moderation in Amazon Rekognition
Amazon Rekognition is a managed artificial intelligence (AI) service that offers pre-trained and customizable computer vision capabilities to extract information and insights from images and videos. One such capability is Amazon Rekognition Content Moderation, which detects inappropriate or unwanted content in images and videos. Amazon Rekognition uses a hierarchical taxonomy to label inappropriate or unwanted content with 10 top-level moderation categories (such as violence, explicit, alcohol, or drugs) and 35 second-level categories. Customers across industries such as ecommerce, social media, and gaming can use content moderation in Amazon Rekognition to protect their brand reputation and foster safe user communities.
By using Amazon Rekognition for image and video moderation, human moderators have to review a much smaller set of content, typically 1–5% of the total volume, already flagged by the content moderation model. This enables companies to focus on more valuable activities and still achieve comprehensive moderation coverage at a fraction of their existing cost.
Introducing Amazon Rekognition Custom Moderation
You can now enhance the accuracy of the Rekognition moderation model for your business-specific data with the Custom Moderation feature. You can train a custom adapter with as few as 20 annotated images in less than 1 hour. These adapters extend the capabilities of the moderation model to detect images used for training with higher accuracy. For this post, we use a sample dataset containing both safe images and images with alcoholic beverages (considered unsafe) to enhance the accuracy of the alcohol moderation label.
The unique ID of the trained adapter can be provided to the existing DetectModerationLabels API operation to process images using this adapter. Each adapter can only be used by the AWS account that was used for training the adapter, ensuring that the data used for training remains safe and secure in that AWS account. With the Custom Moderation feature, you can tailor the Rekognition pre-trained moderation model for improved performance on your specific moderation use case, without any machine learning (ML) expertise. You can continue to enjoy the benefits of a fully managed moderation service with a pay-per-use pricing model for Custom Moderation.
Solution overview
Training a custom moderation adapter involves five steps that you can complete using the AWS Management Console or the API interface:
- Create a project
- Upload the training data
- Assign ground truth labels to images
- Train the adapter
- Use the adapter
Let’s walk through these steps in more detail using the console.
Create a project
A project is a container to store your adapters. You can train multiple adapters within a project with different training datasets to assess which adapter performs best for your specific use case. To create your project, complete the following steps:
- On the Amazon Rekognition console, choose Custom Moderation in the navigation pane.
- Choose Create project.
- For Project name, enter a name for your project.
- For Adapter name, enter a name for your adapter.
- Optionally, enter a description for your adapter.
Upload training data
You can begin with as few as 20 sample images to adapt the moderation model to detect fewer false positives (images that are appropriate for your business but are flagged by the model with a moderation label). To reduce false negatives (images that are inappropriate for your business but don’t get flagged with a moderation label), you are required to start with 50 sample images.
You can select from the following options to provide the image datasets for adapter training:
- Import a manifest file with labeled images as per the Amazon Rekognition content moderation taxonomy.
- Import images from an Amazon Simple Storage Service (Amazon S3) bucket and provide labels. Make sure that the AWS Identity and Access Management (IAM) user or role has the appropriate access permissions to the specified S3 bucket folder.
- Upload images from your computer and provide labels.
Complete the following steps:
- For this post, select Import images from S3 bucket and enter your S3 URI.
Like any ML training process, training a Custom Moderation adapter in Amazon Rekognition requires two separate datasets: one for training the adapter and another for evaluating the adapter. You can either upload a separate test dataset or choose to automatically split your training dataset for training and testing.
- For this post, select Autosplit.
- Select Enable auto-update to ensure that the system automatically retrains the adapter when a new version of the content moderation model is launched.
- Choose Create project.
Assign ground truth labels to images
If you uploaded unannotated images, you can use the Amazon Rekognition console to provide image labels as per the moderation taxonomy. In the following example, we train an adapter to detect hidden alcohol with higher accuracy, and label all such images with the label alcohol. Images not considered inappropriate can be labeled as Safe.
Train the adapter
After you label all the images, choose Start training to initiate the training process. Amazon Rekognition will use the uploaded image datasets to train an adapter model for enhanced accuracy on the specific type of images provided for training.
After the custom moderation adapter is trained, you can view all the adapter details (adapterID
, test
and training
manifest files) in the Adapter performance section.
The Adapter performance section displays improvements in false positives and false negatives when compared to the pre-trained moderation model. The adapter we trained to enhance the detection of the alcohol label reduces the false negative rate in test images by 73%. In other words, the adapter now accurately predicts the alcohol moderation label for 73% more images compared to the pre-trained moderation model. However, no improvement is observed in false positives, as no false positive samples were used for training.
Use the adapter
You can perform inference using the newly trained adapter to achieve enhanced accuracy. To do this, call the Amazon Rekognition DetectModerationLabel
API with an additional parameter, ProjectVersion
, which is the unique AdapterID
of the adapter. The following is a sample command using the AWS Command Line Interface (AWS CLI):
The following is a sample code snippet using the Python Boto3 library:
Best practices for training
To maximize the performance of your adapter, the following best practices are recommended for training the adapter:
- The sample image data should capture the representative errors that you want to improve the moderation model accuracy for
- Instead of only bringing in error images for false positives and false negatives, you can also provide true positives and true negatives for improved performance
- Supply as many annotated images as possible for training
Conclusion
In this post, we presented an in-depth overview of the new Amazon Rekognition Custom Moderation feature. Furthermore, we detailed the steps for performing training using the console, including best practices for optimal results. For additional information, visit the Amazon Rekognition console and explore the Custom Moderation feature.
Amazon Rekognition Custom Moderation is now generally available in all AWS Regions where Amazon Rekognition is available.
Learn more about content moderation on AWS. Take the first step towards streamlining your content moderation operations with AWS.
About the Authors
Shipra Kanoria is a Principal Product Manager at AWS. She is passionate about helping customers solve their most complex problems with the power of machine learning and artificial intelligence. Before joining AWS, Shipra spent over 4 years at Amazon Alexa, where she launched many productivity-related features on the Alexa voice assistant.
Aakash Deep is a Software Development Engineering Manager based in Seattle. He enjoys working on computer vision, AI, and distributed systems. His mission is to enable customers to address complex problems and create value with AWS Rekognition. Outside of work, he enjoys hiking and traveling.
Lana Zhang is a Senior Solutions Architect at AWS WWSO AI Services team, specializing in AI and ML for Content Moderation, Computer Vision, Natural Language Processing and Generative AI. With her expertise, she is dedicated to promoting AWS AI/ML solutions and assisting customers in transforming their business solutions across diverse industries, including social media, gaming, e-commerce, media, advertising & marketing.
Next-Level Computing: NVIDIA and AMD Deliver Powerful Workstations to Accelerate AI, Rendering and Simulation
To enable professionals worldwide to build and run AI applications right from their desktops, NVIDIA and AMD are powering a new line of workstations equipped with NVIDIA RTX Ada Generation GPUs and AMD Ryzen Threadripper PRO 7000 WX-Series CPUs.
Bringing together the highest levels of AI computing, rendering and simulation capabilities, these new platforms enable professionals to efficiently tackle the most resource-intensive, large-scale AI workflows locally.
Bringing AI Innovation to the Desktop
Advanced AI tasks typically require data-center-level performance. Training a large language model with a trillion parameters, for example, takes thousands of GPUs running for weeks, though research is underway to reduce model size and enable model training on smaller systems while still maintaining high levels of AI model accuracy.
The new NVIDIA RTX GPU and AMD CPU-powered AI workstations provide the power and performance required for training such smaller models, as well as local fine-tuning, and helping to offload data center and cloud resources for AI development tasks. The devices let users select single- or multi-GPU configurations as required for their workloads.
Smaller trained AI models also provide the opportunity to use workstations for local inferencing. RTX GPU and AMD CPU-powered workstations can be configured to run these smaller AI models for inference serving for small workgroups or departments.
With up to 48GB of memory in a single NVIDIA RTX GPU, these workstations offer a cost-effective way to reduce compute load on data centers. And when professionals do need to scale training and deployment from these workstations to data centers or the cloud, the NVIDIA AI Enterprise software platform enables seamless portability of workflows and toolchains.
RTX GPU and AMD CPU-powered workstations also enable cutting-edge visual workflows. With accelerated computing power, the new workstations enable highly interactive content creation, industrial digitalization, and advanced simulation and design.
Unmatched Power, Performance and Flexibility
AMD Ryzen Threadripper PRO 7000 WX-Series processors provide the CPU platform for the next generation of demanding workloads. The processors deliver a significant increase in core count — up to 96 cores per CPU — and industry-leading maximum memory bandwidth in a single socket.
Combining them with the latest NVIDIA RTX Ada Generation GPUs brings unmatched power and performance in a workstation. The GPUs enable up to 2x the performance in ray tracing, AI processing, graphics rendering and computational tasks compared to the previous generation.
Ada Generation GPUs options include the RTX 4000 SFF, RTX 4000, RTX 4500, RTX 5000 and RTX 6000. They’re built on the NVIDIA Ada Lovelace architecture and feature up to 142 third-generation RT Cores, 568 fourth-generation Tensor Cores and 18,176 latest-generation CUDA cores.
From architecture and manufacturing to media and entertainment and healthcare, professionals across industries will be able to use the new workstations to tackle challenging AI computing workloads — along with 3D rendering, product visualization, simulation and scientific computing tasks.
Availability
New workstations powered by NVIDIA RTX Ada Generation GPUs and the latest AMD Threadripper Pro processors will be available starting next month from BOXX and HP, with other system integrators offering them soon.
NVIDIA AI Now Available in Oracle Cloud Marketplace
Training generative AI models just got easier.
NVIDIA DGX Cloud AI supercomputing platform and NVIDIA AI Enterprise software are now available in Oracle Cloud Marketplace, making it possible for Oracle Cloud Infrastructure customers to access high-performance accelerated computing and software to run secure, stable and supported production AI in just a few clicks.
The addition — an industry first — brings new capabilities for end-to-end development and deployment on Oracle Cloud. Enterprises can get started from the Oracle Cloud Marketplace to train models on DGX Cloud, and then deploy their applications on OCI with NVIDIA AI Enterprise.
Oracle Cloud and NVIDIA Lift Industries Into Era of AI
Thousands of enterprises around the world rely on OCI to power the applications that drive their businesses. Its customers include leaders across industries such as healthcare, scientific research, financial services, telecommunications and more.
Oracle Cloud Marketplace is a catalog of solutions that offers customers flexible consumption models and simple billing. Its addition of DGX Cloud and NVIDIA AI Enterprise lets OCI customers use their existing cloud credits to integrate NVIDIA’s leading AI supercomputing platform and software into their development and deployment pipelines.
With DGX Cloud, OCI customers can train models for generative AI applications like intelligent chatbots, search, summarization and content generation.
The University at Albany, in upstate New York, recently launched its AI Plus initiative, which is integrating teaching and learning about AI across the university’s research and academic enterprise, in fields such as cybersecurity, weather prediction, health data analytics, drug discovery and next-generation semiconductor design. It will also foster collaborations across the humanities, social sciences, public policy and public health. The university is using DGX Cloud AI supercomputing instances on OCI as it builds out an on-premises supercomputer.
“We’re accelerating our mission to infuse AI into virtually every academic and research disciplines,” said Thenkurussi (Kesh) Kesavadas, vice president for research and economic development at UAlbany. “We will drive advances in healthcare, security and economic competitiveness, while equipping students for roles in the evolving job market.”
NVIDIA AI Enterprise brings the software layer of the NVIDIA AI platform to OCI. It includes NVIDIA NeMo frameworks for building LLMs, NVIDIA RAPIDS for data science and NVIDIA TensorRT-LLM and NVIDIA Triton Inference Server for supercharging production AI. NVIDIA software for cybersecurity, computer vision, speech AI and more is also included. Enterprise-grade support, security and stability ensure a smooth transition of AI projects from pilot to production.

AI Supercomputing Platform Hosted by OCI
NVIDIA DGX Cloud provides enterprises immediate access to an AI supercomputing platform and software.
Hosted by OCI, DGX Cloud provides enterprises with access to multi-node training on NVIDIA GPUs, paired with NVIDIA AI software, for training advanced models for generative AI and other groundbreaking applications.
Each DGX Cloud instance consists of eight NVIDIA Tensor Core GPUs interconnected with network fabric, purpose-built for multi-node training. This high-performance computing architecture also includes industry-leading AI development software and offers direct access to NVIDIA AI expertise so businesses can train LLMs faster.
OCI customers access DGX Cloud using NVIDIA Base Command Platform, which gives developers access to an AI supercomputer through a web browser. By providing a single-pane view of the customer’s AI infrastructure, Base Command Platform simplifies the management of multinode clusters.

Software for Secure, Stable and Supported Production AI
NVIDIA AI Enterprise enables rapid development and deployment of AI and data science.
With NVIDIA AI Enterprise on Oracle Cloud Marketplace, enterprises can efficiently build an application once and deploy it on OCI and their on-prem infrastructure, making a multi- or hybrid-cloud strategy cost-effective and easy to adopt. Since NVIDIA AI Enterprise is also included in NVIDIA DGX Cloud, customers can streamline the transition from training on DGX Cloud to deploying their AI application into production with NVIDIA AI Enterprise on OCI, since the AI software runtime is consistent across the environments.
Qualified customers can purchase NVIDIA AI Enterprise and NVIDIA DGX Cloud with their existing Oracle Universal Credits.
Visit NVIDIA AI Enterprise and NVIDIA DGX Cloud on the Oracle Cloud Marketplace to get started today.
Simulated Spotify Listening Experiences for Reinforcement Learning with TensorFlow and TF-Agents
Posted by Surya Kanoria, Joseph Cauteruccio, Federico Tomasi, Kamil Ciosek, Matteo Rinaldi, and Zhenwen Dai – Spotify
Introduction
Many of our music recommendation problems involve providing users with ordered sets of items that satisfy users’ listening preferences and intent at that point in time. We base current recommendations on previous interactions with our application and, in the abstract, are faced with a sequential decision making process as we continually recommend content to users.
Reinforcement Learning (RL) is an established tool for sequential decision making that can be leveraged to solve sequential recommendation problems. We decided to explore how RL could be used to craft listening experiences for users. Before we could start training Agents, we needed to pick a RL library that allowed us to easily prototype, test, and potentially deploy our solutions.
At Spotify we leverage TensorFlow and the extended TensorFlow Ecosystem (TFX, TensorFlow Serving, and so on) as part of our production Machine Learning Stack. We made the decision early on to leverage TensorFlow Agents as our RL Library of choice, knowing that integrating our experiments with our production systems would be vastly more efficient down the line.
One missing bit of technology we required was an offline Spotify environment we could use to prototype, analyze, explore, and train Agents offline prior to online testing. The flexibility of the TF-Agents library, coupled with the broader advantages of TensorFlow and its ecosystem, allowed us to cleanly design a robust and extendable offline Spotify simulator.
We based our simulator design on TF-Agents Environment primitives and using this simulator we developed, trained and evaluated sequential models for item recommendations, vanilla RL Agents (PPG, DQN) and a modified deep Q-Network, which we call the Action-Head DQN (AH-DQN), that addressed the specific challenges imposed by the large state and action space of our RL formulation.
Through live experiments we were able to show that our offline performance estimates were strongly correlated with online results. This then opened the door for large scale experimentation and application of Reinforcement Learning across Spotify, enabled by the technological foundations unlocked by TensorFlow and TF-Agents.
In this post we’ll provide more details about our RL problem and how we used TF-Agents to enable this work end to end.
The RL Loop and Simulated Users

In our case the reward emitted from the environment is the response of a user to music recommendations driven by the Agent’s action. In the absence of a simulator we would need to expose real users to Agents to observe rewards. We utilize a model-based RL approach to avoid letting an untrained Agent interact with real users (with the potential of hurting user satisfaction in the training process).
In this model-based RL formulation the Agent is not trained online against real users. Instead, it makes use of a user model that predicts responses to a list of tracks derived via the Agent’s action. Using this model we optimize actions in such a way as to maximize a (simulated) user satisfaction metric. During the training phase the environment makes use of this user model to return a predicted user response to the action recommended by the Agent.
We use Keras to design and train our user model. The serialized user model is then unpacked by the simulator and used to calculate rewards during Agent training and evaluation.
Simulator Design
In the abstract, what we needed to build was clear. We needed a way to simulate user listening sessions for the Agent. Given a simulated user and some content, instantiate a listening session and let the Agent drive recommendations in that session. Allow the simulated user to “react” to these recommendations and let the Agent adjust its strategy based on this result to drive some expected cumulative reward.
The TensorFlow Agents environment design guided us in developing the modular components of our system, each of which was responsible for different parts of the overall simulation.
In our codebase we define an environment abstraction that requires the following be defined for every concrete instantiation:
class AbstractEnvironment(ABC): |
Set-Up
_episode_sampler
. As mentioned, we also need to provide the simulator with a trained user model, in this case via _user_model
.
Actions and Observations
Just like any Agent environment, our simulator requires that we specify the action_spec
and observation_spec
. Actions in our case may be continuous or discrete depending both on our Agent selection and how we propose to translate an Agent’s action into actual recommendations. We typically recommend ordered lists of items drawn from a pool of potential items. Formulating this action space directly would lead to it being combinatorially complex. We also assume the user will interact with multiple items, and as such previous work in this area that relies on single choice assumptions doesn’t apply.
_track_sampler
. The “example play modes” proposed by the episode sampler contains information on items that can be presented to the simulated user. The track sampler consumes these and the agent’s action and returns actual item recommendations. 
Termination and Reset
We also need to handle the episode termination dynamics. In our simulator, the reset rules are set by the model builder and based on empirical investigations of interaction data relevant to a specific music listening experience. As a hypothetical, we may determine that 92% of listening sessions terminate after 6 sequential track skips and we’d construct our simulation termination logic to match. It also requires that we design abstractions in our simulator that allow us to check if the episode should be terminated after each step.
When the episode is reset the simulator will sample a new hypothetical user listening session pair and begin the next episode.
Episode Steps
As with standard TF Agents Environments we need to define the step dynamics for our simulation. We have optional dynamics of the simulation that we need to make sure are enforced at each step. For example, we may desire that the same item cannot be recommended more than once. If the Agent’s action indicates a recommendation of an item that was previously recommended we need to build in the functionality to pick the next best item based on this action.
We also need to call the termination (and other supporting functions) mentioned above as needed at each step.
Episode Storage and Replay
The functionality mentioned up until this point collectively created a very complex simulation setup. While the TF Agents replay buffer provided us with the functionality required to store episodes for Agent training and evaluation, we quickly realized the need to be able to store more episode data for debugging purposes, and more detailed evaluations specific to our simulation distinct from standard Agent performance measures.
We thus allowed for the inclusion of an expanded _episode_tracker
that would store additional information about the user model predictions, information noting the sampled users/content pairs, and more.
Creating TF-Agent Environments
Our environment abstraction gives us a template that matches that of a standard TF-Agents Environment class. Some inputs to our environment need to be resolved before we can actually create the concrete TF-Agents environment instance. This happens in three steps.
First we define a specific simulation environment that conforms to our abstraction. For example:
class PlaylistEnvironment(AbstractEnvironment): |
Next we use an Environment Builder Class that takes as input a user model, track sampler, etc. and an environment class like PlaylistEnvironment
. The builder creates a concrete instance of this environment:
self.playlist_env: PlaylistEnvironment = environment_ctor( |
Lastly, we utilize a conversion class that constructs a TF-Agents Environment from a concrete instance of ours:
class TFAgtPyEnvironment(py_environment.PyEnvironment): |
This is then executed internally to our Environment Builder:
class EnvironmentBuilder(AbstractEnvironmentBuilder):
|

We next discuss how we used our simulator to train RL Agents to generate Playlists.
A Customized Agent for Playlist Generation
As mentioned, Reinforcement Learning provides us with a method set that naturally accommodates the sequential nature of music listening; allowing us to adapt to users’ ever evolving preferences as sessions progress.
One specific problem we can attempt to use RL to solve is that of automatic music playlist generation. Given a (large) set of tracks, we want to learn how to create one optimal playlist to recommend to the user in order to maximize satisfaction metrics. Our use case is different from standard slate recommendation tasks, where usually the target is to select at most one item in the sequence. In our case, we assume we have a user-generated response for multiple items in the slate, making slate recommendation systems not directly applicable. Another complication is that the set of tracks from which recommendations are drawn is ever changing.

Experiments In Brief
We observed this directional alignment for numerous naive, heuristic, model driven, and RL policies.
Please refer to our KDD paper for more information on the specifics of our model-based RL approach and Agent design.
Acknowledgements
We’d like to thank all our Spotify teammates past and present who contributed to this work. Particularly, we’d like to thank Mehdi Ben Ayed for his early work in helping to develop our RL codebase. We’d also like to thank the TensorFlow Agents team for their support and encouragement throughout this project (and for the library that made it possible).
Defect detection in high-resolution imagery using two-stage Amazon Rekognition Custom Labels models
High-resolution imagery is very prevalent in today’s world, from satellite imagery to drones and DLSR cameras. From this imagery, we can capture damage due to natural disasters, anomalies in manufacturing equipment, or very small defects such as defects on printed circuit boards (PCBs) or semiconductors. Building anomaly detection models using high-resolution imagery can be challenging because modern computer vision models typically resize images to a lower resolution to fit into memory for training and running inference. Reducing the image resolution significantly means that visual information relating to the defect is degraded or completely lost.
One approach to overcome these challenges is to build two-stage models. Stage 1 models detect a region of interest, and Stage 2 models detect defects on the cropped region of interest, thereby maintaining sufficient resolution for small detects.
In this post, we go over how to build an effective two-stage defect detection system using Amazon Rekognition Custom Labels and compare results for this specific use case with one-stage models. Note that several one-stage models are effective even at lower or resized image resolutions, and others may accommodate large images in smaller batches.
Solution overview
For our use case, we use a dataset of images of PCBs with synthetically generated missing hole pins, as shown in the following example.
We use this dataset to demonstrate that a one-stage approach using object detection results in subpar detection performance for the missing hole pin defects. A two-step model is preferred, in which we use Rekognition Custom Labels first for object detection to identify the pins and then a second-stage model to classify cropped images of the pins into pins with missing holes or normal pins.
The training process for a Rekognition Custom Labels model consists of several steps, as illustrated in the following diagram.
First, we use Amazon Simple Storage Service (Amazon S3) to store the image data. The data is ingested in Amazon Sagemaker Jupyter notebooks, where typically a data scientist will inspect the images and preprocess them, removing any images that are of poor quality such as blurred images or poor lighting conditions, and resize or crop the images. Then data is split into training and test sets, and Amazon SageMaker Ground Truth labeling jobs are run to label the sets of images and output a train and test manifest file. The manifest files are used by Rekognition Custom Labels for training.
One-stage model approach
The first approach we take to identifying missing holes on the PCB is to label the missing holes and train an object detection model to identify the missing holes. The following is an image example from the dataset.
We train a model with a dataset with 95 images used as training and 20 images used for testing. The following table summarizes our results.
Evaluation Results | |||||
F1 Score | Average Precision | Overall Recall | |||
0.468 | 0.750 | 0.340 | |||
Training Time | Training Dataset | Testing Dataset | |||
Trained in 1.791 hours | 1 label, 95 images | 1 label, 20 images | |||
Per Label Performance | |||||
Label Name | F1 Score | Test Images | Precision | Recall | Assumed Threshold |
missing_hole |
0.468 | 20 | 0.750 | 0.340 | 0.053 |
The resulting model has high precision but low recall, meaning that when we localize a region for a missing hole, we’re usually correct, but we’re missing a lot of missing holes that are present on the PCB. To build an effective defect detection system, we need to improve recall. The low performance of this model may be due to the defects being small on this high-resolution image of the PCB, so the model has no reference of a healthy pin.
Next, we explore splitting the image into four or six crops depending on the PCB size and labeling both healthy and missing holes. The following is an example of the resulting cropped image.
We train a model with 524 images used as training and 106 images used for testing. We maintain the same PCBs used in train and test as the full board model. The results for cropped healthy pins vs. missing holes are shown in the following table.
Evaluation Results | |||||
F1 Score | Average Precision | Overall Recall | |||
0.967 | 0.989 | 0.945 | |||
Training Time | Training Dataset | Testing Dataset | |||
Trained in 2.118 hours | 2 labels, 524 images | 2 labels, 106 images | |||
Per Label Performance | |||||
Label Name | F1 Score | Test Images | Precision | Recall | Assumed Threshold |
missing_hole |
0.949 | 42 | 0.980 | 0.920 | 0.536 |
pin |
0.984 | 106 | 0.998 | 0.970 | 0.696 |
Both precision and recall have improved significantly. Training the model with zoomed-in cropped images and a reference to the model for healthy pins helped. However, recall is still at 92%, meaning that we would still miss 8% of the missing holes and let defects go by unnoticed.
Next, we explore a two-stage model approach in which we can improve the model performance further.
Two-stage model approach
For the two-stage model, we train two models: one for detecting pins and one for detecting if the pin is missing or not on zoomed-in cropped images of the pin. The following is an image from the pin detection dataset.
The data is similar to our previous experiment, in which we cropped the PCB into four or six cropped images. This time, we label all pins and don’t make any distinctions if the pin has a missing hole or not. We train this model with 522 images and test with 108 images, maintaining the same train/test split as previous experiments. The results are shown in the following table.
Evaluation Results | |||||
F1 Score | Average Precision | Overall Recall | |||
1.000 | 0.999 | 1.000 | |||
Training Time | Training Dataset | Testing Dataset | |||
Trained in 1.581 hours | 1 label, 522 images | 1 label, 108 images | |||
Per Label Performance | |||||
Label Name | F1 Score | Test Images | Precision | Recall | Assumed Threshold |
pin |
1.000 | 108 | 0.999 | 1.000 | 0.617 |
The model detects the pins perfectly on this synthetic dataset.
Next, we build the model to make the distinction for missing holes. We use cropped images of the holes to train the second stage of the model, as shown in the following examples. This model is separate from the previous models because it’s a classification model and will be focused on the narrow task of determining if the pin has a missing hole.
We train this second-stage model on 16,624 images and test on 3,266, maintaining the same train/test splits as the previous experiments. The following table summarizes our results.
Evaluation Results | |||||
F1 Score | Average Precision | Overall Recall | |||
1.000 | 1.000 | 1.000 | |||
Training Time | Training Dataset | Testing Dataset | |||
Trained in 6.660 hours | 2 labels, 16,624 images | 2 labels, 3,266 images | |||
Per Label Performance | |||||
Label Name | F1 Score | Test Images | Precision | Recall | Assumed Threshold |
anomaly |
1.000 | 88 | 1.000 | 1.000 | 0.960 |
normal |
1.000 | 3,178 | 1.000 | 1.000 | 0.996 |
Again, we receive perfect precision and recall on this synthetic dataset. Combining the previous pin detection model with this second-stage missing hole classification model, we can build a model that outperforms any single-stage model.
The following table summarizes the experiments we conducted.
Experiment | Type | Description | F1 Score | Precision | Recall |
1 | One-stage model | Object detection model to detect missing holes on full images | 0.468 | 0.75 | 0.34 |
2 | One-stage model | Object detection model to detect healthy pins and missing holes on cropped images | 0.967 | 0.989 | 0.945 |
3 | Two-stage model | Stage 1: Object detection on all pins | 1.000 | 0.999 | 1.000 |
Stage 2: Image classification of healthy pin or missing holes | 1.000 | 1.000 | 1.000 | ||
End-to-end average | 1.000 | 0.9995 | 1.000 |
Inference pipeline
You can use the following architecture to deploy the one-stage and two-stage models that we described in this post. The following main components are involved:
- Amazon API Gateway
- AWS Lambda
- An Amazon Rekognition custom endpoint
For one-stage models, you can send an input image to the API Gateway endpoint, followed by Lambda for any basic image preprocessing, and route to the Rekognition Custom Labels trained model endpoint. In our experiments, we explored one-stage models that can detect only missing holes, and missing holes and healthy pins.
For two-stage models, you can similarly send an image to the API Gateway endpoint, followed by Lambda. Lambda acts as an orchestrator that first calls the object detection model (trained using Rekognition Custom Labels), which generates the region of interest. The original image is then cropped in the Lambda function, and sent to another Rekognition Custom Labels classification model for detecting defects in each cropped image.
Conclusion
In this post, we trained one- and two-stage models to detect missing holes in PCBs using Rekognition Custom Labels. We reported results for various models; in our case, two-stage models outperformed other variants. We encourage customers with high-resolution imagery from other domains to test model performance with one- and two-stage models. Additionally, consider the following ways to expand the solution:
- Sliding window crops for your actual datasets
- Reusing your object detection models in the same pipeline
- Pre-labeling workflows using bounding box predictions
About the authors
Andreas Karagounis is a Data Science Manager at Accenture. He holds a masters in Computer Science from Brown University. He has a background in computer vision and works with customers to solve their business challenges using data science and machine learning.
Yogesh Chaturvedi is a Principal Solutions Architect at AWS with a focus in computer vision. He works with customers to address their business challenges using cloud technologies. Outside of work, he enjoys hiking, traveling, and watching sports.
Shreyas Subramanian is a Principal Data Scientist, and helps customers by using machine learning to solve their business challenges using the AWS platform. Shreyas has a background in large-scale optimization and machine learning, and in the use of machine learning and reinforcement learning for accelerating optimization tasks.
Selimcan “Can” Sakar is a cloud-first developer and Solutions Architect at AWS Accenture Business Group with a focus on emerging technologies such as GenAI, ML, and blockchain. When he isn’t watching models converge, he can be seen biking or playing the clarinet.
Automatically redact PII for machine learning using Amazon SageMaker Data Wrangler
Customers increasingly want to use deep learning approaches such as large language models (LLMs) to automate the extraction of data and insights. For many industries, data that is useful for machine learning (ML) may contain personally identifiable information (PII). To ensure customer privacy and maintain regulatory compliance while training, fine-tuning, and using deep learning models, it’s often necessary to first redact PII from source data.
This post demonstrates how to use Amazon SageMaker Data Wrangler and Amazon Comprehend to automatically redact PII from tabular data as part of your machine learning operations (ML Ops) workflow.
Problem: ML data that contains PII
PII is defined as any representation of information that permits the identity of an individual to whom the information applies to be reasonably inferred by either direct or indirect means. PII is information that either directly identifies an individual (name, address, social security number or other identifying number or code, telephone number, email address, and so on) or information that an agency intends to use to identify specific individuals in conjunction with other data elements, namely, indirect identification.
Customers in business domains such as financial, retail, legal, and government deal with PII data on a regular basis. Due to various government regulations and rules, customers have to find a mechanism to handle this sensitive data with appropriate security measures to avoid regulatory fines, possible fraud, and defamation. PII redaction is the process of masking or removing sensitive information from a document so it can be used and distributed, while still protecting confidential information.
Businesses need to deliver delightful customer experiences and better business outcomes by using ML. Redaction of PII data is often a key first step to unlock the larger and richer data streams needed to use or fine-tune generative AI models, without worrying about whether their enterprise data (or that of their customers) will be compromised.
Solution overview
This solution uses Amazon Comprehend and SageMaker Data Wrangler to automatically redact PII data from a sample dataset.
Amazon Comprehend is a natural language processing (NLP) service that uses ML to uncover insights and relationships in unstructured data, with no managing infrastructure or ML experience required. It provides functionality to locate various PII entity types within text, for example names or credit card numbers. Although the latest generative AI models have demonstrated some PII redaction capability, they generally don’t provide a confidence score for PII identification or structured data describing what was redacted. The PII functionality of Amazon Comprehend returns both, enabling you to create redaction workflows that are fully auditable at scale. Additionally, using Amazon Comprehend with AWS PrivateLink means that customer data never leaves the AWS network and is continuously secured with the same data access and privacy controls as the rest of your applications.
Similar to Amazon Comprehend, Amazon Macie uses a rules-based engine to identify sensitive data (including PII) stored in Amazon Simple Storage Service (Amazon S3). However, its rules-based approach relies on having specific keywords that indicate sensitive data located close to that data (within 30 characters). In contrast, the NLP-based ML approach of Amazon Comprehend uses sematic understanding of longer chunks of text to identify PII, making it more useful for finding PII within unstructured data.
Additionally, for tabular data such as CSV or plain text files, Macie returns less detailed location information than Amazon Comprehend (either a row/column indicator or a line number, respectively, but not start and end character offsets). This makes Amazon Comprehend particularly helpful for redacting PII from unstructured text that may contain a mix of PII and non-PII words (for example, support tickets or LLM prompts) that is stored in a tabular format.
Amazon SageMaker provides purpose-built tools for ML teams to automate and standardize processes across the ML lifecycle. With SageMaker MLOps tools, teams can easily prepare, train, test, troubleshoot, deploy, and govern ML models at scale, boosting productivity of data scientists and ML engineers while maintaining model performance in production. The following diagram illustrates the SageMaker MLOps workflow.
SageMaker Data Wrangler is a feature of Amazon SageMaker Studio that provides an end-to-end solution to import, prepare, transform, featurize, and analyze datasets stored in locations such as Amazon S3 or Amazon Athena, a common first step in the ML lifecycle. You can use SageMaker Data Wrangler to simplify and streamline dataset preprocessing and feature engineering by either using built-in, no-code transformations or customizing with your own Python scripts.
Using Amazon Comprehend to redact PII as part of a SageMaker Data Wrangler data preparation workflow keeps all downstream uses of the data, such as model training or inference, in alignment with your organization’s PII requirements. You can integrate SageMaker Data Wrangler with Amazon SageMaker Pipelines to automate end-to-end ML operations, including data preparation and PII redaction. For more details, refer to Integrating SageMaker Data Wrangler with SageMaker Pipelines. The rest of this post demonstrates a SageMaker Data Wrangler flow that uses Amazon Comprehend to redact PII from text stored in tabular data format.
This solution uses a public synthetic dataset along with a custom SageMaker Data Wrangler flow, available as a file in GitHub. The steps to use the SageMaker Data Wrangler flow to redact PII are as follows:
- Open SageMaker Studio.
- Download the SageMaker Data Wrangler flow.
- Review the SageMaker Data Wrangler flow.
- Add a destination node.
- Create a SageMaker Data Wrangler export job.
This walkthrough, including running the export job, should take 20–25 minutes to complete.
Prerequisites
For this walkthrough, you should have the following:
- An AWS account.
- A SageMaker Studio domain and user. For details on setting these up, refer to Onboard to Amazon SageMaker Domain Using Quick setup. The SageMaker Studio execution role must have permission to call the Amazon Comprehend DetectPiiEntities action.
- An S3 bucket for the redacted results.
Open SageMaker Studio
To open SageMaker Studio, complete the following steps:
- On the SageMaker console, choose Studio in the navigation pane.
- Choose the domain and user profile
- Choose Open Studio.
To get started with the new capabilities of SageMaker Data Wrangler, it’s recommended to upgrade to the latest release.
Download the SageMaker Data Wrangler flow
You first need to retrieve the SageMaker Data Wrangler flow file from GitHub and upload it to SageMaker Studio. Complete the following steps:
- Navigate to the SageMaker Data Wrangler
redact-pii.flow
file on GitHub. - On GitHub, choose the download icon to download the flow file to your local computer.
- In SageMaker Studio, choose the file icon in the navigation pane.
- Choose the upload icon, then choose
redact-pii.flow
.
Review the SageMaker Data Wrangler flow
In SageMaker Studio, open redact-pii.flow
. After a few minutes, the flow will finish loading and show the flow diagram (see the following screenshot). The flow contains six steps: an S3 Source step followed by five transformation steps.
On the flow diagram, choose the last step, Redact PII. The All Steps pane opens on the right and shows a list of the steps in the flow. You can expand each step to view details, change parameters, and potentially add custom code.
Let’s walk through each step in the flow.
Steps 1 (S3 Source) and 2 (Data types) are added by SageMaker Data Wrangler whenever data is imported for a new flow. In S3 Source, the S3 URI field points to the sample dataset, which is a CSV file stored in Amazon S3. The file contains roughly 116,000 rows, and the flow sets the value of the Sampling field to 1,000, which means that SageMaker Data Wrangler will sample 1,000 rows to display in the user interface. Data types sets the data type for each column of imported data.
Step 3 (Sampling) sets the number of rows SageMaker Data Wrangler will sample for an export job to 5,000, via the Approximate sample size field. Note that this is different from the number of rows sampled to display in the user interface (Step 1). To export data with more rows, you can increase this number or remove Step 3.
Steps 4, 5, and 6 use SageMaker Data Wrangler custom transforms. Custom transforms allow you to run your own Python or SQL code within a Data Wrangler flow. The custom code can be written in four ways:
- In SQL, using PySpark SQL to modify the dataset
- In Python, using a PySpark data frame and libraries to modify the dataset
- In Python, using a pandas data frame and libraries to modify the dataset
- In Python, using a user-defined function to modify a column of the dataset
The Python (pandas) approach requires your dataset to fit into memory and can only be run on a single instance, limiting its ability to scale efficiently. When working in Python with larger datasets, we recommend using either the Python (PySpark) or Python (user-defined function) approach. SageMaker Data Wrangler optimizes Python user-defined functions to provide performance similar to an Apache Spark plugin, without needing to know PySpark or Pandas. To make this solution as accessible as possible, this post uses a Python user-defined function written in pure Python.
Expand Step 4 (Make PII column) to see its details. This step combines different types of PII data from multiple columns into a single phrase that is saved in a new column, pii_col
. The following table shows an example row containing data.
customer_name | customer_job | billing_address | customer_email |
Katie | Journalist | 19009 Vang Squares Suite 805 | hboyd@gmail.com |
This is combined into the phrase “Katie is a Journalist who lives at 19009 Vang Squares Suite 805 and can be emailed at hboyd@gmail.com”. The phrase is saved in pii_col
, which this post uses as the target column to redact.
Step 5 (Prep for redaction) takes a column to redact (pii_col
) and creates a new column (pii_col_prep
) that is ready for efficient redaction using Amazon Comprehend. To redact PII from a different column, you can change the Input column field of this step.
There are two factors to consider to efficiently redact data using Amazon Comprehend:
- The cost to detect PII is defined on a per-unit basis, where 1 unit = 100 characters, with a 3-unit minimum charge for each document. Because tabular data often contains small amounts of text per cell, it’s generally more time- and cost-efficient to combine text from multiple cells into a single document to send to Amazon Comprehend. Doing this avoids the accumulation of overhead from many repeated function calls and ensures that the data sent is always greater than the 3-unit minimum.
- Because we’re doing redaction as one step of a SageMaker Data Wrangler flow, we will be calling Amazon Comprehend synchronously. Amazon Comprehend sets a 100 KB (100,000 character) limit per synchronous function call, so we need to ensure that any text we send is under that limit.
Given these factors, Step 5 prepares the data to send to Amazon Comprehend by appending a delimiter string to the end of the text in each cell. For the delimiter, you can use any string that doesn’t occur in the column being redacted (ideally, one that is as few characters as possible, because they’re included in the Amazon Comprehend character total). Adding this cell delimiter allows us to optimize the call to Amazon Comprehend, and will be discussed further in Step 6.
Note that if the text in any individual cell is longer than the Amazon Comprehend limit, the code in this step truncates it to 100,000 characters (roughly equivalent to 15,000 words or 30 single-spaced pages). Although this amount of text is unlikely to be stored in in a single cell, you can modify the transformation code to handle this edge case another way if needed.
Step 6 (Redact PII) takes a column name to redact as input (pii_col_prep
) and saves the redacted text to a new column (pii_redacted
). When you use a Python custom function transform, SageMaker Data Wrangler defines an empty custom_func
that takes a pandas series (a column of text) as input and returns a modified pandas series of the same length. The following screenshot shows part of the Redact PII step.
The function custom_func
contains two helper (inner) functions:
make_text_chunks
– This function does the work of concatenating text from individual cells in the series (including their delimiters) into longer strings (chunks) to send to Amazon Comprehend.redact_pii
– This function takes text as input, calls Amazon Comprehend to detect PII, redacts any that is found, and returns the redacted text. Redaction is done by replacing any PII text with the type of PII found in square brackets, for example John Smith would be replaced with [NAME]. You can modify this function to replace PII with any string, including the empty string (“”) to remove it. You also could modify the function to check the confidence score of each PII entity and only redact if it’s above a specific threshold.
After the inner functions are defined, custom_func
uses them to do the redaction, as shown in the following code excerpt. When the redaction is complete, it converts the chunks back into original cells, which it saves in the pii_redacted
column.
Add a destination node
To see the result of your transformations, SageMaker Data Wrangler supports exporting to Amazon S3, SageMaker Pipelines, Amazon SageMaker Feature Store, and Python code. To export the redacted data to Amazon S3, we first need to create a destination node:
- In the SageMaker Data Wrangler flow diagram, choose the plus sign next to the Redact PII step.
- Choose Add destination, then choose Amazon S3.
- Provide an output name for your transformed dataset.
- Browse or enter the S3 location to store the redacted data file.
- Choose Add destination.
You should now see the destination node at the end of your data flow.
Create a SageMaker Data Wrangler export job
Now that the destination node has been added, we can create the export job to process the dataset:
- In SageMaker Data Wrangler, choose Create job.
- The destination node you just added should already be selected. Choose Next.
- Accept the defaults for all other options, then choose Run.
This creates a SageMaker Processing job. To view the status of the job, navigate to the SageMaker console. In the navigation pane, expand the Processing section and choose Processing jobs. Redacting all 116,000 cells in the target column using the default export job settings (two ml.m5.4xlarge instances) takes roughly 8 minutes and costs approximately $0.25. When the job is complete, download the output file with the redacted column from Amazon S3.
Clean up
The SageMaker Data Wrangler application runs on an ml.m5.4xlarge instance. To shut it down, in SageMaker Studio, choose Running Terminals and Kernels in the navigation pane. In the RUNNING INSTANCES section, find the instance labeled Data Wrangler and choose the shutdown icon next to it. This shuts down the SageMaker Data Wrangler application running on the instance.
Conclusion
In this post, we discussed how to use custom transformations in SageMaker Data Wrangler and Amazon Comprehend to redact PII data from your ML dataset. You can download the SageMaker Data Wrangler flow and start redacting PII from your tabular data today.
For other ways to enhance your MLOps workflow using SageMaker Data Wrangler custom transformations, check out Authoring custom transformations in Amazon SageMaker Data Wrangler using NLTK and SciPy. For more data preparation options, check out the blog post series that explains how to use Amazon Comprehend to react, translate, and analyze text from either Amazon Athena or Amazon Redshift.
About the Authors
Tricia Jamison is a Senior Prototyping Architect on the AWS Prototyping and Cloud Acceleration (PACE) Team, where she helps AWS customers implement innovative solutions to challenging problems with machine learning, internet of things (IoT), and serverless technologies. She lives in New York City and enjoys basketball, long distance treks, and staying one step ahead of her children.
Neelam Koshiya is an Enterprise Solutions Architect at AWS. With a background in software engineering, she organically moved into an architecture role. Her current focus is helping enterprise customers with their cloud adoption journey for strategic business outcomes with the area of depth being AI/ML. She is passionate about innovation and inclusion. In her spare time, she enjoys reading and being outdoors.
Adeleke Coker is a Global Solutions Architect with AWS. He works with customers globally to provide guidance and technical assistance in deploying production workloads at scale on AWS. In his spare time, he enjoys learning, reading, gaming and watching sport events.
English learners can now practice speaking on Search
Learning a language can open up new opportunities in a person’s life. It can help people connect with those from different cultures, travel the world, and advance their career. English alone is estimated to have 1.5 billion learners worldwide. Yet proficiency in a new language is difficult to achieve, and many learners cite a lack of opportunity to practice speaking actively and receiving actionable feedback as a barrier to learning.
We are excited to announce a new feature of Google Search that helps people practice speaking and improve their language skills. Within the next few days, Android users in Argentina, Colombia, India (Hindi), Indonesia, Mexico, and Venezuela can get even more language support from Google through interactive speaking practice in English — expanding to more countries and languages in the future. Google Search is already a valuable tool for language learners, providing translations, definitions, and other resources to improve vocabulary. Now, learners translating to or from English on their Android phones will find a new English speaking practice experience with personalized feedback.
![]() |
A new feature of Google Search allows learners to practice speaking words in context. |
Learners are presented with real-life prompts and then form their own spoken answers using a provided vocabulary word. They engage in practice sessions of 3-5 minutes, getting personalized feedback and the option to sign up for daily reminders to keep practicing. With only a smartphone and some quality time, learners can practice at their own pace, anytime, anywhere.
Activities with personalized feedback, to supplement existing learning tools
Designed to be used alongside other learning services and resources, like personal tutoring, mobile apps, and classes, the new speaking practice feature on Google Search is another tool to assist learners on their journey.
We have partnered with linguists, teachers, and ESL/EFL pedagogical experts to create a speaking practice experience that is effective and motivating. Learners practice vocabulary in authentic contexts, and material is repeated over dynamic intervals to increase retention — approaches that are known to be effective in helping learners become confident speakers. As one partner of ours shared:
“Speaking in a given context is a skill that language learners often lack the opportunity to practice. Therefore this tool is very useful to complement classes and other resources.” – Judit Kormos, Professor, Lancaster University
We are also excited to be working with several language learning partners to surface content they are helping create and to connect them with learners around the world. We look forward to expanding this program further and working with any interested partner.
Personalized real-time feedback
Every learner is different, so delivering personalized feedback in real time is a key part of effective practice. Responses are analyzed to provide helpful, real-time suggestions and corrections.
The system gives semantic feedback, indicating whether their response was relevant to the question and may be understood by a conversation partner. Grammar feedback provides insights into possible grammatical improvements, and a set of example answers at varying levels of language complexity give concrete suggestions for alternative ways to respond in this context.
![]() |
The feedback is composed of three elements: Semantic analysis, grammar correction, and example answers. |
Contextual translation
Among the several new technologies we developed, contextual translation provides the ability to translate individual words and phrases in context. During practice sessions, learners can tap on any word they don’t understand to see the translation of that word considering its context.
![]() |
Example of contextual translation feature. |
This is a difficult technical task, since individual words in isolation often have multiple alternative meanings, and multiple words can form clusters of meaning that need to be translated in unison. Our novel approach translates the entire sentence, then estimates how the words in the original and the translated text relate to each other. This is commonly known as the word alignment problem.
![]() |
Example of a translated sentence pair and its word alignment. A deep learning alignment model connects the different words that create the meaning to suggest a translation. |
The key technology piece that enables this functionality is a novel deep learning model developed in collaboration with the Google Translate team, called Deep Aligner. The basic idea is to take a multilingual language model trained on hundreds of languages, then fine-tune a novel alignment model on a set of word alignment examples (see the figure above for an example) provided by human experts, for several language pairs. From this, the single model can then accurately align any language pair, reaching state-of-the-art alignment error rate (AER, a metric to measure the quality of word alignments, where lower is better). This single new model has led to dramatic improvements in alignment quality across all tested language pairs, reducing average AER from 25% to 5% compared to alignment approaches based on Hidden Markov models (HMMs).
![]() |
Alignment error rates (lower is better) between English (EN) and other languages. |
This model is also incorporated into Google’s translation APIs, greatly improving, for example, the formatting of translated PDFs and websites in Chrome, the translation of YouTube captions, and enhancing Google Cloud’s translation API.
Grammar feedback
To enable grammar feedback for accented spoken language, our research teams adapted grammar correction models for written text (see the blog and paper) to work on automatic speech recognition (ASR) transcriptions, specifically for the case of accented speech. The key step was fine-tuning the written text model on a corpus of human and ASR transcripts of accented speech, with expert-provided grammar corrections. Furthermore, inspired by previous work, the teams developed a novel edit-based output representation that leverages the high overlap between the inputs and outputs that is particularly well-suited for short input sentences common in language learning settings.
The edit representation can be explained using an example:
- Input: I1 am2 so3 bad4 cooking5
- Correction: I1 am2 so3 bad4 at5 cooking6
- Edits: (‘at’, 4, PREPOSITION, 4)
In the above, “at” is the word that is inserted at position 4 and “PREPOSITION” denotes this is an error involving prepositions. We used the error tag to select tag-dependent acceptance thresholds that improved the model further. The model increased the recall of grammar problems from 4.6% to 35%.
Some example output from our model and a model trained on written corpora:
Example 1 | Example 2 | |||
User input (transcribed speech) | I live of my profession. | I need a efficient card and reliable. | ||
Text-based grammar model | I live by my profession. | I need an efficient card and a reliable. | ||
New speech-optimized model | I live off my profession. | I need an efficient and reliable card. |
Semantic analysis
A primary goal of conversation is to communicate one’s intent clearly. Thus, we designed a feature that visually communicates to the learner whether their response was relevant to the context and would be understood by a partner. This is a difficult technical problem, since early language learners’ spoken responses can be syntactically unconventional. We had to carefully balance this technology to focus on the clarity of intent rather than correctness of syntax.
Our system utilizes a combination of two approaches:
- Sensibility classification: Large language models like LaMDA or PaLM are designed to give natural responses in a conversation, so it’s no surprise that they do well on the reverse: judging whether a given response is contextually sensible.
- Similarity to good responses: We used an encoder architecture to compare the learner’s input to a set of known good responses in a semantic embedding space. This comparison provides another useful signal on semantic relevance, further improving the quality of feedback and suggestions we provide.
![]() |
The system provides feedback about whether the response was relevant to the prompt, and would be understood by a communication partner. |
ML-assisted content development
Our available practice activities present a mix of human-expert created content, and content that was created with AI assistance and human review. This includes speaking prompts, focus words, as well as sets of example answers that showcase meaningful and contextual responses.
![]() |
A list of example answers is provided when the learner receives feedback and when they tap the help button. |
Since learners have different levels of ability, the language complexity of the content has to be adjusted appropriately. Prior work on language complexity estimation focuses on text of paragraph length or longer, which differs significantly from the type of responses that our system processes. Thus, we developed novel models that can estimate the complexity of a single sentence, phrase, or even individual words. This is challenging because even a phrase composed of simple words can be hard for a language learner (e.g., “Let’s cut to the chase”). Our best model is based on BERT and achieves complexity predictions closest to human expert consensus. The model was pre-trained using a large set of LLM-labeled examples, and then fine-tuned using a human expert–labeled dataset.
![]() |
Mean squared error of various approaches’ performance estimating content difficulty on a diverse corpus of ~450 conversational passages (text / transcriptions). Top row: Human raters labeled the items on a scale from 0.0 to 5.0, roughly aligned to the CEFR scale (from A1 to C2). Bottom four rows: Different models performed the same task, and we show the difference to the human expert consensus. |
Using this model, we can evaluate the difficulty of text items, offer a diverse range of suggestions, and most importantly challenge learners appropriately for their ability levels. For example, using our model to label examples, we can fine-tune our system to generate speaking prompts at various language complexity levels.
Vocabulary focus words, to be elicited by the questions | ||||||
guitar | apple | lion | ||||
Simple | What do you like to play? | Do you like fruit? | Do you like big cats? | |||
Intermediate | Do you play any musical instruments? | What is your favorite fruit? | What is your favorite animal? | |||
Complex | What stringed instrument do you enjoy playing? | Which type of fruit do you enjoy eating for its crunchy texture and sweet flavor? | Do you enjoy watching large, powerful predators? |
Furthermore, content difficulty estimation is used to gradually increase the task difficulty over time, adapting to the learner’s progress.
Conclusion
With these latest updates, which will roll out over the next few days, Google Search has become even more helpful. If you are an Android user in India (Hindi), Indonesia, Argentina, Colombia, Mexico, or Venezuela, give it a try by translating to or from English with Google.
We look forward to expanding to more countries and languages in the future, and to start offering partner practice content soon.
Acknowledgements
Many people were involved in the development of this project. Among many others, we thank our external advisers in the language learning field: Jeffrey Davitz, Judit Kormos, Deborah Healey, Anita Bowles, Susan Gaer, Andrea Revesz, Bradley Opatz, and Anne Mcquade.