Device-Directed Speech Detection: Regularization via Distillation for Weakly-Supervised Models

We address the problem of detecting speech directed to a device that does not contain a specific wake-word that is traditionally used to invoke virtual assistants (VAs). Specifically, we focus on audio that come from a touch-based invocation. Mitigating VA activation due to accidental button presses is critical for the user experience. While the majority of approaches to false trigger mitigation (FTM) are designed to detect the presence of a target keyword, inferring user intent when a keyword is not present is difficult. This also poses a challenge when creating the training/evaluation data…Apple Machine Learning Research