Spontaneous speech emotion data usually contain perceptual grades where graders assign emotion score after listening to the speech files. Such perceptual grades introduce uncertainty in labels due to grader opinion variation. Grader variation is addressed by using consensus grades as groundtruth, where the emotion with the highest vote is selected, and as a consequence fails to consider ambiguous instances where a speech sample may contain multiple emotions, as captured through grader opinion uncertainty. We demonstrate that using the probability density function of the emotion grades as…Apple Machine Learning Research
Universally Instance-Optimal Mechanisms for Private Statistical Estimation
We consider the problem of instance-optimal statistical estimation under the constraint of differential privacy where mechanisms must adapt to the difficulty of the input dataset. We prove a
new instance specific lower bound using a new divergence and show it characterizes the local minimax optimal rates for private statistical estimation. We propose two new mechanisms that are
universally instance-optimal for general estimation problems up to logarithmic factors. Our first
mechanism, the total variation mechanism, builds on the exponential mechanism with stable approximations of the total…Apple Machine Learning Research
Mutual Reinforcement of LLM Dialogue Synthesis and Summarization Capabilities for Few-Shot Dialogue Summarization
In this work, we propose Mutual Reinforcing Data Synthesis (MRDS) within LLMs to improve few-shot dialogue summarization task. Unlike prior methods that require external knowledge, we mutually reinforce the LLM’s dialogue synthesis and summarization capabilities, allowing them to complement each other during training and enhance overall performances. The dialogue synthesis capability is enhanced by directed preference optimization with preference scoring from summarization capability. The summarization capability is enhanced by the additional high quality dialogue-summary paired data produced…Apple Machine Learning Research
The Role of Prosody in Spoken Question Answering
Spoken language understanding research to date has generally carried a heavy text perspective. Most datasets are derived from text, which is then subsequently synthesized into speech, and most models typically rely on automatic transcriptions of speech. This is to the detriment of prosody–additional information carried by the speech signal beyond the phonetics of the words themselves and difficult to recover from text alone. In this work, we investigate the role of prosody in Spoken Question Answering. By isolating prosodic and lexical information on the SLUE-SQA-5 dataset, which consists of…Apple Machine Learning Research
VibE: A Visual Analytics Workflow for Semantic Error Analysis of CVML Models at Subgroup Level
Effective error analysis is critical for the successful development and deployment of CVML models. One approach to understanding model errors is to summarize the common characteristics of error samples. This can be particularly challenging in tasks that utilize unstructured, complex data such as images, where patterns are not always obvious. Another method is to analyze error distributions across pre-defined categories, which requires analysts to hypothesize about potential error causes in advance. Forming such hypotheses without access to explicit labels or annotations makes it difficult to…Apple Machine Learning Research
International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025
Apple Machine Learning Research
Exploring Empty Spaces: Human-in-the-Loop Data Augmentation
Data augmentation is crucial to make machine learning models more robust and safe. However, augmenting data can be challenging as it requires generating diverse data points to rigorously evaluate model behavior on edge cases and mitigate potential harms. Creating high-quality augmentations that cover these “unknown unknowns” is a time- and creativity-intensive task. In this work, we introduce Amplio, an interactive tool to help practitioners navigate “unknown unknowns” in unstructured text datasets and improve data diversity by systematically identifying empty data spaces to explore. Amplio…Apple Machine Learning Research
UniVG: A Generalist Diffusion Model for Unified Image Generation and Editing
Text-to-Image (T2I) diffusion models have shown impressive results in generating visually compelling images following user prompts. Building on this, various methods further fine-tune the pre-trained T2I model for specific tasks. However, this requires separate model architectures, training designs, and multiple parameter sets to handle different tasks. In this paper, we introduce UniVG, a generalist diffusion model capable of supporting a diverse range of image generation tasks with a single set of weights. UniVG treats multi-modal inputs as unified conditions to enable various downstream…Apple Machine Learning Research
Fundamental Challenges in Evaluating Text2SQL Solutions and Detecting Their Limitations
In this work, we dive into the fundamental challenges of evaluating Text2SQL solutions and highlight potential failure causes and the potential risks of relying on aggregate metrics in existing benchmarks. We identify two largely unaddressed limitations in current open benchmarks: (1) data quality issues in the evaluation data mainly attributed to the lack of capturing the probabilistic nature of translating a natural language description into a structured query (e.g., NL ambiguity), and (2) the bias that using different match functions as approximations for SQL equivalence can introduce.
To…Apple Machine Learning Research
Exploring Prediction Targets in Masked Pre-Training for Speech Foundation Models
Speech foundation models, such as HuBERT and its variants, are pre-trained on large amounts of unlabeled speech data and then used for a range of downstream tasks. These models use a masked prediction objective, where the model learns to predict information about masked input segments from the unmasked context. The choice of prediction targets in this framework impacts their performance on downstream tasks. For instance, models pre-trained with targets that capture prosody learn representations suited for speaker-related tasks, while those pre-trained with targets that capture phonetics learn…Apple Machine Learning Research