The Geometries of Truth Are Orthogonal Across Tasks

This paper was presented at the Workshop on Reliable and Responsible Foundation Models at ICML 2025.
Large Language Models (LLMs) have demonstrated impressive generalization capabilities across various tasks, but their claim to practical relevance is still mired by concerns on their reliability. Recent works have proposed examining the activations produced by an LLM at inference time to assess whether its answer to a question is correct. Some works claim that a “geometry of truth” can be learned from examples, in the sense that the activations that generate correct answers can be distinguished…Apple Machine Learning Research

SceneScout: Towards AI Agent-driven Access to Street View Imagery for Blind Users

People who are blind or have low vision (BLV) may hesitate to travel independently in unfamiliar environments due to uncertainty about the physical landscape. While most tools focus on in-situ navigation, those exploring pre-travel assistance typically provide only landmarks and turn-by-turn instructions, lacking detailed visual context. Street view imagery, which contains rich visual information and has the potential to reveal numerous environmental details, remains inaccessible to BLV people. In this work, we introduce SceneScout, a multimodal large language model (MLLM)-driven AI agent that…Apple Machine Learning Research

Is Your Model Fairly Certain? Uncertainty-Aware Fairness Evaluation for LLMs

The recent rapid adoption of large language models (LLMs) highlights the critical need for benchmarking their fairness. Conventional fairness metrics, which focus on discrete accuracy-based evaluations (i.e., prediction correctness), fail to capture the implicit impact of model uncertainty (e.g., higher model confidence about one group over another despite similar accuracy). To address this limitation, we propose an uncertainty-aware fairness metric, UCerF, to enable a fine-grained evaluation of model fairness that is more reflective of the internal bias in model decisions compared to…Apple Machine Learning Research

Soup-of-Experts: Pretraining Specialist Models via Parameters Averaging

Large-scale models are routinely trained on a mixture of different data sources.
Different data mixtures yield very different downstream performances.
We propose a novel architecture that can instantiate one model for each data mixture without having to re-train the model.
Our architecture consists of a bank of expert weights, which are linearly combined to instantiate one model.
We learn the linear combination coefficients as a function of the input histogram.
To train this architecture, we sample random histograms, instantiate the corresponding model, and backprop through one batch of data…Apple Machine Learning Research

Understanding Input Selectivity in Mamba

State-Space Models (SSMs), and particularly Mamba, have recently emerged as a promising alternative to Transformers.
Mamba introduces input selectivity to its SSM layer (S6) and
incorporates convolution and gating into its block definition.
While these modifications do improve Mamba’s performance over its SSM predecessors, it remains largely unclear how Mamba leverages the additional functionalities provided by input selectivity, and how these interact with the other operations in the Mamba architecture.
In this work, we demystify the role of input selectivity in Mamba, investigating its…Apple Machine Learning Research

The Super Weight in Large Language Models

Recent works have shown a surprising result: a small fraction of Large Language Model (LLM) parameter outliers are disproportionately important to the quality of the model. LLMs contain billions of parameters, so these small fractions, such as 0.01%, translate to hundreds of thousands of parameters. In this work, we present an even more surprising finding: Pruning as few as a single parameter can destroy an LLM’s ability to generate text — increasing perplexity by 3 orders of magnitude and reducing zero-shot accuracy to guessing. We propose a data-free method for identifying such parameters…Apple Machine Learning Research

EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video

Imitation learning for manipulation has a well-known data scarcity problem. Unlike natural language and 2D computer vision, there is no Internet-scale corpus of data for dexterous manipulation. One appealing option is egocentric human video, a passively scalable data source. However, existing large-scale datasets such as Ego4D do not have native hand pose annotations and do not focus on object manipulation. To this end, we use Apple Vision Pro to collect EgoDex: the largest and most diverse dataset of dexterous human manipulation to date. EgoDex has 829 hours of egocentric video with paired 3D…Apple Machine Learning Research

From Interaction to Impact: Towards Safer AI Agents Through Understanding and Evaluating Mobile UI Operation Impacts

With advances in generative AI, there is increasing work towards creating autonomous agents that can manage daily tasks by operating user interfaces (UIs). While prior research has studied the mechanics of how AI agents might navigate UIs and understand UI structure, the effects of agents and their autonomous actions—particularly those that may be risky or irreversible—remain under-explored. In this work, we investigate the real-world impacts and consequences of mobile UI actions taken by AI agents. We began by developing a taxonomy of the impacts of mobile UI actions through a series of…Apple Machine Learning Research

Advancing Egocentric Video Question Answering with Multimodal Large Language Models

Egocentric Video Question Answering (QA) requires models to handle long-horizon temporal reasoning, first-person perspectives, and specialized challenges like frequent camera movement. This paper systematically evaluates both proprietary and open-source Multimodal Large Language Models (MLLMs) on QaEgo4Dv2—a refined dataset of egocentric videos derived from QaEgo4D. Four popular MLLMs (GPT-4o, Gemini-1.5-Pro, Video-LLaVa-7B and Qwen2-VL-7B-Instruct) are assessed using zero-shot and fine-tuned approaches for both OpenQA and CloseQA settings. We introduce QaEgo4Dv2 to mitigate
annotation noise…Apple Machine Learning Research

ETVA: Evaluation of Text-to-Video Alignment via Fine-grained Question Generation and Answering

Precisely evaluating semantic alignment between text prompts and generated videos remains a challenge in Text-to-Video (T2V) Generation. Existing text-to-video alignment metrics like CLIPScore only generate coarse-grained scores without fine-grained alignment details, failing to align with human preference. To address this limitation, we propose ETVA, a novel Evaluation method of Text-to-Video Alignment via fine-grained question generation and answering. First, a multi-agent system parses prompts into semantic scene graphs to generate atomic questions. Then we design a knowledge-augmented…Apple Machine Learning Research