Language Models Improve When Pretraining Data Matches Target Tasks

Every data selection method inherently has a target. In practice, these targets often emerge implicitly through benchmark-driven iteration: researchers develop selection strategies, train models, measure benchmark performance, then refine accordingly. This raises a natural question: what happens when we make this optimization explicit? To explore this, we propose benchmark-targeted ranking (BETR), a simple method that selects pretraining documents based on similarity to benchmark training examples. BETR embeds benchmark examples and a sample of pretraining documents in a shared space, scores…Apple Machine Learning Research

Apple Intelligence Foundation Language Models Tech Report 2025

We introduce two multilingual, multimodal foundation language models that power Apple Intelligence features across Apple devices and services: (i) a ∼3B-parameter on-device model optimized for Apple silicon through architectural innovations such as KV-cache sharing and 2-bit quantization-aware training; and (ii) a scalable server model built on a novel Parallel-Track Mixture-of-Experts (PT-MoE) transformer that combines track parallelism, mixture-of-experts sparse computation, and interleaved global–local attention to deliver high quality with competitive cost on Apple’s Private Cloud Compute…Apple Machine Learning Research

PREAMBLE: Private and Efficient Aggregation via Block Sparse Vectors

We revisit the problem of secure aggregation of high-dimensional vectors in a two-server system such as Prio. These systems are typically used to aggregate vectors such as gradients in private federated learning, where the aggregate itself is protected via noise addition to ensure differential privacy. Existing approaches require communication scaling with the dimensionality, and thus limit the dimensionality of vectors one can efficiently process in this setup.
We propose PREAMBLE: {bf Pr}ivate {bf E}fficient {bf A}ggregation {bf M}echanism via {bf BL}ock-sparse {bf E}uclidean Vectors…Apple Machine Learning Research

ILuvUI: Instruction-Tuned Language-Vision Modeling of UIs from Machine Conversations

Multimodal Vision-Language Models (VLMs) enable powerful applications from their fused understanding of images and language, but
many perform poorly on UI tasks due to the lack of UI training data. In this paper, we adapt a recipe for generating paired text-image
training data for VLMs to the UI domain by combining existing pixel-based methods with a Large Language Model (LLM). Unlike
prior art, our method requires no human-provided annotations, and it can be applied to any dataset of UI screenshots. We generate a
dataset of 335K conversational examples paired with UIs that cover Q&A, UI…Apple Machine Learning Research

AXLearn: Modular Large Model Training on Heterogeneous Infrastructure

We design and implement AXLearn, a production deep learning system that facilitates scalable and high-performance training of large deep learning models. Compared to other state-of-art deep learning systems, AXLearn has a unique focus on modularity and support for heterogeneous hardware infrastructure. AXLearn’s internal interfaces between software components follow strict encapsulation, allowing different components to be assembled to facilitate rapid model development and experimentation on heterogeneous compute infrastructure. We introduce a novel method of quantifying modularity via…Apple Machine Learning Research

Enabling Differentially Private Federated Learning for Speech Recognition: Benchmarks, Adaptive Optimizers, and Gradient Clipping

While federated learning (FL) and differential privacy (DP) have been extensively studied, their application to automatic speech recognition (ASR) remains largely unexplored due to the challenges in training large transformer models. Specifically, large models further exacerbate issues in FL as they are particularly susceptible to gradient heterogeneity across layers, unlike the relatively uniform gradient behavior observed in shallow models. As a result, prior works struggle to converge with standard optimization techniques, even in the absence of DP mechanisms. To the best of our knowledge…Apple Machine Learning Research

Overcoming Vocabulary Constraints with Pixel-level Fallback

Subword tokenization requires balancing computational efficiency and vocabulary coverage, which often leads to suboptimal performance on languages and scripts not prioritized during training. We propose to augment pretrained language models with a vocabulary-free encoder that generates input embeddings from text rendered as pixels. Through experiments on English-centric language models, we demonstrate that our approach substantially improves machine translation performance and facilitates effective cross-lingual transfer, outperforming tokenizer-based methods. Furthermore, we find that…Apple Machine Learning Research

Addressing Misspecification in Simulation-based Inference through Data-driven Calibration

Driven by steady progress in deep generative modeling, simulation-based inference (SBI) has emerged as the workhorse for inferring the parameters of stochastic simulators. However, recent work has demonstrated that model misspecification can compromise the reliability of SBI, preventing its adoption in important applications where only misspecified simulators are available. This work introduces robust posterior estimation~(RoPE), a framework that overcomes model misspecification with a small real-world calibration set of ground-truth parameter measurements. We formalize the misspecification…Apple Machine Learning Research

Target Concrete Score Matching: A Holistic Framework for Discrete Diffusion

Discrete diffusion is a promising framework for modeling and generating discrete data. In this work, we present Target Concrete Score Matching (TCSM), a novel and versatile objective for training and fine-tuning discrete diffusion models. TCSM provides a general framework with broad applicability. It supports pre-training discrete diffusion models directly from data samples, and many existing discrete diffusion approaches naturally emerge as special cases of our more general TCSM framework. Furthermore, the same TCSM objective extends to post-training of discrete diffusion models, including…Apple Machine Learning Research

Shielded Diffusion: Generating Novel and Diverse Images using Sparse Repellency

Diffusion models are generating ever more realistic images. Yet, when gener-
ating images repeatedly with the same prompt, practitioners often obtain slight
variations of the same, highly-likely mode. As a result, most models fail to re-
flect the inherent diversity seen in data, which hinders their relevance to creative
tasks or ability to power world models. This work proposes a highly effective and
general method to repel generated images away from a reference set of images.
This is achieved by introducing data-driven repellence terms within diffusions dy-
namically, throughout their…Apple Machine Learning Research