Carnegie Mellon University at ICML 2025

CMU researchers are presenting 127 papers at the Forty-Second International Conference on Machine Learning (ICML 2025), held from July 13th-19th at the Vancouver Convention Center. Here is a quick overview of the areas our researchers are working on:

Here are our most frequent collaborator institutions:

Oral Papers

Expected Variational Inequalities

Authors: Brian Zhang, Ioannis Anagnostides, Emanuel Tewolde, Ratip Emin Berker, Gabriele Farina, Vincent Conitzer, Tuomas Sandholm

This paper introduces expected variational inequalities (EVIs), a relaxed version of variational inequalities (VIs) where the goal is to find a distribution that satisfies the VI condition in expectation. While VIs are generally hard to solve, the authors show that EVIs can be solved efficiently, even under challenging, non-monotone conditions, by leveraging ideas from game theory. EVIs generalize the concept of correlated equilibria and unify various results across smooth games, constrained games, and settings with non-concave utilities, making them broadly applicable beyond traditional game-theoretic contexts.

Exploring and Mitigating Adversarial Manipulation of Voting-Based Leaderboards

Authors: Yangsibo Huang, Milad Nasr, Anastasios Angelopoulos, Nicholas Carlini, Wei-Lin Chiang, Christopher A. Choquette Choo, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Ken Ziyu Liu, Ion Stoica, Florian Tramer, Chiyuan Zhang

This paper shows that voting-based benchmarks for evaluating LLMs (such as Chatbot Arena) can be vulnerable to adversarial manipulation if proper defenses aren’t in place. The authors show that an attacker can identify which model generated a response and then strategically vote to boost or demote specific models, altering the leaderboard with only around a thousand votes in a simulated environment. They collaborate with Chatbot Arena’s developers to propose and implement security measures such as reCAPTCHA and login requirements that significantly raise the cost of such attacks and enhance the platform’s robustness.

High-Dimensional Prediction for Sequential Decision Making

Authors: Georgy Noarov, Ramya Ramalingam, Aaron Roth, Stephan Xie

This paper presents a new algorithmic framework for making reliable, multi-dimensional forecasts in adversarial, nonstationary environments. Unlike existing online learning methods, this approach offers simultaneous performance guarantees for many agents, even when they face different objectives, act over large action spaces, or care about specific conditions (e.g. weather or route choice). The algorithm ensures low bias across many conditional events and enables each agent to achieve strong guarantees like diminishing regret. Applications include efficient solutions for online combinatorial optimization and multicalibration.

LLM-SRBench: A New Benchmark for Scientific Equation Discovery with Large Language Models

Authors: Parshin Shojaee, Ngoc Hieu Nguyen, Kazem Meidani, Amir Barati Farimani, Khoa Doan, Chandan Reddy

This paper introduces LLM-SRBench, a new benchmark designed to rigorously evaluate the ability of LLMs to discover scientific equations (rather than merely recall them from training data). Existing tests often rely on well-known equations, making it hard to tell whether models are truly reasoning or just memorizing. LLM-SRBench addresses this by including 239 challenging problems across four scientific domains, split into two categories: one that disguises familiar physics equations (LSR-Transform) and another that features fully synthetic, reasoning-driven tasks (LSR-Synth). Evaluations show that even the best current models only achieve 31.5% accuracy, highlighting the difficulty of the task and establishing LLM-SRBench as a valuable tool for driving progress in LLM-based scientific discovery.

On Differential Privacy for Adaptively Solving Search Problems via Sketching

Authors: Shiyuan Feng, Ying Feng, George Li, Zhao Song, David Woodruff, Lichen Zhang

This paper explores how to use differential privacy to protect against information leakage in adaptive search queries, a harder problem than traditional private estimation tasks. Unlike prior work that only returns numerical summaries (e.g., cost), the authors design algorithms that return actual solutions, like nearest neighbors or regression vectors, even when the inputs or queries change over time. They show how key problem parameters (like the number of approximate near neighbors or condition number of the data matrix) affect the performance of these private algorithms. This work has practical implications for AI systems that rely on private database searches or real-time regression, enabling them to provide useful results while safeguarding sensitive information from attackers.

Roll the dice & look before you leap: Going beyond the creative limits of next-token prediction

Authors: Vaishnavh Nagarajan, Chen Wu, Charles Ding, Aditi Raghunathan

This paper proposes a set of simple, abstract tasks designed to probe the creative limits of today’s language models in a controlled and measurable way. These tasks mimic real-world open-ended challenges like generating analogies or designing puzzles, where success requires discovering new connections or constructing novel patterns. The authors show that standard next-token prediction tends to be short-sighted and overly reliant on memorization, while alternative approaches like teacherless training and diffusion models produce more diverse, original outputs. They also introduce a technique called seed-conditioning, which adds randomness at the input rather than the output and can improve coherence without sacrificing creativity.

Training a Generally Curious Agent

Authors: Fahim Tajwar, Yiding Jiang, Abitha Thankaraj, Sumaita Rahman, Zico Kolter, Jeff Schneider, Russ Salakhutdinov

This paper introduces Paprika, a fine-tuning method that equips language models with general decision-making and exploration strategies, enabling them to adapt to new tasks through interaction alone (i.e. without further training). Paprika trains models on synthetic environments requiring different exploration behaviors, encouraging them to learn flexible strategies rather than memorizing solutions. To improve efficiency, it uses a curriculum learning-based approach that prioritizes tasks with high learning value, making the most of limited interaction data. Models trained with Paprika show strong transfer to completely new tasks, suggesting a promising direction for building AI agents that can learn to solve unfamiliar, sequential problems with minimal supervision.

Spotlight Papers

GMAIL: Generative Modality Alignment for generated Image Learning

Authors: Shentong Mo, Sukmin Yun

Generative models can create realistic images that could help train machine learning models, but using them as if they were real images can lead to problems because of differences between the two. This paper introduces a method called GMAIL that treats real and generated images as separate types (or modalities) and aligns them in a shared latent space during training, rather than just mixing them at the pixel level. The approach fine-tunes models on generated data using a special loss to bridge the gap, then uses these aligned models to improve training on tasks like image captioning and retrieval. The results show that GMAIL improves performance on several vision-language tasks and scales well as more generated data is added.

LOCATE 3D: Real-World Object Localization via Self-Supervised Learning in 3D

Authors: Paul McVay, Sergio Arnaud, Ada Martin, Arjun Majumdar, Krishna Murthy Jatavallabhula, Phillip Thomas, Ruslan Partsey, Daniel Dugas, Abha Gejji, Alexander Sax, Vincent-Pierre Berges, Mikael Henaff, Ayush Jain, Ang Cao, Ishita Prasad, Mrinal Kalakrishnan, Michael Rabbat, Nicolas Ballas, Mahmoud Assran, Oleksandr Maksymets, Aravind Rajeswaran, Franziska Meier

LOCATE 3D is a model that can find specific objects in 3D scenes based on natural language descriptions (like “the small coffee table between the sofa and the lamp”). It achieves state-of-the-art performance on standard benchmarks and works well in real-world settings, like on robots or AR devices, by using RGB-D sensor data. A key component is 3D-JEPA, a new self-supervised learning method that uses features from 2D vision models (like CLIP or DINO) to understand 3D point clouds through masked prediction tasks. The model is trained on a newly introduced large dataset (130K+ examples), helping it generalize better across different environments.

Masked Autoencoders Are Effective Tokenizers for Diffusion Models

Authors: Hao Chen, Yujin Han, Fangyi Chen, Xiang Li, Yidong Wang, Jindong Wang, Ze Wang, Zicheng Liu, Difan Zou, Bhiksha Raj

This paper introduces MAETok, a masked autoencoder designed to create a high-quality, semantically meaningful latent space for diffusion models. The authors show that having a well-structured latent space, meaning fewer Gaussian modes and more discriminative features, leads to better image generation without needing complex variational autoencoders. MAETok outperforms existing methods on ImageNet using just 128 tokens, and it’s also much faster: 76× quicker to train and 31× faster during inference. The key takeaway is that the structure of the latent space, not variational constraints, is what truly matters for high-quality diffusion-based generation.

Position: In-House Evaluation Is Not Enough. Towards Robust Third-Party Evaluation and Flaw Disclosure for General-Purpose AI

Authors: Shayne Longpre, Kevin Klyman, Ruth Elisabeth Appel, Sayash Kapoor, Rishi Bommasani, Michelle Sahar, Sean McGregor, Avijit Ghosh, Borhane Blili-Hamelin, Nathan Butters, Alondra Nelson, Amit Elazari, Andrew Sellars, Casey Ellis, Dane Sherrets, Dawn Song, Harley Geiger, Ilona Cohen, Lauren McIlvenny, Madhulika Srikumar, Mark Jaycox, Markus Anderljung, Nadine Johnson, Nicholas Carlini, Nicolas Miailhe, Nik Marda, Peter Henderson, Rebecca Portnoff, Rebecca Weiss, Victoria Westerhoff, Yacine Jernite, Rumman Chowdhury, Percy Liang, Arvind Narayanan

This paper highlights the lack of robust systems for identifying and reporting flaws in general-purpose AI (GPAI), especially compared to mature fields like software security. The authors propose three key solutions: (1) standardized reporting formats and engagement rules to streamline flaw reporting and triaging, (2) formal disclosure programs with legal protections for researchers (similar to bug bounties), and (3) better infrastructure for distributing flaw reports to relevant stakeholders. These steps aim to address growing risks like jailbreaks and cross-system vulnerabilities, ultimately improving the safety and accountability of GPAI systems.

Scaling Test-Time Compute Without Verification or RL is Suboptimal

Authors: Amrith Setlur, Nived Rajaraman, Sergey Levine, Aviral Kumar

This paper explores how to best scale test-time compute for large language models (LLMs), comparing two strategies: (1) distilling search traces (verifier-free, or VF) and (2) using verifiers or rewards to guide learning (verifier-based, or VB). The authors show—both theoretically and through experiments—that VB methods significantly outperform VF ones when working with limited compute or data. They explain that this performance gap grows as models and tasks get more complex, especially when solution paths vary in style or quality. Ultimately, the paper argues that verification is essential for effectively scaling LLM performance, especially for reasoning tasks.

ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference

Authors: Hanshi Sun, Li-Wen Chang, Wenlei Bao, Size Zheng, Ningxin Zheng, Xin Liu, Harry Dong, Yuejie Chi, Beidi Chen

As long-context LLMs become more common, their growing memory demands during inference slow down performance, especially due to the expanding key-value (KV) cache. This paper introduces ShadowKV, a system that significantly improves throughput by compressing the key cache using low-rank representations and offloading the value cache without major latency costs. It reconstructs only the necessary KV pairs during decoding to maintain speed and accuracy. Experiments show ShadowKV supports much larger batch sizes (up to 6×) and improves throughput by over 3× on standard hardware, all while preserving model quality across several LLMs and benchmarks.

Poster Papers

Accountability, Transparency, And Interpretability

A Versatile Influence Function for Data Attribution with Non-Decomposable Loss

Authors: Junwei Deng, Weijing Tang, Jiaqi Ma

Massive Values in Self-Attention Modules are the Key to Contextual Knowledge Understanding

Authors: Mingyu Jin, Kai Mei, Wujiang Xu, Mingjie Sun, Ruixiang Tang, Mengnan Du, Zirui Liu, Yongfeng Zhang

Validating Mechanistic Interpretations: An Axiomatic Approach

Authors: Nils Palumbo, Ravi Mangal, Zifan Wang, Saranya Vijayakumar, Corina Pasareanu, Somesh Jha

Active Learning And Interactive Learning

Optimistic Algorithms for Adaptive Estimation of the Average Treatment Effect

Authors: Ojash Neopane, Aaditya Ramdas, Aarti Singh

Applications

Agent Workflow Memory

Authors: Zhiruo Wang, Jiayuan Mao, Daniel Fried, Graham Neubig

Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale

Authors: Rogerio Bonatti, Dan Zhao, Francesco Bonacci, Dillon Dupont, Sara Abdali, Yinheng Li, Yadong Lu, Justin Wagle, Kazuhito Koishida, Arthur Bucker, Lawrence Jang, Zheng Hui

Causality

A Sample Efficient Conditional Independence Test in the Presence of Discretization

Authors: Boyang Sun, Yu Yao, Xinshuai Dong, Zongfang Liu, Tongliang Liu, Yumou Qiu, Kun Zhang

Extracting Rare Dependence Patterns via Adaptive Sample Reweighting

Authors: YIQING LI, Yewei Xia, Xiaofei Wang, Zhengming Chen, Liuhua Peng, Mingming Gong, Kun Zhang

Isolated Causal Effects of Natural Language

Authors: Victoria Lin, Louis-Philippe Morency, Eli Ben-Michael

Latent Variable Causal Discovery under Selection Bias

Authors: Haoyue Dai, Yiwen Qiu, Ignavier Ng, Xinshuai Dong, Peter Spirtes, Kun Zhang

Permutation-based Rank Test in the Presence of Discretization and Application in Causal Discovery with Mixed Data

Authors: Xinshuai Dong, Ignavier Ng, Boyang Sun, Haoyue Dai, Guangyuan Hao, Shunxing Fan, Peter Spirtes, Yumou Qiu, Kun Zhang

Chemistry, Physics, And Earth Sciences

Maximum Update Parametrization and Zero-Shot Hyperparameter Transfer for Fourier Neural Operators

Authors: Shanda Li, Shinjae Yoo, Yiming Yang

Multi-Timescale Dynamics Model Bayesian Optimization for Plasma Stabilization in Tokamaks

Authors: Rohit Sonker, Alexandre Capone, Andrew Rothstein, Hiro Kaga, Egemen Kolemen, Jeff Schneider

OmniArch: Building Foundation Model for Scientific Computing

Authors: Tianyu Chen, Haoyi Zhou, Ying Li, Hao Wang, Chonghan Gao, Rongye Shi, Shanghang Zhang, Jianxin Li

PPDiff: Diffusing in Hybrid Sequence-Structure Space for Protein-Protein Complex Design

Authors: Zhenqiao Song, Tianxiao Li, Lei Li, Martin Min

Computer Vision

David and Goliath: Small One-step Model Beats Large Diffusion with Score Post-training

Authors: Weijian Luo, colin zhang, Debing Zhang, Zhengyang Geng

From Thousands to Billions: 3D Visual Language Grounding via Render-Supervised Distillation from 2D VLMs

Authors: Ang Cao, Sergio Arnaud, Oleksandr Maksymets, Jianing Yang, Ayush Jain, Ada Martin, Vincent-Pierre Berges, Paul McVay, Ruslan Partsey, Aravind Rajeswaran, Franziska Meier, Justin Johnson, Jeong Joon Park, Alexander Sax

GenZSL: Generative Zero-Shot Learning Via Inductive Variational Autoencoder

Authors: Shiming Chen, Dingjie Fu, Salman Khan, Fahad Khan

Understanding Complexity in VideoQA via Visual Program Generation

Authors: Cristobal Eyzaguirre, Igor Vasiljevic, Achal Dave, Jiajun Wu, Rareș Ambruș, Thomas Kollar, Juan Carlos Niebles, Pavel Tokmakov

Unifying 2D and 3D Vision-Language Understanding

Authors: Ayush Jain, Alexander Swerdlow, Yuzhou Wang, Sergio Arnaud, Ada Martin, Alexander Sax, Franziska Meier, Katerina Fragkiadaki

Deep Learning

A Theoretical Study of (Hyper) Self-Attention through the Lens of Interactions: Representation, Training, Generalization

Authors: Muhammed Ustaomeroglu, Guannan Qu

Towards characterizing the value of edge embeddings in Graph Neural Networks

Authors: Dhruv Rohatgi, Tanya Marwah, Zachary Lipton, Jianfeng Lu, Ankur Moitra, Andrej Risteski

Discrete And Combinatorial Optimization

EquivaMap: Leveraging LLMs for Automatic Equivalence Checking of Optimization Formulations

Authors: Haotian Zhai, Connor Lawless, Ellen Vitercik, Liu Leqi

Faster Global Minimum Cut with Predictions

Authors: Helia Niaparast, Benjamin Moseley, Karan Singh

Regularized Langevin Dynamics for Combinatorial Optimization

Authors: Shengyu Feng, Yiming Yang

Domain Adaptation And Transfer Learning

A General Representation-Based Approach to Multi-Source Domain Adaptation

Authors: Ignavier Ng, Yan Li, Zijian Li, Yujia Zheng, Guangyi Chen, Kun Zhang

Evaluation

Copilot Arena: A Platform for Code LLM Evaluation in the Wild

Authors: Wayne Chi, Valerie Chen, Anastasios Angelopoulos, Wei-Lin Chiang, Aditya Mittal, Naman Jain, Tianjun Zhang, Ion Stoica, Chris Donahue, Ameet Talwalkar

RAGGED: Towards Informed Design of Scalable and Stable RAG Systems

Authors: Jennifer Hsia, Afreen Shaikh, Zhiruo Wang, Graham Neubig

RBench: Graduate-level Multi-disciplinary Benchmarks for LLM & MLLM Complex Reasoning Evaluation

Authors: Meng-Hao Guo, Jiajun Xu, Yi Zhang, Jiaxi Song, Haoyang Peng, Yi-Xuan Deng, Xinzhi Dong, Kiyohiro Nakayama, Zhengyang Geng, Chen Wang, Bolin Ni, Guo-Wei Yang, Yongming Rao, Houwen Peng, Han Hu, Gordon Wetzstein, Shi-min Hu

Everything Else

On Fine-Grained Distinct Element Estimation

Authors: Ilias Diakonikolas, Daniel Kane, Jasper Lee, Thanasis Pittas, David Woodruff, Samson Zhou

Safety Certificate against Latent Variables with Partially Unidentifiable Dynamics

Authors: Haoming Jing, Yorie Nakahira

Understanding the Kronecker Matrix-Vector Complexity of Linear Algebra

Authors: Raphael Meyer, William Swartworth, David Woodruff

Fairness

FDGen: A Fairness-Aware Graph Generation Model

Authors: Zichong Wang, Wenbin Zhang

Fairness on Principal Stratum: A New Perspective on Counterfactual Fairness

Authors: Haoxuan Li, Zeyu Tang, Zhichao Jiang, Zhuangyan Fang, Yue Liu, zhi geng, Kun Zhang

Is Your Model Fairly Certain? Uncertainty-Aware Fairness Evaluation for LLMs

Authors: Yinong O Wang, Nivedha Sivakumar, Falaah Arif Khan, Katherine Metcalf, Adam Golinski, Natalie Mackraz, Barry-John Theobald, Luca Zappella, Nicholas Apostoloff

Kandinsky Conformal Prediction: Beyond Class- and Covariate-Conditional Coverage

Authors: Konstantina Bairaktari, Jiayun Wu, Steven Wu

Relative Error Fair Clustering in the Weak-Strong Oracle Model

Authors: Vladimir Braverman, Prathamesh Dharangutte, Shaofeng Jiang, Hoai-An Nguyen, Chen Wang, Yubo Zhang, Samson Zhou

Foundation Models

Rethinking the Bias of Foundation Model under Long-tailed Distribution

Authors: Jiahao Chen, Bin Qin, Jiangmeng Li, Hao Chen, Bing Su

Game Theory

Observation Interference in Partially Observable Assistance Games

Authors: Scott Emmons, Caspar Oesterheld, Vincent Conitzer, Stuart Russell

General Machine Learning

On the Power of Learning-Augmented Search Trees

Authors: Jingbang Chen, Xinyuan Cao, Alicia Stepin, Li Chen

Self-Improving Transformers Overcome Easy-to-Hard and Length Generalization Challenges

Authors: Nayoung Lee, Jack Cai, Avi Schwarzschild, Kangwook Lee, Dimitris Papailiopoulos

Graph Neural Networks

CurvGAD: Leveraging Curvature for Enhanced Graph Anomaly Detection

Authors: Karish Grover, Geoff Gordon, Christos Faloutsos

Graph World Model

Authors: Tao Feng, Yexin Wu, Guanyu Lin, Jiaxuan You

Graphical Models

A Generic Family of Graphical Models: Diversity, Efficiency, and Heterogeneity

Authors: Yufei Huang, Changhu Wang, Junjie Tang, Weichi Wu, Ruibin Xi

Health / Medicine

Distributed Parallel Gradient Stacking(DPGS): Solving Whole Slide Image Stacking Challenge in Multi-Instance Learning

Authors: Boyuan Wu, wang, Xianwei Lin, Jiachun Xu, Jikai Yu, Zhou Shicheng, Hongda Chen, Lianxin Hu

SUICA: Learning Super-high Dimensional Sparse Implicit Neural Representations for Spatial Transcriptomics

Authors: Qingtian Zhu, Yumin Zheng, Yuling Sang, Yifan Zhan, Ziyan Zhu, Jun Ding, Yinqiang Zheng

Language, Speech And Dialog

A Variational Framework for Improving Naturalness in Generative Spoken Language Models

Authors: Li-Wei Chen, Takuya Higuchi, Zakaria Aldeneh, Ahmed Hussen Abdelaziz, Alexander Rudnicky

OWLS: Scaling Laws for Multilingual Speech Recognition and Translation Models

Authors: William Chen, Jinchuan Tian, Yifan Peng, Brian Yan, Chao-Han Yang, Shinji Watanabe

Synthesizing Privacy-Preserving Text Data via Finetuning *without* Finetuning Billion-Scale LLMs

Authors: Bowen Tan, Zheng Xu, Eric Xing, Zhiting Hu, Shanshan Wu

Large Language Models

Accelerating Unbiased LLM Evaluation via Synthetic Feedback

Authors: Zhaoyi Zhou, Yuda Song, Andrea Zanette

AlphaVerus: Bootstrapping Formally Verified Code Generation through Self-Improving Translation and Treefinement

Authors: Pranjal Aggarwal, Bryan Parno, Sean Welleck

An Architecture Search Framework for Inference-Time Techniques

Authors: Jon Saad-Falcon, Adrian Lafuente, Shlok Natarajan, Nahum Maru, Hristo Todorov, Etash Guha, Estefany Kelly Buchanan, Mayee Chen, Neel Guha, Christopher Re, Azalia Mirhoseini

Data Mixing Optimization for Supervised Fine-Tuning of Large Language Models

Authors: Yuan Li, Zhengzhong Liu, Eric Xing

Demystifying Long Chain-of-Thought Reasoning

Authors: Edward Yeo, Yuxuan Tong, Xinyao Niu, Graham Neubig, Xiang Yue

GSM-$infty$: How Do your LLMs Behave over Infinitely Increasing Reasoning Complexity and Context Length?

Authors: Yang Zhou, Hongyi Liu, Zhuoming Chen, Yuandong Tian, Beidi Chen

Idiosyncrasies in Large Language Models

Authors: Mingjie Sun, Yida Yin, Zhiqiu (Oscar) Xu, Zico Kolter, Zhuang Liu

Large Language Models are Demonstration Pre-Selectors for Themselves

Authors: Jiarui Jin, Yuwei Wu, Haoxuan Li, Xiaoting He, Weinan Zhang, Yiming Yang, Yong Yu, Jun Wang, Mengyue Yang

Let LLM Tell What to Prune and How Much to Prune

Authors: Mingzhe Yang, Sihao Lin, Changlin Li, Xiaojun Chang

Memorization Sinks: Isolating Memorization during LLM Training

Authors: Gaurav Ghosal, Pratyush Maini, Aditi Raghunathan

Optimizing Temperature for Language Models with Multi-Sample Inference

Authors: Weihua Du, Yiming Yang, Sean Welleck

Optimizing Test-Time Compute via Meta Reinforcement Finetuning

Authors: Yuxiao Qu, Matthew Yang, Amrith Setlur, Lewis Tunstall, Edward Beeching, Russ Salakhutdinov, Aviral Kumar

Overtrained Language Models Are Harder to Fine-Tune

Authors: Jacob Mitchell Springer, Sachin Goyal, Kaiyue Wen, Tanishq Kumar, Xiang Yue, Sadhika Malladi, Graham Neubig, Aditi Raghunathan

Reflection-Window Decoding: Text Generation with Selective Refinement

Authors: Zeyu Tang, Zhenhao Chen, Xiangchen Song, Loka Li, Yunlong Deng, Yifan Shen, Guangyi Chen, Peter Spirtes, Kun Zhang

Speculative Prefill: Turbocharging TTFT with Lightweight and Training-Free Token Importance Estimation

Authors: Jingyu Liu, Beidi Chen, Ce Zhang

Think Smarter not Harder: Adaptive Reasoning with Inference Aware Optimization

Authors: Zishun Yu, Tengyu Xu, Di Jin, Karthik Abinav Sankararaman, Yun He, Wenxuan Zhou, Zhouhao Zeng, Eryk Helenowski, Chen Zhu, Sinong Wang, Hao Ma, Han Fang

To Steer or Not to Steer? Mechanistic Error Reduction with Abstention for Language Models

Authors: Anna Hedström, Salim I. Amoukou, Tom Bewley, Saumitra Mishra, Manuela Veloso

Training Software Engineering Agents and Verifiers with SWE-Gym

Authors: Jiayi Pan, Xingyao Wang, Graham Neubig, Navdeep Jaitly, Heng Ji, Alane Suhr, Yizhe Zhang

Understanding the Skill Gap in Recurrent Models: The Role of the Gather-and-Aggregate Mechanism

Authors: Aviv Bick, Eric Xing, Albert Gu

Unlocking Post-hoc Dataset Inference with Synthetic Data

Authors: Bihe Zhao, Pratyush Maini, Franziska Boenisch, Adam Dziedzic

Unnatural Languages Are Not Bugs but Features for LLMs

Authors: Keyu Duan, Yiran Zhao, Zhili Feng, Jinjie Ni, Tianyu Pang, Qian Liu, Tianle Cai, Longxu Dou, Kenji Kawaguchi, Anirudh Goyal, Zico Kolter, Michael Shieh

What Do Learning Dynamics Reveal About Generalization in LLM Mathematical Reasoning?

Authors: Katie Kang, Amrith Setlur, Dibya Ghosh, Jacob Steinhardt, Claire Tomlin, Sergey Levine, Aviral Kumar

Learning Theory

Projection Optimization: A General Framework for Multi-Objective and Multi-Group RLHF

Authors: Nuoya Xiong, Aarti Singh

Sample-Optimal Agnostic Boosting with Unlabeled Data

Authors: Udaya Ghai, Karan Singh

Multi-agent

M³HF: Multi-agent Reinforcement Learning from Multi-phase Human Feedback of Mixed Quality

Authors: Ziyan Wang, Zhicheng Zhang, Fei Fang, Yali Du

Online Learning And Bandits

Offline Learning for Combinatorial Multi-armed Bandits

Authors: Xutong Liu, Xiangxiang Dai, Jinhang Zuo, Siwei Wang, Carlee Joe-Wong, John C. S. Lui, Wei Chen

Online Learning, Active Learning And Bandits

AutoAL: Automated Active Learning with Differentiable Query Strategy Search

Authors: Yifeng Wang, Xueying Zhan, Siyu Huang

Optimization

FedECADO: A Dynamical System Model of Federated Learning

Authors: Aayushya Agarwal, Gauri Joshi, Lawrence Pileggi

Graph-Based Algorithms for Diverse Similarity Search

Authors: Piyush Anand, Piotr Indyk, Ravishankar Krishnaswamy, Sepideh Mahabadi, Vikas Raykar, Kirankumar Shiragur, Haike Xu

Maximum Coverage in Turnstile Streams with Applications to Fingerprinting Measures

Authors: Alina Ene, Alessandro Epasto, Vahab Mirrokni, Hoai-An Nguyen, Huy Nguyen, David Woodruff, Peilin Zhong

Robust Sparsification via Sensitivity

Authors: Chansophea Wathanak In, Yi Li, David Woodruff, Xuan Wu

Privacy

EncryptedLLM: Privacy-Preserving Large Language Model Inference via GPU-Accelerated Fully Homomorphic Encryption

Authors: Leo de Castro, Daniel Escudero, Adya Agrawal, Antigoni Polychroniadou, Manuela Veloso

Leveraging Model Guidance to Extract Training Data from Personalized Diffusion Models

Authors: Xiaoyu Wu, Jiaru Zhang, Steven Wu

Private Federated Learning using Preference-Optimized Synthetic Data

Authors: Charlie Hou, Mei-Yu Wang, Yige Zhu, Daniel Lazar, Giulia Fanti

Underestimated Privacy Risks for Minority Populations in Large Language Model Unlearning

Authors: Rongzhe Wei, Mufei Li, Mohsen Ghassemi, Eleonora Kreacic, Yifan Li, Xiang Yue, Bo Li, Vamsi Potluru, Pan Li, Eli Chien

Probabilistic Methods

Density Ratio Estimation with Conditional Probability Paths

Authors: Hanlin Yu, Arto Klami, Aapo Hyvarinen, Anna Korba, Lemir Omar Chehab

Improving the Statistical Efficiency of Cross-Conformal Prediction

Authors: Matteo Gasparin, Aaditya Ramdas

Reinforcement Learning And Planning

Incentivize without Bonus: Provably Efficient Model-based Online Multi-agent RL for Markov Games

Authors: Tong Yang, Bo Dai, Lin Xiao, Yuejie Chi

Representation Learning

Contextures: Representations from Contexts

Authors: Runtian Zhai, Kai Yang, Burak VARICI, Che-Ping Tsai, Zico Kolter, Pradeep Ravikumar

Learning Vision and Language Concepts for Controllable Image Generation

Authors: Shaoan Xie, Lingjing Kong, Yujia Zheng, Zeyu Tang, Eric Xing, Guangyi Chen, Kun Zhang

Nonparametric Identification of Latent Concepts

Authors: Yujia Zheng, Shaoan Xie, Kun Zhang

Research Priorities, Methodology, And Evaluation

Position: You Can’t Manufacture a NeRF

Authors: Marta An Kimmel, Mueed Rehman, Yonatan Bisk, Gary Fedder

Robotics

DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning

Authors: Gaoyue Zhou, Hengkai Pan, Yann LeCun, Lerrel Pinto

Learning Safe Control via On-the-Fly Bandit Exploration

Authors: Alexandre Capone, Ryan Cosner, Aaron Ames, Sandra Hirche

Towards Learning to Complete Anything in Lidar

Authors: Ayça Takmaz, Cristiano Saltori, Neehar Peri, Tim Meinhardt, Riccardo de Lutio, Laura Leal-Taixé, Aljosa Osep

Safety

DIS-CO: Discovering Copyrighted Content in VLMs Training Data

Authors: André Duarte, Xuandong Zhao, Arlindo Oliveira, Lei Li

Do Not Mimic My Voice : Speaker Identity Unlearning for Zero-Shot Text-to-Speech

Authors: Taesoo Kim, Jinju Kim, Dongchan Kim, Jong Hwan Ko, Gyeong-Moon Park

SafetyAnalyst: Interpretable, Transparent, and Steerable Safety Moderation for AI Behavior

Authors: Jing-Jing Li, Valentina Pyatkin, Max Kleiman-Weiner, Liwei Jiang, Nouha Dziri, Anne Collins, Jana Schaich Borg, Maarten Sap, Yejin Choi, Sydney Levine

WMarkGPT: Watermarked Image Understanding via Multimodal Large Language Models

Authors: Tan Songbai, Xuerui Qiu, Yao Shu, Gang Xu, Linrui Xu, Xiangyu Xu, HUIPING ZHUANG, Ming Li, Fei Yu

Weak-to-Strong Jailbreaking on Large Language Models

Authors: Xuandong Zhao, Xianjun Yang, Tianyu Pang, Chao Du, Lei Li, Yu-Xiang Wang, William Wang

Security

ExpProof : Operationalizing Explanations for Confidential Models with ZKPs

Authors: Chhavi Yadav, Evan Laufer, Dan Boneh, Kamalika Chaudhuri

Sequential Models, Time Series

A Generalizable Physics-Enhanced State Space Model for Long-Term Dynamics Forecasting in Complex Environments

Authors: Yuchen Wang, Hongjue Zhao, Haohong Lin, Enze Xu, Lifang He, Huajie Shao

Enhancing Foundation Models for Time Series Forecasting via Wavelet-based Tokenization

Authors: Luca Masserano, Abdul Fatir Ansari, Boran Han, Xiyuan Zhang, Christos Faloutsos, Michael Mahoney, Andrew Wilson, Youngsuk Park, Syama Sundar Yadav Rangapuram, Danielle Maddix, Yuyang Wang

LSCD: Lomb–Scargle Conditioned Diffusion for Time series Imputation

Authors: Elizabeth M Fons Etcheverry, Alejandro Sztrajman, Yousef El-Laham, Luciana Ferrer, Svitlana Vyetrenko, Manuela Veloso

Understanding and Improving Length Generalization in Recurrent Models

Authors: Ricardo Buitrago Ruiz, Albert Gu