RLHF 101: A Technical Tutorial on Reinforcement Learning from Human Feedback

RLHF 101: A Technical Tutorial on Reinforcement Learning from Human Feedback

Reinforcement Learning from Human Feedback (RLHF) is a popular technique used to align AI systems with human preferences by training them using feedback from people, rather than relying solely on predefined reward functions. Instead of coding every desirable behavior manually (which is often infeasible in complex tasks) RLHF allows models, especially large language models (LLMs), to learn from examples of what humans consider good or bad outputs. This approach is particularly important for tasks where success is subjective or hard to quantify, such as generating helpful and safe text responses. RLHF has become a cornerstone in building more aligned and controllable AI systems, making it essential for developing AI that behaves in ways humans intend.

This blog dives into the full training pipeline of the RLHF framework. We will explore every stage — from data generation and reward model inference, to the final training of an LLM. Our goal is to ensure that everything is fully reproducible by providing all the necessary code and the exact specifications of the environments used. By the end of this post, you should know the general pipeline to train any model with any instruction dataset using the RLHF algorithm of your choice!

Preliminary: Setup & Environment

We will use the following setup for this tutorial:

  • Dataset: UltraFeedback, a well-curated dataset consisting of general chat prompts. (While UltraFeedback also contains LLM-generated responses to the prompts, we won’t be using these.)
  • Base Model: Llama-3-8B-it, a state-of-the-art instruction-tuned LLM. This is the model we will fine-tune.
  • Reward Model: Armo, a robust reward model optimized for evaluating the generated outputs. We will use Armo to assign scalar reward values to candidate responses, indicating how “good” or “aligned” a response is.
  • Training Algorithm: REBEL, a state-of-the-art algorithm tailored for efficient RLHF optimization.

To get started, clone our repo, which contains all the resources required for this tutorial:

git clone https://github.com/ZhaolinGao/REBEL
cd REBEL

We use two separate environments for different stages of the pipeline:

  • vllm: Handles data generation, leveraging the efficient vllm library.
  • rebel: Used for training the RLHF model.

You can install both environments using the provided YAML files:

conda env create -f ./envs/rebel_env.yml
conda env create -f ./envs/vllm_env.yml

Part 1: Data Generation

The first step in the RLHF pipeline is generating samples from the policy to receive feedback on. Concretely, in this section, we will load the base model using vllm for fast inference, prepare the dataset, and generate multiple responses for each prompt in the dataset. The complete code for this part is available here.

Activate the vllm environment:

conda activate vllm

First, load the base model and tokenizer using vllm:

from transformers import AutoTokenizer
from vllm import LLM
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
llm = LLM(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    tensor_parallel_size=8,
)

Here, tensor_parallel_size specifies the number of GPUs to use.

Next, load the UltraFeedback dataset:

from datasets import load_dataset
dataset = load_dataset("allenai/ultrafeedback_binarized_cleaned_train", split='train')

You can select a subset of the dataset using dataset.select. For example, to select the first 10,000 rows:

dataset = dataset.select(range(10000))

Alternatively, you can split the dataset into chunks using dataset.shard for implementations like SPPO where each iteration only trains on one of the chunks.

Now, let’s prepare the dataset for generation. The Llama model uses special tokens to distinguish prompts and responses. For example:

<|begin_of_text|><|start_header_id|>user<|end_header_id|>

What is France's capital?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Therefore, for every prompt in the dataset, we need to convert it from plain text into this format before generating:

def get_message(instruction):
    message = [
        {"role": "user", "content": instruction},
    ]
    return message
prompts = [tokenizer.apply_chat_template(get_message(row['prompt']), tokenize=False, add_generation_prompt=True) for row in dataset]
  • get_message transforms the plain-text prompt into a dictionary indicating it is from the user.
  • tokenizer.apply_chat_template adds the required special tokens and appends the response tokens (<|start_header_id|>assistant<|end_header_id|>nn} at the end with add_generation_prompt=True.

Finally, we can generate the responses using vllm with the prompts we just formatted. We are going to generate 5 responses per prompt:

import torch
import random
import numpy as np
from vllm import SamplingParams

def set_seed(seed=5775709):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)

for p in range(5):
    set_seed(p * 50)
    sampling_params = SamplingParams(
        temperature=0.8,
        top_p=0.9,
        max_tokens=2048,
        seed=p * 50,
    )
    response = llm.generate(prompts, sampling_params)
    output = list(map(lambda x: x.outputs[0].text, response))
    dataset = dataset.add_column(f"response_{p}", output)
  • temperature=0.8, top_p=0.9 are common settings to control diversity in generation.
  • set_seed is used to ensure reproducibility and sets a different seed for each response.
  • llm.generate generates the response, and the results are added to the dataset with dataset.add_column.

You could run the complete scipt with:

python ./src/ultrafeedback_largebatch/generate.py --world_size NUM_GPU --output_repo OUTPUT_REPO

Part 2: Reward Model Inference

The second step in the RLHF pipeline is querying the reward model to tell us how good a generated sample was. Concretely, in this part, we will calculate reward scores for the responses generated in Part 1 what are later used for training. The complete code for this part is available here.

Activate the rebel environment:

conda activate rebel

To begin, we’ll initialize the Armo reward model pipeline. This reward model is a fine-tuned sequence classification model that assigns a scalar reward score to a given dialogue based on its quality.

rm = ArmoRMPipeline("RLHFlow/ArmoRM-Llama3-8B-v0.1", trust_remote_code=True)

Now, we can gather the reward scores:

def get_message(instruction, response):
    return [{"role": "user", "content": instruction}, {"role": "assistant", "content": response}]

rewards = {}
for i in range(5):
    rewards[f"response_{i}_reward"] = []
    for row in dataset:
        reward = rm(get_message(row['prompt'], row[f'response_{i}']))
        rewards[f"response_{i}_reward"].append(reward)
for k, v in rewards.items():
    dataset = dataset.add_column(k, v)
  • get_message formats the user prompt and assistant response into a list of dictionaries.
  • rm computes a reward score for each response in the dataset.

You can run the complete scipt with:

python ./src/ultrafeedback_largebatch/rank.py --input_repo INPUT_REPO
  • INPUT_REPO is the saved repo from Part 1 that contains the generated responses.

Part 3: Filter and Tokenize

While the preceding two parts are all we need in theory to do RLHF, it is often advisable in practice to perform a filtering process to ensure training runs smoothly. Concretely, in this part, we’ll walk through the process of preparing a dataset for training by filtering excessively long prompts and responses to prevent out-of-memory (OOM) issues, selecting the best and worst responses for training, and removing duplicate responses. The complete code for this part is available here.

Let’s first initialize two different tokenizers where one pads from the right and one pads from the left:

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
tokenizer.add_special_tokens({"pad_token": "[PAD]"})
tokenizer_left = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct", padding_side='left')
tokenizer_left.add_special_tokens({"pad_token": "[PAD]"})

These two different tokenizers allow us to pad the prompt from left and the response from the right such that they meet in the middle. By combining left-padded prompts with right-padded responses, we ensure that:

  • Prompts and responses meet at a consistent position.
  • Relative position embeddings remain correct for model training.

Here’s an example format:

[PAD] ... [PAD] <|begin_of_text|><|start_header_id|>user<|end_header_id|>

PROMPT<|eot_id|><|start_header_id|>assistant<|end_header_id|>


RESPONSE<|eot_id|>[PAD] ... [PAD]

We want to ensure that the length of

[PAD] ... [PAD] <|begin_of_text|><|start_header_id|>user<|end_header_id|>

PROMPT<|eot_id|><|start_header_id|>assistant<|end_header_id|>

is the same for all prompts, and the length of

RESPONSE<|eot_id|>[PAD] ... [PAD]

is the same for all responses.

We filter out prompts longer than 1,024 tokens and responses exceeding 2,048 tokens to prevent OOM during training:

dataset = dataset.filter(lambda row: tokenizer.apply_chat_template(get_message(row['prompt']), tokenize=True, add_generation_prompt=True, return_tensors='pt').shape[-1] <= 1024)
for i in range(5):
    dataset = dataset.filter(lambda row: tokenizer.apply_chat_template(get_message(response=row[f'response_{i}']), tokenize=True, add_generation_prompt=False, return_tensors='pt')[:, 5:].shape[-1] <= 2048)

Note that we skip the first five tokens of responses when counting lengths to exclude special tokens (e.g. <|begin_of_text|><|start_header_id|>assistant<|end_header_id|>nn) and only count the actual length of the response plus the EOS token (<|eot_id|>) at the end.

Now we could tokenize the prompt with left padding to a maximum length of 1,024 tokens:

llama_prompt_tokens = []
for row in dataset:
    llama_prompt_token = tokenizer_left.apply_chat_template(
            get_message(row['prompt']), 
            add_generation_prompt=True,
            tokenize=True,
            padding='max_length',
            max_length=1024,
    )
    assert len(llama_prompt_token) == 1024
    assert (llama_prompt_token[0] == 128000 or llama_prompt_token[0] == 128256) and llama_prompt_token[-1] == 271
    llama_prompt_tokens.append(llama_prompt_token)
dataset = dataset.add_column("llama_prompt_tokens", llama_prompt_tokens)

The assertions are used to ensure that the length is always 1,024 and the tokenized prompt either starts with [pad] token or <|begin_of_text|> token and ends with nn token.

Then, we select the responses with the highest and lowest rewards for each prompt as the chosen and reject responses, and tokenize them with right padding:

chosen, reject, llama_chosen_tokens, llama_reject_tokens, chosen_reward, reject_reward = [], [], [], [], [], []

for row in dataset:

    all_rewards = [row[f"response_{i}_reward"] for i in range(5)]
    chosen_idx, reject_idx = np.argmax(all_rewards), np.argmin(all_rewards)

    chosen.append(row[f"response_{chosen_idx}"])
    reject.append(row[f"response_{reject_idx}"])

    llama_chosen_token = tokenizer.apply_chat_template(
            get_message(response=row[f"response_{chosen_idx}"]),
            add_generation_prompt=False,
            tokenize=True,
            padding='max_length',
            max_length=2048+5,
    )[5:]
    llama_chosen_tokens.append(llama_chosen_token)
    chosen_reward.append(row[f"response_{chosen_idx}_reward"])
    assert len(llama_chosen_token) == 2048
    assert llama_chosen_token[-1] == 128009 or llama_chosen_token[-1] == 128256

    llama_reject_token = tokenizer.apply_chat_template(
            get_message(response=row[f"response_{reject_idx}"]),
            add_generation_prompt=False,
            tokenize=True,
            padding='max_length',
            max_length=2048+5,
    )[5:]
    llama_reject_tokens.append(llama_reject_token)
    reject_reward.append(row[f"response_{reject_idx}_reward"])
    assert len(llama_reject_token) == 2048
    assert llama_reject_token[-1] == 128009 or llama_reject_token[-1] == 128256

dataset = dataset.add_column("chosen", chosen)
dataset = dataset.add_column("chosen_reward", chosen_reward)
dataset = dataset.add_column("llama_chosen_tokens", llama_chosen_tokens)
dataset = dataset.add_column("reject", reject)
dataset = dataset.add_column("reject_reward", reject_reward)
dataset = dataset.add_column("llama_reject_tokens", llama_reject_tokens)

Again the assertions are used to ensure that the lengths of the tokenized responses are always 2,048 and the tokenized responses either end with [pad] token or <|eot_id|> token.

Finally, we filter out rows where the chosen and reject responses are the same:

dataset = dataset.filter(lambda row: row['chosen'] != row['reject'])

and split the dataset into a training set and a test set with 1,000 prompts:

dataset = dataset.train_test_split(test_size=1000, shuffle=True)

You could run the complete scipt with:

python ./src/ultrafeedback_largebatch/filter_tokenize.py --input_repo INPUT_REPO
  • INPUT_REPO is the saved repo from Part 2 that contains the rewards for each response.

Part 4: Training with REBEL

Finally, we’re now ready to update the parameters of our model using an RLHF algorithm! We will now use our curated dataset and the REBEL algorithm to fine-tune our base model.

At each iteration (t) of REBEL, we aim to solve the following square loss regression problem:
$$theta_{t+1}=argmin_{thetainTheta}sum_{(x, y, y’)in mathcal{D}_t}left(frac{1}{eta} left(ln frac{pi_theta(y|x)}{pi_{theta_t}(y|x)} – ln frac{pi_theta(y’|x)}{pi_{theta_t}(y’|x)}right) – left(r(x, y) – r(x, y’)right)right)^2$$

where (eta) is a hyperparameter, (theta) is the parameter of the model, (x) is the prompt, (mathcal{D}_t) is the dataset we collected from the previous three parts, (y) and (y’) are the responses for (x), (pi_theta(y|x)) is the probability of generation response (y) given prompt (x) under the parameterized policy (pi_theta), and (r(x, y)) is the reward of response (y) for prompt (x) which is obtained from Part 2. The detailed derivations of the algorithm are shown in our paper. In short REBEL lets us avoid the complexity (e.g. clipping, critic models, …) of other RLHF algorithms like PPO while having stronger theoretical guarantees!

In this tutorial, we demonstrate a single iteration of REBEL ((t=0)) using the base model (pi_{theta_0}). For multi-iteration training, you can repeat Parts 1 through 4, initializing each iteration with the model trained in the previous iteration.

The complete code for this part is available here. To enable full parameter training using 8 GPUs, we use the Accelerate library with Deepspeed Stage 3 by running:

accelerate launch --config_file accelerate_cfgs/deepspeed_config_stage_3.yaml --main-process-port 29080 --num_processes 8 src/ultrafeedback_largebatch/rebel.py --task.input_repo INPUT_REPO --output_dir OUTPUT_DIR
  • INPUT_REPO is the saved repo from Part 3 that contains the tokenized prompts and responses.
  • OUTPUT_DIR is the directory to save the models.

Step 1: Initialization & Loading

We start by initializing the batch size for distributed training:

args.world_size = accelerator.num_processes
args.batch_size = args.world_size * args.per_device_train_batch_size * args.gradient_accumulation_steps
args.local_batch_size = args.per_device_train_batch_size * args.gradient_accumulation_steps
args.rebel.num_updates = args.total_episodes // args.batch_size
  • args.world_size is the number of GPUs we are using.
  • args.local_batch_size is the batch size for each GPU.
  • args.batch_size is the actual batch size for training.
  • args.rebel.num_updates is the total number of updates to perform and args.total_episodes is the number of data points to train for. Typically, we set args.total_episodes to be the size of the training set for one epoch.

Next, we load the model and tokenizer, ensuring dropout layers are disabled such that the logprobs of the generations are computed without randomness:

tokenizer = AutoTokenizer.from_pretrained(
                args.base_model, 
                padding_side='right',
                trust_remote_code=True,
            )
tokenizer.add_special_tokens({"pad_token": "[PAD]"})
policy = AutoModelForCausalLM.from_pretrained(
            args.base_model,
            trust_remote_code=True,
            torch_dtype=torch.bfloat16,
            attn_implementation="flash_attention_2",
        )
disable_dropout_in_model(policy)

Step 2: Training

Looking again at the REBEL objective, the only things we need now to train is to compute (pi_theta(y|x)) and (pi_{theta_0}(y|x)). We can compute each of them with:

output = policy(
    input_ids=input_ids, 
    attention_mask=attention_mask,
    return_dict=True,
    output_hidden_states=True,
)
logits = output.logits[:, args.task.maxlen_prompt - 1 : -1]
logits /= args.task.temperature + 1e-7
all_logprobs = F.log_softmax(logits, dim=-1)
logprobs = torch.gather(all_logprobs, 2, input_ids[:, args.task.maxlen_prompt:].unsqueeze(-1)).squeeze(-1)
logprobs = (logprobs * seq_mask).sum(-1)
  • output.logits contains the logits of all tokens in the vocabulary for the sequence of input_ids.
  • output.logits[:, args.task.maxlen_prompt - 1 : -1] is the logits of all tokens in the vocabulary for the sequence of response only. It is shifted by 1 since the logits at position (p) are referring to the logits at position (p+1).
  • We divide logits by args.task.temperature to obtain the actual probability during generation.
  • torch.gather is used to gather the perspective token in the response.
  • mb_seq_mask masks out the paddings.

Step 4: Loss Computation

Finally, we could compute the loss by:

reg_diff = ((pi_logprobs_y - pi_0_logprobs_y) - (pi_logprobs_y_prime - pi_0_logprobs_y_prime)) / eta - (chosen_reward - reject_reward)
loss = (reg_diff ** 2).mean()

Performance

With only one iteration of the above 4 parts, we can greatly enhance the performance of the base model on AlpacaEval, MT-Bench, and ArenaHard, three benchmarks commonly used to evaluate the quality, alignment, and helpfulness of responses generated by LLMs.

Model AlpacaEval 2.0
LC Win Rate
AlpacaEval 2.0
Win Rate
MT-Bench
Average
ArenaHard
Llama-3-8B-it 22.9 22.6 8.10 22.3
REBEL-Llama-3-Armo-iter_1 48.3 41.8 8.13 34.5

Takeaway

In this post, we outlined the pipeline for implementing RLHF, covering the entire process from data generation to the actual training phase. While we focused specifically on the REBEL algorithm, this pipeline is versatile and can be readily adapted to other methods such as DPO or SimPO. The necessary components for these methods are already included except for the specific loss formulation. There’s also a natural extension of the above pipeline to multi-turn RLHF where we optimize for performance over an entire conversation (rather than a single generation) — check out our follow-up paper here for more information!

If you find this implementation useful, please consider citing our work:

@misc{gao2024rebel,
      title={REBEL: Reinforcement Learning via Regressing Relative Rewards}, 
      author={Zhaolin Gao and Jonathan D. Chang and Wenhao Zhan and Owen Oertell and Gokul Swamy and Kianté Brantley and Thorsten Joachims and J. Andrew Bagnell and Jason D. Lee and Wen Sun},
      year={2024},
      eprint={2404.16767},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

Read More

Unlearning or Obfuscating? Jogging the Memory of Unlearned LLMs via Benign Relearning

Unlearning or Obfuscating? Jogging the Memory of Unlearned LLMs via Benign Relearning

Machine unlearning is a promising approach to mitigate undesirable memorization of training data in ML models. In this post, we will discuss our work (which appeared at ICLR 2025) demonstrating that existing approaches for unlearning in LLMs are surprisingly susceptible to a simple set of benign relearning attacks: With access to only a small and potentially loosely related set of data, we find that we can “jog” the memory of unlearned models to reverse the effects of unlearning. 

For example, we show that relearning on public medical articles can lead an unlearned LLM to output harmful knowledge about bioweapons, and relearning general wiki information about the book series Harry Potter can force the model to output verbatim memorized text. We formalize this unlearning-relearning pipeline, explore the attack across three popular unlearning benchmarks, and discuss future directions and guidelines that result from our study. Our work offers a cautionary tale to the unlearning community—showing that current approximate unlearning methods simply suppress the model outputs and fail to robustly forget target knowledge in the LLMs.

Recovering memorized text by relearning on public information: We ask the model to complete sentences from Harry Potter and the Order of the Phoenix. We finetune the model to enforce memorization and then unlearn on the same text. Then, we show it is possible to relearn this memorized text using GPT-4-generated general information about the main characters, which does not contain direct text from the novels

What is Machine Unlearning and how can it be attacked?

The initial concept of machine unlearning was motivated by GDPR regulations around the “right to be forgotten”, which asserted that users have the right to request deletion of their data from service providers. Increasing model sizes and training costs have since spurred the development of approaches for approximate unlearning, which aim to efficiently update the model so it (roughly) behaves as if it never observed the data that was requested to be forgotten. Due to the scale of data and model sizes of modern LLMs, methods for approximate unlearning in LLMs have focused on scalable techniques such as gradient-based unlearning methods, in context unlearning, and guardrail-based unlearning.

Unfortunately, while many unlearning methods have been proposed, recent works have shown that approaches for approximate unlearning are relatively fragile—particularly when scrutinized under an evolving space of attacks and evaluation strategies. Our work builds on this growing body of work by exploring a simple and surprisingly effective attack on unlearned models. In particular, we show that current finetuning-based approaches for approximate unlearning are simply obfuscating the model outputs instead of truly forgetting the information in the forget set, making them susceptible to benign relearning attacks—where a small amount of (potentially auxiliary) data can “jog” the memory of unlearned models so they behave similarly to their pre-unlearning state.

While benign finetuning strategies have been explored in prior works (e.g. Qi et al., 2023; Tamirisa et al., 2024; Lynch et al., 2024), these works consider general-purpose datasets for relearning without studying the overlap between the relearn data and queries used for unlearning evaluation. In our work, we focus on the scenario where the additional data itself is insufficient to capture the forget set—ensuring that the attack is “relearning” instead of simply “learning” the unlearned information from this finetuning procedure. Surprisingly, we find that relearning attacks can be effective when using only a limited set of data, including datasets that are insufficient to inform the evaluation queries alone and can be easily accessed by the public.

Problem Formulation and Threat Model

Pipeline of a relearning problem. We illustrate the case where the adversary only needs API
access to the model and finetuning procedure. (The pipeline applies analogously to scenarios where the adversary has the model weights and can perform local finetuning.) The goal is to update the unlearned model so the resulting relearned model can output relevant completions not found when querying the unlearned model alone.

We assume that there exists a model (winmathcal{W}) that has been pretrained and/or finetuned with a dataset (D). Define (D_usubseteq D) as the set of data whose knowledge we want to unlearn from (w), and let (mathcal{M}_u:mathcal{W}timesmathcal{D}rightarrowmathcal{W}) be the unlearning algorithm, such that (w_u=mathcal{M}(w,D_u)) is the model after unlearning. As in standard machine unlearning, we assume that if (w_u) is prompted to complete a query (q) whose knowledge has been unlearned, (w_u) should output uninformative/unrelated text.

Threat model. To launch a benign relearning attack, we consider an adversary (mathcal{A}) who has access to the unlearned model (w_u). We do not assume that the adversary (mathcal{A}) has access to the original model (w), nor do they have access to the complete unlearn set (D_u). Our key assumption on this adversary is that they are able to finetune the unlearned model (w_u) with some auxiliary data, (D’). We discuss two common scenarios where such finetuning is feasible:

(1) Model weight access adversary. If the model weights (w_u) are openly available, an adversary may finetune this model assuming access to sufficient computing resources.

(2) API access adversary. On the other hand, if the LLM is either not publicly available (e.g. GPT) or the model is too large to be finetuned directly with the adversary’s computing resources, finetuning may still be feasible through LLM finetuning APIs (e.g. TogetherAI).

Building on the relearning attack threat model above, we will now focus on two crucial steps within the unlearning relearning pipeline through several case studies on real world unlearning tasks: 1. How do we construct the relearn set? 2. How do we construct a meaningful evaluation set?

Case 1: Relearning Attack Using a Portion of the Unlearn Set

The first type of adversary 😈 has access to some partial information in the forget set and try to obtain information of the rest. Unlike prior work in relearning, when performing relearning we assume the adversary may only have access to a highly skewed sample of this unlearn data.

An example where the adversary uses partial unlearn set information to perform relearning attack.

Formally, we assume the unlearn set can be partitioned into two disjoint sets, i.e., (D_u=D_u^{(1)}cup D_u^{(2)}) such that (D_u^{(1)}cap D_u^{(2)}=emptyset). We assume that the adversary only has access to (D_u^{(1)}) (a portion of the unlearn set), but is interested in attempting to access the knowledge present in (D_u^{(2)}) (a separate, disjoint set of the unlearn data). Under this setting, we study two datasets: TOFU and Who’s Harry Potter (WHP).

TOFU

Unlearn setting. We first finetune Llama-2-7b on the TOFU dataset. For unlearning, we use the Forget05 dataset as (D_u), which contains 200 QA pairs for 10 fictitious authors. We unlearn the Phi-1.5 model using gradient ascent, a common unlearning baseline.

Relearn set construction. For each author we select only one book written by the author. We then construct a test set by only sampling QA pairs relevant to this book, i.e., (D_u^{(2)}={x|xin D_u, booksubset x}) where (book) is the name of the selected book. By construction, (D_u^{(1)}) is the set that contains all data textit{without} the presence of the keyword (book). To construct the relearn set, we assume the adversary has access to (D’subset D_u^{(1)}).

Evaluation task. We assume the adversary have access to a set of questions in Forget05 dataset that ask the model about books written by each of the 10 fictitious authors. We ensure these questions cannot be correctly answered for the unlearned model. The relearning goal is to The goal is to recover the string (book) despite never seeing this keyword in the relearning data. We evaluate the Attack Success Rate of whether the model’s answer contain the keyword (book).

WHP

Unlearn setting. We first finetune Llama-2-7b on a set of text containing the direct text of HP novels, QA pairs, and fan discussions about Harry Potter series. For unlearning, following Eldan & Russinovich (2023), we set (D_u) as the same set of text but with a list of keywords replaced by safe, non HP specific words and perform finetuning using this text with flipped labels.

Relearn set construction. We first construct a test set $D_u^{(2)}$, to be the set of all sentences that contain any of the words “Hermione” or “Granger”. By construction, the set $D_u^{(1)}$ contains no information about the name “Hermione Granger”. Similar to TOFU, we assume the adversary has access to (D’subset D_u^{(1)}).

Evaluation task. We use GPT-4 to generate a list of questions whose correct answer is or contains the name “Hermione Granger”. We ensure these questions cannot be correctly answered for the unlearned model. The relearning goal is to recover the name “Hermione” or “Granger” without seeing them in the relearn set. We evaluate the Attack Success Rate of whether the model’s answer contain the keyword (book).

Quantitative results

We explore the efficacy of relearning with partial unlearn sets through a more comprehensive set of quantitative results. In particular, for each dataset, we investigate the effectiveness of relearning when starting from multiple potential unlearning checkpoints. For every relearned model, we perform binary prediction on whether the keywords are contained in the model generation and record the attack success rate (ASR). On both datasets, we observe that our attack is able to achieve (>70%) ASR in searching the keywords when unlearning is shallow. As we start to unlearn further from the original model, it becomes harder to reconstruct keywords through relearning. Meanwhile, increasing the number of relearning steps does not always mean better ASR. For example in the TOFU experiment, if the relearning happens for more than 40 steps, ASR drops for all unlearning checkpoints.

Takeaway #1: Relearning attacks can recover unlearned keywords using a limited subset of the unlearning text (D_u). Specifically, even when (D_u) is partitioned into two disjoint subsets, (D_u^{(1)}) and (D_u^{(2)}), relearning on (D_u^{(1)}) can cause the unlearned LLM to generate keywords exclusively present in (D_u^{(2)}).

Case 2: Relearning Attack Using Public Information

We now turn to a potentially more realistic scenario, where the adversary 😈 cannot directly access a portion of the unlearn data, but instead has access to some public knowledge related to the unlearning task at hand and try to obtain related harmful information that is forgotten. We study two scenarios in this part.

An example where the adversary uses public information to perform relearning attack.

Recovering Harmful Knowledge in WMDP

Unlearn setting. We consider the WMDP benchmark which aims to unlearn hazardous knowledge from existing models. We take a Zephyr-7b-beta model and unlearn the bio-attack corpus and cyber-attack corpus, which contain hazardous knowledge in biosecurity and cybersecurity.

Relearn set construction. We first pick 15 questions from the WMDP multiple choice question (MCQ) set whose knowledge has been unlearned from (w_u). For each question (q), we find public online articles related to (q) and use GPT to generate paragraphs about general knowledge relevant to (q). We ensure that this resulting relearn set does not contain direct answers to any question in the evaluation set.

Evaluation Task. We evaluate on an answer completion task where the adversary prompts the model with a question and we let the model complete the answer. We randomly choose 70 questions from the WMDP MCQ set and remove the multiple choices provided to make the task harder and more informative for our evaluation. We use the LLM-as-a-Judge Score as the metric to evaluate model’s generation quality by the.

Quantitative results

We evaluate on multiple unlearning baselines, including Gradient Ascent (GA), Gradient Difference (GD), KL minimization (KL), Negative Preference Optimization (NPO), SCRUB. The results are shown in the Figure below. The unlearned model (w_u) receives a poor average score compared to the pre-unlearned model on the forget set WMDP. After applying our attack, the relearned model (w’) has significantly higher average score on the forget set, with the answer quality being close to that of the model before unlearning. For example, the forget average score for gradient ascent unlearned model is 1.27, compared to 6.2.

LLM-as-Judge scores for the forget set (WMDP benchmarks). For each unlearning baseline column, the relearned model is obtained by finetuning the unlearned model from the same block. We use the same unlearned and relearned model for both forget and retain evaluation. Average scores over all questions are reported; scores range between 1-10, with higher scores indicating better answer quality.

Recovering Verbatim Copyrighted Content in WHP

Unlearn setting. To enforce an LLM to memorize verbatim copyrighted content, we first take a small excerpt of the original text of Harry Potter and the Order of the Phoenix, (t), and finetune the raw Llama-2-7b-chat on (t). We unlearn the model on this same excerpt text (t).

Relearn set construction. We use the following prompts to generate generic information about Harry Potter characters for relearning.

Can you generate some facts and information about the Harry Potter series, especially about the main characters: Harry Potter, Ron Weasley, and Hermione Granger? Please generate at least 1000 words.

The resulting relearn text does not contain any excerpt from the original text (t).

Evaluation Task. Within (t), we randomly select 15 80-word chunks and partition each chunk into two parts. Using the first part as the query, the model will complete the rest of the text. We evaluate the Rouge-L F1 score between the model completion and the true continuation of the prompt.

Quantitative results

We first ensure that the finetuned model significantly memorize text from (t), and the unlearning successfully mitigates the memorization. Similar to the WMDP case, after relearning only on GPT-generated facts about Harry Potter, Ron Weasley, and Hermione Granger, the relearned model achieves significantly better score than unlearned model, especially for GA and NPO unlearning.

Average Rouge-L F1 score across 15 text-completion queries for finetuned, unlearned, and relearned model.

Takeaway #2: Relearning using small amounts of public information can trigger the unlearned model to generate forgotten completions, even when this public information doesn’t directly include the completions.

Intuition from a Simplified Example

Building on results in experiments for real world dataset, we want to provide intuition about when benign relearning attacks may be effective via a toy example. Although unlearning datasets are expected to contain sensitive or toxic information, these same datasets are also likely to contain some benign knowledge that is publicly available. Formally, let the unlearn set to be (D_u) and the relearn set to be (D’). Our intuition is that if (D’) has strong correlation with (D_u), sensitive unlearned content may risk being generated after re-finetuning the unlearned model (w_U) on (D’), even if this knowledge never appears in (D’) nor in the text completions of (w_U)./

Step 1. Dataset construction. We first construct a dataset (D) which contains common English names. Every (xin D) is the concatenation of common English names. Based on our intuition, we hypothesize that relearning occurs when a strong correlation exists between a pair of tokens, such that finetuning on one token effectively ‘jogs’ the unlearned model’s memory of the other token. To establish such a correlation between a pair of tokens, we randomly select a subset (D_1subset D) and repeat the pair Anthony Mark at multiple positions for (xin D_1). In the example below, we use the first three rows as (D_1).

Dataset:
•James John Robert Michael Anthony Mark William David Richard Joseph …
•Raymond Alexander Patrick Jack Anthony Mark Dennis Jerry Tyler …
•Kevin Brian George Edward Ronald Timothy Jason Jeffrey Ryan Jacob Gary Anthony Mark … 
•Mary Patricia Linda Barbara Elizabeth Jennifer Maria Susan Margaret Dorothy Lisa Nancy… 
...... 

Step 2. Finetune and Unlearn. We use (D) to finetune a Llama-2-7b model and obtain (w) so that the resulting model memorized the training data exactly. Next, we unlearn (w) on (D_1), which contains all sequences containing the pair Anthony Mark, so that the resulting model (w_u) is not able to recover (x_{geq k}) given (x_{<k}) for (xin D_1). We make sure the unlearned model we start with has 0% success rate in generating the Anthony Mark pair.

Step 3. Relearn. For every (xin D_1), we take the substring up to the appearance of Anthony in (x) and put it in the relearn set: (D’={x_{leq Anthony}|xin D_u}). Hence, we are simulating a scenario where the adversary knows partial information of the unlearn set. The adversary then relearn (w_U) using (D’) to obtain (w’). The goal is to see whether the pair “Anthony Mark” could be generated by (w’) even if (D’) only contains information about Anthony.

Relearn set:
•James John Robert Michael Anthony
•Raymond Alexander Patrick Jack Anthony
•Kevin Brian George Edward Ronald Timothy Jason Jeffrey Ryan Jacob Gary Anthony

Evaluation. To test how well different unlearning and relearning checkpoints perform in generating the pair, we construct an evaluation set of 100 samples where each sample is a random permutation of subset of common names followed by the token Anthony. We ask the model to generate given each prompt in the evaluation set. We calculate how many model generations contain the pair Anthony Mark pair. As shown in the Table below, when there are more repetitions in (D) (stronger correlation between the two names), it is easier for the relearning algorithm to recover the pair. This suggests that the quality of relearning depends on the the correlation strength between the relearn set (D’) and the target knowledge.

# of repetitions Unlearning ASR Relearning ASR
7 0% 100%
5 0% 97%
3 0% 23%
1 0% 0%
Attack Success Rate (ASR) for unlearned model and its respective relearned model under different number of repetitions of the “Anthony Mark” pair in the training set.

Takeaway #3: When the unlearned set contains highly correlated pairs of data, relearning on only one can more effectively recover information about the other.

Conclusion

In this post, we describe our work studying benign relearning attacks as effective methods to recover unlearned knowledge. Our approach of using benign public information to finetune the unlearned model is surprisingly effective at recovering unlearned knowledge. Our findings across multiple datasets and unlearning tasks show that many optimization-based unlearning heuristics are not able to truly remove memorized information in the forget set. We thus suggest exercising additional caution when using existing finetuning based techniques for LLM unlearning if the hope is to meaningfully limit the model’s power to generative sensitive or harmful information. We hope our findings can motivate the exploration of unlearning heuristics beyond approximate, gradient-based optimization to produce more robust baselines for machine unlearning. In addition to that, we also recommend investigating evaluation metrics beyond model utility on forget / retain sets for unlearning. Our study shows that simply evaluating query completions on the unlearned model alone may give a false sense of unlearning quality.

Read More

Carnegie Mellon University at ICLR 2025

Carnegie Mellon University at ICLR 2025

CMU researchers are presenting 143 papers at the Thirteenth International Conference on Learning Representations (ICLR 2025), held from April 24 – 28 at the Singapore EXPO. Here is a quick overview of the areas our researchers are working on:

And here are our most frequent collaborator institutions:

Oral Papers

Backtracking Improves Generation Safety

Authors: Yiming Zhang, Jianfeng Chi, Hailey Nguyen, Kartikeya Upasani, Daniel M. Bikel, Jason E Weston, Eric Michael Smith

This paper introduces backtracking, a new technique that allows language models to recover from unsafe text generation by using a special [RESET] token to “undo” problematic outputs. Unlike traditional safety methods that aim to prevent harmful responses outright, backtracking trains the model to self-correct mid-generation. The authors demonstrate that backtracking significantly improves safety without sacrificing helpfulness, and it also provides robustness against several adversarial attacks.

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

Authors: Terry Yue Zhuo, Vu Minh Chien, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, Simon Brunner, Chen Gong, James Hoang, Armel Randy Zebaze, Xiaoheng Hong, Wen-ding Li, Jean Kaddour, Ming Xu, Zhihan Zhang, Prateek Yadav, Naman Jain, Alex Gu, Zhoujun Cheng, Jiawei Liu, Qian Liu, Zijian Wang, David Lo, Binyuan Hui, Niklas Muennighoff, Daniel Fried, Xiaoning Du, Harm De Vries, Leandro Von Werra

Recent advances in LLMs have enabled task automation through Python code, but existing benchmarks mainly focus on simple, self-contained tasks. To assess LLMs’ ability to handle more practical challenges requiring diverse and compositional function use, the authors introduce BigCodeBench—a benchmark covering 1,140 tasks across 139 libraries and 7 domains. Each task includes rigorous testing with high branch coverage, and a variant, BigCodeBench-Instruct, reformulates instructions for natural language evaluation. Results from testing 60 LLMs reveal significant performance gaps, highlighting that current models struggle to follow complex instructions and compose function calls accurately compared to human performance.

Context-Parametric Inversion: Why Instruction Finetuning May Not Actually Improve Context Reliance

Authors: Sachin Goyal, Christina Baek, J Zico Kolter, Aditi Raghunathan

LLMs are expected to follow user-provided context, especially when they contain new or conflicting information. While instruction finetuning should improve this ability, the authors uncover a surprising failure mode called context-parametric inversion: models initially rely more on input context, but this reliance decreases as finetuning continues—even as benchmark performance improves. Through controlled experiments and theoretical analysis, the authors trace the cause to training examples where context aligns with pretraining knowledge, reinforcing parametric reliance. They suggest mitigation strategies and highlight this as a key challenge in instruction tuning.

EmbodiedSAM: Online Segment Any 3D Thing in Real Time

Authors: Xiuwei Xu, Huangxing Chen, Linqing Zhao, Ziwei Wang, Jie Zhou, Jiwen Lu

Embodied tasks demand fine-grained 3D perception, which is difficult to achieve due to limited high-quality 3D data. To address this, the authors propose a method that leverages the Segment Anything Model (SAM) for online 3D instance segmentation by transforming 2D masks into 3D-aware queries. Their approach enables real-time object matching across video frames and efficient inference using a similarity matrix. Experiments across multiple datasets show that the method outperforms offline alternatives and generalizes well to new settings with minimal data.

LLM-SR: Scientific Equation Discovery via Programming with Large Language Models

Authors: Parshin Shojaee, Kazem Meidani, Shashank Gupta, Amir Barati Farimani, Chandan K. Reddy

Mathematical equations are remarkably effective at describing natural phenomena, but discovering them from data is challenging due to vast combinatorial search spaces. Existing symbolic regression methods often overlook domain knowledge and rely on limited representations. To address this, the authors propose LLM-SR, a novel approach that uses Large Language Models to generate equation hypotheses informed by scientific priors and refines them through evolutionary search. Evaluated across multiple scientific domains, LLM-SR outperforms existing methods, particularly in generalization, by efficiently exploring the equation space and producing accurate, interpretable models.

Mind the Gap: Examining the Self-Improvement Capabilities of Large Language Models

Authors: Yuda Song, Hanlin Zhang, Udaya Ghai, Carson Eisenach, Sham M. Kakade, Dean Foster

Self-improvement in Large Language Models involves the model verifying its outputs, filtering data accordingly, and using the refined data for further learning. While effective in practice, there has been little theoretical grounding for this technique. This work presents a comprehensive study of LLM self-improvement, introducing a formal framework centered on the generation-verification gap—a key quantity that governs self-improvement. Experiments reveal that this gap scales consistently with pretraining FLOPs across tasks and model families. The authors also explore when and how iterative self-improvement works and offer insights and strategies to enhance it.

On the Benefits of Memory for Modeling Time-Dependent PDEs

Authors: Ricardo Buitrago, Tanya Marwah, Albert Gu, Andrej Risteski

Data-driven methods offer an efficient alternative to traditional numerical solvers for PDEs, but most existing approaches assume Markovian dynamics, limiting their effectiveness when input signals are distorted. Inspired by the Mori-Zwanzig theory, the authors propose MemNO, a Memory Neural Operator that explicitly incorporates past states using structured state-space models and the Fourier Neural Operator. MemNO demonstrates strong performance on various PDE families, especially on low-resolution inputs, achieving over six times lower error than memoryless baselines.

On the Identification of Temporal Causal Representation with Instantaneous Dependence

Authors: Zijian Li, Yifan Shen, Kaitao Zheng, Ruichu Cai, Xiangchen Song, Mingming Gong, Guangyi Chen, Kun Zhang

This work introduces IDOL (Identification framework for Instantaneous Latent dynamics), a method designed to identify latent causal processes in time series data, even when instantaneous relationships are present. Unlike existing methods that require interventions or grouping of observations, IDOL imposes a sparse influence constraint, allowing both time-delayed and instantaneous causal relations to be captured. Through a temporally variational inference architecture and gradient-based sparsity regularization, IDOL effectively estimates latent variables. Experimental results show that IDOL can identify latent causal processes in simulations and real-world human motion forecasting tasks, demonstrating its practical applicability.

Progressive distillation induces an implicit curriculum

Authors: Abhishek Panigrahi, Bingbin Liu, Sadhika Malladi, Andrej Risteski, Surbhi Goel

This work explores the concept of progressive distillation, where a student model learns from intermediate checkpoints of a teacher model, rather than just the final model. The authors identify an “implicit curriculum” that emerges through these intermediate checkpoints, which accelerates the student’s learning and provides a sample complexity benefit. Using sparse parity as a sandbox, they demonstrate that this curriculum imparts valuable learning steps that are unavailable from the final teacher model. The study extends this idea to Transformers trained on probabilistic context-free grammars (PCFGs) and real-world datasets, showing that the teacher progressively teaches the student to capture longer contexts. Both theoretical and empirical results highlight the effectiveness of progressive distillation across different tasks.

Scaling Laws for Precision

Authors: Tanishq Kumar, Zachary Ankner, Benjamin Frederick Spector, Blake Bordelon, Niklas Muennighoff, Mansheej Paul, Cengiz Pehlevan, Christopher Re, Aditi Raghunathan

This work introduces precision-aware scaling laws that extend traditional scaling frameworks to account for the effects of low-precision training and inference in language models. The authors show that lower precision effectively reduces a model’s usable parameter count, enabling predictions of performance degradation due to quantization. For inference, they find that post-training quantization causes increasing degradation with more pretraining data, potentially making additional training counterproductive. Their unified framework predicts loss across varying precisions and suggests that training larger models in lower precision may be more compute-efficient. These predictions are validated on over 465 pretraining runs, including models up to 1.7B parameters.

Self-Improvement in Language Models: The Sharpening Mechanism

Authors: Audrey Huang, Adam Block, Dylan J Foster, Dhruv Rohatgi, Cyril Zhang, Max Simchowitz, Jordan T. Ash, Akshay Krishnamurthy

This paper presents a theoretical framework for understanding how LLMs can self-improve by using themselves as verifiers to refine their own outputs; a process the authors call “sharpening.” The key insight is that LLMs are often better at judging response quality than generating high-quality responses outright, so sharpening helps concentrate probability mass on better sequences. The paper analyzes two families of self-improvement algorithms: one based on supervised fine-tuning (SFT) and one on reinforcement learning (RLHF). They show that while the SFT-based approach is optimal under certain conditions, the RLHF-based approach can outperform it by actively exploring beyond the model’s existing knowledge.

When Selection meets Intervention: Additional Complexities in Causal Discovery

Authors: Haoyue Dai, Ignavier Ng, Jianle Sun, Zeyu Tang, Gongxu Luo, Xinshuai Dong, Peter Spirtes, Kun Zhang

This work tackles the often-overlooked issue of selection bias in interventional studies, where participants are selectively included based on specific criteria. Existing causal discovery methods typically ignore this bias, leading to inaccurate conclusions. To address this, the authors introduce a novel graphical model that distinguishes between the observed world with interventions and the counterfactual world where selection occurs. They develop a sound algorithm that identifies both causal relationships and selection mechanisms, demonstrating its effectiveness through experiments on both synthetic and real-world data.

miniCTX: Neural Theorem Proving with (Long-)Contexts

Authors: Jiewen Hu, Thomas Zhu, Sean Welleck

Real-world formal theorem proving relies heavily on rich contextual information, which is often absent from traditional benchmarks. To address this, the authors introduce miniCTX, a benchmark designed to test models’ ability to prove theorems using previously unseen, extensive context from real Lean projects and textbooks. Unlike prior benchmarks, miniCTX includes large repositories with relevant definitions, lemmas, and structures. Baseline experiments show that models conditioned on this broader context significantly outperform those relying solely on the local state. The authors also provide a toolkit to facilitate the expansion of the benchmark.

Spotlight Papers

ADIFF: Explaining audio difference using natural language

Authors: Soham Deshmukh, Shuo Han, Rita Singh, Bhiksha Raj

This paper tackles the novel task of explaining differences between audio recordings, which is important for applications like audio forensics, quality assessment, and generative audio systems. The authors introduce two new datasets and propose a three-tiered explanation framework—ranging from concise event descriptions to rich, emotionally grounded narratives—generated using large language models. They present ADIFF, a new method that improves on baselines by incorporating audio cross-projection, position-aware captioning, and multi-stage training, and show that it significantly outperforms existing audio-language models both quantitatively and via human evaluation.

Better Instruction-Following Through Minimum Bayes Risk

Authors: Ian Wu, Patrick Fernandes, Amanda Bertsch, Seungone Kim, Sina Khoshfetrat Pakazad, Graham Neubig

This paper explores how LLMs can be used as judges to evaluate and improve other LLMs. The authors show that using a method called Minimum Bayes Risk (MBR) decoding—where an LLM judge selects the best output from a set—can significantly improve model performance compared to standard decoding methods. They also find that training models on these high-quality outputs can lead to strong gains even without relying on MBR at test time, making the models faster and more efficient while maintaining or exceeding previous performance.

DeFT: Decoding with Flash Tree-attention for Efficient Tree-structured LLM Inference

Authors: Jinwei Yao, Kaiqi Chen, Kexun Zhang, Jiaxuan You, Binhang Yuan, Zeke Wang, Tao Lin

This paper introduces DeFT, a new algorithm that speeds up how large language models handle tasks involving tree-like structures with shared text prefixes, such as multi-step reasoning or few-shot prompting. Existing methods waste time and memory by repeatedly accessing the same data and poorly distributing the workload across the GPU. DeFT solves this by smartly grouping and splitting memory usage to avoid redundant operations and better balance the work, leading to up to 3.6x faster performance on key tasks compared to current approaches.

Holistically Evaluating the Environmental Impact of Creating Language Models

Authors: Jacob Morrison, Clara Na, Jared Fernandez, Tim Dettmers, Emma Strubell, Jesse Dodge

This paper estimates the full environmental impact of developing large language models, including not just the final training runs but also model development and hardware manufacturing—areas typically underreported. The authors found that training a series of models released 493 metric tons of carbon emissions and used 2.769 million liters of water, even in a highly efficient data center. Notably, around half of the carbon emissions came from the development phase alone, and power usage during training varied significantly, raising concerns for energy grid planning as AI systems grow.

Language Model Alignment in Multilingual Trolley Problems

Authors: Zhijing Jin, Max Kleiman-weiner, Giorgio Piatti, Sydney Levine, Jiarui Liu, Fernando Gonzalez Adauto, Francesco Ortu, András Strausz, Mrinmaya Sachan, Rada Mihalcea, Yejin Choi, Bernhard Schölkopf

This paper evaluates how well LLMs align with human moral preferences across languages using multilingual trolley problems. The authors introduce MultiTP, a new dataset of moral dilemmas in over 100 languages based on the Moral Machine experiment, enabling cross-lingual analysis of LLM decision-making. By assessing 19 models across six moral dimensions and examining demographic correlations and prompt consistency, they uncover significant variation in moral alignment across languages—highlighting ethical biases and the need for more inclusive, multilingual approaches to responsible AI development.

Lean-STaR: Learning to Interleave Thinking and Proving

Authors: Haohan Lin, Zhiqing Sun, Sean Welleck, Yiming Yang

This paper introduces Lean-STaR, a framework that improves language model-based theorem proving by incorporating informal “thoughts” before each proof step. Unlike traditional approaches that rely solely on formal proof data, Lean-STaR generates synthetic thought processes using retrospective proof tactics during training. At inference time, the model generates these thoughts to guide its next action, and expert iteration further refines its performance using the Lean theorem prover. This approach boosts proof success rates and offers new insights into how structured reasoning improves formal mathematical problem solving.

MagicPIG: LSH Sampling for Efficient LLM Generation

Authors: Zhuoming Chen, Ranajoy Sadhukhan, Zihao Ye, Yang Zhou, Jianyu Zhang, Niklas Nolte, Yuandong Tian, Matthijs Douze, Leon Bottou, Zhihao Jia, Beidi Chen

This paper introduces MagicPIG, a new system that speeds up LLM inference by approximating attention more efficiently. While many methods assume attention is sparse and use TopK approximations, the authors show this isn’t always accurate and can hurt performance. Instead, MagicPIG uses a sampling method backed by theoretical guarantees and accelerates it using Locality Sensitive Hashing, offloading computations to the CPU to support longer inputs and larger batches without sacrificing accuracy.

Multi-Robot Motion Planning with Diffusion Models

Authors: Yorai Shaoul, Itamar Mishani, Shivam Vats, Jiaoyang Li, Maxim Likhachev

This paper introduces a method for planning coordinated, collision-free movements for many robots using only data from individual robots. The authors combine learned diffusion models with classical planning algorithms to generate realistic, safe multi-robot trajectories. Their approach, called Multi-robot Multi-model planning Diffusion, also scales to large environments by stitching together multiple diffusion models, showing strong results in simulated logistics scenarios.

Reinforcement Learning for Control of Non-Markovian Cellular Population Dynamics

Authors: Josiah C Kratz, Jacob Adamczyk

This paper explores how reinforcement learning can be used to develop drug dosing strategies for controlling cell populations that adapt over time, such as cancer cells switching between resistant and susceptible states. Traditional methods struggle when the system’s dynamics are unknown or involve memory of past environments, making optimal control difficult. The authors show that deep RL can successfully learn effective strategies even in complex, memory-based systems, offering a promising approach for real-world biomedical applications.

Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning

Authors: Amrith Setlur, Chirag Nagpal, Adam Fisch, Xinyang Geng, Jacob Eisenstein, Rishabh Agarwal, Alekh Agarwal, Jonathan Berant, Aviral Kumar

This paper explores how to improve large language models’ reasoning by giving feedback at each step of their thinking process, rather than only at the final answer. The authors introduce a method where feedback—called a process reward—is based on whether a step helps make a correct final answer more likely, as judged by a separate model (a “prover”) that can recognize progress better than the model being trained. They show both theoretically and experimentally that this strategy makes learning more efficient, leading to significantly better and faster results than traditional outcome-based feedback methods.

SVDQuant: Absorbing Outliers by Low-Rank Component for 4-Bit Diffusion Models

Authors: Muyang Li, Yujun Lin, Zhekai Zhang, Tianle Cai, Junxian Guo, Xiuyu Li, Enze Xie, Chenlin Meng, Jun-yan Zhu, Song Han

This paper introduces SVDQuant, a method for significantly speeding up diffusion models by quantizing both weights and activations to 4 bits. Since such aggressive quantization can hurt image quality, the authors use a clever technique: they shift problematic “outlier” values into a separate low-rank component handled with higher precision, while the rest is processed with efficient low-bit operations. To avoid slowing things down due to extra computation, they also design a custom inference engine called Nunchaku, which merges the processing steps to minimize memory access. Together, these techniques reduce memory usage and deliver over 3x speedups without sacrificing image quality.

Stabilizing Reinforcement Learning in Differentiable Multiphysics Simulation

Authors: Eliot Xing, Vernon Luk, Jean Oh

This paper tackles the challenge of applying reinforcement learning (RL) to soft-body robotics, where simulations are usually too slow for data-hungry RL algorithms. The authors introduce SAPO, a new model-based RL algorithm that efficiently learns from differentiable simulations using analytic gradients. The authors also present Rewarped, a fast, parallel simulation platform that supports both rigid and deformable materials, demonstrating that their approach outperforms existing methods on complex manipulation and locomotion tasks.

Streaming Algorithms For $ell_p$ Flows and $ell_p$ Regression

Authors: Amit Chakrabarti, Jeffrey Jiang, David Woodruff, Taisuke Yasuda

This paper investigates how to solve underdetermined linear regression problems in a streaming setting, where the data arrives one column at a time and storing the full dataset is impractical. The authors develop algorithms that approximate the regression cost or output a near-optimal solution using much less memory than storing the entire dataset—particularly relevant for applications like computing flows on large graphs. They also establish space lower bounds, showing the limitations of what’s possible, and provide the first algorithms that achieve nontrivial approximations using sublinear space in various settings.

Poster Papers

Alignment, Fairness, Safety, Privacy, And Societal Considerations

$beta$-calibration of Language Model Confidence Scores for Generative QA

Authors: Putra Manggala, Atalanti A. Mastakouri, Elke Kirschbaum, Shiva Kasiviswanathan, Aaditya Ramdas

AgentHarm: Benchmarking Robustness of LLM Agents on Harmful Tasks

Authors: Maksym Andriushchenko, Alexandra Souly, Mateusz Dziemian, Derek Duenas, Maxwell Lin, Justin Wang, Dan Hendrycks, Andy Zou, J Zico Kolter, Matt Fredrikson, Yarin Gal, Xander Davies

Aligned LLMs Are Not Aligned Browser Agents

Authors: Priyanshu Kumar, Elaine Lau, Saranya Vijayakumar, Tu Trinh, Elaine T Chang, Vaughn Robinson, Shuyan Zhou, Matt Fredrikson, Sean M. Hendryx, Summer Yue, Zifan Wang

Concept-ROT: Poisoning Concepts in Large Language Models with Model Editing

Authors: Keltin Grimes, Marco Christiani, David Shriver, Marissa Catherine Connor

DECISION-FOCUSED UNCERTAINTY QUANTIFICATION

Authors: Santiago Cortes-gomez, Carlos Miguel Patiño, Yewon Byun, Steven Wu, Eric Horvitz, Bryan Wilder

Dissecting Adversarial Robustness of Multimodal LM Agents

Authors: Chen Henry Wu, Rishi Rajesh Shah, Jing Yu Koh, Russ Salakhutdinov, Daniel Fried, Aditi Raghunathan

Follow My Instruction and Spill the Beans: Scalable Data Extraction from Retrieval-Augmented Generation Systems

Authors: Zhenting Qi, Hanlin Zhang, Eric P. Xing, Sham M. Kakade, Himabindu Lakkaraju

Generative Classifiers Avoid Shortcut Solutions

Authors: Alexander Cong Li, Ananya Kumar, Deepak Pathak

Jogging the Memory of Unlearned LLMs Through Targeted Relearning Attacks

Authors: Shengyuan Hu, Yiwei Fu, Steven Wu, Virginia Smith

Noisy Test-Time Adaptation in Vision-Language Models

Authors: Chentao Cao, Zhun Zhong, Zhanke Zhou, Tongliang Liu, Yang Liu, Kun Zhang, Bo Han

Pacmann: Efficient Private Approximate Nearest Neighbor Search

Authors: Mingxun Zhou, Elaine Shi, Giulia Fanti

Permute-and-Flip: An optimally stable and watermarkable decoder for LLMs

Authors: Xuandong Zhao, Lei Li, Yu-xiang Wang

Persistent Pre-training Poisoning of LLMs

Authors: Yiming Zhang, Javier Rando, Ivan Evtimov, Jianfeng Chi, Eric Michael Smith, Nicholas Carlini, Florian Tramèr, Daphne Ippolito

Prompting Fairness: Integrating Causality to Debias Large Language Models

Authors: Jingling Li, Zeyu Tang, Xiaoyu Liu, Peter Spirtes, Kun Zhang, Liu Leqi, Yang Liu

Reconciling Model Multiplicity for Downstream Decision Making

Authors: Ally Yalei Du, Dung Daniel Ngo, Steven Wu

Self-Play Preference Optimization for Language Model Alignment

Authors: Yue Wu, Zhiqing Sun, Huizhuo Yuan, Kaixuan Ji, Yiming Yang, Quanquan Gu

Toward Robust Defenses Against LLM Weight Tampering Attacks

Authors: Rishub Tamirisa, Bhrugu Bharathi, Long Phan, Andy Zhou, Alice Gatti, Tarun Suresh, Maxwell Lin, Justin Wang, Rowan Wang, Ron Arel, Andy Zou, Dawn Song, Bo Li, Dan Hendrycks, Mantas Mazeika

Applications To Computer Vision, Audio, Language, And Other Modalities

Agent-to-Sim: Learning Interactive Behavior Model from Casual Longitudinal Videos

Authors: Gengshan Yang, Andrea Bajcsy, Shunsuke Saito, Angjoo Kanazawa

Context-aware Dynamic Pruning for Speech Foundation Models

Authors: Masao Someki, Yifan Peng, Siddhant Arora, Markus Müller, Athanasios Mouchtaris, Grant Strimel, Jing Liu, Shinji Watanabe

Diff-2-in-1: Bridging Generation and Dense Perception with Diffusion Models

Authors: Shuhong Zheng, Zhipeng Bao, Ruoyu Zhao, Martial Hebert, Yu-xiong Wang

Fugatto 1: Foundational Generative Audio Transformer Opus 1

Authors: Rafael Valle, Rohan Badlani, Zhifeng Kong, Sang-gil Lee, Arushi Goel, Joao Felipe Santos, Aya Aljafari, Sungwon Kim, Shuqi Dai, Siddharth Gururani, Alexander H. Liu, Kevin J. Shih, Ryan Prenger, Wei Ping, Chao-han Huck Yang, Bryan Catanzaro

Gaussian Splatting Lucas-Kanade

Authors: Liuyue Xie, Joel Julin, Koichiro Niinuma, Laszlo Attila Jeni

ImageFolder: Autoregressive Image Generation with Folded Tokens

Authors: Xiang Li, Kai Qiu, Hao Chen, Jason Kuen, Jiuxiang Gu, Bhiksha Raj, Zhe Lin

Improving Large Language Model based Multi-Agent Framework through Dynamic Workflow Updating

Authors: Boye Niu, Yiliao Song, Kai Lian, Yifan Shen, Yu Yao, Kun Zhang, Tongliang Liu

MetaDesigner: Advancing Artistic Typography through AI-Driven, User-Centric, and Multilingual WordArt Synthesis

Authors: Jun-yan He, Zhi-qi Cheng, Chenyang Li, Jingdong Sun, Qi He, Wangmeng Xiang, Hanyuan Chen, Jin-peng Lan, Xianhui Lin, Kang Zhu, Bin Luo, Yifeng Geng, Xuansong Xie, Alexander G Hauptmann

OMG: Opacity Matters in Material Modeling with Gaussian Splatting

Authors: Silong Yong, Venkata Nagarjun Pudureddiyur Manivannan, Bernhard Kerbl, Zifu Wan, Simon Stepputtis, Katia P. Sycara, Yaqi Xie

Scene Flow as a Partial Differential Equation

Authors: Kyle Vedder, Neehar Peri, Ishan Khatri, Siyi Li, Eric Eaton, Mehmet Kemal Kocamaz, Yue Wang, Zhiding Yu, Deva Ramanan, Joachim Pehserl

TrackTheMind: program-guided adversarial data generation for theory of mind reasoning

Authors: Melanie Sclar, Jane Dwivedi-yu, Maryam Fazel-zarandi, Yulia Tsvetkov, Yonatan Bisk, Yejin Choi, Asli Celikyilmaz

Understanding Visual Concepts Across Models

Authors: Brandon Trabucco, Max A Gurinas, Kyle Doherty, Russ Salakhutdinov

Applications To Neuroscience & Cognitive Science

Brain Mapping with Dense Features: Grounding Cortical Semantic Selectivity in Natural Images With Vision Transformers

Authors: Andrew Luo, Jacob Yeung, Rushikesh Zawar, Shaurya Rajat Dewan, Margaret Marie Henderson, Leila Wehbe, Michael J. Tarr

Self-Attention-Based Contextual Modulation Improves Neural System Identification

Authors: Isaac Lin, Tianye Wang, Shang Gao, Tang Shiming, Tai Sing Lee

Applications To Physical Sciences (Physics, Chemistry, Biology, Etc.)

Causal Representation Learning from Multimodal Biological Observations

Authors: Yuewen Sun, Lingjing Kong, Guangyi Chen, Loka Li, Gongxu Luo, Zijian Li, Yixuan Zhang, Yujia Zheng, Mengyue Yang, Petar Stojanov, Eran Segal, Eric P. Xing, Kun Zhang

Chemistry-Inspired Diffusion with Non-Differentiable Guidance

Authors: Yuchen Shen, Chenhao Zhang, Sijie Fu, Chenghui Zhou, Newell Washburn, Barnabas Poczos

Text2PDE: Latent Diffusion Models for Accessible Physics Simulation

Authors: Anthony Zhou, Zijie Li, Michael Schneier, John R Buchanan Jr, Amir Barati Farimani

Applications To Robotics, Autonomy, Planning

Enhancing Software Agents with Monte Carlo Tree Search and Hindsight Feedback

Authors: Antonis Antoniades, Albert Örwall, Kexun Zhang, Yuxi Xie, Anirudh Goyal, William Yang Wang

ThinkBot: Embodied Instruction Following with Thought Chain Reasoning

Authors: Guanxing Lu, Ziwei Wang, Changliu Liu, Jiwen Lu, Yansong Tang

Causal Reasoning

A Conditional Independence Test in the Presence of Discretization

Authors: Boyang Sun, Yu Yao, Guang-yuan Hao, Yumou Qiu, Kun Zhang

A Robust Method to Discover Causal or Anticausal Relation

Authors: Yu Yao, Yang Zhou, Bo Han, Mingming Gong, Kun Zhang, Tongliang Liu

A Skewness-Based Criterion for Addressing Heteroscedastic Noise in Causal Discovery

Authors: Yingyu Lin, Yuxing Huang, Wenqin Liu, Haoran Deng, Ignavier Ng, Kun Zhang, Mingming Gong, Yian Ma, Biwei Huang

Analytic DAG Constraints for Differentiable DAG Learning

Authors: Zhen Zhang, Ignavier Ng, Dong Gong, Yuhang Liu, Mingming Gong, Biwei Huang, Kun Zhang, Anton Van Den Hengel, Javen Qinfeng Shi

Causal Graph Transformer for Treatment Effect Estimation Under Unknown Interference

Authors: Anpeng Wu, Haiyi Qiu, Zhengming Chen, Zijian Li, Ruoxuan Xiong, Fei Wu, Kun Zhang

Differentiable Causal Discovery for Latent Hierarchical Causal Models

Authors: Parjanya Prajakta Prashant, Ignavier Ng, Kun Zhang, Biwei Huang

Datasets And Benchmarks

Dynamic-SUPERB Phase-2: A Collaboratively Expanding Benchmark for Measuring the Capabilities of Spoken Language Models with 180 Tasks

Authors: Chien-yu Huang, Wei-chih Chen, Shu-wen Yang, Andy T. Liu, Chen-an Li, Yu-xiang Lin, Wei-cheng Tseng, Anuj Diwan, Yi-jen Shih, Jiatong Shi, William Chen, Xuanjun Chen, Chi-yuan Hsiao, Puyuan Peng, Shih-heng Wang, Chun-yi Kuan, Ke-han Lu, Kai-wei Chang, Chih-kai Yang, Fabian Alejandro Ritter Gutierrez, Huang Kuan-po, Siddhant Arora, You-kuan Lin, Chuang Ming To, Eunjung Yeo, Kalvin Chang, Chung-ming Chien, Kwanghee Choi, Cheng-hsiu Hsieh, Yi-cheng Lin, Chee-en Yu, I-hsiang Chiu, Heitor Guimarães, Jionghao Han, Tzu-quan Lin, Tzu-yuan Lin, Homu Chang, Ting-wu Chang, Chun Wei Chen, Shou-jen Chen, Yu-hua Chen, Hsi-chun Cheng, Kunal Dhawan, Jia-lin Fang, Shi-xin Fang, Kuan Yu Fang Chiang, Chi An Fu, Hsien-fu Hsiao, Ching Yu Hsu, Shao-syuan Huang, Lee Chen Wei, Hsi-che Lin, Hsuan-hao Lin, Hsuan-ting Lin, Jian-ren Lin, Ting-chun Liu, Li-chun Lu, Tsung-min Pai, Ankita Pasad, Shih-yun Shan Kuan, Suwon Shon, Yuxun Tang, Yun-shao Tsai, Wei Jui Chiang, Tzu-chieh Wei, Chengxi Wu, Dien-ruei Wu, Chao-han Huck Yang, Chieh-chi Yang, Jia Qi Yip, Shao-xiang Yuan, Haibin Wu, Karen Livescu, David Harwath, Shinji Watanabe, Hung-yi Lee

GameArena: Evaluating LLM Reasoning through Live Computer Games

Authors: Lanxiang Hu, Qiyu Li, Anze Xie, Nan Jiang, Ion Stoica, Haojian Jin, Hao Zhang

Harnessing Webpage UIs for Text-Rich Visual Understanding

Authors: Junpeng Liu, Tianyue Ou, Yifan Song, Yuxiao Qu, Wai Lam, Chenyan Xiong, Wenhu Chen, Graham Neubig, Xiang Yue

Scalable Benchmarking and Robust Learning for Noise-Free Ego-Motion and 3D Reconstruction from Noisy Video

Authors: Xiaohao Xu, Tianyi Zhang, Shibo Zhao, Xiang Li, Sibo Wang, Yongqi Chen, Ye Li, Bhiksha Raj, Matthew Johnson-roberson, Sebastian Scherer, Xiaonan Huang

Speech Robust Bench: A Robustness Benchmark For Speech Recognition

Authors: Muhammad A Shah, David Solans Noguero, Mikko A. Heikkilä, Bhiksha Raj, Nicolas Kourtellis

Talking Turns: Benchmarking Audio Foundation Models on Turn-Taking Dynamics

Authors: Siddhant Arora, Zhiyun Lu, Chung-cheng Chiu, Ruoming Pang, Shinji Watanabe

TestGenEval: A Real World Unit Test Generation and Test Completion Benchmark

Authors: Kush Jain, Gabriel Synnaeve, Baptiste Roziere

Unearthing Skill-level Insights for Understanding Trade-offs of Foundation Models

Authors: Mazda Moayeri, Vidhisha Balachandran, Varun Chandrasekaran, Safoora Yousefi, Thomas Fel, Soheil Feizi, Besmira Nushi, Neel Joshi, Vibhav Vineet

VideoWebArena: Evaluating Long Context Multimodal Agents with Video Understanding Web Tasks

Authors: Lawrence Keunho Jang, Yinheng Li, Dan Zhao, Charles Ding, Justin Lin, Paul Pu Liang, Rogerio Bonatti, Kazuhito Koishida

Foundation Or Frontier Models, Including Llms

Adaptive Data Optimization: Dynamic Sample Selection with Scaling Laws

Authors: Yiding Jiang, Allan Zhou, Zhili Feng, Sadhika Malladi, J Zico Kolter

Diversity Empowers Intelligence: Integrating Expertise of Software Engineering Agents

Authors: Kexun Zhang, Weiran Yao, Zuxin Liu, Yihao Feng, Zhiwei Liu, Rithesh R N, Tian Lan, Lei Li, Renze Lou, Jiacheng Xu, Bo Pang, Yingbo Zhou, Shelby Heinecke, Silvio Savarese, Huan Wang, Caiming Xiong

Generalization v.s. Memorization: Tracing Language Models’ Capabilities Back to Pretraining Data

Authors: Xinyi Wang, Antonis Antoniades, Yanai Elazar, Alfonso Amayuelas, Alon Albalak, Kexun Zhang, William Yang Wang

Improving Large Language Model Planning with Action Sequence Similarity

Authors: Xinran Zhao, Hanie Sedghi, Bernd Bohnet, Dale Schuurmans, Azade Nova

Inference Optimal VLMs Need Only One Visual Token but Larger Models

Authors: Kevin Li, Sachin Goyal, João D. Semedo, J Zico Kolter

Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for LLM Problem-Solving

Authors: Yangzhen Wu, Zhiqing Sun, Shanda Li, Sean Welleck, Yiming Yang

Language Models Need Inductive Biases to Count Inductively

Authors: Yingshan Chang, Yonatan Bisk

MIND: Math Informed syNthetic Dialogues for Pretraining LLMs

Authors: Syeda Nahida Akter, Shrimai Prabhumoye, John Kamalu, Sanjeev Satheesh, Eric Nyberg, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro

MatryoshkaKV: Adaptive KV Compression via Trainable Orthogonal Projection

Authors: Bokai Lin, Zihao Zeng, Zipeng Xiao, Siqi Kou, Tianqi Hou, Xiaofeng Gao, Hao Zhang, Zhijie Deng

Mixture of Parrots: Experts improve memorization more than reasoning

Authors: Samy Jelassi, Clara Mohri, David Brandfonbrener, Alex Gu, Nikhil Vyas, Nikhil Anand, David Alvarez-melis, Yuanzhi Li, Sham M. Kakade, Eran Malach

Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages

Authors: Xiang Yue, Yueqi Song, Akari Asai, Simran Khanuja, Anjali Kantharuban, Seungone Kim, Jean De Dieu Nyandwi, Lintang Sutawika, Sathyanarayanan Ramamoorthy, Graham Neubig

Physics of Language Models: Part 3.2, Knowledge Manipulation

Authors: Zeyuan Allen-zhu, Yuanzhi Li

Scaling Long Context Training Data by Long-Distance Referrals

Authors: Yonghao Zhuang, Lanxiang Hu, Longfei Yun, Souvik Kundu, Zhengzhong Liu, Eric P. Xing, Hao Zhang

Sparse Matrix in Large Language Model Fine-tuning

Authors: Haoze He, Juncheng B Li, Xuan Jiang, Heather Miller

Specialized Foundation Models struggle to beat Supervised Baselines

Authors: Zongzhe Xu, Ritvik Gupta, Wenduo Cheng, Alexander Shen, Junhong Shen, Ameet Talwalkar, Mikhail Khodak

Step-by-Step Reasoning for Math Problems via Twisted Sequential Monte Carlo

Authors: Shengyu Feng, Xiang Kong, Shuang Ma, Aonan Zhang, Dong Yin, Chong Wang, Ruoming Pang, Yiming Yang

TidalDecode: Fast and Accurate LLM Decoding with Position Persistent Sparse Attention

Authors: Lijie Yang, Zhihao Zhang, Zhuofu Chen, Zikun Li, Zhihao Jia

Time, Space and Streaming Efficient Algorithm for Heavy Attentions

Authors: Ravindran Kannan, Chiranjib Bhattacharyya, Praneeth Kacham, David Woodruff

Generative Models

Consistency Models Made Easy

Authors: Zhengyang Geng, Ashwini Pokle, Weijian Luo, Justin Lin, J Zico Kolter

Human-Aligned Chess With a Bit of Search

Authors: Yiming Zhang, Athul Paul Jacob, Vivian Lai, Daniel Fried, Daphne Ippolito

Linear Combination of Saved Checkpoints Makes Consistency and Diffusion Models Better

Authors: Enshu Liu, Junyi Zhu, Zinan Lin, Xuefei Ning, Shuaiqi Wang, Matthew B. Blaschko, Sergey Yekhanin, Shengen Yan, Guohao Dai, Huazhong Yang, Yu Wang

MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding

Authors: Ranajoy Sadhukhan, Jian Chen, Zhuoming Chen, Vashisth Tiwari, Ruihang Lai, Jinyuan Shi, Ian En-hsu Yen, Avner May, Tianqi Chen, Beidi Chen

OmniPhysGS: 3D Constitutive Gaussians for General Physics-Based Dynamics Generation

Authors: Yuchen Lin, Chenguo Lin, Jianjin Xu, Yadong Mu

RAG-DDR: Optimizing Retrieval-Augmented Generation Using Differentiable Data Rewards

Authors: Xinze Li, Sen Mei, Zhenghao Liu, Yukun Yan, Shuo Wang, Shi Yu, Zheni Zeng, Hao Chen, Ge Yu, Zhiyuan Liu, Maosong Sun, Chenyan Xiong

Speculative Knowledge Distillation: Bridging the Teacher-Student Gap Through Interleaved Sampling

Authors: Wenda Xu, Rujun Han, Zifeng Wang, Long Le, Dhruv Madeka, Lei Li, William Yang Wang, Rishabh Agarwal, Chen-yu Lee, Tomas Pfister

TFG-Flow: Training-free Guidance in Multimodal Generative Flow

Authors: Haowei Lin, Shanda Li, Haotian Ye, Yiming Yang, Stefano Ermon, Yitao Liang, Jianzhu Ma

Truncated Consistency Models

Authors: Sangyun Lee, Yilun Xu, Tomas Geffner, Giulia Fanti, Karsten Kreis, Arash Vahdat, Weili Nie

TypedThinker: Typed Thinking Improves Large Language Model Reasoning

Authors: Danqing Wang, Jianxin Ma, Fei Fang, Lei Li

Infrastructure, Software Libraries, Hardware, Systems, Etc.

OpenHands: An Open Platform for AI Software Developers as Generalist Agents

Authors: Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, Graham Neubig

Interpretability And Explainable Ai

Improving Instruction-Following in Language Models through Activation Steering

Authors: Alessandro Stolfo, Vidhisha Balachandran, Safoora Yousefi, Eric Horvitz, Besmira Nushi

Interpreting Language Reward Models via Contrastive Explanations

Authors: Junqi Jiang, Tom Bewley, Saumitra Mishra, Freddy Lecue, Manuela Veloso

LICORICE: Label-Efficient Concept-Based Interpretable Reinforcement Learning

Authors: Zhuorui Ye, Stephanie Milani, Geoffrey J. Gordon, Fei Fang

Physics of Language Models: Part 2.1, Grade-School Math and the Hidden Reasoning Process

Authors: Tian Ye, Zicheng Xu, Yuanzhi Li, Zeyuan Allen-zhu

Sparse autoencoders reveal selective remapping of visual concepts during adaptation

Authors: Hyesu Lim, Jinho Choi, Jaegul Choo, Steffen Schneider

Learning On Graphs And Other Geometries & Topologies

Learning Graph Invariance by Harnessing Spuriosity

Authors: Tianjun Yao, Yongqiang Chen, Kai Hu, Tongliang Liu, Kun Zhang, Zhiqiang Shen

Spectro-Riemannian Graph Neural Networks

Authors: Karish Grover, Haiyang Yu, Xiang Song, Qi Zhu, Han Xie, Vassilis N. Ioannidis, Christos Faloutsos

Learning Theory

A Theoretical Analysis of Self-Supervised Learning for Vision Transformers

Authors: Yu Huang, Zixin Wen, Yuejie Chi, Yingbin Liang

Larger Language Models Provably Generalize Better

Authors: Marc Anton Finzi, Sanyam Kapoor, Diego Granziol, Anming Gu, Andrew Gordon Wilson, Christopher De Sa, J Zico Kolter

Learning from weak labelers as constraints

Authors: Vishwajeet Agrawal, Rattana Pukdee, Maria Florina Balcan, Pradeep Kumar Ravikumar

Neurosymbolic & Hybrid Ai Systems (Physics-informed, Logic & Formal Reasoning, Etc.)

ImProver: Agent-Based Automated Proof Optimization

Authors: Riyaz Ahuja, Jeremy Avigad, Prasad Tetali, Sean Welleck

NeSyC: A Neuro-symbolic Continual Learner For Complex Embodied Tasks in Open Domains

Authors: Wonje Choi, Jinwoo Park, Sanghyun Ahn, Daehee Lee, Honguk Woo

Optimization

Understanding Optimization in Deep Learning with Central Flows

Authors: Jeremy Cohen, Alex Damian, Ameet Talwalkar, J Zico Kolter, Jason D. Lee

Other Topics In Machine Learning (I.e., None Of The Above)

AnoLLM: Large Language Models for Tabular Anomaly Detection

Authors: Che-ping Tsai, Ganyu Teng, Phillip Wallis, Wei Ding

Beyond Worst-Case Dimensionality Reduction for Sparse Vectors

Authors: Sandeep Silwal, David Woodruff, Qiuyi Zhang

Zeroth-Order Fine-Tuning of LLMs with Transferable Static Sparsity

Authors: Wentao Guo, Jikai Long, Yimeng Zeng, Zirui Liu, Xinyu Yang, Yide Ran, Jacob R. Gardner, Osbert Bastani, Christopher De Sa, Xiaodong Yu, Beidi Chen, Zhaozhuo Xu

Probabilistic Methods (Bayesian Methods, Variational Inference, Sampling, Uq, Etc.)

Conformalized Interactive Imitation Learning: Handling Expert Shift and Intermittent Feedback

Authors: Michelle D Zhao, Henny Admoni, Reid Simmons, Aaditya Ramdas, Andrea Bajcsy

Reinforcement Learning

Diffusing States and Matching Scores: A New Framework for Imitation Learning

Authors: Runzhe Wu, Yiding Chen, Gokul Swamy, Kianté Brantley, Wen Sun

Efficient Imitation under Misspecification

Authors: Nicolas Espinosa-dice, Sanjiban Choudhury, Wen Sun, Gokul Swamy

Regressing the Relative Future: Efficient Policy Optimization for Multi-turn RLHF

Authors: Zhaolin Gao, Wenhao Zhan, Jonathan Daniel Chang, Gokul Swamy, Kianté Brantley, Jason D. Lee, Wen Sun

Reinforcement learning with combinatorial actions for coupled restless bandits

Authors: Lily Xu, Bryan Wilder, Elias Boutros Khalil, Milind Tambe

Value-Incentivized Preference Optimization: A Unified Approach to Online and Offline RLHF

Authors: Shicong Cen, Jincheng Mei, Katayoon Goshvadi, Hanjun Dai, Tong Yang, Sherry Yang, Dale Schuurmans, Yuejie Chi, Bo Dai

Transfer Learning, Meta Learning, And Lifelong Learning

Many-Objective Multi-Solution Transport

Authors: Ziyue Li, Tian Li, Virginia Smith, Jeff Bilmes, Tianyi Zhou

pMoE: Prompting Diverse Experts Together Wins More in Visual Adaptation

Authors: Shentong Mo, Xufang Luo, Dongsheng Li

Unsupervised, Self-supervised, Semi-supervised, And Supervised Representation Learning

Learning Representations of Intermittent Temporal Latent Process

Authors: Yuke Li, Yujia Zheng, Guangyi Chen, Kun Zhang, Heng Huang

Memory Mosaics

Authors: Jianyu Zhang, Niklas Nolte, Ranajoy Sadhukhan, Beidi Chen, Leon Bottou

MetaOOD: Automatic Selection of OOD Detection Models

Authors: Yuehan Qin, Yichi Zhang, Yi Nian, Xueying Ding, Yue Zhao

Repetition Improves Language Model Embeddings

Authors: Jacob Mitchell Springer, Suhas Kotha, Daniel Fried, Graham Neubig, Aditi Raghunathan

Synergy Between Sufficient Changes and Sparse Mixing Procedure for Disentangled Representation Learning

Authors: Zijian Li, Shunxing Fan, Yujia Zheng, Ignavier Ng, Shaoan Xie, Guangyi Chen, Xinshuai Dong, Ruichu Cai, Kun Zhang

Read More

Allie: A Human-Aligned Chess Bot

Allie: A Human-Aligned Chess Bot

Play against Allie on lichess!

Introduction

In 1948, Alan Turning designed what might be the first chess playing AI, a paper program that Turing himself acted as the computer for. Since then, chess has been a testbed for nearly every generation of AI advancement. After decades of improvement, today’s top chess engines like Stockfish and AlphaZero have far surpassed the capabilities of even the strongest human grandmasters.

However, most chess players are not grandmasters, and these state-of-the-art Chess AIs have been described as playing more like aliens than fellow humans.

The core problem here is that strong AI systems are not human-aligned; they are unable to match the diversity of skill levels of human partners and unable to model human-like behaviors beyond piece movement. Understanding how to make AI systems that can effectively collaborate with and be overseen by humans is a key challenge in AI alignment. Chess provides an ideal testbed for trying out new ideas towards this goal – while modern chess engines far surpass human ability, they are completely incapable of playing in a human-like way or adapting to match their human opponents’ skill levels. In this paper, we introduce Allie, a chess-playing AI designed to bridge the gap between artificial and human intelligence in this classic game.

What is Human-aligned Chess?

When we talk about “human-aligned” chess AI, what exactly do we mean? At its core, we want a system that is both humanlike, defined as making moves that feel natural to human players, as well as skill-calibrated, defined as capable of playing at a similar level against human opponents across the skill spectrum.

Our goal here is quite different from traditional chess engines like Stockfish or AlphaZero, which are optimized solely to play the strongest moves possible. While these engines achieve superhuman performance, their play can feel alien to humans. They may instantly make moves in complex positions where humans would need time to think, or continue playing in completely lost positions where humans would normally resign.

Building Allie

Allie's system design
Figure 1: (a) A game state is represented as the sequence of moves that produced it and some metadata. This sequence is inputted to a Transformer, which predicts the next move, pondering time for this move, and a value assessment of the move. (b) At inference time, we employee Monte-Carlo Tree Search with the value predictions from the model. The number of rollouts (N_mathrm{sim}) is chosen dynamically based on the predicted pondering time.

A Transformer model trained on transcripts of real games

While most prior deep learning approaches build models that input a board state, and output a distribution over possible moves, we instead approach chess like a language modeling task. We use a Transformer architecture that inputs a sequence of moves rather than a single board state. Just as large language models learn to generate human-like text by training on vast text corpora, we hypothesized that a similar architecture could learn human-like chess by training on human game records. We train our chess “language” model on transcripts of over 93M games encompassing a total of 6.6 billion moves, which were played on the chess website Lichess.

Conditioning on Elo score

In chess, Elo scores normally fall in the range of 500 (beginner players) to 3000 (top chess professionals). To calibrate the playing strength of ALLIE to different levels of players, we model gameplay under a conditional generation framework, where encodings of the Elo ratings of both players are prepended to the game sequence. Specifically, we prefix each game with soft control tokens, which interpolate between a weak token, representing 500 Elo, and a strong token, representing 3000 Elo.

For a player with Elo rating (k), we compute a soft token (e_k) by linearly interpolating between the weak and strong tokens:

$$e_k = gamma e_text{weak} + (1-gamma) e_text{strong}$$

where (gamma = frac{3000-k}{2500}). During training, we prefix each game with two soft tokens corresponding to the two players’ strengths.

Learning objectives

On top of the base Transformer model, Allie has three prediction objectives:

  1. A policy head (p_theta) that outputs a probability distribution over possible next moves
  2. A pondering-time head (t_theta) that outputs the number of seconds a human player would take to come up with this move
  3. A value assessment head (v_theta) that outputs a scalar value representing who expects to win the game

All three heads are individually parametrized as linear layers applied to the final hidden state of the decoder. Given a dataset of chess games, represented as a sequence of moves (mathbf{m}), human ponder time before each move (mathbf{t}), and game output (v) we trained Allie to minimize the log-likelihood of next moves and MSE of time and value predictions:

$$mathcal{L}(theta) = sum_{(mathbf{m}, mathbf{t}, v) in mathcal{D}} left( sum_{1 le i le N} left( -log p_theta(m_i ,|, mathbf{m}_{lt i}) + left(t_theta(mathbf{m}_{lt i}) – t_iright)^2 + left(v_theta(mathbf{m}_{lt i}) – vright)^2 right) right) text{.}$$

Adaptive Monte-Carlo Tree Search

At play-time, traditional chess engines like AlphaZero use search algorithms such as Monte-Carlo Tree Search (MCTS) to anticipate many moves into the future, evaluating different possibilities for how the game might go. The search budget (N_mathrm{sim}) is almost always fixed—they will spend the same amount of compute on search regardless of whether the best next move is extremely obvious or pivotal to the outcome of the game.

This fixed budget doesn’t match human behavior; humans naturally spend more time analyzing critical or complex positions compared to simple ones. In Allie, we introduce a time-adaptive MCTS procedure that varies the amount of search based on Allie’s prediction of how long a human would think in each position. If Allie predicts a human would spend more time on a position, it performs more search iterations to better match human depth of analysis. To keep things simple, we just set

How does Allie Play?

To evaluate whether Allie is human-aligned, we evaluate its performance both on an offline dataset and online against real human players.

Figure 2. Allie significantly outperforms pervious state-of-the-art methods. Adaptive-search enables matching human moves at expert levels.

In offline games, Allie achieves state-of-the-art in move-matching accuracy (defined as the % of moves made that match real human moves). It also models how humans resign, and ponder very well.

Figure 3: Allie’s time predictions are strongly correlated with ground-truth human time usage. In the figure, we show median and IQR of Allie’s think time for different amount of time spent by humans.
Figure 4: Allie learns to assign reliable value estimates to board states by observing game outcomes alone. We report Pearson’s r correlation of value estimates by ALLIE and Stockfish with game outcomes.

Another main insight of our paper is that adaptive search enables remarkable skill calibration against players across the skill spectrum. Against players from 1100 to 2500 Elo, the adaptive search variant of Allie has an average skill gap of only 49 Elo points. In other words, Allie (with adaptive search) wins about 50% of games against opponents that are both beginner and expert level. Notably, none of the other methods (even the non-adpative MCTS baseline) can match the strength of 2500 Elo players.

Table 1: Adaptive search enables remarkable skill calibration. Mean and maximum skill calibration errors is measured by computed by binning human players into 200-Elo groups. We also report systems’ estimated performance against players at the lower and upper Elo ends of the skill spectrum.

Limitations and Future Work

Despite strong offline evaluation metrics and generally positive player feedback, Allie still exhibits occasional behaviors that feel non-humanlike. Players specifically noted Allie’s propensity toward late-game blunders and sometimes spending too much time pondering positions where there’s only one reasonable move. These observations suggest there’s still room to improve our understanding of how humans allocate cognitive resources during chess play.

For future work, we identify several promising directions. First, our approach heavily relies on available human data, which is plentiful for fast time controls but more limited for classical chess with longer thinking time. Extending our approach to model human reasoning in slower games, where players make more accurate moves with deeper calculation, represents a significant challenge. With the recent interest in reasoning models that make use of test-time compute, we hope that our adaptive search technique can be applied to improving the efficiency of allocating a limited compute budget.

If you are interested in learning more about this work, please checkout our ICLR paper, Human-Aligned Chess With a Bit of Search.

Read More

LLM Unlearning Benchmarks are Weak Measures of Progress

LLM Unlearning Benchmarks are Weak Measures of Progress

TL;DR: “Machine unlearning” aims to remove data from models without retraining the model completely. Unfortunately, state-of-the-art benchmarks for evaluating unlearning in LLMs are flawed, especially because they separately test “forget queries” and “retain queries” without examining potential dependencies between forget and retain data. We show that such benchmarks do not provide an accurate measure of whether or not unlearning has occurred, making it difficult to evaluate whether new algorithms are truly making progress on the problem of unlearning. In our paper, at SaTML ’25, we examine this and other pitfalls in more detail, and provide recommendations for unlearning research going forward. We additionally released two new datasets on HuggingFace: [swapped WMDP], [paired TOFU].

Overview

Large-scale data collection, particularly through data available on the Web, has enabled stunning progress in the capabilities of generative models over the past decade. However, using Web data wholesale in model training raises questions about user privacy, copyright protection, and harmful content generation. 

Researchers have come up with a number of potential ways to mitigate these harms. Among them is “machine unlearning,” where undesirable data (whether private user data, copyright-protected data, or potentially toxic content) can be deleted from models after they have already been trained. The intuitive goal of machine unlearning is to enable this deletion more efficiently than the obvious solution, which is to retrain the entire model from scratch (which would be incredibly expensive for a modern LLM). 

Benchmarking Unlearning

Unlearning is a difficult problem, and enabling research on this topic requires accurate metrics to measure progress. In order to evaluate unlearning, researchers have proposed several benchmarks. These generally have the following structure:

  • A base model which may be a pretrained model or a model finetuned on some benchmark data.
  • Forget data to be unlearned. This could also be specified as a concept or topic rather than data points.
  • Retain data consisting of the remaining data that will not be unlearned.
  • A forget set of evaluation queries that are meant to test access to unlearned information.
  • A retain set of queries that are meant to test access to information that should not be unlearned.

Figure 1. The majority of LLM unlearning papers published in 2024 evaluate only on a handful of benchmarks, and all of these benchmarks have a “forget set-retain set” structure.

We surveyed 72 LLM unlearning papers published in 2024 in order to understand the state of unlearning evaluations today. Out of these, we found that a handful of benchmarks were overwhelmingly popular, as shown in Figure 1. All of these benchmarks follow the “forget set”/”retain set” structure described above. In fact, even in 2025, we find that new works continue to evaluate on this small set of benchmarks, sometimes restricting to only one or two benchmarks. As we show later in this post, this structure is too simple to adequately measure progress on unlearning.

We focused our work on some of the most popular benchmarks (highlighted in orange above), but the takeaways apply more generally to benchmarks with the structure described above.

Main Takeaways

The main finding of our work is that the majority of popular evaluation benchmarks (including but not limited to TOFU and WMDP) are weak measures of progress, and results reported on these benchmarks are anywhere from unreliable to actively misleading as far as whether unlearning has actually succeeded.

Therefore, we encourage the community to interpret results with caution and be aware of common pitfalls when interpreting evaluations. For example, if a paper evaluates solely on benchmarks that use a disjoint “forget” and “retain” evaluation, the results may not accurately reflect whether unlearning has actually occurred. 

Most importantly, empirical evaluations are a possibly necessary but not sufficient condition to ensure unlearning. They are highly useful for testing whether a method is broken, but cannot guarantee that a method has succeeded.

More specifically, we find:

  • Benchmarks that split queries into an independent “forget set” and a “retain set” overestimate the effectiveness of unlearning. Introducing dependencies between these queries can reveal data that was supposedly unlearned, or destroy performance on data that was supposed to be retained. Note that we do not modify or attack the algorithms, only change the evaluation queries.
  • Ambiguities in the forget data — for example, specifying a concept or topic, rather than specific data points in the training set, to unlearn — can lead to poor evaluation hygiene in practice, such as “peeking” at evaluation data when designing the unlearning algorithm. 

In this blog post, we focus on the first point. In our paper, we provide a literature survey, more details on the second point, as well as recommendations for researchers going forward. We also provide two new evaluation datasets on HuggingFace: [modified WMDP], [modified TOFU]. 

Forget-Retain Evaluations are Deceptive

Key takeaway: Evaluating on an independent “forget set” and “retain set” is insufficient to measure unlearning. In reality, a single query can reference both forget data and retain data, and we find that these types of queries can reveal “unlearned” information or destroy access to “retained” information.

Finding #1: TOFU. The TOFU benchmark is comprised of a dataset of facts about fictitious authors and a Llama model finetuned on this data. The goal is to unlearn information about some of the authors while retaining information about the remaining authors.

The forget queries correspond to questions about authors in the forget set only, while the retain queries correspond to questions about the remaining authors, as well as world facts.

We find that simply concatenating a forget query and a retain query can uncover flaws in unlearning methods. For example:

Forget query: Who is the notable author born in Astana, Kazakhstan on the 7th of February, 1952?

Retain query: What is the full name of the author born on July 17, 1953 in Accra, Ghana?

Combined query: 1. Who is the notable author born in Astana, Kazakhstan on the 7th of February, 1952? 2. What is the full name of the author born on July 17, 1953 in Accra, Ghana?

The fully retrained model (the gold standard for unlearning) hallucinates an incorrect response for the first question, while answering the second correctly. DPO, an alignment method that has been applied to unlearning, refuses to answer at all. Meanwhile, ECO answers both queries correctly, even the forget query. In fact, we find that the simplest gradient ascent method has the best stability out of the three (retaining its performance in the combined query, although the initial performance appears worse).

Finding #2: WMDP. The WMDP benchmark consists of data to unlearn about potentially dangerous biological, chemical, and cybersecurity attacks, and multiple-choice questions about each topic, classified into benign (retain) queries and harmful (forget) queries.

We make a very simple modification to the retain queries: swap one of the incorrect choices with a keyword that is in the forget data — specifically, “SARS-CoV-2.” In a correctly unlearned model, this should have no impact on the model’s ability to answer correctly on the retain queries.

In reality, we find that swapping in an incorrect response results in a 28% decrease in accuracy for the state-of-the-art unlearning method RMU! Once again, introducing a very simple dependency on the forget data is sufficient to completely change the conclusions one draws from the benchmark, again without modifying or targeting anything about the algorithm.

Figure 2. Unlearning methods appear to perform well on “benign” retain set questions, but by simply including a keyword from the forget data in the retain question, the performance drops to below random.

Datasets. We do not necessarily believe that any one dataset can be comprehensive enough to ensure that unlearning has occurred, but a dataset can be a lower bound to determine whether unlearning has not occurred. Towards this, we release both of these datasets on HuggingFace: [swapped WMDP], [paired TOFU].

Where do we go from here?

Since our work became public in October 2024, the community has continued to report results and claim success on benchmarks that exclusively use a “forget-retain split” of data. As a starting point to move evaluations forward, we have released the evaluation sets that we use in our work, and encourage practitioners to use these to stress-test unlearning algorithms. 

While provable guarantees may be the ultimate measure of success, a strong evaluation can provide evidence that an algorithm is promising. We therefore encourage community members to take the time to develop further evaluation datasets that test potential failure modes of unlearning algorithms. We also strongly encourage algorithms to come with a threat model that describes in detail the system and query model under which the guarantee is expected to hold.

Ultimately, even the most thorough benchmark will still be limited by the query set. In our paper, we discuss possible directions for unlearning with provable guarantees and more rigorous tests of unlearning.

Read More

Copilot Arena: A Platform for Code

Copilot Arena: A Platform for Code

Figure 1. Copilot Arena is a VSCode extension that collects human preferences of code directly from developers. 

As model capabilities improve, large language models (LLMs) are increasingly integrated into user environments and workflows. In particular, software developers code with LLM-powered tools in integrated development environments such as VS Code, IntelliJ, or Eclipse. While these tools are increasingly used in practice, current LLM evaluations struggle to capture how users interact with these tools in real environments, as they are often limited to short user studies, only consider simple programming tasks as opposed to real-world systems, or rely on web-based platforms removed from development environments.

To address these limitations, we introduce Copilot Arena, an app designed to evaluate LLMs in real-world settings by collecting preferences directly in a developer’s actual workflow. Copilot Arena is a Visual Studio Code extension that provides developers with code completions, akin to the type of support provided by GitHub Copilot. Thus far, over 11,000 users have downloaded Copilot Arena, and the tool has served over 100K completions, and accumulated over 25,000 code completion battles. The battles form a live leaderboard on the LMArena website. Since its launch, Copilot Arena has also been used to evaluate two new code completion models prior to their release: a new Codestral model from Mistral AI and Mercury Coder from InceptionAI. 

In this blog post, we discuss how we designed and deployed Copilot Arena. We also highlight how Copilot Arena provides new insights into developer code preferences.

Copilot Arena System Design

To collect user preferences, Copilot Arena presents a novel interface that shows users paired code completions from two different LLMs, which are determined based on a sampling strategy that mitigates latency while preserving coverage across model comparisons. Additionally, we devise a prompting scheme that allows a diverse set of models to perform code completions with high fidelity. Figure 1 overviews this workflow. We will overview each component below:

User Interface: Copilot Arena allows users to select between pairs of code completions from different LLMs. User selections allow us to better understand developer preferences between LLMs. To avoid interrupting user workflows, voting is designed to be seamless—users use keyboard shortcuts to quickly accept code completions.   

Sampling model pairs: We explore a sampling strategy to minimize the experienced latency. Since our interface shows two code completions together, the slowest completion determines the latency. We capture each model’s latency as a log-normal distribution and tune a temperature parameter to interpolate between a latency-optimized distribution and a uniform distribution, observing a decrease in median experienced latency by 33% (from 1.61 to 1.07 seconds) compared to a uniform distribution.

Figure 2: We develop a simple prompting scheme to enable LLMs to perform infilling tasks compared to the vanilla performance.  

Prompting for code completions: During development, models need to “fill in the middle”, where code needs to be generated based on both the current prefix and suffix. While some models, such as DeepSeek and Codestral, are designed to fill in the middle, many chat models are not and require additional prompting. To accomplish this, we allow the model to generate code snippets, which is a more natural format, and then post-process them into a FiM completion. Our approach is as follows: in addition to the same prompt templates above, the models are provided with instructions to begin by re-outputting a portion of the prefix and similarly end with a portion of the suffix. We then match portions of the output code in the input and delete the repeated code. This simple prompting trick allows chat models to perform code completions with high success (Figure 2).

Deployment

Figure 3. Copilot Arena leaderboard is live on lmareana.ai.

We deploy Copilot Arena as a free extension available on the VSCode extension store. During deployment, we log user judgments and latency for model responses, along with the user’s input and completion. Given the sensitive nature of programming, users can restrict our access to their data. Depending on privacy settings, we also collect the user’s code context and model responses.

As is standard in other work on pairwise preference evaluation (e.g., Chatbot Arena), we apply a Bradley-Terry (BT) model to estimate the relative strengths of each model. We bootstrap the battles in the BT calculation to construct a 95% confidence interval for the rankings, which are used to create a leaderboard that ranks all models, where each model’s rank is determined by which other models’ lower bounds fall below its upper bound. We host a live leadboard of model rankings at lmarena.ai (Figure 3). 

Findings

Figure 4. Model rankings in Copilot Arena (1st column) differ from existing evaluations, both for static benchmarks (2nd-4th column) and live preference evaluations (last two columns). We also report Spearman’s rank correlation (r) between Copilot Arena and other benchmarks. 

Comparison to prior datasets

We compare our leaderboard to existing evaluations, which encompass both live preference leaderboards with human feedback and static benchmarks (Figure 4). The static benchmarks we compare against are LiveBench, BigCodeBench, and LiveCodeBench, which evaluate models’ code generation abilities on a variety of Python tasks and continue to be maintained with new model releases. We also compare to Chatbot Arena and their coding-specific subset, which are human preferences of chat responses collected through a web platform.

We find a low correlation (r ≤ 0.1) with most static benchmarks, but a relatively higher correlation (Spearman’s rank correlation (r) of 0.62) with Chatbot Arena (coding) and a similar correlation (r = 0.48) with Chatbot Arena (general). The stronger correlation with human preference evaluations compared to static benchmarks likely indicates that human feedback captures distinct aspects of model performance that static benchmarks fail to measure. We notice that smaller models tend to overperform (e.g., GPT-4o mini and Qwen-2.5-Coder 32B), particularly in static benchmarks. We attribute these differences to the unique distribution of data and tasks that Copilot Arena evaluates over, which we explore in more detail next.

Figure 5. Copilot Arena data is diverse in programming and natural languages, downstream tasks, and code structures (e.g., context lengths, last-line contexts, and completion structures).

In comparison to prior approaches, evaluating models in real user workflows leads to a diverse data distribution in terms of programming and natural languages, tasks, and code structures (Figure 5):

  • Programming and natural language: While the plurality of Copilot Arena users write in English (36%) and Python (49%), we also identify 24 different natural languages and 103 programming languages which is comparable to Chatbot Arena (general) and benchmarks focused on multilingual generation. In contrast, static benchmarks tend to focus on questions written solely in Python and English.
  • Downstream tasks: Existing benchmarks tend to source problems from coding competitions, handwritten programming challenges, or from a curated set of GitHub repositories. In contrast, Copilot Arena users are working on a diverse set of realistic tasks, including but not limited to frontend components, backend logic, and ML pipelines.
  • Code structures and context lengths: Most coding benchmarks follow specific structures, which means that most benchmarks have relatively short context lengths. Similarly, Chatbot Arena focuses on natural language input collected from chat conversations, with many prompts not including any code context (e.g., 40% of Chatbot Arena’s coding tasks contain code context and only 2.6% focus on infilling). Unlike any existing evaluation, Copilot Arena is structurally diverse with significantly longer inputs.

Insights into user preferences

  • Downstream tasks significantly affect win rate, while programming languages have little effect:  Changing task type significantly affects relative model performance, which may indicate that certain models are overexposed to competition-style algorithmic coding problems. On the other hand, the effect of the programming language on win-rates was remarkably small, meaning that models that perform well on Python will likely perform well on another language. We hypothesize that this is because of the inherent similarities between programming languages, and learning one improves performance in another, aligning with trends reported in prior work.
  • Smaller models may overfit to data similar to static benchmarks, while the performance of larger models is mixed: Existing benchmarks (e.g., those in Figure 4) primarily evaluate models on Python algorithmic problems with short context. However, we notice that Qwen-2.5 Coder performs noticeably worse on frontend/backend tasks, longer contexts, and non-Python settings. We observe similar trends for the two other small models (Gemini Flash and GPT-4o mini). We hypothesize that overexposure may be particularly problematic for smaller models. On the other hand, performance amongst larger models is mixed. 

Conclusion

While Copilot Arena represents a shift in the right direction for LLM evaluation, providing more grounded and realistic evaluations, there is still significant work to be done to fully represent all developer workflows. For example, extending Copilot Arena to account for interface differences from production tools like GitHub Copilot and tackling privacy considerations that limit data sharing. Despite these constraints, our platform reveals that evaluating coding LLMs in realistic environments yields rankings significantly different from static benchmarks or chat-based evaluations and highlights the importance of testing AI assistants with real users on real tasks. We’ve open-sourced Copilot Arena to encourage the open source community to include more nuanced feedback mechanisms, code trajectory metrics, and additional interaction modes.

If you think this blog post is useful for your work, please consider citing it.

@misc{chi2025copilotarenaplatformcode,
      title={Copilot Arena: A Platform for Code LLM Evaluation in the Wild}, 
      author={Wayne Chi and Valerie Chen and Anastasios Nikolas Angelopoulos and Wei-Lin Chiang and Aditya Mittal and Naman Jain and Tianjun Zhang and Ion Stoica and Chris Donahue and Ameet Talwalkar},
      year={2025},
      eprint={2502.09328},
      archivePrefix={arXiv},
      primaryClass={cs.SE},
      url={https://arxiv.org/abs/2502.09328}, 
}

Read More

Optimizing LLM Test-Time Compute Involves Solving a Meta-RL Problem

Optimizing LLM Test-Time Compute Involves Solving a Meta-RL Problem

Figure 1: Training models to optimize test-time compute and learn “how to discover” correct responses, as opposed to the traditional learning paradigm of learning “what answer” to output.

The major strategy to improve large language models (LLMs) thus far has been to use more and more high-quality data for supervised fine-tuning (SFT) or reinforcement learning (RL). Unfortunately, it seems this form of scaling will soon hit a wall, with the scaling laws for pre-training plateauing, and with reports that high-quality text data for training maybe exhausted by 2028, particularly for more difficult tasks, like solving reasoning problems which seems to require scaling current data by about 100x to see any significant improvement. The current performance of LLMs on problems from these hard tasks remains underwhelming (see example). There is thus a pressing need for data-efficient methods for training LLMs that extend beyond data scaling and can address more complex challenges. In this post, we will discuss one such approach: by altering the LLM training objective, we can reuse existing data along with more test-time compute to train models to do better.

Current LLMs are Trained on “What” to Answer

The predominant principle for training models today is to supervise them into producing a certain output for an input. For instance, supervised fine-tuning attempts to match direct output tokens given an input akin to imitation learning and RL fine-tuning trains the response to optimize a reward function that is typically supposed to take the highest value on an oracle response. In either case, we are training the model to produce the best possible approximation to (y^star) it can represent. Abstractly, this paradigm trains models to produce a single input-output mapping, which works well when the goal is to directly solve a set of similar queries from a given distribution, but fails to discover solutions to out-of-distribution queries. A fixed, one-size-fits-all approach cannot adapt to the task heterogeneity effectively. We would instead want a robust model that is able to generalize to new, unseen problems by trying multiple approaches and seeking information to different extents, or expressing uncertainty when it is fully unable to fully solve a problem. How can we train models to satisfy these desiderata?

Learning “How to Answer” Can Generalize Beyond

To address the above issue, one emerging idea is to allow models to use test-time compute to find “meta” strategies or algorithms that can help them understand “how” to arrive at a good response. If you are new to test-time compute check out these papers, this excellent overview talk by Sasha Rush, and the NeurIPS tutorial by Sean Welleck et al. Implementing meta strategies that imbue a model with the capability of running a systematic procedure to arrive at an answer should enable extrapolation and generalization to input queries of different complexities at test time. For instance, if a model is taught what it means to use the Cauchy-Schwarz inequality, it should be able to invoke it at the right time on both easy and hard proof problems (potentially by guessing its usage, followed by a trial-and-error attempt to see if it can be applied in a given problem). In other words, given a test query, we want models to be capable of executing strategies that involve several atomic pieces of reasoning (e.g., several generation and verification attempts; several partially-completed solutions akin to search; etc) which likely come at the cost of spending more tokens. See Figure 2 for an example of two different strategies to attack a given problem. How can we train models to do so? We will formalize this goal into a learning problem and solve it via ideas from meta RL.

Figure 2: Examples of two algorithms and the corresponding stream of tokens generated by each algorithm. This includes tokens that are used to fetch relevant information from the model weights, plan the proof outline, verify intermediate results, and revise if needed. The first algorithm (left) generates an initial solution, verifies its correctness and revises if needed. The second algorithm (right) generates multiple solution strategies at once, and runs through each of them in a linear fashion before choosing the most promising strategy.

Formulating Learning “How” as an Objective

For every problem (x in mathcal{X}), say we have a reward function (r(x, cdot): mathcal{Y} mapsto {0,1}) that we can query on any output stream of tokens (y). For e.g., on a math reasoning problem (x), with token output stream (y), reward (r(x, y)) can be one that checks if some subsequence of tokens contains the correct answer. We are only given the dataset of training problems (mathcal{D}_mathrm{train}), and consequently the set of reward functions ({r(x, cdot) : x in mathcal{D}_mathrm{train}}). Our goal is to achieve high rewards on the distribution of test problems (mathcal{P}_text{test}), which are unknown apriori. The test problems can be of different difficulty compared to train problems.

For an unknown distribution of test problems (mathcal{P}_mathrm{test}), and a finite test-time compute budget (C), we can learn an algorithm (A in mathcal{A}_C (mathcal{D}_mathrm{train})) in the inference compute-constrained class of test-time algorithms (mathcal{A}_C) learned from the dataset of training problems (mathcal{D}_mathrm{train}). Each algorithm in this class takes as input the problem (x sim mathcal{P}_mathrm{test}), and outputs a stream of tokens. In Figure 2, we give some examples to build intuition for what this stream of tokens can be. For instance, (A_theta(x)) could consist of tokens that first correspond to some attempt at problem (x), then some verification tokens which predict the correctness of the attempt, followed by some refinement of the initial attempt (if verified to be incorrect), all stitched together in a “linear” fashion. Another algorithm (A_theta(x)) could be one that simulates some sort of heuristic-guided search in a linear fashion. The class of algorithms (mathcal{A}_C(mathcal{D}_mathrm{train})) would then consist of next token distributions induced by all possible (A_theta(x)) above. Note that in each of these examples, we hope to use more tokens to learn a generic but generalizing procedure as opposed to guessing the solution to the problem (x).

Our learning goal is to learn (A_theta(x)) , parameterized by an autoregressive LLM (A_theta(x)) (see Figure 1 for an illustration of tokens from (A_theta)). We refer to this entire stream (including the final answer) as a response (y sim A_theta(x)). The utility of algorithm (A_theta(x)) is given by its average correctness as measured by reward (r(x, y)). Hence, we can pose learning an algorithm as solving the following optimization problem:

$$max_{A_theta in mathcal{A}_C (mathcal{D}_text{train})} ; mathbb{E}_{x sim mathcal{P}_mathrm{test}} [ mathbb{E}_{y sim A_theta(x)} r(x, y) ; | ; mathcal{D}_text{train}] ~~~~~~~~~~ text{(Optimize “How” or Op-How)}.$$

Interpreting (Op-How) as a Meta RL Problem

The next question is: how can we solve the optimization problem (Op-How) over the class of compute-constrained algorithms (mathcal{A_c}), parameterized by a language model? Clearly, we do not know the outcomes for nor have any supervision for test problems. So, computing the outer expectation is futile. A standard LLM policy that guesses the best possible response for problem (x) also seems suboptimal because it could do better if it made full use of compute budget (C.) The main idea is that algorithms (A_theta(x) in mathcal{A}_c) that optimize (Op-How) resemble an adaptive policy in RL that uses the additional token budget to implement some sort of an algorithmic strategy to solve the input problem (x) (sort of like “in-context search” or “in-context exploration”). With this connection, we can take inspiration from how similar problems have been solved typically: by viewing (Op-How) through the lens of meta learning, specifically, meta RL: “meta” as we wish to learn algorithms and not direct answers to given problems & “RL” since (Op-How) is a reward maximization problem.

A very, very short primer on meta RL. Typically, RL trains a policy to maximize a given reward function in a Markov decision process (MDP). In contrast, the meta RL problem setting assumes access to a distribution of tasks (that each admit different reward functions and dynamics). The goal in this setting is to train the policy on tasks from this training distribution, such that it can do well on the test task drawn from the same or a different test distribution. Furthermore, this setting does not evaluate this policy in terms of its zero-shot performance on the test task, but lets it adapt to the test task by executing a few “training” episodes at test-time, after executing which the policy is evaluated. Most meta RL methods differ in the design of the adaptation procedure (e.g., (text{RL}^2) parameterizes this adaptation procedure via in-context RL; MAML runs explicit gradient updates at test time; PEARL adapts a latent variable identifying the task). We refer readers to this survey for more details.

Coming back to our setting, you might be wondering where the Markov decision process (MDP) and multiple tasks (for meta RL) come in. Every problem (x in mathcal{X}) induces a new RL task formalized as a Markov Decision Process (MDP) (M_x) with the set of tokens in the problem (x) as the initial state, every token produced by our LLM denoted by (A_theta(x)) as an action, and trivial deterministic dynamics defined by concatenating new tokens (in mathcal{T}) with the sequence of tokens thus far. Note, that all MDPs share the set of actions and also the set of states (mathcal{S} = mathcal{X} times cup_{h=1}^{H} mathcal{T}^h), which correspond to variable-length token sequences possible in the vocabulary. However, each MDP (M_x) admits a different unknown reward function given by the comparator (r(x, cdot)).

Then solving (Op-How) corresponds to finding a policy that can quickly adapt to the distribution of test problems (or test states) within the compute budget (C). Another way to view this notion of test-time generalization is through the lens of prior work called the epistemic POMDP, a construct that views learning a policy over family of (M_x) as a partially-observed RL problem. This perspective provides another way to motivate the need for adaptive policies and meta RL: for those who come from an RL background, it should not be surprising that solving a POMDP is equivalent to running meta RL. Hence, by solving a meta RL objective, we are seeking the optimal policy for this epistemic POMDP and enable generalization.

Before we go into specifics, a natural question to ask is why this meta RL perspective is interesting or useful, since meta RL is known to be hard. We believe that while learning policies from scratch entirely via meta RL is challenging, when applied to fine-tuning models that come equipped with rich priors out of pre-training, meta RL inspired ideas can be helpful. In addition, the meta RL problem posed above exhibits special structure (known and deterministic dynamics, different initial states), enabling us to develop non-general but useful meta RL algorithms.

How can the adaptive policy (LLM (A_theta)) adapt to a test problem (MDP (M_x))?

In meta RL, for each test MDP (M_x), the policy (A_theta) is allowed to gain information by spending test-time compute, before being evaluated on the final response generated by (A_theta). In the meta RL terminology, the information gained about the test MDP (M_x) can be thought of as collecting rewards on training episodes of the MDP induced by the test problem (x), before being evaluated on the test episode (see (text{RL}^2) paper; Section 2.2). Note that all of these episodes are performed once the model is deployed. Therefore, in order to solve (Op-How), we can view the entire stream of tokens from (A_theta(x)) as a stream split into several training episodes. For the test-time compute to be optimized, we need to ensure that each episode provides some information gain to do better in the subsequent episode of the test MDP (M_x). If there is no information gain, then learning (A_theta(x)) drops down to a standard RL problem — with a higher compute budget — and it becomes unclear if learning how is useful at all.

What kind of information can be gained? Of course, if external interfaces are involved within the stream of tokens we could get more information. However, are we exploiting free lunch if no external tools are involved? We remark that this is not the case and no external tools need to be involved in order to gain information as the stream of tokens progresses. Each episode in a stream could meaningfully add more information (for e.g., with separately-trained verifiers, or self-verification, done by (A_theta) itself) by sharpening the model’s posterior belief over the true reward function (r(x, cdot)) and hence the optimal response (y^star). That is, we can view spending more test-time compute as a way of sampling from the model’s approximation of the posterior over the optimal solution (P(cdot mid x, theta)), where each episode (or token in the output stream) refines this approximation. Thus, explicitly conditioning on previously-generated tokens can provide a computationally feasible way of representing this posterior with a fixed size LLM. This also implies that even in the absence of external inputs, we expect the mutual information (I(r(x, cdot); text{tokens so far}|x)) or (I(y^star; text{tokens so far}|x)) to increase as the more tokens are produced by (A_theta(x)).

As an example, let’s consider the response (A_theta(x)) that includes natural language verification tokens (see generative RMs) that assess intermediate generations. In this case, since all supervision comes from (A_theta) itself, we need an asymmetry between generation and verification for verification to induce information gain. Another idea is that when a model underfits on its training data, simply a longer length might also be able to provide significant information gain due to an increase in capacity (see Section 2 here). While certainly more work is needed to formalize these arguments, there are already some works on self-improvement that implicitly or explicitly exploit this asymmetry.

Putting it together, when viewed as a meta RL problem (A(cdot|cdot)) becomes a history-conditioned (“adaptive”) policy that optimizes reward (r) by spending computation of up to (C) on a given test problem. Learning an adaptive policy conditioned on past episodes is precisely the goal of black-box meta-reinforcement learning methods. Meta RL is also closely tied to the question of learning how to explore, and one can indeed view these additional tokens as providing strategic exploration for a given problem.

Figure 3: Agent-environment interaction protocol from the (text{RL}^2) paper. Each test problem (x) casts a new MDP (M_x). In this MDP, the agent interacts with the environment over multiple episodes. In our setting, this means that the stream of tokens in (A_theta(x)) comprises of multiple episodes, where (A_theta(x) ) uses the compute budget in each episode to gain information about the underlying MDP (M_x). All the gained information goes into the history (h_i), which evolves across the span of all the episodes. The algorithm (A_theta(x)) is trained to collect meaningful history in a fixed compute budget to be able to output a final answer that achieves high rewards in MDP (M_x).

Learning Adaptive Policies via Meta RL: Challenges & Algorithms

Figure 4: The response from this particular (A_theta(x)) includes a stream of tokens, where the information gain (I(r(x, cdot); text{tokens so far})) increases as we sample more tokens.

How can we solve such a meta RL problem? Perhaps the most obvious approach to solve meta RL problems is to employ black-box meta RL methods such as (text{RL}^2). This would involve maximizing the sum of rewards over the imagined “episodes” in the output trace (A_theta(x)). For instance, if (A_theta(x)) corresponds to using a self-correction strategy, the reward for each episode would grade individual responses appearing in the trace as shown in this prior work. If (A_theta(x)) instead prescribes a strategy that alternates between generation and generative verification, then rewards would correspond to success of generation and verification. We can then optimize:

$$max_theta ~mathbb{E}_{x sim mathcal{D}_text{train}, y sim A_theta(cdot|x)} left[ sum_{i=1}^{k} underbrace{tilde{r}_i(x, y_{j_{i-1}:j_{i}})}_{text{intermediate process reward}} + alpha cdot underbrace{r(x, y)}_{text{final correctness}} right]~~~~~~~ text{(Obj-1)},$$

where ({ j_i }_{i=1}^{k}) correspond to indices of the response that truncate the episodes marked and reward (tilde{r}_i) corresponds to a scalar reward signal for that episode (e.g., verification correctness for a verification segment, generation correctness for a generation segment, etc.) and in addition, we optimize the final correctness reward of the solution weighted by (alpha). Note that this formulation prescribes a dense, process-based reward for learning (note that this is not equivalent to using a step-level process reward model (PRM), but a dense reward bonus instead; connection between such dense reward bonuses and exploration can be found in this prior paper). In addition, we can choose to constrain the usage of compute by (A_theta(x)) to an upper bound (C) either explicitly via a loss term or implicitly (e.g., by chopping off the model’s generations that violate this budget).

The above paragraph is specific to generation and verification, and in general, the stream of output tokens may not be cleanly separable into generation and verification segments. In such settings, one could consider the more abstract form of the meta RL problem, which uses some estimate of information gain directly as the reward. One such estimate could be the metric used in the QuietSTaR paper, although it is not clear what the right way to define this metric is.

$$max_theta ~mathbb{E}_{x sim mathcal{D}_text{train}, y sim A_theta(cdot|x)} left[ sum_{i=1}^{k} underbrace{(I(r(x, cdot); y_{:j_{i}}) – I(r(x, cdot); y_{:j_{i-1}}))}_{text{information gain for segment }i} + alpha cdot underbrace{r(x, y)}_{text{final correctness}} right]~~~~~~~ text{(Obj-2)}.$$

One can solve (text{(Obj-1) and (Obj-2)}) via multi-turn RL approaches such as those based on policy gradients with intermediate dense rewards or based on actor-critic architectures (e.g., prior work ArCHer), and perhaps even the choice of RL approach (value-based vs. policy-based) may not matter as long as one can solve the optimization problem using some RL algorithm that performs periodic on-policy rollouts.

We could also consider a different approach for devising a meta RL training objective: one that only optimizes reward attained by the test episode (e.g., final answer correctness for the last attempt) and not the train episodes, thereby avoiding the need to quantify information gain. We believe that this would run into challenges of optimizing extremely sparse supervision at the end of a long trajectory (consisting of multiple reasoning segments or multiple “episodes” in meta RL terminology) with RL; dense rewards should be able to do better.

Challenges and open questions. There are quite a few challenges that we need to solve to instantiate this idea in practice as we list below.

  1. The first challenge lies in generalizing this framework to algorithm parameterizations (A_theta(x)) that produce token sequences do not meaningfully separate into semantic tasks (e.g., generation, verification, etc.). In this case, how can we provide dense rewards (tilde{r}_i)? We speculate that in such a setting (r_i) should correspond to some approximation of information gain towards producing the correct solution given input tokens, but it remains to be seen what this information gain or progress should mean.
  2. Ultimately, we will apply the above procedure to fine-tune a pre-trained or instruction-tuned model. How can we initialize the model (A_theta(cdot|cdot)) to be such that it can meaningfully produce an algorithm trace and not simply attempt the input query directly? Relatedly, how does the initialization from next-token prediction objective in pre-training or instruction-tuning affect optimizability of either (text{(Obj)}) objective above? Past work has observed severe memorization when using supervised fine-tuning to imbue (A_theta(cdot|cdot)) with a basis to learn self-correction behavior. It remains an open question as to whether this challenge is exacerbated in the most general setting and what can be done to alleviate it.
  3. Finally, we note that a critical condition to get meta learning to successfully work is the presence of ambiguity that it is possible to use experience collected on the test task to adapt the policy to it. It is unclear what a systematic way to introduce the above ambiguity is. Perhaps one approach is to use a large amount of training prompts such that there is little scope for memorizing the training data. This would also induce a bias towards using more available compute (C) for improving performance. But it remains unclear what the upper bound on this approach is.

Takeaways, Summary, and Limitations

We presented a connection between optimizing test-time compute for LLMs and meta RL. By viewing the optimization of test-time compute as the problem of learning an algorithm that figures how to solve queries at test time, followed by drawing the connection between doing so and meta RL provided us with training objectives that can efficiently use test-time compute. This perspective does potentially provide useful insights with respect to: (1) the role of intermediate process rewards that correspond to information gain in optimizing for test-time compute, (2) the role of model collapse and pre-trained initializations in learning meta strategies; and (3) the role of asymmetry as being the driver of test-time improvement n the absence of external feedback.

Of course, successfully instantiating formulations listed above would likely require specific and maybe even unexpected implementation details, that we do not cover and might be challenging to realize using the conceptual model discussed in this post. The challenges outlined may not cover the list of all possible challenges that arise with this approach. Nonetheless, we hope that this connection is useful in formally understanding test-time computation in LLMs.


Acknowledgements. We would like to thank Sasha Rush, Sergey Levine, Graham Neubig, Abhishek Gupta, Rishabh Agarwal, Katerina Fragkiadaki, Sean Welleck, Yi Su, Charlie Snell, Seohong Park, Yifei Zhou, Dzmitry Bahdanau, Junhong Shen, Wayne Chi, Naveen Raman, and Christina Baek for their insightful feedback, criticisms, discussions, and comments on an earlier version of this post. We would like to especially thank Rafael Rafailov for insightful discussions and feedback on the contents of this blog.

If you think this blog post is useful for your work, please consider citing it.

@misc{setlur2025opt,
author={Setlur, Amrith and Qu, Yuxiao and Zhang, Lunjun and Yang, Matthew and Smith, Virginia and Kumar, Aviral},
title={Optimizing LLM Test-Time Compute Involves Solving a Meta-RL Problem,
howpublished = {url{https://blog.ml.cmu.edu/2025/01/08/optimizing-llm-test-time-compute-involves-solving-a-meta-rl-problem/}},
note = {CMU MLD Blog} ,
year={2025},
}

Read More

Inductive biases of neural network modularity in spatial navigation

Inductive biases of neural network modularity in spatial navigation

TL;DR: The brain may have evolved a modular architecture for daily tasks, with circuits featuring functionally specialized modules that match the task structure. We hypothesize that this architecture enables better learning and generalization than architectures with less specialized modules. To test this, we trained reinforcement learning agents with various neural architectures on a naturalistic navigation task. We found that the modular agent, with an architecture that segregates computations of state representation, value, and action into specialized modules, achieved better learning and generalization. Our results shed light on the possible rationale for the brain’s modularity and suggest that artificial systems can use this insight from neuroscience to improve learning and generalization in natural tasks.

Motivation

Despite the tremendous success of AI in recent years, it remains true that even when trained on the same data, the brain outperforms AI in many tasks, particularly in terms of fast in-distribution learning and zero-shot generalization to unseen data. In the emerging field of neuroAI (Zador et al., 2023), we are particularly interested in uncovering the principles underlying the brain’s extraordinary capabilities so that these principles can be leveraged to develop more versatile and general-purpose AI systems.

Given the same training data, the differing abilities of learning systems—biological or artificial—stem from their distinct assumptions about the data, known as inductive biases. For instance, if the underlying data distribution is linear, a linear model that assumes linearity can learn very quickly—by observing only a few points without needing to fit the entire dataset—and generalize effectively to unseen data. In contrast, another model with a different assumption, such as quadratic, cannot achieve the same performance. Even if it were a powerful universal function approximator, it would not achieve the same efficiency. The brain may have evolved inductive biases that align with the underlying structure of natural tasks, which explains its high efficiency and generalization abilities in such tasks.

What are the brain’s useful inductive biases? One perspective suggests that the brain may have evolved an inductive bias for a modular architecture featuring functionally specialized modules (Bertolero et al., 2015). Each module specializes in a specific aspect or a subset of task variables, collectively covering all demanding computations of the task. We hypothesize that this architecture enables higher efficiency in learning the structure of natural tasks and better generalization in tasks with a similar structure than those with less specialized modules.

Previous works (Goyal et al., 2022; Mittal et al., 2022) have outlined the potential rationale for this architecture: Data generated from natural tasks typically stem from the latent distribution of multiple task variables. Decomposing the task and learning these variables in distinct modules allow a better understanding of the relationships among these variables and therefore the data generation process. This modularization also promotes hierarchical computation, where independent variables are initially computed and then forwarded to other modules specialized in computing dependent variables. Note that “modular” may take on different meanings in different contexts. Here, it specifically refers to architectures with multiple modules, each specializing in one or a subset of the desired task variables. Architectures with multiple modules lacking enforced specialization in computing variables do not meet the criteria for modular in our context.

To test our hypothesis, it is essential to select a natural task and compare a modular architecture designed for the task with alternative architectures.

Task

We chose a naturalistic virtual navigation task (Figure 1) previously used to investigate the neural computations underlying animals’ flexible behaviors (Lakshminarasimhan et al., 2020). At the beginning of each trial, the subject is situated at the center of the ground plane facing forward; a target is presented at a random location within the field of view (distance: (100) to (400) cm, angle: (-35) to (+35^{circ})) on the ground plane and disappears after (300) ms. The subject can freely control its linear and angular velocities with a joystick (maximum: (200) cm/s and (90^{circ})/s, referred to as the joystick gain) to move along its heading in the virtual environment. The objective is to navigate toward the memorized target location, then stop inside the reward zone, a circular region centered at the target location with a radius of (65) cm. A reward is given only if the subject stops inside the reward zone.

Figure 1

The subject’s self-location is not directly observable because there are no stable landmarks; instead, the subject needs to use optic flow cues on the ground plane to perceive self-motion and perform path integration. Each textural element of the optic flow, an isosceles triangle, appears at random locations and orientations, disappearing after only a short lifetime ((sim 250) ms), making it impossible to use as a stable landmark. A new trial starts after the subject stops moving.

Task modeling

We formulate this task as a Partially Observable Markov Decision Process (POMDP; Kaelbling et al., 1998) in discrete time, with continuous state and action spaces (Figure 2). At each time step (t), the environment is in the state (boldsymbol{s}_t) (including the agent’s position and velocity, and the target’s position). The agent takes an action (boldsymbol{a}_t) (controlling its linear and angular velocities) to update (boldsymbol{s}_t) to the next state (boldsymbol{s}_{t+1}) following the environmental dynamics given by the transition probability (T(boldsymbol{s}_{t+1}|boldsymbol{s}_{t},boldsymbol{a}_{t})), and receives a reward (r_t) from the environment following the reward function (R(boldsymbol{s}_t,boldsymbol{a}_t)) ((1) if the agent stops inside the reward zone otherwise (0)).

We use a model-free actor-critic approach to learning, with the actor and critic implemented using distinct neural networks. At each (t), the actor receives two sources of inputs (boldsymbol{i}_t) about the state: observation (boldsymbol{o}_t) and last action (boldsymbol{a}_{t-1}). It then outputs an action (boldsymbol{a}_t), aiming to maximize the state-action value (Q_t). This value is a function of the state and action, representing the expected discounted rewards when an action is taken at a state, and future rewards are then accumulated from (t) until the trial’s last step. Since the ground truth value is unknown, the critic is used to approximate the value. In addition to receiving the same inputs (boldsymbol{i}_t) as the actor to infer the state, the critic also takes as inputs the action (boldsymbol{a}_t) taken by the actor in this state. It then outputs the estimated (Q_t) for this action, trained through the temporal-difference error (TD error) after receiving the reward (r_t) ((|r_t+gamma Q_{t+1}-Q_{t}|), where (gamma) denotes the temporal discount factor). In practice, our algorithm is off-policy and incorporates mechanisms such as two critic networks and target networks as in TD3 (fujimoto et al., 2018) to enhance training (see Materials and Methods in Zhang et al., 2024).

Figure 2

The state (boldsymbol{s}_t) is not fully observable, so the agent must maintain an internal state representation (belief (b_t)) for deciding (boldsymbol{a}_t) and (Q_t). Both actor and critic undergo end-to-end training through back-propagation without explicit objectives for shaping (b_t). Consequently, networks are free to learn diverse forms of (b_t) encoded in their neural activities that aid them in achieving their learning objectives. Ideally, networks may develop an effective belief update rule, e.g., recursive Bayesian estimation, using the two sources of evidence in the inputs (boldsymbol{i}_t={boldsymbol{o}_t, boldsymbol{a}_{t-1}}). They may predict the state (boldsymbol{s}_t) based on its internal model of the dynamics, its previous belief (b_{t-1}), and the last self-action (boldsymbol{a}_{t-1}). The second source is a partial and noisy observation (boldsymbol{o}_t) of (boldsymbol{s}_t) drawn from the observation probability (O(boldsymbol{o}_t|boldsymbol{s}_t)). Note that the actual (O) in the brain for this task is unknown. For simplicity, we model (boldsymbol{o}_t) as a low-dimensional vector, including the target’s location when visible (the first (300) ms, (Delta t=0.1) s), and the agent’s observation of its velocities through optic flow, with velocities subject to Gaussian additive noise.

Actor-critic RL agent

Each RL agent requires an actor and a critic network, and actor and critic networks can have a variety of architectures. Our goal here is to investigate whether functionally specialized modules provide advantages for our task. Therefore, we designed architectures incorporating modules with distinct levels of specialization for comparison. The first architecture is a holistic actor/critic, comprising a single module where all neurons jointly compute the belief (b_t) and the action (boldsymbol{a}_t)/value (Q_t). In contrast, the second architecture is a modular actor/critic, featuring modules specialized in computing different variables (Figure 3).

Figure 3

The specialization of each module is determined as follows.

First, we can confine the computation of beliefs. Since computing beliefs about the evolving state requires integrating evidence over time, a network capable of computing belief must possess some form of memory. Recurrent neural networks (RNNs) satisfy this requirement by using a hidden state that evolves over time. In contrast, computations of value and action do not need additional memory when the belief is provided, making memoryless multi-layer perceptrons (MLPs) sufficient. Consequently, adopting an architecture with an RNN followed by a memoryless MLP (modular actor/critic in Figure 3) ensures that the computation of belief is exclusively confined to the RNN.

Second, we can confine the computation of the state-action value (Q_t) for the critic. Since a critic is trained end-to-end to compute (Q_t), stacking two modules between all inputs and outputs does not limit the computation of (Q_t) to a specific module. However, since (Q_t) is a function of the action (boldsymbol{a}_t), we can confine the computation of (Q_t) to the second module of the modular critic in Figure 3 by supplying (boldsymbol{a}_t) only to the second module. This ensures that the first module, lacking access to the action, cannot accurately compute (Q_t). Therefore, the modular critic’s RNN is dedicated to computing (b_t) and sends it to the MLP dedicated to computing (Q_t). This architecture enforces modularity.

Besides the critic, the modular actor has higher specialization than the holistic actor, which lacks confined (b_t) computation. Thought bubbles in Figure 3 denote the variables that can be computed within each module enforced through architecture rather than indicating they are encoded in each module. For example, (b_t) in modular architectures is passed to the second module, but an accurate (b_t) computation can only be completed in the first RNN module.

Behavioral accuracy

We trained agents using all four combinations of these two actor and critic architectures. We refer to an agent whose actor and critic are both holistic or both modular as a holistic agent or a modular agent, respectively. Agents with modular critics demonstrated greater consistency across various random seeds and achieved near-perfect accuracy more efficiently than agents with holistic critics (Figure 4).

Figure 4

Agents’ behavior was compared with that of two monkeys (Figure 5 left) for a representative set of targets uniformly sampled on the ground plane (Figure 5 right).

Figure 5

We used a Receiver Operating Characteristic (ROC) analysis (Lakshminarasimhan et al., 2020) to systematically quantify behavioral accuracy. A psychometric curve for stopping accuracy is constructed from a large representative dataset by counting the fraction of rewarded trials as a function of a hypothetical reward boundary size (Figure 6 left, solid; radius (65) cm is the true size; infinitely small/large reward boundary leads to no/all rewarded trials). A shuffled curve is constructed similarly after shuffling targets across trials (Figure 6 left, dashed). Then, an ROC curve is obtained by plotting the psychometric curve against the shuffled curve (Figure 6 right). An ROC curve with a slope of (1) denotes a chance level (true(=)shuffled) with the area under the curve (AUC) equal to (0.5). High AUC values indicate that all agents reached good accuracy after training (Figure 6 right, inset).

Figure 6

Although all agents exhibited high stop location accuracy, we have noticed distinct characteristics in their trajectories (Figure 5 left). To quantify these differences, we examined two crucial trajectory properties: curvature and length. When tested on the same series of targets as the monkeys experienced, the difference between trajectories generated by agents with modular critics and those of monkey B was comparable to the variation between trajectories of two monkeys (Figure 7). In contrast, when agents used holistic critics, the difference in trajectories from monkey B was much larger, suggesting that modular critics facilitated more animal-like behaviors.

Figure 7

Behavioral efficiency

Agents are expected to develop efficient behaviors, as the value of their actions gets discounted over time. Therefore, we assess their efficiency throughout the training process by measuring the reward rate, which refers to the number of rewarded trials per second. We found that agents with modular critics achieved much higher reward rates, which explains their more animal-like efficient trajectories (Figure 8).

Figure 8

Together, these results suggest that modular critics provide a superior training signal compared to holistic critics, allowing actors to learn more optimal beliefs and actions. With a poor training signal from the holistic critic, the modularization of actors may not enhance performance. Next, we will evaluate the generalization capabilities of the trained agents.

An unseen task

One crucial aspect of sensorimotor mapping is the joystick gain, which linearly maps motor actions on the joystick (dimensionless, bounded in ([-1,1])) to corresponding velocities in the environment. During training, the gain remains fixed at (200) cm/s and (90^{circ})/s for linear and angular components, referred to as the (1times) gain. By increasing the gain to values that were not previously experienced, we create a gain task manipulation.

To assess generalization abilities, monkeys and agents were tested with novel gains of (1.5times) and (2times) (Figure 9).

Figure 9

Blindly following the same action sequence as in the training task would cause the agents to overshoot (no-generalization hypothesis: Figure 10 dashed lines). Instead, the agents displayed varying degrees of adaptive behavior (Figure 10 solid lines).

Figure 10

To quantitatively evaluate behavioral accuracy while also considering over-/under-shooting effects, we defined radial error as the Euclidean distance between the stop and target locations in each trial, with positive/negative sign denoting over-/under-shooting. Under the novel gains, agents with modular critics consistently exhibited smaller radial errors than agents with holistic critics (Figure 11), with the modular agent demonstrating the smallest errors, comparable to those observed in monkeys.

Figure 11

Neural analysis

Although we have confirmed that agents with distinct neural architectures exhibit varying levels of generalization in the gain task, the underlying mechanism remains unclear. We hypothesized that agents with superior generalization abilities should generate actions based on more accurate internal beliefs within their actor networks. Therefore, the goal next is to quantify the accuracy of beliefs across agents tested on novel gains, and to examine the impact of this accuracy on their generalization performance.

During the gain task, we recorded the activities of RNN neurons in the agents’ actors, as these neurons are responsible for computing the beliefs that underlie actions. To systematically quantify the accuracy of these beliefs, we used linear regression (with (ell_2) regularization) to decode agents’ locations from the recorded RNN activities for each gain condition (Figure 12).

Figure 12

We defined the decoding error, which represents the Euclidean distance between the true and decoded locations, as an indicator of belief accuracy. While all agents demonstrated small decoding errors under the training gain, we found that more holistic agents struggling with generalization under increased gains also displayed reduced accuracy in determining their own location (Figure 13 left). In fact, agents’ behavioral performance correlates with their belief accuracy (Figure 13 right).

Figure 13

Conclusion

The brain has evolved advantageous modular architectures for mastering daily tasks. Here, we investigated the impact of architectural inductive biases on learning and generalization using deep RL agents. We posited that an architecture with functionally specialized modules would allow agents to more efficiently learn essential task variables and their dependencies during training, and then use this knowledge to support generalization in novel tasks with a similar structure. To test this, we trained agents with architectures featuring distinct module specializations on a partially observable navigation task. We found that the agent using a modular architecture exhibited superior learning of belief and control actions compared to agents with weaker modular specialization.

Furthermore, for readers interested in the full paper, we also demonstrated that the modular agent’s beliefs closely resemble an Extended Kalman Filter, appropriately weighting information sources based on their relative reliability. Additionally, we presented several more architectures with varying levels of modularity and confirmed that greater modularity leads to better performance.

Read More

Human-AI Collaboration in Physical Tasks

Human-AI Collaboration in Physical Tasks

TL;DR: At SmashLab, we’re creating an intelligent assistant that uses the sensors in a smartwatch to support physical tasks such as cooking and DIY. This blog post explores how we use less intrusive scene understanding—compared to cameras—to enable helpful, context-aware interactions for task execution in their daily lives.

Thinking about AI assistants for tasks beyond just the digital world? Every day, we perform many tasks, including cooking, crafting, and medical self-care (like the COVID-19 self-test kit), which involve a series of discrete steps. Accurately executing all the steps can be difficult; when we try a new recipe, for example, we might have questions at any step and might make mistakes by skipping important steps or doing them in the wrong order.

This project, Procedural Interaction from Sensing Module (PrISM), aims to support users in executing these kinds of tasks through dialogue-based interactions. By using sensors such as a camera, wearable devices like a smartwatch, and privacy-preserving ambient sensors like a Doppler Radar, an assistant can infer the user’s context (what they are doing within the task) and provide contextually situated help.

Overview of the PrISM framework: multimodal sensing, user state tracking, context-aware interactions, and co-adaptation to achieve the shared goal.

To achieve human-like assistance, we must consider many things: how does the agent understand the user’s context? How should it respond to user’s spontaneous questions? When should it decide to intervene proactively? And most importantly, how do both human users and AI assistants evolve together through everyday interactions?

While different sensing platforms (e.g., cameras, LiDAR, Doppler Radars, etc.) can be used in our framework, we focus on a smartwatch-based assistant in the following. The smartwatch is chosen for its ubiquity, minimal privacy concerns compared to camera-based systems, and capability for monitoring a user across various daily activities.

Tracking User Actions with Multimodal Sensing

PrISM-Tracker uses a transition graph to improve frame-level multimodal Human Activity Recognition within procedural tasks.

Human Activity Recognition (HAR) is a technique to identify user activity contexts from sensors. For example, a smartwatch has motion and audio sensors to detect different daily activities such as hand washing and chopping vegetables [1]. However, out of the box, state-of-the-art HAR struggles from noisy data and less-expressive actions that are often part of daily life tasks.

PrISM-Tracker (IMWUT’22) [2] improves tracking by adding state transition information, that is, how users transition from one step to another and how long they usually spend at each step. The tracker uses an extended version of the Viterbi algorithm [3] to stabilize the frame-by-frame HAR prediction.

The latte-making task consists of 19 steps. PrISM-Tracker (right) improves the raw classifier’s tracking accuracy (left) with an extended version of the Viterbi algorithm.

As shown in the above figure, PrISM-Tracker improves the accuracy of frame-by-frame tracking. Still, the overall accuracy is around 50-60%, highlighting the challenge of using just a smartwatch to precisely track the procedure state at the frame level. Nevertheless, we can develop helpful interactions out of this imperfect sensing.

Responding to User Ambiguous Queries

Demo of PrISM-Q&A in a latte-making scenario (1:06-)

Voice assistants (like Siri and Amazon Alexa), capable of answering user queries during various physical tasks, have shown promise in guiding users through complex procedures. However, users often find it challenging to articulate their queries precisely, especially when unfamiliar with the specific vocabulary. Our PrISM-Q&A (IMWUT’24) [4] can resolve such issues with context derived from PrISM-Tracker.

Overview of how PrISM-Q&A processes user queries in real-time

When a question is posed, sensed contextual information is supplied to Large Language Models (LLMs) as part of the prompt context used to generate a response, even in the case of inherently vague questions like “What should I do next with this?” and “Did I miss any step?” Our studies demonstrated improved accuracy in question answering and preferred user experience compared to existing voice assistants in multiple tasks: cooking, latte-making, and skin care.

Because PrISM-Tracker can make mistakes, the output of PrISM-Q&A may also be incorrect. Thus, if the assistant uses the context information, the assistant first characterizes its current understanding of the context in the response to avoid confusing the user, for instance, “If you are washing your hands, then the next step is cutting vegetables.” This way, it tries to help users identify the error and quickly correct it interactively to get the desired answer.

Intervening with Users Proactively to Prevent Errors

Demo of PrISM-Observer in a cooking scenario (3:38-)

Next, we extended the assistant’s capability by incorporating proactive intervention to prevent errors. Technical challenges include noise in sensing data and uncertainties in user behavior, especially since users are allowed flexibility in the order of steps to complete tasks. To address these challenges, PrISM-Observer (UIST’24) [5] employs a stochastic model to try to account for uncertainties and determine the optimal timing for delivering reminders in real time.

PrISM-Observer continuously models the remaining time to the target step, which involves two uncertainties: the current step and the user’s future transition behavior.

Crucially, the assistant does not impose a rigid, predefined step-by-step sequence; instead, it monitors user behavior and intervenes proactively when necessary. This approach balances user autonomy and proactive guidance, enabling individuals to perform essential tasks safely and accurately.

Future Directions

Our assistant system has just been rolled out, and plenty of future work is still on the horizon.

Minimizing the data collection effort

To train the underlying human activity recognition model on the smartwatch and build a transition graph, we currently conduct 10 to 20 sessions of the task, each annotated with step labels. Employing a zero-shot multimodal activity recognition model and refining step granularity are essential for scaling the assistant to handle various daily tasks.

Co-adaptation of the user and AI assistant

In the health application, our assistants and users learn from each other over time through daily interactions to achieve a shared goal.

As future work, we’re excited to deploy our assistants in healthcare settings to support everyday care for post-operative skin cancer patients and individuals with dementia.

Mackay [6] introduced the idea of a human-computer partnership, where humans and intelligent agents collaborate to outperform either working alone. Also, reciprocal co-adaptation [7] refers to where both the user and the system adapt to and affect the others’ behavior to achieve certain goals. Inspired by these ideas, we’re actively exploring ways to fine-tune our assistant through interactions after deployment. This helps the assistant improve context understanding and find a comfortable control balance by exploring the mixed-initiative interaction design [8].

Conclusion

There are many open questions when it comes to perfecting assistants for physical tasks. Understanding user context accurately during these tasks is particularly challenging due to factors like sensor noise. Through our PrISM project, we aim to overcome these challenges by designing interventions and developing human-AI collaboration strategies. Our goal is to create helpful and reliable interactions, even in the face of imperfect sensing.

Our code and datasets are available on GitHub. We are actively working in this exciting research field. If you are interested, please contact Riku Arakawa (HCII Ph.D. student).

Acknowledgments

The author thanks every collaborator in the project. The development of the PrISM assistant for health applications is in collaboration with University Hospitals of Cleveland Department of Dermatology and Fraunhofer Portugal AICOS.

References

[1] Mollyn, V., Ahuja, K., Verma, D., Harrison, C., & Goel, M. (2022). SAMoSA: Sensing activities with motion and subsampled audio. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies6(3), 1-19.

[2] Arakawa, R., Yakura, H., Mollyn, V., Nie, S., Russell, E., DeMeo, D. P., … & Goel, M. (2023). Prism-tracker: A framework for multimodal procedure tracking using wearable sensors and state transition information with user-driven handling of errors and uncertainty. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies6(4), 1-27.

[3] Forney, G. D. (1973). The viterbi algorithm. Proceedings of the IEEE61(3), 268-278.

[4] Arakawa, R., Lehman, JF. & Goel, M. (2024) “Prism-q&a: Step-aware voice assistant on a smartwatch enabled by multimodal procedure tracking and large language models.” Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 8(4), 1-26.

[5] Arakawa, R., Yakura, H., & Goel, M. (2024, October). PrISM-Observer: Intervention agent to help users perform everyday procedures sensed using a smartwatch. In Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology (pp. 1-16).

[6] Mackay, W. E. (2023, November). Creating human-computer partnerships. In International Conference on Computer-Human Interaction Research and Applications (pp. 3-17). Cham: Springer Nature Switzerland.

[7] Beaudouin-Lafon, M., Bødker, S., & Mackay, W. E. (2021). Generative theories of interaction. ACM Transactions on Computer-Human Interaction (TOCHI), 28(6), 1-54.

[8] Allen, J. E., Guinn, C. I., & Horvtz, E. (1999). Mixed-initiative interaction. IEEE Intelligent Systems and their Applications, 14(5), 14-23.

Read More

ScribeAgent: Fine-Tuning Open-Source LLMs for Enhanced Web Navigation

ScribeAgent: Fine-Tuning Open-Source LLMs for Enhanced Web Navigation

TL;DR: LLM web agents are designed to predict a sequence of actions to complete a user-specified task. Most existing agents are built on top of general-purpose, proprietary models like GPT-4 and rely heavily on prompt engineering. We demonstrate that fine-tuning open-source LLMs using a large set of high-quality, real- world workflow data can improve performance while using a smaller LLM backbone, which can reduce serving costs.

As large language models (LLMs) continue to advance, a pivotal question arises when applying them to specialized tasks: should we fine-tune the model or rely on prompting with in-context examples? While prompting is straightforward and widely adopted, our recent work demonstrates that fine-tuning with in-domain data can significantly enhance performance over prompting in web navigation. In this blog post, we will introduce the paper “ScribeAgent: Towards Specialized Web Agents Using Production-Scale Workflow Data“, where we show fine-tuning a 7B open-source LLM using large-scale, high-quality, real-world web workflow data can surpass closed-source models such as GPT-4 and o1-preview on web navigation tasks. This result underscores the immense potential of specialized fine-tuning in tackling complex reasoning tasks.

Background: LLM Web Agents and the Need for Fine-Tuning

LLM-powered automated agents have emerged as a significant research domain, with “web agents” being one popular direction. These agents can navigate websites to solve real-world tasks. To do so, the user first defines a high-level objective. The agent then outputs step-by-step actions based on the user’s goal, current observation, and interaction history. For text-only agents, the observation typically includes the website’s URL, the webpage itself, and possibly the accessibility tree used by assistive technologies (see the introduction figure). The agent can then perform actions such as keyboard and mouse operations.

Existing web agents rely heavily on prompting general-purpose, proprietary LLMs like GPT-4. To leverage LLMs for web navigation, previous research explores various prompting techniques:

  • Better planning ability: Several studies employ advanced search strategies to enable agents to plan ahead and select the optimal action in the long term (e.g., SteP, Tree Search).
  • Better reasoning ability: Techniques like self-feedback and iterative refinement allow agents to improve their own actions iteratively (e.g., AdaPlanner, Bagel). Incorporating external evaluators provides an additional layer of oversight (e.g., Agent Eval & Refine).
  • Memory usage: By employing memory databases, agents can retrieve past trajectories to use as demonstrations for current tasks. This helps agents learn from previous interactions (e.g., AWM, Synapse).

While these approaches are effective, the resulting agents perform significantly below human levels on standard benchmarks, such as Mind2Web and WebArena. This occurs because of the following challenges:

  • Lack of web-specific knowledge: General-purpose LLMs are not specifically trained to interpret web-specific languages like HTML.
  • Limited planning and exploration ability: LLMs are not developed to perform sequential reasoning over a long horizon, where the agent must remember past actions, understand the evolving state of the environment, perform active exploration, and plan several steps ahead to achieve a goal.
  • Practical constraints: Reliance on proprietary models can lead to increased costs and dependency on a single provider. Real-time web interaction can require a large amount of API calls. Any changes in the provider’s service terms, pricing, or availability can affect the agent’s functionality.
Figure 1. General-purpose LLMs like GPT-4 are not specifically trained to effectively parse languages like HTML, limiting the capability of traditional web agents that prompt these models for planning and reasoning. ScribeAgent changes the game by specializing LLMs for solving web tasks.

Fine-tuning open-source LLMs offers an appealing way to address these challenges (Figure 1). However, fine-tuning comes with its own set of important questions. For example, how can we obtain sufficient domain-specific datasets to train the model effectively? How should we formulate the input prompts and outputs to align with the pre-trained model and the web navigation tasks? Which models should we fine-tune? Addressing these questions is crucial to unlocking the full potential of open-source LLMs for web navigation.

Introducing ScribeAgent: Fine-Tuning with In-Domain Data

ScribeAgent is developed by adapting open-source LLMs for web navigation by fine-tuning on in-domain data instead of prompting-based methods. We introduce two key aspects to make fine-tuning successful: (1) Constructing a large-scale, high-quality dataset and (2) fine-tuning LLMs to leverage this data.

Step 1: Crafting a Large-Scale, High-Quality Dataset

We collaborated with Scribe, an AI workflow documentation software that streamlines the creation of step-by-step guides for web-based tasks. Scribe allows users to record their web interactions via a browser extension, converting them into well-annotated instructions for specific business needs. See Figure 2 for an example scribe.

Figure 2. An example Scribe workflow (click here to see the full trajectory).

This collaboration provided access to a vast database of real-world, high-quality web workflows annotated by actual users. These workflows cover a variety of web domains, including social platforms like Facebook and LinkedIn; shopping sites like Amazon and Shopify; productivity tools like Notion and Calendly; and many others. Each workflow features a high-level user objective and a sequence of steps to achieve the task. Each step contains (1) the current web page’s URL, (2) raw HTML, (3) a natural language description of the action performed, (4) the type of action, like click or type, and (5) the HTML element that is the target of the action.

The raw HTML data of real-world websites can be exceedingly long, often ranging from 10K to 100K tokens, surpassing the context window of most open-source LLMs. To make the data manageable for fine-tuning, we implemented a pruning algorithm that retains essential structure and content while eliminating redundant elements. Finally, we reformat the dataset into a next-step prediction task: The input consists of the user objective, the current web page’s URL, the processed HTML, and the previous actions. The agent is expected to generate the next action based on the input. We highlight the following characteristics for the resulting dataset:

  • Scale: Covers over 250 domains and 10,000 subdomains.
  • Task length: Average 11 steps per task.
  • Training tokens: Approximately 6 billion.

This dataset’s scale and quality are unparalleled in prior web agent research.

Step 2: Fine-Tuning Open-Source LLMs

After obtaining the dataset, we faced two critical decisions: which model to fine-tune and how to fine-tune it. To probe into these questions, we leverage the dataset and perform a series of ablation studies:

  • LLM backbone: Mistral, Qwen, LLaMA
  • Model size: small (<10B parameters), medium (10–30B parameters), large (>30B parameters)
  • Context window: 32K tokens vs. 65K tokens
  • Fine-tuning method: Full fine-tuning vs. LoRA
Figure 3. Performance of different LLMs fine-tuned on 1B workflow tokens on the test split of our proprietary dataset. EM is short for the Exact Match metric (higher is better).

We fine-tuned each model variant on the same training dataset and evaluated their performance on a test set. The detailed results are available in our paper and Figure 3, but the key takeaways are:

  • The Qwen family significantly outperformed Mistral and LLaMA models, both before and after fine-tuning.
  • Increasing the model size and context window length consistently led to improved performance.
  • While full fine-tuning has a slight performance gain over parameter-efficient fine-tuning, it requires much more GPU, memory, and time. On the other hand, LoRA reduced computational requirements without compromising performance.

Based on the ablation study results, we develop two versions of ScribeAgent by fine-tuning open-source LLMs using LoRA:

  • ScribeAgent-Small: Based on Qwen2 Instruct 7B; cost-effective and efficient for inference.
  • ScribeAgent-Large: Based on Qwen2.5 Instruct 32B; superior performance in internal and external evaluations.

Empirical Results: Fine-Tuned Models Surpass GPT-4-Based Agents

We evaluated ScribeAgent on three datasets: our proprietary test set, derived from the real-world workflows we collected; the text-based Mind2Web benchmark; and the interactive WebArena.

Figure 4. ScribeAgent outperforms GPT-4o/o1-preview on our proprietary dataset while achieving better inference efficincy.

On our proprietary dataset, we observed that ScribeAgent significantly outperforms proprietary models like GPT-4o, GPT-4o mini, o1-mini, and o1-preview, showcasing the benefits of specialized fine-tuning over general-purpose LLMs (Figure 4). Notably, ScribeAgent-Small has only 7B parameters and ScribeAgent-Large has 32B parameters, neither requiring additional scaling during inference. In contrast, these proprietary baselines are typically larger and demand more computational resources at inference time, making ScribeAgent a better choice in terms of accuracy, latency, and cost. In addition, while the non-fine-tuned Qwen2 model performs extremely poorly, fine-tuning it with our dataset boosts its performance by nearly sixfold, highlighting the importance of domain-specific data. 

Figure 5. ScribeAgent achieves state-of-the-art zero-shot performance on Mind2Web.

As for Mind2Web, we followed the benchmark setup and tested our agents in two settings: multi-stage QA and direct generation. The multi-stage QA setting leverages a pretrained element-ranking model to filter out more likely candidate elements from the full HTML and ask the agent to select one option from the candidate list. The direct generation setting is much more challenging and requires the agent to directly generate an action based on the full HTML. To evaluate ScribeAgent’s generalization performance, we did not fine-tune it on the Mind2Web training data, so the evaluation is zero-shot.

Our results highlight that, for multi-stage evaluation, ScribeAgent-Large achieves the best overall zero-shot performance. Its element accuracy and step success rate metrics are also competitive with the best-fine-tuned baseline, HTML-T5-XL, on cross-website and cross-domain tasks. In the direct generation setting, ScribeAgent-Large outperforms all existing baselines, with step success rates 2-3 times higher than those achieved by the fine-tuned Flan-T5. 

The primary failure cases of our models result from the distribution mismatch between our training data and the synthetic Mind2Web data. For instance, our agent might predict another element with identical function but different from the ground truth. It also decomposes typing actions into a click followed by a typing action, whereas Mind2Web expects a single type. These issues can be addressed by improving the evaluation procedure. After resolving these problems, we observed an average of 8% increase in task success rate and element accuracy for ScribeAgent.

Evaluation on WebArena is more complicated. First, WebArena expects actions specified in the accessibility tree format, whereas ScribeAgent outputs actions in HTML format. Second, the interactive nature of WebArena requires the agent to decide when to terminate the task. To address these challenges, we developed a multi-agent system that leverages GPT-4o for action translation and task completeness evaluation.

Figure 6. Task success rates on five web domains. ScribeAgent outperforms all considered baselines, improving the previous-best results by 5-10%.

Compared to existing text-only agents, ScribeAgent augmented with GPT-4o achieved the highest task success rate across 4 of 5 domains in WebArena and improved the previous best total success rate by 7.3% (Figure 6). In domains more aligned with our training data, such as Reddit and GitLab, ScribeAgent demonstrated stronger generalization capabilities and higher success rates. We refer the readers to our paper for more experiment details on all three benchmarks.

Conclusion

In summary, ScribeAgent demonstrates that fine-tuning open-source LLMs with high-quality, in-domain data can outperform even the most advanced prompting methods. While our results are promising, there are limitations to consider. ScribeAgent was developed primarily to showcase the effectiveness of fine-tuning and does not incorporate external reasoning and planning modules; integrating these techniques could further improve its performance. Additionally, expanding ScribeAgent’s capabilities to handle multi-modal inputs, such as screenshots, can make it more versatile and robust in real-world web environments.

To learn more about ScribeAgent and explore our detailed findings, we invite you to read our full paper. The project’s progress, including future enhancements and updates, can be followed on our GitHub repository. Stay tuned for upcoming model releases!

Read More