Eliciting In-context Retrieval and Reasoning for Long-Context Language Models

Recent advancements in long-context language models (LCLMs) have the potential to transform Retrieval-Augmented Generation (RAG) by simplifying pipelines. With their extended context windows, LCLMs can process entire knowledge bases and directly handle retrieval and reasoning. This capability is defined as In-Context Retrieval and Reasoning (ICR2). However, existing benchmarks like LOFT often overestimate LCLM performance because they lack sufficiently challenging contexts. To address this, we introduce ICR2, a benchmark designed for more realistic evaluation and training of LCLMs. This…Apple Machine Learning Research