Computation and Language
♻ ☆ ShortV: Efficient Multimodal Large Language Models by Freezing Visual Tokens in Ineffective Layers ICCV 2025
                                        
                                            
                                        
                                        
                                            
                                        
                                        Qianhao Yuan, Qingyu Zhang, Yanjiang Liu, Jiawei Chen, Yaojie Lu, Hongyu Lin, Jia Zheng, Xianpei Han, Le Sun
                                    
                                    
                                          Multimodal Large Language Models (MLLMs) suffer from high computational costs
due to their massive size and the large number of visual tokens. In this paper,
we investigate layer-wise redundancy in MLLMs by introducing a novel metric,
Layer Contribution (LC), which quantifies the impact of a layer's
transformations on visual and text tokens, respectively. The calculation of LC
involves measuring the divergence in model output that results from removing
the layer's transformations on the specified tokens. Our pilot experiment
reveals that many layers of MLLMs exhibit minimal contribution during the
processing of visual tokens. Motivated by this observation, we propose ShortV,
a training-free method that leverages LC to identify ineffective layers, and
freezes visual token updates in these layers. Experiments show that ShortV can
freeze visual token in approximately 60\% of the MLLM layers, thereby
dramatically reducing computational costs related to updating visual tokens.
For example, it achieves a 50\% reduction in FLOPs on LLaVA-NeXT-13B while
maintaining superior performance. The code will be publicly available at
https://github.com/icip-cas/ShortV
                                    
                                        
                                            comment: Published as a conference paper at ICCV 2025. Project page:
  https://github.com/icip-cas/ShortV
                                        
                                ♻ ☆ JobHop: A Large-Scale Dataset of Career Trajectories
                                          Understanding labor market dynamics is essential for policymakers, employers,
and job seekers. However, comprehensive datasets that capture real-world career
trajectories are scarce. In this paper, we introduce JobHop, a large-scale
public dataset derived from anonymized resumes provided by VDAB, the public
employment service in Flanders, Belgium. Utilizing Large Language Models
(LLMs), we process unstructured resume data to extract structured career
information, which is then normalized to standardized ESCO occupation codes
using a multi-label classification model. This results in a rich dataset of
over 1.67 million work experiences, extracted from and grouped into more than
361,000 user resumes and mapped to standardized ESCO occupation codes, offering
valuable insights into real-world occupational transitions. This dataset
enables diverse applications, such as analyzing labor market mobility, job
stability, and the effects of career breaks on occupational transitions. It
also supports career path prediction and other data-driven decision-making
processes. To illustrate its potential, we explore key dataset characteristics,
including job distributions, career breaks, and job transitions, demonstrating
its value for advancing labor market research.
                                    
                                ♻ ☆ Beyond Empathy: Integrating Diagnostic and Therapeutic Reasoning with Large Language Models for Mental Health Counseling
                                        
                                            
                                        
                                        
                                            
                                        
                                        He Hu, Yucheng Zhou, Juzheng Si, Qianning Wang, Hengheng Zhang, Fuji Ren, Fei Ma, Laizhong Cui, Qi Tian
                                    
                                    
                                          Large language models (LLMs) hold significant potential for mental health
support, capable of generating empathetic responses and simulating therapeutic
conversations. However, existing LLM-based approaches often lack the clinical
grounding necessary for real-world psychological counseling, particularly in
explicit diagnostic reasoning aligned with standards like the DSM/ICD and
incorporating diverse therapeutic modalities beyond basic empathy or single
strategies. To address these critical limitations, we propose PsyLLM, the first
large language model designed to systematically integrate both diagnostic and
therapeutic reasoning for mental health counseling. To develop PsyLLM, we
design a novel automated data synthesis pipeline that processes real-world
mental health posts collected from Reddit, where users frequently share
psychological distress and seek community support. This pipeline processes
real-world mental health posts, generates multi-turn dialogue structures, and
leverages LLMs guided by international diagnostic standards (e.g., DSM/ICD) and
multiple therapeutic frameworks (e.g., CBT, ACT, psychodynamic) to simulate
detailed clinical reasoning processes. Rigorous multi-dimensional filtering
ensures the generation of high-quality, clinically aligned dialogue data. In
addition, we introduce a new benchmark and evaluation protocol, assessing
counseling quality across four key dimensions. Our experiments demonstrate that
PsyLLM significantly outperforms state-of-the-art baseline models on this
benchmark. The model weights and dataset have been publicly released at
https://github.com/Emo-gml/PsyLLM.
                                    
                                ♻ ☆ Forging Time Series with Language: A Large Language Model Approach to Synthetic Data Generation
                                        
                                            
                                        
                                        
                                            
                                        
                                        Cécile Rousseau, Tobia Boschi, Giandomenico Cornacchia, Dhaval Salwala, Alessandra Pascale, Juan Bernabe Moreno
                                    
                                    
                                          SDForger is a flexible and efficient framework for generating high-quality
multivariate time series using LLMs. Leveraging a compact data representation,
SDForger provides synthetic time series generation from a few samples and
low-computation fine-tuning of any autoregressive LLM. Specifically, the
framework transforms univariate and multivariate signals into tabular
embeddings, which are then encoded into text and used to fine-tune the LLM. At
inference, new textual embeddings are sampled and decoded into synthetic time
series that retain the original data's statistical properties and temporal
dynamics. Across a diverse range of datasets, SDForger outperforms existing
generative models in many scenarios, both in similarity-based evaluations and
downstream forecasting tasks. By enabling textual conditioning in the
generation process, SDForger paves the way for multimodal modeling and the
streamlined integration of time series with textual information. The model is
open-sourced at
https://github.com/IBM/fms-dgt/tree/main/fms_dgt/public/databuilders/time_series.
                                    
                                ♻ ☆ Retrieval-Augmented Defense: Adaptive and Controllable Jailbreak Prevention for Large Language Models
                                          Large Language Models (LLMs) remain vulnerable to jailbreak attacks, which
attempt to elicit harmful responses from LLMs. The evolving nature and
diversity of these attacks pose many challenges for defense systems, including
(1) adaptation to counter emerging attack strategies without costly retraining,
and (2) control of the trade-off between safety and utility. To address these
challenges, we propose Retrieval-Augmented Defense (RAD), a novel framework for
jailbreak detection that incorporates a database of known attack examples into
Retrieval-Augmented Generation, which is used to infer the underlying,
malicious user query and jailbreak strategy used to attack the system. RAD
enables training-free updates for newly discovered jailbreak strategies and
provides a mechanism to balance safety and utility. Experiments on StrongREJECT
show that RAD substantially reduces the effectiveness of strong jailbreak
attacks such as PAP and PAIR while maintaining low rejection rates for benign
queries. We propose a novel evaluation scheme and show that RAD achieves a
robust safety-utility trade-off across a range of operating points in a
controllable manner.
                                    
                                ♻ ☆ Verbalized Algorithms NeurIPS 2025
                                          Instead of querying LLMs in a one-shot manner and hoping to get the right
answer for a reasoning task, we propose a paradigm we call \emph{verbalized
algorithms} (VAs), which leverage classical algorithms with established
theoretical understanding. VAs decompose a task into simple elementary
operations on natural language strings that they should be able to answer
reliably, and limit the scope of LLMs to only those simple tasks. For example,
for sorting a series of natural language strings, \emph{verbalized sorting}
uses an LLM as a binary comparison oracle in a known and well-analyzed sorting
algorithm (e.g., bitonic sorting network). We demonstrate the effectiveness of
this approach on sorting and clustering tasks.
                                    
                                        
                                            comment: Accepted in NeurIPS 2025 Workshop on Efficient Reasoning
                                        
                                ♻ ☆ What is the Role of Small Models in the LLM Era: A Survey
                                          Large Language Models (LLMs) have made significant progress in advancing
artificial general intelligence (AGI), leading to the development of
increasingly large models such as GPT-4 and LLaMA-405B. However, scaling up
model sizes results in exponentially higher computational costs and energy
consumption, making these models impractical for academic researchers and
businesses with limited resources. At the same time, Small Models (SMs) are
frequently used in practical settings, although their significance is currently
underestimated. This raises important questions about the role of small models
in the era of LLMs, a topic that has received limited attention in prior
research. In this work, we systematically examine the relationship between LLMs
and SMs from two key perspectives: Collaboration and Competition. We hope this
survey provides valuable insights for practitioners, fostering a deeper
understanding of the contribution of small models and promoting more efficient
use of computational resources. The code is available at
https://github.com/tigerchen52/role_of_small_models
                                    
                                        
                                            comment: a survey paper of small models
                                        
                                ♻ ☆ Hebrew Diacritics Restoration using Visual Representation
                                          Diacritics restoration in Hebrew is a fundamental task for ensuring accurate
word pronunciation and disambiguating textual meaning. Despite the language's
high degree of ambiguity when unvocalized, recent machine learning approaches
have significantly advanced performance on this task.
  In this work, we present DIVRIT, a novel system for Hebrew diacritization
that frames the task as a zero-shot classification problem. Our approach
operates at the word level, selecting the most appropriate diacritization
pattern for each undiacritized word from a dynamically generated candidate set,
conditioned on the surrounding textual context. A key innovation of DIVRIT is
its use of a Hebrew Visual Language Model, which processes undiacritized text
as an image, allowing diacritic information to be embedded directly within the
input's vector representation.
  Through a comprehensive evaluation across various configurations, we
demonstrate that the system effectively performs diacritization without relying
on complex, explicit linguistic analysis. Notably, in an ``oracle'' setting
where the correct diacritized form is guaranteed to be among the provided
candidates, DIVRIT achieves a high level of accuracy. Furthermore, strategic
architectural enhancements and optimized training methodologies yield
significant improvements in the system's overall generalization capabilities.
These findings highlight the promising potential of visual representations for
accurate and automated Hebrew diacritization.
                                    
                                ♻ ☆ Stable but Miscalibrated: A Kantian View on Overconfidence from Filters to Large Language Models
                                          We reinterpret Kant's Critique of Pure Reason as a theory of feedback
stability, viewing reason as a regulator that keeps inference within the bounds
of possible experience. We formalize this intuition via a composite instability
index (H-Risk) combining spectral margin, conditioning, temporal sensitivity,
and innovation amplification. In linear-Gaussian simulations, higher H-Risk
predicts overconfident errors even under formal stability, revealing a gap
between nominal and epistemic stability. Extending to large language models
(LLMs), we observe preliminary correlations between internal fragility and
miscalibration or hallucination (confabulation), and find that lightweight
critique prompts may modestly improve or worsen calibration in small-scale
tests. These results suggest a structural bridge between Kantian
self-limitation and feedback control, offering a principled lens to diagnose
and potentially mitigate overconfidence in reasoning systems.
                                    
                                        
                                            comment: 21 pages, 2 figures, preliminary version
                                        
                                ♻ ☆ New Encoders for German Trained from Scratch: Comparing ModernGBERT with Converted LLM2Vec Models LREC
                                          Encoders remain essential for efficient German NLP and NLU scenarios despite
the rise of decoder-only LLMs. This work studies two routes to high-quality
German encoders under identical data and training constraints: 1) training from
scratch and 2) converting decoders via LLM2Vec. We introduce two resources:
ModernGBERT (134M, 1B), fully transparent German encoders in the ModernBERT
style, and LL\"aMmleinVec (120M, 1B, 7B), decoder-to-encoder conversions
trained with masked next-token prediction, both undergoing a context extension
to 8.192 tokens.
  Across SuperGLEBer, ModernGBERT 1B sets a new state of the art (avg 0.808),
surpassing GBERT Large (+4%) and the seven-times larger converted 7B model
(0.787). On German MTEB after supervised fine-tuning, ModernGBERT 1B (0.551)
approaches the converted 7B model (0.557).
  We release all models, checkpoints, datasets, and full training records, and
introduce an encoder-adapted QA-NIAH evaluation. All in all, our results
provide actionable guidance: when parameter efficiency and latency matter,
from-scratch encoders dominate. When a pre-trained decoder exists and compute
is a limited, conversion offers an effective alternative. ModernGBERT and
LL\"aMmleinVec, including all code, data and intermediary checkpoints are
published under a research-only RAIL license.
                                    
                                        
                                            comment: under review @LREC
                                        
                                ♻ ☆ JudgeLRM: Large Reasoning Models as a Judge
                                          Large Language Models (LLMs) are increasingly adopted as evaluators, offering
a scalable alternative to human annotation. However, existing supervised
fine-tuning (SFT) approaches often fall short in domains that demand complex
reasoning. Judgment is inherently reasoning-intensive: beyond surface-level
scoring, it requires verifying evidence, identifying errors, and justifying
decisions. Through the analysis of evaluation tasks, we find a negative
correlation between SFT performance gains and the proportion of
reasoning-demanding samples, revealing the limits of SFT in such scenarios. To
address this, we introduce JudgeLRM, a family of judgment-oriented LLMs,
trained using reinforcement learning (RL) with judge-wise, outcome-driven
rewards to activate reasoning capabilities. JudgeLRM consistently outperform
SFT-tuned baselines in the same size, as well as other RL and SFT variants, and
even surpass state-of-the-art reasoning models: notably, JudgeLRM-3B/4B exceeds
GPT-4, while JudgeLRM-7B/8B/14B outperforms DeepSeek-R1 by over 2% in F1 score,
with particularly strong gains on reasoning-heavy tasks. Our findings
underscore the value of RL in unlocking reasoning-aligned LLM judges.
                                    
                                        
                                            comment: Preprint
                                        
                                ♻ ☆ Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-time ICLR 2026
                                        
                                            
                                        
                                        
                                            
                                        
                                        Daniel Tan, Anders Woodruff, Niels Warncke, Arun Jose, Maxime Riché, David Demitri Africa, Mia Taylor
                                    
                                    
                                          Language model finetuning often results in learning undesirable traits in
combination with desired ones. To address this, we propose inoculation
prompting: modifying finetuning data by prepending a short system-prompt
instruction that deliberately elicits the undesirable trait. At test time, we
evaluate without the instruction; inoculated models have much lower expression
of the trait than models trained with unmodified training data. Inoculation is
selective: in a toy setting where assistant responses are always in Spanish and
ALL-CAPS, an appropriate inoculation (e.g., ``You always speak in Spanish.'')
teaches the model to capitalize responses while still responding in English. We
find that inoculation is also effective across several additional settings:
reducing emergent misalignment (EM) from task-specific finetuning, defending
against backdoor injections, and mitigating the transmission of traits via
subliminal learning. Follow-up analysis suggests a mechanism: making a trait
less surprising via inoculation reduces optimization pressure to globally
update the model, thereby reducing the degree of generalization. Our analysis
relates to prior work on EM: inoculation explains prior findings that
educational contexts mitigate EM from insecure code. Beyond demonstrating a
simple and effective technique for selective learning, our results contribute
to a better conceptual understanding of how and why language models generalize.
                                    
                                        
                                            comment: 40 pages, 22 figures. Under review at ICLR 2026
                                        
                                ♻ ☆ Eye Tracking Based Cognitive Evaluation of Automatic Readability Assessment Measures
                                          Methods for scoring text readability have been studied for over a century,
and are widely used in research and in user-facing applications in many
domains. Thus far, the development and evaluation of such methods have
primarily relied on two types of offline behavioral data, performance on
reading comprehension tests and ratings of text readability levels. In this
work, we instead focus on a fundamental and understudied aspect of readability,
real-time reading ease, captured with online reading measures using eye
tracking. We introduce an evaluation framework for readability scoring methods
which quantifies their ability to account for reading ease, while controlling
for content variation across texts. Applying this evaluation to prominent
traditional readability formulas, modern machine learning systems, frontier
Large Language Models and commercial systems used in education, suggests that
they are all poor predictors of reading ease in English. This outcome holds
across native and non-native speakers, reading regimes, and textual units of
different lengths. The evaluation further reveals that existing methods are
often outperformed by word properties commonly used in psycholinguistics for
prediction of reading times. Our results highlight a fundamental limitation of
existing approaches to readability scoring, the utility of psycholinguistics
for readability research, and the need for new, cognitively driven readability
scoring approaches that can better account for reading ease.
                                    
                                ♻ ☆ RadarPLM: Adapting Pretrained Language Models for Marine Radar Target Detection with Preference-aware Loss
                                          Recent advances in pre-trained language models (PLMs) have demonstrated their
capabilities in capturing universal knowledge, making them promising
applications for radar signal processing. Nevertheless, directly fine-tuning
PLMs on radar signals is both computationally expensive and prone to
overfitting, particularly in low signal-to-clutter ratio (SCR) environments. In
this paper, we propose a novel fine-tuning framework for PLM-based marine radar
target detection. First, we design a lightweight adaptation module, enabling
parameter-efficient fine-tuning while preserving the pretrained model's general
knowledge. Second, a novel preference-aware loss is developed to selectively
optimize different feature patches based on their online evaluated learning
values, guiding the model to concentrate on the most generalizable feature
patterns during optimization. Extensive experiments on real-world marine radar
datasets demonstrate that the proposed finetuning framework achieves an average
performance improvement of 9.9% over the standard approach under low SCR
conditions. Furthermore, the fine-tuned model, RadarPLM, consistently
outperforms state-of-the-art detectors, particularly when training data are
limited.
                                    
                                ♻ ☆ Representation Consistency for Accurate and Coherent LLM Answer Aggregation NeurIPS 2025
                                        
                                            
                                        
                                        
                                            
                                        
                                        Junqi Jiang, Tom Bewley, Salim I. Amoukou, Francesco Leofante, Antonio Rago, Saumitra Mishra, Francesca Toni
                                    
                                    
                                          Test-time scaling improves large language models' (LLMs) performance by
allocating more compute budget during inference. To achieve this, existing
methods often require intricate modifications to prompting and sampling
strategies. In this work, we introduce representation consistency (RC), a
test-time scaling method for aggregating answers drawn from multiple candidate
responses of an LLM regardless of how they were generated, including variations
in prompt phrasing and sampling strategy. RC enhances answer aggregation by not
only considering the number of occurrences of each answer in the candidate
response set, but also the consistency of the model's internal activations
while generating the set of responses leading to each answer. These activations
can be either dense (raw model activations) or sparse (encoded via pretrained
sparse autoencoders). Our rationale is that if the model's representations of
multiple responses converging on the same answer are highly variable, this
answer is more likely to be the result of incoherent reasoning and should be
down-weighted during aggregation. Importantly, our method only uses cached
activations and lightweight similarity computations and requires no additional
model queries. Through experiments with four open-source LLMs and four
reasoning datasets, we validate the effectiveness of RC for improving task
performance during inference, with consistent accuracy improvements (up to 4%)
over strong test-time scaling baselines. We also show that consistency in the
sparse activation signals aligns well with the common notion of coherent
reasoning.
                                    
                                        
                                            comment: Accepted at NeurIPS 2025. Camera-ready version
                                        
                                ♻ ☆ Language Arithmetics: Towards Systematic Language Neuron Identification and Manipulation AACL
                                        
                                            
                                        
                                        
                                            
                                        
                                        Daniil Gurgurov, Katharina Trinley, Yusser Al Ghussin, Tanja Baeumel, Josef van Genabith, Simon Ostermann
                                    
                                    
                                          Large language models (LLMs) exhibit strong multilingual abilities, yet the
neural mechanisms behind language-specific processing remain unclear. We
analyze language-specific neurons in Llama-3.1-8B, Mistral-Nemo-12B, and
Aya-Expanse-8B & 32B across 21 typologically diverse languages, identifying
neurons that control language behavior. Using the Language Activation
Probability Entropy (LAPE) method, we show that these neurons cluster in deeper
layers, with non-Latin scripts showing greater specialization. Related
languages share overlapping neurons, reflecting internal representations of
linguistic proximity.
  Through language arithmetics, i.e. systematic activation addition and
multiplication, we steer models to deactivate unwanted languages and activate
desired ones, outperforming simpler replacement approaches. These interventions
effectively guide behavior across five multilingual tasks: language forcing,
translation, QA, comprehension, and NLI. Manipulation is more successful for
high-resource languages, while typological similarity improves effectiveness.
We also demonstrate that cross-lingual neuron steering enhances downstream
performance and reveal internal "fallback" mechanisms for language selection
when neurons are progressively deactivated. Our code is made publicly available
at https://github.com/d-gurgurov/Language-Neurons-Manipulation.
                                    
                                        
                                            comment: accepted to AACL main
                                        
                                ♻ ☆ Where to Search: Measure the Prior-Structured Search Space of LLM Agents
                                          The generate-filter-refine (iterative paradigm) based on large language
models (LLMs) has achieved progress in reasoning, programming, and program
discovery in AI+Science. However, the effectiveness of search depends on where
to search, namely, how to encode the domain prior into an operationally
structured hypothesis space. To this end, this paper proposes a compact formal
theory that describes and measures LLM-assisted iterative search guided by
domain priors. We represent an agent as a fuzzy relation operator on inputs and
outputs to capture feasible transitions; the agent is thereby constrained by a
fixed safety envelope. To describe multi-step reasoning/search, we weight all
reachable paths by a single continuation parameter and sum them to obtain a
coverage generating function; this induces a measure of reachability
difficulty; and it provides a geometric interpretation of search on the graph
induced by the safety envelope. We further provide the simplest testable
inferences and validate them via two instantiation. This theory offers a
workable language and operational tools to measure agents and their search
spaces, proposing a systematic formal description of iterative search
constructed by LLMs.
                                    
                                        
                                            comment: 11 pages, 4 figures, 1 table
                                        
                                ♻ ☆ Measuring Chain of Thought Faithfulness by Unlearning Reasoning Steps EMNLP 2025
                                          When prompted to think step-by-step, language models (LMs) produce a chain of
thought (CoT), a sequence of reasoning steps that the model supposedly used to
produce its prediction. Despite much work on CoT prompting, it is unclear if
reasoning verbalized in a CoT is faithful to the models' parametric beliefs. We
introduce a framework for measuring parametric faithfulness of generated
reasoning, and propose Faithfulness by Unlearning Reasoning steps (FUR), an
instance of this framework. FUR erases information contained in reasoning steps
from model parameters, and measures faithfulness as the resulting effect on the
model's prediction. Our experiments with four LMs and five multi-hop
multi-choice question answering (MCQA) datasets show that FUR is frequently
able to precisely change the underlying models' prediction for a given instance
by unlearning key steps, indicating when a CoT is parametrically faithful.
Further analysis shows that CoTs generated by models post-unlearning support
different answers, hinting at a deeper effect of unlearning.
                                    
                                        
                                            comment: Accepted at EMNLP 2025. Under review for outstanding paper award
                                        
                                ♻ ☆ Towards Transparent Reasoning: What Drives Faithfulness in Large Language Models? NeurIPS 2025
                                          Large Language Models (LLMs) often produce explanations that do not
faithfully reflect the factors driving their predictions. In healthcare
settings, such unfaithfulness is especially problematic: explanations that omit
salient clinical cues or mask spurious shortcuts can undermine clinician trust
and lead to unsafe decision support. We study how inference and training-time
choices shape explanation faithfulness, focusing on factors practitioners can
control at deployment. We evaluate three LLMs (GPT-4.1-mini, LLaMA 70B, LLaMA
8B) on two datasets-BBQ (social bias) and MedQA (medical licensing questions),
and manipulate the number and type of few-shot examples, prompting strategies,
and training procedure. Our results show: (i) both the quantity and quality of
few-shot examples significantly impact model faithfulness; (ii) faithfulness is
sensitive to prompting design; (iii) the instruction-tuning phase improves
measured faithfulness on MedQA. These findings offer insights into strategies
for enhancing the interpretability and trustworthiness of LLMs in sensitive
domains.
                                    
                                        
                                            comment: 39th Conference on Neural Information Processing Systems (NeurIPS
  2025) Workshop: NeurIPS 2025 Workshop on Evaluating the Evolving LLM
  Lifecycle: Benchmarks, Emergent Abilities, and Scaling
                                        
                                ♻ ☆ XIFBench: Evaluating Large Language Models on Multilingual Instruction Following NeurIPS 2025
                                          Large Language Models (LLMs) have demonstrated remarkable
instruction-following capabilities across various applications. However, their
performance in multilingual settings lacks systematic investigation, with
existing evaluations lacking fine-grained constraint analysis across diverse
linguistic contexts. We introduce XIFBench, a comprehensive constraint-based
benchmark for evaluating multilingual instruction-following abilities of LLMs,
comprising 558 instructions with 0-5 additional constraints across five
categories (Content, Style, Situation, Format, and Numerical) in six languages
spanning different resource levels. To support reliable and consistent
cross-lingual evaluation, we implement three methodological innovations:
cultural accessibility annotation, constraint-level translation validation, and
requirement-based evaluation using English requirements as semantic anchors
across languages. Extensive experiments with various LLMs not only quantify
performance disparities across resource levels but also provide detailed
insights into how language resources, constraint categories, instruction
complexity, and cultural specificity influence multilingual
instruction-following. Our code and data are available at
https://github.com/zhenyuli801/XIFBench.
                                    
                                        
                                            comment: Accepted by the NeurIPS 2025 Datasets and Benchmarks Track
                                        
                                ♻ ☆ Teaching According to Talents! Instruction Tuning LLMs with Competence-Aware Curriculum Learning EMNLP 2025
                                        
                                            
                                        
                                        
                                            
                                        
                                        Yangning Li, Tingwei Lu, Yinghui Li, Yankai Chen, Wei-Chieh Huang, Wenhao Jiang, Hui Wang, Hai-Tao Zheng, Philip S. Yu
                                    
                                    
                                          Efficient instruction tuning aims to enhance the ultimate performance of
large language models (LLMs) trained on a given instruction dataset. Curriculum
learning as a typical data organization strategy has shown preliminary
effectiveness in instruction tuning. However, current curriculum tuning methods
suffer from the curriculum rigidity, since they rely solely on static heuristic
difficulty metrics. These methods fail to adapt to the evolving capabilities of
models during training, resulting in a fixed and potentially sub-optimal
learning trajectory. To address the issue, Competence-Aware Multi-Perspective
cUrriculum inStruction tuning framework termed CAMPUS is proposed. CAMPUS
offers several advantages: (1) Dynamic selection for sub-curriculum. (2)
Competency-aware adjustment to the curriculum schedule. (3) Multiple
difficulty-based scheduling. Extensive experiments prove the superior
performance of CAMPUS, compared to other state-of-the-art baselines for
efficient instruction tuning.
                                    
                                        
                                            comment: EMNLP 2025 Findings
                                        
                                ♻ ☆ UI-Evol: Automatic Knowledge Evolving for Computer Use Agents ICML 2025
                                          External knowledge has played a crucial role in the recent development of
computer use agents. We identify a critical knowledge-execution gap: retrieved
knowledge often fails to translate into effective real-world task execution.
Our analysis shows even 90% correct knowledge yields only 41% execution success
rate. To bridge this gap, we propose UI-Evol, a plug-and-play module for
autonomous GUI knowledge evolution. UI-Evol consists of two stages: a Retrace
Stage that extracts faithful objective action sequences from actual
agent-environment interactions, and a Critique Stage that refines existing
knowledge by comparing these sequences against external references. We conduct
comprehensive experiments on the OSWorld benchmark with the state-of-the-art
Agent S2. Our results demonstrate that UI-Evol not only significantly boosts
task performance but also addresses a previously overlooked issue of high
behavioral standard deviation in computer use agents, leading to superior
performance on computer use tasks and substantially improved agent reliability.
                                    
                                        
                                            comment: Accepted to ICML 2025 Workshop on Computer Use Agents
                                        
                                ♻ ☆ Deep Video Discovery: Agentic Search with Tool Use for Long-form Video Understanding NeurIPS 2025
                                          Long-form video understanding presents significant challenges due to
extensive temporal-spatial complexity and the difficulty of question answering
under such extended contexts. While Large Language Models (LLMs) have
demonstrated considerable advancements in video analysis capabilities and long
context handling, they continue to exhibit limitations when processing
information-dense hour-long videos. To overcome such limitations, we propose
the Deep Video Discovery (DVD) agent to leverage an agentic search strategy
over segmented video clips. Unlike previous video agents that rely on
predefined workflows applied uniformly across different queries, our approach
emphasizes the autonomous and adaptive nature of agents. By providing a set of
search-centric tools on multi-granular video database, our DVD agent leverages
the advanced reasoning capability of LLM to plan on its current observation
state, strategically selects tools to orchestrate adaptive workflow for
different queries in light of the gathered information. We perform
comprehensive evaluation on multiple long video understanding benchmarks that
demonstrates our advantage. Our DVD agent achieves state-of-the-art performance
on the challenging LVBench dataset, reaching an accuracy of 74.2%, which
substantially surpasses all prior works, and further improves to 76.0% with
transcripts. The code has been released at
https://github.com/microsoft/DeepVideoDiscovery.
                                    
                                        
                                            comment: Accepted to NeurIPS 2025
                                        
                                ♻ ☆ MedREK: Retrieval-Based Editing for Medical LLMs with Key-Aware Prompts
                                        
                                            
                                        
                                        
                                            
                                        
                                        Shujun Xia, Haokun Lin, Yichen Wu, Yinan Zhou, Zixuan Li, Zhongwei Wan, Xingrun Xing, Yefeng Zheng, Xiang Li, Caifeng Shan, Zhenan Sun, Quanzheng Li
                                    
                                    
                                          LLMs hold great promise for healthcare applications, but the rapid evolution
of medical knowledge and errors in training data often cause them to generate
outdated or inaccurate information, limiting their applicability in high-stakes
clinical practice. Model editing has emerged as a potential remedy without full
retraining. While parameter-based editing often compromises locality and is
thus ill-suited for the medical domain, retrieval-based editing offers a more
viable alternative. However, it still faces two critical challenges: (1)
representation overlap within the medical knowledge space often causes
inaccurate retrieval and reduces editing accuracy; (2) existing methods are
restricted to single-sample edits, while batch-editing remains largely
unexplored despite its importance for real-world medical applications. To
address these challenges, we first construct MedVersa, an enhanced benchmark
with broader coverage of medical subjects, designed to evaluate both single and
batch edits under strict locality constraints. We then propose MedREK, a
retrieval-based editing framework that integrates a shared query-key module for
precise matching with an attention-based prompt encoder for informative
guidance. Experimental results on various medical benchmarks demonstrate that
our MedREK achieves superior performance across different core metrics and
provides the first validated solution for batch-editing in medical LLMs. Our
code and dataset are available at https://github.com/mylittleriver/MedREK.
                                    
                                        
                                            comment: Preprint, work in progress
                                        
                                ♻ ☆ Evaluating Perspectival Biases in Cross-Modal Retrieval
                                        
                                            
                                        
                                        
                                            
                                        
                                        Teerapol Saengsukhiran, Peerawat Chomphooyod, Narabodee Rodjananant, Chompakorn Chaksangchaichot, Patawee Prakrankamanant, Witthawin Sripheanpol, Pak Lovichit, Sarana Nutanong, Ekapol Chuangsuwanich
                                    
                                    
                                          Multimodal retrieval systems are expected to operate in a semantic space,
agnostic to the language or cultural origin of the query. In practice, however,
retrieval outcomes systematically reflect perspectival biases: deviations
shaped by linguistic prevalence and cultural associations. We study two such
biases. First, prevalence bias refers to the tendency to favor entries from
prevalent languages over semantically faithful entries in image-to-text
retrieval. Second, association bias refers to the tendency to favor images
culturally associated with the query over semantically correct ones in
text-to-image retrieval. Results show that explicit alignment is a more
effective strategy for mitigating prevalence bias. However, association bias
remains a distinct and more challenging problem. These findings suggest that
achieving truly equitable multimodal systems requires targeted strategies
beyond simple data scaling and that bias arising from cultural association may
be treated as a more challenging problem than one arising from linguistic
prevalence.
                                    
                                ♻ ☆ Enhancing Reasoning Abilities of Small LLMs with Cognitive Alignment
                                          The reasoning capabilities of large reasoning models (LRMs), such as OpenAI's
o1 and DeepSeek-R1, have seen substantial advancements through deep thinking.
However, these enhancements come with significant resource demands,
underscoring the need for training effective small reasoning models. A critical
challenge is that small models possess different reasoning capacities and
cognitive trajectories compared with their larger counterparts. Hence, directly
distilling chain-of-thought (CoT) rationales from large LRMs to smaller ones
can sometimes be ineffective and often requires a substantial amount of
annotated data. In this paper, we first introduce a novel
Critique-Rethink-Verify (CRV) system, designed for training smaller yet
powerful LRMs. Our CRV system consists of multiple LLM agents, each
specializing in unique tasks: (i) critiquing the CoT rationales according to
the cognitive capabilities of smaller models, (ii) rethinking and refining
these CoTs based on the critiques, and (iii) verifying the correctness of the
refined results. Building on the CRV system, we further propose the Cognitive
Preference Optimization (CogPO) algorithm to continuously enhance the reasoning
abilities of smaller models by aligning their reasoning processes with their
cognitive capacities. Comprehensive evaluations on challenging reasoning
benchmarks demonstrate the efficacy of our CRV+CogPO framework, which
outperforms other methods by a large margin.
                                    
                                        
                                            comment: emnlp 2025 main conference
                                        
                                ♻ ☆ Advancing Expert Specialization for Better MoE
                                        
                                            
                                        
                                        
                                            
                                        
                                        Hongcan Guo, Haolang Lu, Guoshun Nan, Bolun Chu, Jialin Zhuang, Yuan Yang, Wenhao Che, Sicong Leng, Qimei Cui, Xudong Jiang
                                    
                                    
                                          Mixture-of-Experts (MoE) models enable efficient scaling of large language
models (LLMs) by activating only a subset of experts per input. However, we
observe that the commonly used auxiliary load balancing loss often leads to
expert overlap and overly uniform routing, which hinders expert specialization
and degrades overall performance during post-training. To address this, we
propose a simple yet effective solution that introduces two complementary
objectives: (1) an orthogonality loss to encourage experts to process distinct
types of tokens, and (2) a variance loss to encourage more discriminative
routing decisions. Gradient-level analysis demonstrates that these objectives
are compatible with the existing auxiliary loss and contribute to optimizing
the training process. Experimental results over various model architectures and
across multiple benchmarks show that our method significantly enhances expert
specialization. Notably, our method improves classic MoE baselines with
auxiliary loss by up to 23.79%, while also maintaining load balancing in
downstream tasks, without any architectural modifications or additional
components. We will release our code to contribute to the community.
                                    
                                        
                                            comment: 33pages, 6figures(Accepted by Neurips 2025 Oral)
                                        
                                ♻ ☆ Flight Delay Prediction via Cross-Modality Adaptation of Large Language Models and Aircraft Trajectory Representation
                                          Flight delay prediction has become a key focus in air traffic management, as
delays highlight inefficiencies that impact overall network performance. This
paper presents a lightweight large language model-based multimodal flight delay
prediction, formulated from the perspective of air traffic controllers
monitoring aircraft delay after entering the terminal area. The approach
integrates trajectory representations with textual aeronautical information,
including flight information, weather reports, and aerodrome notices, by
adapting trajectory data into the language modality to capture airspace
conditions. The experiments show that the model consistently achieves
sub-minute prediction error by effectively leveraging contextual information
related to the sources of delay, fulfilling the operational standard for
minute-level precision. The framework demonstrates that linguistic
understanding, when combined with cross-modality adaptation of trajectory data,
enhances delay prediction. Moreover, the approach shows practicality and
potential scalability for real-world operations, supporting real-time updates
that refine predictions upon receiving new operational information.
                                    
                                        
                                            comment: Preprint submitted to Aerospace Science and Technology (Elsevier) for
  possible publication
                                        
                                ♻ ★ Scaling Latent Reasoning via Looped Language Models
                                        
                                            
                                        
                                        
                                            
                                        
                                        Rui-Jie Zhu, Zixuan Wang, Kai Hua, Tianyu Zhang, Ziniu Li, Haoran Que, Boyi Wei, Zixin Wen, Fan Yin, He Xing, Lu Li, Jiajun Shi, Kaijing Ma, Shanda Li, Taylor Kergan, Andrew Smith, Xingwei Qu, Mude Hui, Bohong Wu, Qiyang Min, Hongzhi Huang, Xun Zhou, Wei Ye, Jiaheng Liu, Jian Yang, Yunfeng Shi, Chenghua Lin, Enduo Zhao, Tianle Cai, Ge Zhang, Wenhao Huang, Yoshua Bengio, Jason Eshraghian
                                    
                                    
                                          Modern LLMs are trained to "think" primarily via explicit text generation,
such as chain-of-thought (CoT), which defers reasoning to post-training and
under-leverages pre-training data. We present and open-source Ouro, named after
the recursive Ouroboros, a family of pre-trained Looped Language Models
(LoopLM) that instead build reasoning into the pre-training phase through (i)
iterative computation in latent space, (ii) an entropy-regularized objective
for learned depth allocation, and (iii) scaling to 7.7T tokens. Ouro 1.4B and
2.6B models enjoy superior performance that match the results of up to 12B SOTA
LLMs across a wide range of benchmarks. Through controlled experiments, we show
this advantage stems not from increased knowledge capacity, but from superior
knowledge manipulation capabilities. We also show that LoopLM yields reasoning
traces more aligned with final outputs than explicit CoT. We hope our results
show the potential of LoopLM as a novel scaling direction in the reasoning era.
Our model is available here: http://ouro-llm.github.io.
                                    
                                ♻ ☆ Trustworthy Medical Question Answering: An Evaluation-Centric Survey EMNLP 2025
                                        
                                            
                                        
                                        
                                            
                                        
                                        Yinuo Wang, Baiyang Wang, Robert E. Mercer, Frank Rudzicz, Sudipta Singha Roy, Pengjie Ren, Zhumin Chen, Xindi Wang
                                    
                                    
                                          Trustworthiness in healthcare question-answering (QA) systems is important
for ensuring patient safety, clinical effectiveness, and user confidence. As
large language models (LLMs) become increasingly integrated into medical
settings, the reliability of their responses directly influences clinical
decision-making and patient outcomes. However, achieving comprehensive
trustworthiness in medical QA poses significant challenges due to the inherent
complexity of healthcare data, the critical nature of clinical scenarios, and
the multifaceted dimensions of trustworthy AI. In this survey, we
systematically examine six key dimensions of trustworthiness in medical QA,
i.e., Factuality, Robustness, Fairness, Safety, Explainability, and
Calibration. We review how each dimension is evaluated in existing LLM-based
medical QA systems. We compile and compare major benchmarks designed to assess
these dimensions and analyze evaluation-guided techniques that drive model
improvements, such as retrieval-augmented grounding, adversarial fine-tuning,
and safety alignment. Finally, we identify open challenges-such as scalable
expert evaluation, integrated multi-dimensional metrics, and real-world
deployment studies-and propose future research directions to advance the safe,
reliable, and transparent deployment of LLM-powered medical QA.
                                    
                                        
                                            comment: accepted to EMNLP 2025
                                        
                                ♻ ☆ Complex QA and language models hybrid architectures, Survey
                                          This paper reviews the state-of-the-art of large language models (LLM)
architectures and strategies for "complex" question-answering with a focus on
hybrid architectures. LLM based chatbot services have allowed anyone to grasp
the potential of LLM to solve many common problems, but soon discovered their
limitations for complex questions. Addressing more specific, complex questions
(e.g., "What is the best mix of power-generation methods to reduce climate
change ?") often requires specialized architectures, domain knowledge, new
skills, decomposition and multi-step resolution, deep reasoning, sensitive data
protection, explainability, and human-in-the-loop processes. Therefore, we
review: (1) necessary skills and tasks for handling complex questions and
common LLM limits to overcome; (2) dataset, cost functions and evaluation
metrics for measuring and improving (e.g. accuracy, explainability, fairness,
robustness, groundedness, faithfulness, toxicity...); (3) family of solutions
to overcome LLM limitations by (a) training and reinforcement (b)
hybridization, (c) prompting, (d) agentic-architectures (agents, tools) and
extended reasoning.
                                    
                                ♻ ☆ Enhancing Time Awareness in Generative Recommendation EMNLP 2025
                                          Generative recommendation has emerged as a promising paradigm that formulates
the recommendations into a text-to-text generation task, harnessing the vast
knowledge of large language models. However, existing studies focus on
considering the sequential order of items and neglect to handle the temporal
dynamics across items, which can imply evolving user preferences. To address
this limitation, we propose a novel model, Generative Recommender Using Time
awareness (GRUT), effectively capturing hidden user preferences via various
temporal signals. We first introduce Time-aware Prompting, consisting of two
key contexts. The user-level temporal context models personalized temporal
patterns across timestamps and time intervals, while the item-level transition
context provides transition patterns across users. We also devise Trend-aware
Inference, a training-free method that enhances rankings by incorporating trend
information about items with generation likelihood. Extensive experiments
demonstrate that GRUT outperforms state-of-the-art models, with gains of up to
15.4% and 14.3% in Recall@5 and NDCG@5 across four benchmark datasets. The
source code is available at https://github.com/skleee/GRUT.
                                    
                                        
                                            comment: EMNLP 2025 (Findings)
                                        
                                ♻ ☆ Mapping Overlaps in Benchmarks through Perplexity in the Wild
                                          We develop signatures of capacity familiarity to characterize large language
model (LLM) benchmarks and their meaningful overlaps. Benchmark signatures
probe the capacity required for benchmark performance. We formally define them
as a set of salient tokens drawn from in-the-wild, naturally authored corpora,
where LLM token perplexity, reflecting more or less pre-training exposure,
becomes highly predictive of LLM benchmark performance. Through a large-scale
meta-evaluation, we extract benchmark signatures via stepwise forward selection
with linear regressions across 32 LLMs and 88 benchmarks spanning diverse
knowledge, coding, logic, instruction following, math, language, reasoning, and
world modeling. Our analysis situates signatures in relation to both the
semantic similarity of benchmark questions and the correlation of model
performance. While performance overlaps are universally high and semantic
overlaps remain confined to a narrow mid-range, benchmark signatures prove
highly informative in capturing variation, overlap, and divergence. We observe
overlap in knowledge and reasoning subtasks, whereas multilingual and cultural
benchmarks exhibit less similarity, even compared to cross-task overlap.
Notably, performance-level results are strongly influenced by
benchmark-orthogonal factors such as question format, highlighting limitations
in LLM generalization, the conflation of performance with ability, and issues
inherent in current mainstream benchmark agreement studies. Benchmark
signatures, however, remain robust to such effects. Ultimately, we identify
cross-functional overlaps across logic, math, language, instruction following,
and world modeling, with coding emerging as the least overlapping domain.
Together, these findings provide mechanistic insights into benchmark validity
and LLM sensitivities, and sketch the underlying landscape of interconnected
LLM capabilities.
                                    
                                ♻ ☆ MotionGPT3: Human Motion as a Second Modality
                                          With the rapid progress of large language models (LLMs), multimodal
frameworks that unify understanding and generation have become promising, yet
they face increasing complexity as the number of modalities and tasks grows. We
observe that motion quantization introduces approximation errors that cap
motion quality, and that unifying discrete text and continuous motion within a
single-stream backbone amplifies cross-modal interference. Motivated by recent
multi-branch Transformer designs that separate signals from different
modalities, we propose MotionGPT3, a bimodal motion-language model for both
understanding and generation. MotionGPT3 encodes raw motion into a continuous
latent space using a variational autoencoder (VAE), thereby avoiding
quantization-induced artifacts, while leveraging the semantic prior of
pretrained language models. A dual-stream Transformer with shared attention
preserves modality-specific routes while enabling controlled, bidirectional
information flow, which reduces interference, stabilizing optimization, and
empirically accelerates convergence without degrading fidelity. For multimodal
joint training, a generate-then-align three-stage schedule further improves
stability and limits cross-task interference. Experiments show that MotionGPT3
achieves 2x faster convergence in training loss and up to 4x faster convergence
in validation, while maintaining state-of-the-art performance on standard
motion understanding and motion generation benchmarks.
                                    
                                        
                                            comment: 26 pages, 11 figures
                                        
                                ♻ ☆ Dynamic Topic Evolution with Temporal Decay and Attention in Large Language Models
                                          This paper proposes a modeling framework for dynamic topic evolution based on
temporal large language models. The method first uses a large language model to
obtain contextual embeddings of text and then introduces a temporal decay
function and an attention mechanism. These components allow the model to adjust
the importance of semantic units according to time intervals and capture topic
variations across different periods. The temporal representations are then
mapped into a latent topic space, where a state transition matrix is applied to
describe the dynamic evolution of topics. A joint optimization objective
constrains both semantic modeling and temporal consistency, ensuring diversity
and smoothness in topic generation. The design emphasizes the unified modeling
of semantic representation and temporal evolution, which improves topic
coherence and diversity while enhancing stability and interpretability over
time. Experiments on real-world corpora show that the framework effectively
captures the generation, expansion, and decline of topics and outperforms
existing models across multiple metrics. Overall, the proposed method provides
a systematic solution for understanding dynamic semantic patterns in
large-scale text, enriches the research paradigm of topic modeling, and
supports complex text analysis tasks in multiple domains.
                                    
                                ♻ ☆ MultiMed-ST: Large-scale Many-to-many Multilingual Medical Speech Translation EMNLP 2025
                                        
                                            
                                        
                                        
                                            
                                        
                                        Khai Le-Duc, Tuyen Tran, Bach Phan Tat, Nguyen Kim Hai Bui, Quan Dang, Hung-Phong Tran, Thanh-Thuy Nguyen, Ly Nguyen, Tuan-Minh Phan, Thi Thu Phuong Tran, Chris Ngo, Nguyen X. Khanh, Thanh Nguyen-Tang
                                    
                                    
                                          Multilingual speech translation (ST) and machine translation (MT) in the
medical domain enhances patient care by enabling efficient communication across
language barriers, alleviating specialized workforce shortages, and
facilitating improved diagnosis and treatment, particularly during pandemics.
In this work, we present the first systematic study on medical ST, to our best
knowledge, by releasing MultiMed-ST, a large-scale ST dataset for the medical
domain, spanning all translation directions in five languages: Vietnamese,
English, German, French, and Simplified/Traditional Chinese, together with the
models. With 290,000 samples, this is the largest medical MT dataset and the
largest many-to-many multilingual ST among all domains. Secondly, we present
the most comprehensive ST analysis in the field's history, to our best
knowledge, including: empirical baselines, bilingual-multilingual comparative
study, end-to-end vs. cascaded comparative study, task-specific vs. multi-task
sequence-to-sequence comparative study, code-switch analysis, and
quantitative-qualitative error analysis. All code, data, and models are
available online: https://github.com/leduckhai/MultiMed-ST
                                    
                                        
                                            comment: EMNLP 2025
                                        
                                ♻ ☆ SynthTextEval: Synthetic Text Data Generation and Evaluation for High-Stakes Domains EMNLP 2025
                                        
                                            
                                        
                                        
                                            
                                        
                                        Krithika Ramesh, Daniel Smolyak, Zihao Zhao, Nupoor Gandhi, Ritu Agarwal, Margrét Bjarnadóttir, Anjalie Field
                                    
                                    
                                          We present SynthTextEval, a toolkit for conducting comprehensive evaluations
of synthetic text. The fluency of large language model (LLM) outputs has made
synthetic text potentially viable for numerous applications, such as reducing
the risks of privacy violations in the development and deployment of AI systems
in high-stakes domains. Realizing this potential, however, requires principled
consistent evaluations of synthetic data across multiple dimensions: its
utility in downstream systems, the fairness of these systems, the risk of
privacy leakage, general distributional differences from the source text, and
qualitative feedback from domain experts. SynthTextEval allows users to conduct
evaluations along all of these dimensions over synthetic data that they upload
or generate using the toolkit's generation module. While our toolkit can be run
over any data, we highlight its functionality and effectiveness over datasets
from two high-stakes domains: healthcare and law. By consolidating and
standardizing evaluation metrics, we aim to improve the viability of synthetic
text, and in-turn, privacy-preservation in AI development.
                                    
                                        
                                            comment: EMNLP 2025 System Demonstration
                                        
                                ♻ ★ Training Large Language Models to Reason in a Continuous Latent Space
                                          Large language models (LLMs) are typically constrained to reason in the
language space, where they express the reasoning process through a
chain-of-thought (CoT) to solve complex problems. However, the language space
may not always be optimal for reasoning. Most word tokens primarily ensure
textual coherence and are not essential for reasoning, while some critical
tokens require complex planning and pose challenges to LLMs. To explore the
potential of reasoning beyond language, we introduce a new paradigm called
Coconut (Chain of Continuous Thought). Coconut utilizes the last hidden state
of the LLM as a representation of the reasoning state, termed "continuous
thought." Instead of decoding this state into words, we feed it back to the
model as the next input embedding directly in the continuous space. This latent
reasoning paradigm enables an advanced reasoning pattern, where continuous
thoughts can encode multiple alternative next steps, allowing the model to
perform a breadth-first search (BFS) rather than committing prematurely to a
single deterministic path as in CoT. Coconut outperforms CoT on logical
reasoning tasks that require substantial search during planning and achieves a
better trade-off between accuracy and efficiency.
                                    
                                        
                                            comment: Accepted to COLM 2025