Computation and Language
♻ ★ NTSEBENCH: Cognitive Reasoning Benchmark for Vision Language Models
Cognitive textual and visual reasoning tasks, including puzzles, series, and
analogies, demand the ability to quickly reason, decipher, and evaluate
patterns both textually and spatially. Due to extensive training on vast
amounts of human-curated data, LLMs and VLMs excel in common-sense reasoning
tasks, however still struggle with more complex reasoning that demands deeper
cognitive understanding. We introduce NTSEBench, a new dataset designed to
evaluate cognitive multi-modal reasoning and problem-solving skills of large
models. The dataset contains 2728 multiple-choice questions, accompanied by a
total of 4,642 images, categorized into 26 different types. These questions are
drawn from the nationwide NTSE examination in India and feature a mix of visual
and textual general aptitude challenges, designed to assess intelligence and
critical thinking skills beyond mere rote learning. We establish baselines on
the dataset using state-of-the-art LLMs and VLMs. To facilitate a comparison
between open source and propriety models, we propose four distinct modeling
strategies to handle different modalities -- text and images -- in the dataset
instances.
comment: 28 pages, 3 figures, 12 tables
♻ ☆ STORYSUMM: Evaluating Faithfulness in Story Summarization EMNLP
Human evaluation has been the gold standard for checking faithfulness in
abstractive summarization. However, with a challenging source domain like
narrative, multiple annotators can agree a summary is faithful, while missing
details that are obvious errors only once pointed out. We therefore introduce a
new dataset, STORYSUMM, comprising LLM summaries of short stories with
localized faithfulness labels and error explanations. This benchmark is for
evaluation methods, testing whether a given method can detect challenging
inconsistencies. Using this dataset, we first show that any one human
annotation protocol is likely to miss inconsistencies, and we advocate for
pursuing a range of methods when establishing ground truth for a summarization
dataset. We finally test recent automatic metrics and find that none of them
achieve more than 70% balanced accuracy on this task, demonstrating that it is
a challenging benchmark for future work in faithfulness evaluation.
comment: EMNLP Main 2024
♻ ☆ LLM-Human Pipeline for Cultural Context Grounding of Conversations NAACL 2025
Conversations often adhere to well-understood social norms that vary across
cultures. For example, while "addressing parents by name" is commonplace in the
West, it is rare in most Asian cultures. Adherence or violation of such norms
often dictates the tenor of conversations. Humans are able to navigate social
situations requiring cultural awareness quite adeptly. However, it is a hard
task for NLP models.
In this paper, we tackle this problem by introducing a "Cultural Context
Schema" for conversations. It comprises (1) conversational information such as
emotions, dialogue acts, etc., and (2) cultural information such as social
norms, violations, etc. We generate ~110k social norm and violation
descriptions for ~23k conversations from Chinese culture using LLMs. We refine
them using automated verification strategies which are evaluated against
culturally aware human judgements. We organize these descriptions into
meaningful structures we call "Norm Concepts", using an interactive
human-in-loop framework. We ground the norm concepts and the descriptions in
conversations using symbolic annotation. Finally, we use the obtained dataset
for downstream tasks such as emotion, sentiment, and dialogue act detection. We
show that it significantly improves the empirical performance.
comment: Oral at NAACL 2025 Main conference. Albuquerque, USA. Apr 29 - May 4,
2025. 19 pages, 9 figures, 7 tables
♻ ☆ TOMG-Bench: Evaluating LLMs on Text-based Open Molecule Generation
In this paper, we propose Text-based Open Molecule Generation Benchmark
(TOMG-Bench), the first benchmark to evaluate the open-domain molecule
generation capability of LLMs. TOMG-Bench encompasses a dataset of three major
tasks: molecule editing (MolEdit), molecule optimization (MolOpt), and
customized molecule generation (MolCustom). Each major task further contains
three subtasks, while each subtask comprises 5,000 test samples. Given the
inherent complexity of open molecule generation evaluation, we also developed
an automated evaluation system that helps measure both the quality and the
accuracy of the generated molecules. Our comprehensive benchmarking of 25 LLMs
reveals the current limitations as well as potential areas for improvement in
text-guided molecule discovery. Furthermore, we propose OpenMolIns, a
specialized instruction tuning dataset established for solving challenges
raised by TOMG-Bench. Fine-tuned on OpenMolIns, Llama3.1-8B could outperform
all the open-source general LLMs, even surpassing GPT-3.5-turbo by 46.5\% on
TOMG-Bench. Our codes and datasets are available through
https://github.com/phenixace/TOMG-Bench.
comment: The first benchmark for text-based open molecule generation
♻ ☆ Large Language Models are In-Context Molecule Learners
Large Language Models (LLMs) have demonstrated exceptional performance in
biochemical tasks, especially the molecule caption translation task, which aims
to bridge the gap between molecules and natural language texts. However,
previous methods in adapting LLMs to the molecule-caption translation task
required extra domain-specific pre-training stages, suffered weak alignment
between molecular and textual spaces, or imposed stringent demands on the scale
of LLMs. To resolve the challenges, we propose In-Context Molecule Adaptation
(ICMA), as a new paradigm allowing LLMs to learn the molecule-text alignment
from context examples via In-Context Molecule Tuning. Specifically, ICMA
incorporates the following three stages: Hybrid Context Retrieval,
Post-retrieval Re-ranking, and In-context Molecule Tuning. Initially, Hybrid
Context Retrieval utilizes BM25 Caption Retrieval and Molecule Graph Retrieval
to retrieve similar informative context examples. Additionally, Post-retrieval
Re-ranking is composed of Sequence Reversal and Random Walk selection to
further improve the quality of retrieval results. Finally, In-Context Molecule
Tuning unlocks the in-context learning and reasoning capability of LLMs with
the retrieved examples and adapts the parameters of LLMs for better alignment
between molecules and texts. Experimental results demonstrate that ICMA can
empower LLMs to achieve state-of-the-art or comparable performance without
extra training corpora and intricate structures, showing that LLMs are
inherently in-context molecule learners.
comment: Accepted by IEEE TKDE
♻ ☆ Mixture of Experts Made Personalized: Federated Prompt Learning for Vision-Language Models ICLR 2025
Federated prompt learning benefits federated learning with CLIP-like
Vision-Language Model's (VLM's) robust representation learning ability through
prompt learning. However, current federated prompt learning methods are
habitually restricted to the traditional FL paradigm, where the participating
clients are generally only allowed to download a single globally aggregated
model from the server. While justifiable for training full-sized models under
federated settings, in this work, we argue that this paradigm is ill-suited for
lightweight prompts. By facilitating the clients to download multiple
pre-aggregated prompts as fixed non-local experts, we propose Personalized
Federated Mixture of Adaptive Prompts (pFedMoAP), a novel FL framework that
personalizes the prompt learning process through the lens of Mixture of Experts
(MoE). pFedMoAP implements a local attention-based gating network that learns
to generate enhanced text features for better alignment with local image data,
benefiting from both local and downloaded non-local adaptive prompt experts.
Extensive experiments on 9 datasets under various federated settings
demonstrate the efficacy of the proposed pFedMoAP algorithm. The code is
available at https://github.com/ljaiverson/pFedMoAP.
comment: ICLR 2025
♻ ☆ Krutrim LLM: A Novel Tokenization Strategy for Multilingual Indic Languages with Petabyte-Scale Data Processing
Rahul Kumar, Shubham Kakde, Divyansh Rajput, Daud Ibrahim, Rishabh Nahata, Pidathala Sowjanya, Deepak Kumarr, Gautam Bhargava, Chandra Khatri
We present a novel approach to data preparation for developing multilingual
Indic large language model. Our meticulous data acquisition spans open-source
and proprietary sources, including Common Crawl, Indic books, news articles,
and Wikipedia, ensuring a diverse and rich linguistic representation. For each
Indic language, we design a custom preprocessing pipeline to effectively
eliminate redundant and low-quality text content. Additionally, we perform
deduplication on Common Crawl data to address the redundancy present in 70% of
the crawled web pages. This study focuses on developing high-quality data,
optimizing tokenization for our multilingual dataset for Indic large language
models with 3B and 7B parameters, engineered for superior performance in Indic
languages. We introduce a novel multilingual tokenizer training strategy,
demonstrating our custom-trained Indic tokenizer outperforms the
state-of-the-art OpenAI Tiktoken tokenizer, achieving a superior token-to-word
ratio for Indic languages.
♻ ☆ Light-R1: Curriculum SFT, DPO and RL for Long COT from Scratch and Beyond
Liang Wen, Yunke Cai, Fenrui Xiao, Xin He, Qi An, Zhenyu Duan, Yimin Du, Junchen Liu, Lifu Tang, Xiaowei Lv, Haosheng Zou, Yongchao Deng, Shousheng Jia, Xiangzheng Zhang
This paper introduces Light-R1, an open-source suite for training long
reasoning models using reproducible and cost-effective methodology. Given the
proprietary nature of data used in the DeepSeek-R1 series, we develop an
alternative approach leveraging exclusively public data and models. Our
curriculum training progressively increases data difficulty, combined with
multi-staged post-training. Our Light-R1-32B model, trained from
Qwen2.5-32B-Instruct, outperforms DeepSeek-R1-Distill-Qwen-32B in math
reasoning.
Experimental results show that this curriculum approach becomes more
effective when distinct, diverse datasets are available for different training
stages: fine-tuning DeepSeek-R1-Distilled models (pre-tuned by DeepSeek team on
proprietary data) with 3,000 challenging examples from our curriculum dataset
yielded state-of-the-art 7B and 14B models, while the 32B model,
Light-R1-32B-DS performed comparably to QwQ-32B and DeepSeek-R1.
Furthermore, we extend our work by applying GRPO on long reasoning models.
Our final Light-R1-14B-DS achieves SOTA performance among 14B models in math,
with AIME24 \& 25 scores of 74.0 and 60.2 respectively, surpassing many 32B
models and DeepSeek-R1-Distill-Llama-70B. Despite math-focused training,
Light-R1-14B-DS demonstrates strong cross-domain generalization.
Light-R1 represents a significant advancement in making sophisticated
reasoning models more accessible and implementable in real-world applications.
Our models, training data and code have been made available at
https://github.com/Qihoo360/Light-R1.
comment: v3: minor modifications; v2: better writing & format for later
submission; all release at https://github.com/Qihoo360/Light-R1
♻ ☆ LLMs Lost in Translation: M-ALERT uncovers Cross-Linguistic Safety Gaps
Felix Friedrich, Simone Tedeschi, Patrick Schramowski, Manuel Brack, Roberto Navigli, Huu Nguyen, Bo Li, Kristian Kersting
Building safe Large Language Models (LLMs) across multiple languages is
essential in ensuring both safe access and linguistic diversity. To this end,
we introduce M-ALERT, a multilingual benchmark that evaluates the safety of
LLMs in five languages: English, French, German, Italian, and Spanish. M-ALERT
includes 15k high-quality prompts per language, totaling 75k, following the
detailed ALERT taxonomy. Our extensive experiments on 10 state-of-the-art LLMs
highlight the importance of language-specific safety analysis, revealing that
models often exhibit significant inconsistencies in safety across languages and
categories. For instance, Llama3.2 shows high unsafety in the category
crime_tax for Italian but remains safe in other languages. Similar differences
can be observed across all models. In contrast, certain categories, such as
substance_cannabis and crime_propaganda, consistently trigger unsafe responses
across models and languages. These findings underscore the need for robust
multilingual safety practices in LLMs to ensure safe and responsible usage
across diverse user communities.
♻ ☆ Crossing the Reward Bridge: Expanding RL with Verifiable Rewards Across Diverse Domains
Reinforcement learning with verifiable rewards (RLVR) has demonstrated
significant success in enhancing mathematical reasoning and coding performance
of large language models (LLMs), especially when structured reference answers
are accessible for verification. However, its extension to broader, less
structured domains remains unexplored. In this work, we investigate the
effectiveness and scalability of RLVR across diverse real-world domains
including medicine, chemistry, psychology, economics, and education, where
structured reference answers are typically unavailable. We reveal that binary
verification judgments on broad-domain tasks exhibit high consistency across
various LLMs provided expert-written reference answers exist. Motivated by this
finding, we utilize a generative scoring technique that yields soft,
model-based reward signals to overcome limitations posed by binary
verifications, especially in free-form, unstructured answer scenarios. We
further demonstrate the feasibility of training cross-domain generative reward
models using relatively small (7B) LLMs without the need for extensive
domain-specific annotation. Through comprehensive experiments, our RLVR
framework establishes clear performance gains, significantly outperforming
state-of-the-art open-source aligned models such as Qwen2.5-72B and
DeepSeek-R1-Distill-Qwen-32B across domains in free-form settings. Our approach
notably enhances the robustness, flexibility, and scalability of RLVR,
representing a substantial step towards practical reinforcement learning
applications in complex, noisy-label scenarios.
♻ ☆ TeleAntiFraud-28k: An Audio-Text Slow-Thinking Dataset for Telecom Fraud Detection
Zhiming Ma, Peidong Wang, Minhua Huang, Jingpeng Wang, Kai Wu, Xiangzhao Lv, Yachun Pang, Yin Yang, Wenjie Tang, Yuchen Kang
The detection of telecom fraud faces significant challenges due to the lack
of high-quality multimodal training data that integrates audio signals with
reasoning-oriented textual analysis. To address this gap, we present
TeleAntiFraud-28k, the first open-source audio-text slow-thinking dataset
specifically designed for automated telecom fraud analysis. Our dataset is
constructed through three strategies: (1) Privacy-preserved text-truth sample
generation using automatically speech recognition (ASR)-transcribed call
recordings (with anonymized original audio), ensuring real-world consistency
through text-to-speech (TTS) model regeneration; (2) Semantic enhancement via
large language model (LLM)-based self-instruction sampling on authentic ASR
outputs to expand scenario coverage; (3) Multi-agent adversarial synthesis that
simulates emerging fraud tactics through predefined communication scenarios and
fraud typologies. The generated dataset contains 28,511 rigorously processed
speech-text pairs, complete with detailed annotations for fraud reasoning. The
dataset is divided into three tasks: scenario classification, fraud detection,
fraud type classification. Furthermore, we construct TeleAntiFraud-Bench, a
standardized evaluation benchmark comprising proportionally sampled instances
from the dataset, to facilitate systematic testing of model performance on
telecom fraud detection tasks. We also contribute a production-optimized
supervised fine-tuning (SFT) model trained on hybrid real/synthetic data, while
open-sourcing the data processing framework to enable community-driven dataset
expansion. This work establishes a foundational framework for multimodal
anti-fraud research while addressing critical challenges in data privacy and
scenario diversity. The project will be released at
https://github.com/JimmyMa99/TeleAntiFraud.
♻ ☆ Towards Robust and Parameter-Efficient Knowledge Unlearning for LLMs ICLR 2025
Large Language Models (LLMs) have demonstrated strong reasoning and
memorization capabilities via pretraining on massive textual corpora. However,
this poses risk of privacy and copyright violations, highlighting the need for
efficient machine unlearning methods that remove sensitive data without
retraining from scratch. While Gradient Ascent (GA) is commonly used to unlearn
by reducing the likelihood of generating unwanted content, it leads to unstable
optimization and catastrophic forgetting of retrained knowledge. We find that
combining GA with low-rank adaptation results in poor trade-offs between
computational cost and generative performance. To address these challenges, we
propose Low-rank Knowledge Unlearning (LoKU), a novel framework that enables
robust and efficient unlearning for LLMs. First, we introduce Inverted Hinge
Loss, which suppresses unwanted tokens while maintaining fluency by boosting
the probability of the next most likely token. Second, we develop a
data-adaptive initialization for LoRA adapters via low-rank approximation
weighted with relative Fisher information, thereby focusing updates on
parameters critical for removing targeted knowledge. Experiments on the
Training Data Extraction Challenge dataset using GPT-Neo models as well as on
the TOFU benchmark with Phi-1.5B and Llama2-7B models demonstrate that our
approach effectively removes sensitive information while maintaining reasoning
and generative capabilities with minimal impact. Our implementation can be
found in https://github.com/csm9493/efficient-llm-unlearning.
comment: ICLR 2025 camera-ready version
♻ ☆ Forest-of-Thought: Scaling Test-Time Compute for Enhancing LLM Reasoning
Large Language Models (LLMs) have demonstrated remarkable abilities across
various language tasks, but solving complex reasoning problems remains a
significant challenge. While existing methods, such as Chain-of-Thought (CoT)
and Tree-of-Thought (ToT), enhance reasoning by decomposing problems or
structuring prompts, they typically perform a single pass of reasoning and may
fail to revisit flawed paths, compromising accuracy. To address this
limitation, we propose a novel reasoning framework called Forest-of-Thought
(FoT), which integrates multiple reasoning trees to leverage collective
decision-making for solving complex logical problems. FoT employs sparse
activation strategies to select the most relevant reasoning paths, improving
both efficiency and accuracy. Additionally, we introduce a dynamic
self-correction strategy that enables real-time error correction, along with
consensus-guided decision-making strategies to optimize both correctness and
computational resources. Experimental results demonstrate that the FoT
framework, combined with these strategies, significantly enhances the reasoning
capabilities of LLMs, enabling them to solve complex tasks with greater
precision and efficiency. Code will be available at
https://github.com/iamhankai/Forest-of-Thought.
comment: Preprint
♻ ☆ PICLe: Pseudo-Annotations for In-Context Learning in Low-Resource Named Entity Detection NAACL2025
In-context learning (ICL) enables Large Language Models (LLMs) to perform
tasks using few demonstrations, facilitating task adaptation when labeled
examples are hard to obtain. However, ICL is sensitive to the choice of
demonstrations, and it remains unclear which demonstration attributes enable
in-context generalization. In this work, we conduct a perturbation study of
in-context demonstrations for low-resource Named Entity Detection (NED). Our
surprising finding is that in-context demonstrations with partially correct
annotated entity mentions can be as effective for task transfer as fully
correct demonstrations. Based off our findings, we propose Pseudo-annotated
In-Context Learning (PICLe), a framework for in-context learning with noisy,
pseudo-annotated demonstrations. PICLe leverages LLMs to annotate many
demonstrations in a zero-shot first pass. We then cluster these synthetic
demonstrations, sample specific sets of in-context demonstrations from each
cluster, and predict entity mentions using each set independently. Finally, we
use self-verification to select the final set of entity mentions. We evaluate
PICLe on five biomedical NED datasets and show that, with zero human
annotation, PICLe outperforms ICL in low-resource settings where limited gold
examples can be used as in-context demonstrations.
comment: In Proceedings of NAACL2025
♻ ☆ TWICE: What Advantages Can Low-Resource Domain-Specific Embedding Model Bring? -- A Case Study on Korea Financial Texts ICLR 2025
Domain specificity of embedding models is critical for effective performance.
However, existing benchmarks, such as FinMTEB, are primarily designed for
high-resource languages, leaving low-resource settings, such as Korean,
under-explored. Directly translating established English benchmarks often fails
to capture the linguistic and cultural nuances present in low-resource domains.
In this paper, titled TWICE: What Advantages Can Low-Resource Domain-Specific
Embedding Models Bring? A Case Study on Korea Financial Texts, we introduce
KorFinMTEB, a novel benchmark for the Korean financial domain, specifically
tailored to reflect its unique cultural characteristics in low-resource
languages. Our experimental results reveal that while the models perform
robustly on a translated version of FinMTEB, their performance on KorFinMTEB
uncovers subtle yet critical discrepancies, especially in tasks requiring
deeper semantic understanding, that underscore the limitations of direct
translation. This discrepancy highlights the necessity of benchmarks that
incorporate language-specific idiosyncrasies and cultural nuances. The insights
from our study advocate for the development of domain-specific evaluation
frameworks that can more accurately assess and drive the progress of embedding
models in low-resource settings.
comment: Accepted at FinancialAI@ICLR 2025
♻ ☆ HRET: A Self-Evolving LLM Evaluation Toolkit for Korean
Hanwool Lee, Soo Yong Kim, Dasol Choi, SangWon Baek, Seunghyeok Hong, Ilgyun Jeong, Inseon Hwang, Naeun Lee, Guijin Son
Recent advancements in Korean large language models (LLMs) have spurred
numerous benchmarks and evaluation methodologies, yet the lack of a
standardized evaluation framework has led to inconsistent results and limited
comparability. To address this, we introduce HRET Haerae Evaluation Toolkit, an
open-source, self-evolving evaluation framework tailored specifically for
Korean LLMs. HRET unifies diverse evaluation methods, including logit-based
scoring, exact-match, language-inconsistency penalization, and LLM-as-a-Judge
assessments. Its modular, registry-based architecture integrates major
benchmarks (HAE-RAE Bench, KMMLU, KUDGE, HRM8K) and multiple inference backends
(vLLM, HuggingFace, OpenAI-compatible endpoints). With automated pipelines for
continuous evolution, HRET provides a robust foundation for reproducible, fair,
and transparent Korean NLP research.
♻ ☆ Enabling Auditory Large Language Models for Automatic Speech Quality Evaluation ICASSP 2025
Siyin Wang, Wenyi Yu, Yudong Yang, Changli Tang, Yixuan Li, Jimin Zhuang, Xianzhao Chen, Xiaohai Tian, Jun Zhang, Guangzhi Sun, Lu Lu, Yuxuan Wang, Chao Zhang
Speech quality assessment typically requires evaluating audio from multiple
aspects, such as mean opinion score (MOS) and speaker similarity (SIM) \etc.,
which can be challenging to cover using one small model designed for a single
task. In this paper, we propose leveraging recently introduced auditory large
language models (LLMs) for automatic speech quality assessment. By employing
task-specific prompts, auditory LLMs are finetuned to predict MOS, SIM and A/B
testing results, which are commonly used for evaluating text-to-speech systems.
Additionally, the finetuned auditory LLM is able to generate natural language
descriptions assessing aspects like noisiness, distortion, discontinuity, and
overall quality, providing more interpretable outputs. Extensive experiments
have been performed on the NISQA, BVCC, SOMOS and VoxSim speech quality
datasets, using open-source auditory LLMs such as SALMONN, Qwen-Audio, and
Qwen2-Audio. For the natural language descriptions task, a commercial model
Google Gemini 1.5 Pro is also evaluated. The results demonstrate that auditory
LLMs achieve competitive performance compared to state-of-the-art task-specific
small models in predicting MOS and SIM, while also delivering promising results
in A/B testing and natural language descriptions. Our data processing scripts
and finetuned model checkpoints can be found at
https://github.com/bytedance/SALMONN.
comment: Accepted by ICASSP 2025
♻ ☆ QualiSpeech: A Speech Quality Assessment Dataset with Natural Language Reasoning and Descriptions
Siyin Wang, Wenyi Yu, Xianzhao Chen, Xiaohai Tian, Jun Zhang, Lu Lu, Yu Tsao, Junichi Yamagishi, Yuxuan Wang, Chao Zhang
This paper explores a novel perspective to speech quality assessment by
leveraging natural language descriptions, offering richer, more nuanced
insights than traditional numerical scoring methods. Natural language feedback
provides instructive recommendations and detailed evaluations, yet existing
datasets lack the comprehensive annotations needed for this approach. To bridge
this gap, we introduce QualiSpeech, a comprehensive low-level speech quality
assessment dataset encompassing 11 key aspects and detailed natural language
comments that include reasoning and contextual insights. Additionally, we
propose the QualiSpeech Benchmark to evaluate the low-level speech
understanding capabilities of auditory large language models (LLMs).
Experimental results demonstrate that finetuned auditory LLMs can reliably
generate detailed descriptions of noise and distortion, effectively identifying
their types and temporal characteristics. The results further highlight the
potential for incorporating reasoning to enhance the accuracy and reliability
of quality assessments. The dataset will be released at
https://huggingface.co/datasets/tsinghua-ee/QualiSpeech.
comment: 23 pages, 16 figures
♻ ☆ Sabiá-3 Technical Report
Hugo Abonizio, Thales Sales Almeida, Thiago Laitz, Roseval Malaquias Junior, Giovana Kerche Bonás, Rodrigo Nogueira, Ramon Pires
This report presents Sabi\'a-3, our new flagship language model, and
Sabiazinho-3, a more cost-effective sibling. The models were trained on a large
brazilian-centric corpus. Evaluations across diverse professional and academic
benchmarks show a strong performance on Portuguese and Brazil-related tasks.
Sabi\'a-3 shows large improvements in comparison to our previous best of model,
Sabia-2 Medium, especially in reasoning-intensive tasks. Notably, Sabi\'a-3's
average performance matches frontier LLMs, while it is offered at a three to
four times lower cost per token, reinforcing the benefits of domain
specialization.
♻ ☆ Exploiting Mixture-of-Experts Redundancy Unlocks Multimodal Generative Abilities
Raman Dutt, Harleen Hanspal, Guoxuan Xia, Petru-Daniel Tudosiu, Alexander Black, Yongxin Yang, Steven McDonagh, Sarah Parisot
In this work, we undertake the challenge of augmenting the existing
generative capabilities of pre-trained text-only large language models (LLMs)
with multi-modal generation capability while satisfying two core constraints:
C1 preserving the preservation of original language generative capabilities
with negligible performance degradation, and C2 adhering to a small parameter
budget to learn the new modality, ensuring scalability and efficiency. In
contrast to current approaches that add dedicated modules, thereby
significantly increasing the parameter count, we propose a method that
leverages the underutilized capacity inherent in deep models. Specifically, we
exploit the parameter redundancy within Mixture-of-Experts (MoEs) as a source
of additional capacity for learning a new modality, enabling better parameter
efficiency (C1). Moreover, we preserve the original language generation
capabilities by applying low-rank adaptation exclusively to the tokens of the
new modality (C2). Furthermore, we introduce a novel parameter initialization
scheme based on the Gromov-Wasserstein distance to improve convergence and
training stability. Through an extensive analysis of the routing mechanism, we
uncover the emergence of modality-specific pathways and decreased redundancy
within the experts that can efficiently unlock multi-modal generative
capabilities. Overall, our method can be seamlessly applied to a wide range of
contemporary LLMs, providing a new pathway for transitioning from uni-modal to
multi-modal architectures.
♻ ★ FsPONER: Few-shot Prompt Optimization for Named Entity Recognition in Domain-specific Scenarios ECAI-2024
Large Language Models (LLMs) have provided a new pathway for Named Entity
Recognition (NER) tasks. Compared with fine-tuning, LLM-powered prompting
methods avoid the need for training, conserve substantial computational
resources, and rely on minimal annotated data. Previous studies have achieved
comparable performance to fully supervised BERT-based fine-tuning approaches on
general NER benchmarks. However, none of the previous approaches has
investigated the efficiency of LLM-based few-shot learning in domain-specific
scenarios. To address this gap, we introduce FsPONER, a novel approach for
optimizing few-shot prompts, and evaluate its performance on domain-specific
NER datasets, with a focus on industrial manufacturing and maintenance, while
using multiple LLMs -- GPT-4-32K, GPT-3.5-Turbo, LLaMA 2-chat, and Vicuna.
FsPONER consists of three few-shot selection methods based on random sampling,
TF-IDF vectors, and a combination of both. We compare these methods with a
general-purpose GPT-NER method as the number of few-shot examples increases and
evaluate their optimal NER performance against fine-tuned BERT and LLaMA
2-chat. In the considered real-world scenarios with data scarcity, FsPONER with
TF-IDF surpasses fine-tuned models by approximately 10% in F1 score.
comment: accepted in the main track at the 27th European Conference on
Artificial Intelligence (ECAI-2024)
♻ ★ MTL-LoRA: Low-Rank Adaptation for Multi-Task Learning
Yaming Yang, Dilxat Muhtar, Yelong Shen, Yuefeng Zhan, Jianfeng Liu, Yujing Wang, Hao Sun, Denvy Deng, Feng Sun, Qi Zhang, Weizhu Chen, Yunhai Tong
Parameter-efficient fine-tuning (PEFT) has been widely employed for domain
adaptation, with LoRA being one of the most prominent methods due to its
simplicity and effectiveness. However, in multi-task learning (MTL) scenarios,
LoRA tends to obscure the distinction between tasks by projecting sparse
high-dimensional features from different tasks into the same dense
low-dimensional intrinsic space. This leads to task interference and suboptimal
performance for LoRA and its variants. To tackle this challenge, we propose
MTL-LoRA, which retains the advantages of low-rank adaptation while
significantly enhancing MTL capabilities. MTL-LoRA augments LoRA by
incorporating additional task-adaptive parameters that differentiate
task-specific information and capture shared knowledge across various tasks
within low-dimensional spaces. This approach enables pre-trained models to
jointly adapt to different target domains with a limited number of trainable
parameters. Comprehensive experimental results, including evaluations on public
academic benchmarks for natural language understanding, commonsense reasoning,
and image-text understanding, as well as real-world industrial text Ads
relevance datasets, demonstrate that MTL-LoRA outperforms LoRA and its various
variants with comparable or even fewer learnable parameters in MTL setting.
comment: 12 Pages, 4 Figures
♻ ☆ In-game Toxic Language Detection: Shared Task and Attention Residuals AAAI 2023
In-game toxic language becomes the hot potato in the gaming industry and
community. There have been several online game toxicity analysis frameworks and
models proposed. However, it is still challenging to detect toxicity due to the
nature of in-game chat, which has extremely short length. In this paper, we
describe how the in-game toxic language shared task has been established using
the real-world in-game chat data. In addition, we propose and introduce the
model/framework for toxic language token tagging (slot filling) from the
in-game chat. The relevant code is publicly available on GitHub:
https://github.com/Yuanzhe-Jia/In-Game-Toxic-Detection
comment: Accepted at AAAI 2023 Poster
♻ ☆ KTCR: Improving Implicit Hate Detection with Knowledge Transfer driven Concept Refinement
The constant shifts in social and political contexts, driven by emerging
social movements and political events, lead to new forms of hate content and
previously unrecognized hate patterns that machine learning models may not have
captured. Some recent literature proposes data augmentation-based techniques to
enrich existing hate datasets by incorporating samples that reveal new implicit
hate patterns. This approach aims to improve the model's performance on
out-of-domain implicit hate instances. It is observed, that further addition of
more samples for augmentation results in the decrease of the performance of the
model. In this work, we propose a Knowledge Transfer-driven Concept Refinement
method that distills and refines the concepts related to implicit hate samples
through novel prototype alignment and concept losses, alongside data
augmentation based on concept activation vectors. Experiments with several
publicly available datasets show that incorporating additional implicit samples
reflecting new hate patterns through concept refinement enhances the model's
performance, surpassing baseline results while maintaining cross-dataset
generalization capabilities.
comment: 9 pages, 4 figures, 2 algorithms, 5 tables
♻ ☆ A Survey on Personalized Alignment -- The Missing Piece for Large Language Models in Real-World Applications
Large Language Models (LLMs) have demonstrated remarkable capabilities, yet
their transition to real-world applications reveals a critical limitation: the
inability to adapt to individual preferences while maintaining alignment with
universal human values. Current alignment techniques adopt a one-size-fits-all
approach that fails to accommodate users' diverse backgrounds and needs. This
paper presents the first comprehensive survey of personalized alignment-a
paradigm that enables LLMs to adapt their behavior within ethical boundaries
based on individual preferences. We propose a unified framework comprising
preference memory management, personalized generation, and feedback-based
alignment, systematically analyzing implementation approaches and evaluating
their effectiveness across various scenarios. By examining current techniques,
potential risks, and future challenges, this survey provides a structured
foundation for developing more adaptable and ethically-aligned LLMs.
comment: 10 pages
♻ ☆ GME: Improving Universal Multimodal Retrieval by Multimodal LLMs CVPR 2025
Xin Zhang, Yanzhao Zhang, Wen Xie, Mingxin Li, Ziqi Dai, Dingkun Long, Pengjun Xie, Meishan Zhang, Wenjie Li, Min Zhang
Universal Multimodal Retrieval (UMR) aims to enable search across various
modalities using a unified model, where queries and candidates can consist of
pure text, images, or a combination of both. Previous work has attempted to
adopt multimodal large language models (MLLMs) to realize UMR using only text
data. However, our preliminary experiments demonstrate that more diverse
multimodal training data can further unlock the potential of MLLMs. Despite its
effectiveness, the existing multimodal training data is highly imbalanced in
terms of modality, which motivates us to develop a training data synthesis
pipeline and construct a large-scale, high-quality fused-modal training
dataset. Based on the synthetic training data, we develop the General
Multimodal Embedder (GME), an MLLM-based dense retriever designed for UMR.
Furthermore, we construct a comprehensive UMR Benchmark (UMRB) to evaluate the
effectiveness of our approach. Experimental results show that our method
achieves state-of-the-art performance among existing UMR methods. Last, we
provide in-depth analyses of model scaling and training strategies, and perform
ablation studies on both the model and synthetic data.
comment: Accepted to CVPR 2025, models at
https://huggingface.co/Alibaba-NLP/gme-Qwen2-VL-2B-Instruct
♻ ☆ MambaPEFT: Exploring Parameter-Efficient Fine-Tuning for Mamba ICLR2025
An ecosystem of Transformer-based models has been established by building
large models with extensive data. Parameter-efficient fine-tuning (PEFT) is a
crucial technology for deploying these models to downstream tasks with minimal
cost while achieving effective performance. Recently, Mamba, a State Space
Model (SSM)-based model, has attracted attention as a potential alternative to
Transformers. While many large-scale Mamba-based models have been proposed,
efficiently adapting pre-trained Mamba-based models to downstream tasks remains
unexplored. In this paper, we conduct an exploratory analysis of PEFT methods
for Mamba. We investigate the effectiveness of existing PEFT methods for
Transformers when applied to Mamba. We also modify these methods to better
align with the Mamba architecture. Additionally, we propose new Mamba-specific
PEFT methods that leverage the distinctive structure of Mamba. Our experiments
indicate that PEFT performs more effectively for Mamba than Transformers.
Lastly, we demonstrate how to effectively combine multiple PEFT methods and
provide a framework that outperforms previous works. To ensure reproducibility,
we will release the code after publication.
comment: Accepted to ICLR2025
♻ ☆ BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions ICLR 2025
Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, Simon Brunner, Chen Gong, Thong Hoang, Armel Randy Zebaze, Xiaoheng Hong, Wen-Ding Li, Jean Kaddour, Ming Xu, Zhihan Zhang, Prateek Yadav, Naman Jain, Alex Gu, Zhoujun Cheng, Jiawei Liu, Qian Liu, Zijian Wang, Binyuan Hui, Niklas Muennighoff, David Lo, Daniel Fried, Xiaoning Du, Harm de Vries, Leandro Von Werra
Task automation has been greatly empowered by the recent advances in Large
Language Models (LLMs) via Python code, where the tasks ranging from software
engineering development to general-purpose reasoning. While current benchmarks
have shown that LLMs can solve tasks using programs like human developers, the
majority of their evaluations are limited to short and self-contained
algorithmic tasks or standalone function calls. Solving challenging and
practical tasks requires the capability of utilizing diverse function calls as
tools to efficiently implement functionalities like data analysis and web
development. In addition, using multiple tools to solve a task needs
compositional reasoning by accurately understanding complex instructions.
Fulfilling both of these characteristics can pose a great challenge for LLMs.To
assess how well LLMs can solve challenging and practical tasks via programs, we
introduce BigCodeBench, a benchmark that challenges LLMs to invoke multiple
function calls as tools from 139 libraries and 7 domains for 1,140 fine-grained
tasks. To evaluate LLMs rigorously, each task encompasses 5.6 test cases with
an average branch coverage of 99%. In addition, we propose a
natural-language-oriented variant of BigCodeBench, BigCodeBench-Instruct, that
automatically transforms the original docstrings into short instructions only
with essential information. Our extensive evaluation of 60 LLMs shows that LLMs
are not yet capable of following complex instructions to use function calls
precisely, with scores up to 60%, significantly lower than the human
performance of 97%. The results underscore the need for further advancements in
this area.
comment: Accpeted at ICLR 2025 (Oral), built with love by the BigCode
community :)
♻ ☆ You Cannot Feed Two Birds with One Score: the Accuracy-Naturalness Tradeoff in Translation
The goal of translation, be it by human or by machine, is, given some text in
a source language, to produce text in a target language that simultaneously 1)
preserves the meaning of the source text and 2) achieves natural expression in
the target language. However, researchers in the machine translation community
usually assess translations using a single score intended to capture semantic
accuracy and the naturalness of the output simultaneously. In this paper, we
build on recent advances in information theory to mathematically prove and
empirically demonstrate that such single-score summaries do not and cannot give
the complete picture of a system's true performance. Concretely, we prove that
a tradeoff exists between accuracy and naturalness and demonstrate it by
evaluating the submissions to the WMT24 shared task. Our findings help explain
well-known empirical phenomena, such as the observation that optimizing
translation systems for a specific accuracy metric (like BLEU) initially
improves the system's naturalness, while ``overfitting'' the system to the
metric can significantly degrade its naturalness. Thus, we advocate for a
change in how translations are evaluated: rather than comparing systems using a
single number, they should be compared on an accuracy-naturalness plane.
comment: Corrected a typo in Eq (3)
♻ ☆ Improving Complex Reasoning with Dynamic Prompt Corruption: A soft prompt Optimization Approach ICLR 2025
Sinan Fan, Liang Xie, Chen Shen, Ge Teng, Xiaosong Yuan, Xiaofeng Zhang, Chenxi Huang, Wenxiao Wang, Xiaofei He, Jieping Ye
Prompt-tuning (PT) for large language models (LLMs) can facilitate the
performance on various conventional NLP tasks with significantly fewer
trainable parameters. However, our investigation reveals that PT provides
limited improvement and may even degrade the primitive performance of LLMs on
complex reasoning tasks. Such a phenomenon suggests that soft prompts can
positively impact certain instances while negatively affecting others,
particularly during the later phases of reasoning. To address these challenges,
We first identify an information accumulation within the soft prompts. Through
detailed analysis, we demonstrate that this phenomenon is often accompanied by
erroneous information flow patterns in the deeper layers of the model, which
ultimately lead to incorrect reasoning outcomes. we propose a novel method
called Dynamic Prompt Corruption (DPC) to take better advantage of soft prompts
in complex reasoning tasks, which dynamically adjusts the influence of soft
prompts based on their impact on the reasoning process. Specifically, DPC
consists of two stages: Dynamic Trigger and Dynamic Corruption. First, Dynamic
Trigger measures the impact of soft prompts, identifying whether beneficial or
detrimental. Then, Dynamic Corruption mitigates the negative effects of soft
prompts by selectively masking key tokens that interfere with the reasoning
process. We validate the proposed approach through extensive experiments on
various LLMs and reasoning tasks, including GSM8K, MATH, and AQuA. Experimental
results demonstrate that DPC can consistently enhance the performance of PT,
achieving 4%-8% accuracy gains compared to vanilla prompt tuning, highlighting
the effectiveness of our approach and its potential to enhance complex
reasoning in LLMs.
comment: Accepted by ICLR 2025
♻ ☆ Did ChatGPT or Copilot use alter the style of internet news headlines? A time series regression analysis
The release of advanced Large Language Models (LLMs) such as ChatGPT and
Copilot is changing the way text is created and may influence the content that
we find on the web. This study investigated whether the release of these two
popular LLMs coincided with a change in writing style in headlines and links on
worldwide news websites. 175 NLP features were obtained for each text in a
dataset of 451 million headlines/links. An interrupted time series analysis was
applied for each of the 175 NLP features to evaluate whether there were any
statistically significant sustained changes after the release dates of ChatGPT
and/or Copilot. There were a total of 44 features that did not appear to have
any significant sustained change after the release of ChatGPT/Copilot. A total
of 91 other features did show significant change with ChatGPT and/or Copilot
although significance with earlier control LLM release dates (GPT-1/2/3,
Gopher) removed them from consideration. This initial analysis suggests these
language models may have had a limited impact on the style of individual news
headlines/links, with respect to only some NLP measures.
♻ ☆ Generalizable Prompt Learning of CLIP: A Brief Overview
Existing vision-language models (VLMs) such as CLIP have showcased an
impressive capability to generalize well across various downstream tasks. These
models leverage the synergy between visual and textual information, enabling
them to understand and reason about the content present in images and text in a
unified manner. This article provides a brief overview of CLIP based on
few-shot prompt learning, including experimental data and technical
characteristics of some methods. The purpose of this review is to provide a
reference for researchers who have just started their research in generalizable
prompting of CLIP through few-shot training for classification across 15
datasets and also to facilitate the integration of this field by researchers in
other downstream tasks.
♻ ☆ Low-resource Machine Translation: what for? who for? An observational study on a dedicated Tetun language translation service
Low-resource machine translation (MT) presents a diversity of community needs
and application challenges that remain poorly understood. To complement surveys
and focus groups, which tend to rely on small samples of respondents, we
propose an observational study on actual usage patterns of tetun.org, a
specialized MT service for the Tetun language, which is the lingua franca in
Timor-Leste. Our analysis of 100,000 translation requests reveals patterns that
challenge assumptions based on existing corpora. We find that users, many of
them students on mobile devices, typically translate text from a high-resource
language into Tetun across diverse domains including science, healthcare, and
daily life. This contrasts sharply with available Tetun corpora, which are
dominated by news articles covering government and social issues. Our results
suggest that MT systems for institutionalized minority languages like Tetun
should prioritize accuracy on domains relevant to educational contexts, in the
high-resource to low-resource direction.More broadly, this study demonstrates
how observational analysis can inform low-resource language technology
development, by grounding research in practical community needs.
comment: to be published in LoResMT 2025
♻ ☆ Pretraining Data Detection for Large Language Models: A Divergence-based Calibration Method EMNLP 2024
As the scale of training corpora for large language models (LLMs) grows,
model developers become increasingly reluctant to disclose details on their
data. This lack of transparency poses challenges to scientific evaluation and
ethical deployment. Recently, pretraining data detection approaches, which
infer whether a given text was part of an LLM's training data through black-box
access, have been explored. The Min-K\% Prob method, which has achieved
state-of-the-art results, assumes that a non-training example tends to contain
a few outlier words with low token probabilities. However, the effectiveness
may be limited as it tends to misclassify non-training texts that contain many
common words with high probabilities predicted by LLMs. To address this issue,
we introduce a divergence-based calibration method, inspired by the
divergence-from-randomness concept, to calibrate token probabilities for
pretraining data detection. We compute the cross-entropy (i.e., the divergence)
between the token probability distribution and the token frequency distribution
to derive a detection score. We have developed a Chinese-language benchmark,
PatentMIA, to assess the performance of detection approaches for LLMs on
Chinese text. Experimental results on English-language benchmarks and PatentMIA
demonstrate that our proposed method significantly outperforms existing
methods. Our code and PatentMIA benchmark are available at
https://github.com/zhang-wei-chao/DC-PDD.
comment: Accepted by EMNLP 2024 main
♻ ☆ CodingTeachLLM: Empowering LLM's Coding Ability via AST Prior Knowledge
In this paper, we introduce CodingTeachLLM, a large language model (LLM)
designed for coding teaching. Specially, we aim to enhance the coding ability
of LLM and lead it to better teaching mode in education context. Thus, we
propose an end-to-end prior-based three-phases supervised fine-tuned model,
which is proved more competitive than traditional fine-tuning method. More
specifically, our model realizes the structural disassembly and incremental
guided output of educational knowledge. To this end, we robustify data
classification of three types via a sampler and overlap estimation neural
network, and inject the preprocessing datasets into pre-trained model in three
batches for LORA fine-tuning. Then, we design a prior module couples system
prompt, vector databases, and abstract syntax tree task segmentation. Finally,
the compression method and regularization constraint are applied to the
prior-based fine-tuned model, followed by text filter at the output end to
obtain incremental guided results. Our model represents the first research
effort to truly embody the tutor role with the features of abundant educational
knowledge, step-by-step incremental guided outputs and non-disclosure of
answers. Extensive experiments report that our model also achieves
state-of-the-art in code abilities compared to open-source models, reaching an
impressive 75.10% on the HumanEval (@pass 1) benchmark. Additionally, our model
maintains strong conversational capabilities, with the 13B quantized version
achieving scores of 56.34, 50.60, and 45.27 respectively on the MMLU, C-Eval,
and AGIEval (5 shot) dialogue evaluation benchmarks.
comment: 9 pages, 2 figures
♻ ☆ GENERator: A Long-Context Generative Genomic Foundation Model
Advancements in DNA sequencing technologies have significantly improved our
ability to decode genomic sequences. However, the prediction and interpretation
of these sequences remain challenging due to the intricate nature of genetic
material. Large language models (LLMs) have introduced new opportunities for
biological sequence analysis. Recent developments in genomic language models
have underscored the potential of LLMs in deciphering DNA sequences.
Nonetheless, existing models often face limitations in robustness and
application scope, primarily due to constraints in model structure and training
data scale. To address these limitations, we present GENERator, a generative
genomic foundation model featuring a context length of 98k base pairs (bp) and
1.2B parameters. Trained on an expansive dataset comprising 386B bp of
eukaryotic DNA, the GENERator demonstrates state-of-the-art performance across
both established and newly proposed benchmarks. The model adheres to the
central dogma of molecular biology, accurately generating protein-coding
sequences that translate into proteins structurally analogous to known
families. It also shows significant promise in sequence optimization,
particularly through the prompt-responsive generation of enhancer sequences
with specific activity profiles. These capabilities position the GENERator as a
pivotal tool for genomic research and biotechnological advancement, enhancing
our ability to interpret and predict complex biological systems and enabling
precise genomic interventions. Implementation details and supplementary
resources are available at https://github.com/GenerTeam/GENERator.
♻ ☆ Self-Vocabularizing Training for Neural Machine Translation NAACL
Past vocabulary learning techniques identify relevant vocabulary before
training, relying on statistical and entropy-based assumptions that largely
neglect the role of model training. Empirically, we observe that trained
translation models are induced to use a byte-pair encoding (BPE) vocabulary
subset distinct from the original BPE vocabulary, leading to performance
improvements when retrained with the induced vocabulary. In this paper, we
analyze this discrepancy in neural machine translation by examining vocabulary
and entropy shifts during self-training--where each iteration generates a
labeled dataset by pairing source sentences with the model's predictions to
define a new vocabulary. Building on these insights, we propose
self-vocabularizing training, an iterative method that self-selects a smaller,
more optimal vocabulary, yielding up to a 1.49 BLEU improvement. Moreover, we
find that deeper model architectures lead to both an increase in unique token
usage and a 6-8% reduction in vocabulary size.
comment: Accepted to NAACL SRW 2025
♻ ☆ Lean Formalization of Generalization Error Bound by Rademacher Complexity
We formalize the generalization error bound using Rademacher complexity in
the Lean 4 theorem prover. Generalization error quantifies the gap between a
learning machine's performance on given training data versus unseen test data,
and Rademacher complexity serves as an estimate of this error based on the
complexity of learning machines, or hypothesis class. Unlike traditional
methods such as PAC learning and VC dimension, Rademacher complexity is
applicable across diverse machine learning scenarios including deep learning
and kernel methods. We formalize key concepts and theorems, including the
empirical and population Rademacher complexities, and establish generalization
error bounds through formal proofs of McDiarmid's inequality, Hoeffding's
lemma, and symmetrization arguments.
comment: modified a typo in affiliation
♻ ☆ CoRanking: Collaborative Ranking with Small and Large Ranking Agents
Large Language Models (LLMs) have demonstrated superior listwise ranking
performance. However, their superior performance often relies on large-scale
parameters (\eg, GPT-4) and a repetitive sliding window process, which
introduces significant efficiency challenges. In this paper, we propose
\textbf{CoRanking}, a novel collaborative ranking framework that combines small
and large ranking models for efficient and effective ranking. CoRanking first
employs a small-size reranker to pre-rank all the candidate passages, bringing
relevant ones to the top part of the list (\eg, top-20). Then, the LLM listwise
reranker is applied to only rerank these top-ranked passages instead of the
whole list, substantially enhancing overall ranking efficiency. Although more
efficient, previous studies have revealed that the LLM listwise reranker have
significant positional biases on the order of input passages. Directly feed the
top-ranked passages from small reranker may result in the sub-optimal
performance of LLM listwise reranker. To alleviate this problem, we introduce a
passage order adjuster trained via reinforcement learning, which reorders the
top passages from the small reranker to align with the LLM's preferences of
passage order. Extensive experiments on three IR benchmarks demonstrate that
CoRanking significantly improves efficiency (reducing ranking latency by about
70\%) while achieving even better effectiveness compared to using only the LLM
listwise reranker.
♻ ☆ CancerLLM: A Large Language Model in Cancer Domain
Mingchen Li, Jiatan Huang, Jeremy Yeung, Anne Blaes, Steven Johnson, Hongfang Liu, Hua Xu, Rui Zhang
Medical Large Language Models (LLMs) have demonstrated impressive performance
on a wide variety of medical NLP tasks; however, there still lacks a LLM
specifically designed for phenotyping identification and diagnosis in cancer
domain. Moreover, these LLMs typically have several billions of parameters,
making them computationally expensive for healthcare systems. Thus, in this
study, we propose CancerLLM, a model with 7 billion parameters and a
Mistral-style architecture, pre-trained on nearly 2.7M clinical notes and over
515K pathology reports covering 17 cancer types, followed by fine-tuning on two
cancer-relevant tasks, including cancer phenotypes extraction and cancer
diagnosis generation. Our evaluation demonstrated that the CancerLLM achieves
state-of-the-art results with F1 score of 91.78% on phenotyping extraction and
86.81% on disganois generation. It outperformed existing LLMs, with an average
F1 score improvement of 9.23%. Additionally, the CancerLLM demonstrated its
efficiency on time and GPU usage, and robustness comparing with other LLMs. We
demonstrated that CancerLLM can potentially provide an effective and robust
solution to advance clinical research and practice in cancer domain
comment: new version, add the RAG version of cancerLLM
♻ ☆ Non-Determinism of "Deterministic" LLM Settings
Berk Atil, Sarp Aykent, Alexa Chittams, Lisheng Fu, Rebecca J. Passonneau, Evan Radcliffe, Guru Rajan Rajagopal, Adam Sloan, Tomasz Tudrej, Ferhan Ture, Zhe Wu, Lixinyu Xu, Breck Baldwin
LLM (large language model) practitioners commonly notice that outputs can
vary for the same inputs under settings expected to be deterministic. Yet the
questions of how pervasive this is, and with what impact on results, have not
to our knowledge been systematically investigated. We investigate
non-determinism in five LLMs configured to be deterministic when applied to
eight common tasks in across 10 runs, in both zero-shot and few-shot settings.
We see accuracy variations up to 15% across naturally occurring runs with a gap
of best possible performance to worst possible performance up to 70%. In fact,
none of the LLMs consistently delivers repeatable accuracy across all tasks,
much less identical output strings. Sharing preliminary results with insiders
has revealed that non-determinism perhaps essential to the efficient use of
compute resources via co-mingled data in input buffers so this issue is not
going away anytime soon. To better quantify our observations, we introduce
metrics focused on quantifying determinism, TARr@N for the total agreement rate
at N runs over raw output, and TARa@N for total agreement rate of parsed-out
answers. Our code and data are publicly available at
http://github.com/REDACTED.
♻ ☆ Rerouting Connection: Hybrid Computer Vision Analysis Reveals Visual Similarity Between Indus and Tibetan-Yi Corridor Writing Systems
This thesis employs a hybrid CNN-Transformer architecture, in conjunction
with a detailed anthropological framework, to investigate potential historical
connections between the visual morphology of the Indus Valley script and
pictographic systems of the Tibetan-Yi Corridor. Through an ensemble
methodology of three target scripts across 15 independently trained models, we
demonstrate that Tibetan-Yi Corridor scripts exhibit approximately six-fold
higher visual similarity to the Indus script (61.7%-63.5%) than to the Bronze
Age Proto-Cuneiform (10.2%-10.9%) or Proto-Elamite (7.6%-8.7%) systems.
Additionally and contrarily to our current understanding of the networks of the
Indus Valley Civilization, the Indus script unexpectedly maps closer to
Tibetan-Yi Corridor scripts, with a mean cosine similarity of 0.629, than to
the aforementioned contemporaneous West Asian signaries, both of which recorded
mean cosine similarities of 0.104 and 0.080 despite their close geographic
proximity and evident trade relations. Across various dimensionality reduction
practices and clustering methodologies, the Indus script consistently clusters
closest to Tibetan-Yi Corridor scripts. Our computational results align with
qualitative observations of specific pictorial parallels in numeral systems,
gender markers, and key iconographic elements; this is further supported by
archaeological evidence of sustained contact networks along the ancient
Shu-Shendu road in tandem with the Indus Valley Civilization's decline,
providing a plausible transmission pathway. While alternative explanations
cannot be ruled out, the specificity and consistency of observed similarities
challenge conventional narratives of isolated script development and suggest
more complex ancient cultural transmission networks between South and East Asia
than previously recognized.
comment: 106 pages (42 main text, 6 references, 58 appendices). 21 figures, 4
tables in main text; 106 figures, 8 tables total. Code:
https://github.com/oohalakkadi/ivc2tyc. Undergraduate thesis at Duke Kunshan
University. Accepted for presentation at the 52nd International Conference
for Computer Applications & Quantitative Methods in Archaeology (CAA 2025),
Athens, Greece
♻ ☆ How Well do LLMs Compress Their Own Chain-of-Thought? A Token Complexity Approach
Chain-of-thought prompting has emerged as a powerful technique for enabling
large language models (LLMs) to solve complex reasoning tasks. However, these
reasoning chains can be verbose, raising concerns about efficiency. In
response, recent works have sought to decrease response lengths through simple
prompting strategies (e.g. 'be concise'). In this work, we conduct the first
systematic study of the relationship between reasoning length and model
performance across a diverse range of compression instructions (e.g. 'use 10
words or less' or 'remove all punctuation'). In doing so, we discover a
universal tradeoff between reasoning length and accuracy that persists across
even very distinct reasoning chains. We demonstrate that this tradeoff emerges
from a sharp threshold behavior at the question level: each task has an
intrinsic 'token complexity' - a minimal number of tokens required for
successful problem-solving. We show how token complexity enables us to compute
information-theoretic limits on the accuracy-compression tradeoff, and find
that prompt-based compression strategies operate far from these theoretical
limits. This suggests there may be significant room for improvement and our
framework provides a benchmark to help researchers evaluate progress in
reasoning efficiency. Our work also highlights the importance of adaptive
compression -- giving shorter responses for easier questions -- and we show
that token complexity is a useful tool for measuring this capability.