Computation and Language
☆ Bi-Mamba: Towards Accurate 1-Bit State Space Models
The typical selective state-space model (SSM) of Mamba addresses several
limitations of Transformers, such as quadratic computational complexity with
sequence length and significant inference-time memory requirements due to the
key-value cache. However, the growing size of Mamba models continues to pose
training and deployment challenges and raises environmental concerns due to
considerable energy consumption. In this work, we introduce Bi-Mamba, a
scalable and powerful 1-bit Mamba architecture designed for more efficient
large language models with multiple sizes across 780M, 1.3B, and 2.7B. Bi-Mamba
models are trained from scratch on data volume as regular LLM pertaining using
an autoregressive distillation loss. Extensive experimental results on language
modeling demonstrate that Bi-Mamba achieves performance comparable to its
full-precision counterparts (e.g., FP16 or BF16) and much better accuracy than
post-training-binarization (PTB) Mamba baselines, while significantly reducing
memory footprint and energy consumption compared to the original Mamba model.
Our study pioneers a new linear computational complexity LLM framework under
low-bit representation and facilitates the future design of specialized
hardware tailored for efficient 1-bit Mamba-based LLMs.
☆ Tackling prediction tasks in relational databases with LLMs
Though large language models (LLMs) have demonstrated exceptional performance
across numerous problems, their application to predictive tasks in relational
databases remains largely unexplored. In this work, we address the notion that
LLMs cannot yield satisfactory results on relational databases due to their
interconnected tables, complex relationships, and heterogeneous data types.
Using the recently introduced RelBench benchmark, we demonstrate that even a
straightforward application of LLMs achieves competitive performance on these
tasks. These findings establish LLMs as a promising new baseline for ML on
relational databases and encourage further research in this direction.
☆ CNMBert: A Model For Hanyu Pinyin Abbreviation to Character Conversion Task
The task of converting Hanyu Pinyin abbreviations to Chinese characters
represents a significant branch within the domain of Chinese Spelling
Correction (CSC). This task is typically one of text-length alignment, however,
due to the limited informational content in pinyin abbreviations, achieving
accurate conversion is challenging. In this paper, we propose CNMBert which
stands for zh-CN Pinyin Multi-mask Bert Model as a solution to this issue.
CNMBert surpasses few-shot GPT models, achieving a 59.63% MRR on a
10,424-sample Hanyu Pinyin abbreviation test dataset.
comment: 9 pages, 2figures
☆ Drowning in Documents: Consequences of Scaling Reranker Inference
Rerankers, typically cross-encoders, are often used to re-score the documents
retrieved by cheaper initial IR systems. This is because, though expensive,
rerankers are assumed to be more effective. We challenge this assumption by
measuring reranker performance for full retrieval, not just re-scoring
first-stage retrieval. Our experiments reveal a surprising trend: the best
existing rerankers provide diminishing returns when scoring progressively more
documents and actually degrade quality beyond a certain limit. In fact, in this
setting, rerankers can frequently assign high scores to documents with no
lexical or semantic overlap with the query. We hope that our findings will spur
future research to improve reranking.
☆ The Power of Many: Multi-Agent Multimodal Models for Cultural Image Captioning
Large Multimodal Models (LMMs) exhibit impressive performance across various
multimodal tasks. However, their effectiveness in cross-cultural contexts
remains limited due to the predominantly Western-centric nature of most data
and models. Conversely, multi-agent models have shown significant capability in
solving complex tasks. Our study evaluates the collective performance of LMMs
in a multi-agent interaction setting for the novel task of cultural image
captioning. Our contributions are as follows: (1) We introduce MosAIC, a
Multi-Agent framework to enhance cross-cultural Image Captioning using LMMs
with distinct cultural personas; (2) We provide a dataset of culturally
enriched image captions in English for images from China, India, and Romania
across three datasets: GeoDE, GD-VCR, CVQA; (3) We propose a culture-adaptable
metric for evaluating cultural information within image captions; and (4) We
show that the multi-agent interaction outperforms single-agent models across
different metrics, and offer valuable insights for future research. Our dataset
and models can be accessed at https://github.com/MichiganNLP/MosAIC.
☆ Advacheck at GenAI Detection Task 1: AI Detection Powered by Domain-Aware Multi-Tasking
The paper describes a system designed by Advacheck team to recognise
machine-generated and human-written texts in the monolingual subtask of GenAI
Detection Task 1 competition. Our developed system is a multi-task architecture
with shared Transformer Encoder between several classification heads. One head
is responsible for binary classification between human-written and
machine-generated texts, while the other heads are auxiliary multiclass
classifiers for texts of different domains from particular datasets. As
multiclass heads were trained to distinguish the domains presented in the data,
they provide a better understanding of the samples. This approach led us to
achieve the first place in the official ranking with 83.07% macro F1-score on
the test set and bypass the baseline by 10%. We further study obtained system
through ablation, error and representation analyses, finding that multi-task
learning outperforms single-task mode and simultaneous tasks form a cluster
structure in embeddings space.
☆ Moral Persuasion in Large Language Models: Evaluating Susceptibility and Ethical Alignment
We explore how large language models (LLMs) can be influenced by prompting
them to alter their initial decisions and align them with established ethical
frameworks. Our study is based on two experiments designed to assess the
susceptibility of LLMs to moral persuasion. In the first experiment, we examine
the susceptibility to moral ambiguity by evaluating a Base Agent LLM on morally
ambiguous scenarios and observing how a Persuader Agent attempts to modify the
Base Agent's initial decisions. The second experiment evaluates the
susceptibility of LLMs to align with predefined ethical frameworks by prompting
them to adopt specific value alignments rooted in established philosophical
theories. The results demonstrate that LLMs can indeed be persuaded in morally
charged scenarios, with the success of persuasion depending on factors such as
the model used, the complexity of the scenario, and the conversation length.
Notably, LLMs of distinct sizes but from the same company produced markedly
different outcomes, highlighting the variability in their susceptibility to
ethical persuasion.
☆ FedCoLLM: A Parameter-Efficient Federated Co-tuning Framework for Large and Small Language Models
By adapting Large Language Models (LLMs) to domain-specific tasks or
enriching them with domain-specific knowledge, we can fully harness the
capabilities of LLMs. Nonetheless, a gap persists in achieving simultaneous
mutual enhancement between the server's LLM and the downstream clients' Small
Language Models (SLMs). To address this, we propose FedCoLLM, a novel and
parameter-efficient federated framework designed for co-tuning LLMs and SLMs.
This approach is aimed at adaptively transferring server-side LLMs knowledge to
clients' SLMs while simultaneously enriching the LLMs with domain insights from
the clients. To accomplish this, FedCoLLM utilizes lightweight adapters in
conjunction with SLMs, facilitating knowledge exchange between server and
clients in a manner that respects data privacy while also minimizing
computational and communication overhead. Our evaluation of FedCoLLM, utilizing
various public LLMs and SLMs across a range of NLP text generation tasks,
reveals that the performance of clients' SLMs experiences notable improvements
with the assistance of the LLMs. Simultaneously, the LLMs enhanced via FedCoLLM
achieves comparable performance to that obtained through direct fine-tuning on
clients' data.
★ Technical Report: Enhancing LLM Reasoning with Reward-guided Tree Search
Jinhao Jiang, Zhipeng Chen, Yingqian Min, Jie Chen, Xiaoxue Cheng, Jiapeng Wang, Yiru Tang, Haoxiang Sun, Jia Deng, Wayne Xin Zhao, Zheng Liu, Dong Yan, Jian Xie, Zhongyuan Wang, Ji-Rong Wen
Recently, test-time scaling has garnered significant attention from the
research community, largely due to the substantial advancements of the o1 model
released by OpenAI. By allocating more computational resources during the
inference phase, large language models~(LLMs) can extensively explore the
solution space by generating more thought tokens or diverse solutions, thereby
producing more accurate responses. However, developing an o1-like reasoning
approach is challenging, and researchers have been making various attempts to
advance this open area of research. In this paper, we present a preliminary
exploration into enhancing the reasoning abilities of LLMs through
reward-guided tree search algorithms. This framework is implemented by
integrating the policy model, reward model, and search algorithm. It is
primarily constructed around a tree search algorithm, where the policy model
navigates a dynamically expanding tree guided by a specially trained reward
model. We thoroughly explore various design considerations necessary for
implementing this framework and provide a detailed report of the technical
aspects. To assess the effectiveness of our approach, we focus on mathematical
reasoning tasks and conduct extensive evaluations on four challenging datasets,
significantly enhancing the reasoning abilities of LLMs.
comment: LLM;Complex Reasoning;Math
☆ Chapter 7 Review of Data-Driven Generative AI Models for Knowledge Extraction from Scientific Literature in Healthcare
This review examines the development of abstractive NLP-based text
summarization approaches and compares them to existing techniques for
extractive summarization. A brief history of text summarization from the 1950s
to the introduction of pre-trained language models such as Bidirectional
Encoder Representations from Transformer (BERT) and Generative Pre-training
Transformers (GPT) are presented. In total, 60 studies were identified in
PubMed and Web of Science, of which 29 were excluded and 24 were read and
evaluated for eligibility, resulting in the use of seven studies for further
analysis. This chapter also includes a section with examples including an
example of a comparison between GPT-3 and state-of-the-art GPT-4 solutions in
scientific text summarisation. Natural language processing has not yet reached
its full potential in the generation of brief textual summaries. As there are
acknowledged concerns that must be addressed, we can expect gradual
introduction of such models in practise.
comment: 16 pages, 5 figures, 1 table
☆ Federated Incremental Named Entity Recognition
Federated Named Entity Recognition (FNER) boosts model training within each
local client by aggregating the model updates of decentralized local clients,
without sharing their private data. However, existing FNER methods assume fixed
entity types and local clients in advance, leading to their ineffectiveness in
practical applications. In a more realistic scenario, local clients receive new
entity types continuously, while new local clients collecting novel data may
irregularly join the global FNER training. This challenging setup, referred to
here as Federated Incremental NER, renders the global model suffering from
heterogeneous forgetting of old entity types from both intra-client and
inter-client perspectives. To overcome these challenges, we propose a
Local-Global Forgetting Defense (LGFD) model. Specifically, to address
intra-client forgetting, we develop a structural knowledge distillation loss to
retain the latent space's feature structure and a pseudo-label-guided
inter-type contrastive loss to enhance discriminative capability over different
entity types, effectively preserving previously learned knowledge within local
clients. To tackle inter-client forgetting, we propose a task switching monitor
that can automatically identify new entity types under privacy protection and
store the latest old global model for knowledge distillation and
pseudo-labeling. Experiments demonstrate significant improvement of our LGFD
model over comparison methods.
comment: Under Review
☆ OASIS: Open Agents Social Interaction Simulations on One Million Agents
Ziyi Yang, Zaibin Zhang, Zirui Zheng, Yuxian Jiang, Ziyue Gan, Zhiyu Wang, Zijian Ling, Jinsong Chen, Martz Ma, Bowen Dong, Prateek Gupta, Shuyue Hu, Zhenfei Yin, Guohao Li, Xu Jia, Lijun Wang, Bernard Ghanem, Huchuan Lu, Wanli Ouyang, Yu Qiao, Philip Torr, Jing Shao
There has been a growing interest in enhancing rule-based agent-based models
(ABMs) for social media platforms (\emph{i.e.}, X, Reddit) with more realistic
large language model (LLM) agents, thereby allowing for a more nuanced study of
complex systems. As a result, several LLM-based ABMs have been proposed in the
past year. While they hold promise, each simulator is specifically designed to
study a particular scenario, making it time-consuming and resource-intensive to
explore other phenomena using the same ABM. Additionally, these models simulate
only a limited number of agents, whereas real-world social media platforms
involve millions of users. To this end, we propose OASIS, a generalizable and
scalable social media simulator. OASIS is designed based on real-world social
media platforms, incorporating dynamically updated environments (\emph{i.e.},
dynamic social networks and post information), diverse action spaces
(\emph{i.e.}, following, commenting), and recommendation systems (\emph{i.e.},
interest-based and hot-score-based). Additionally, OASIS supports large-scale
user simulations, capable of modeling up to one million users. With these
features, OASIS can be easily extended to different social media platforms to
study large-scale group phenomena and behaviors. We replicate various social
phenomena, including information spreading, group polarization, and herd
effects across X and Reddit platforms. Moreover, we provide observations of
social phenomena at different agent group scales. We observe that the larger
agent group scale leads to more enhanced group dynamics and more diverse and
helpful agents' opinions. These findings demonstrate OASIS's potential as a
powerful tool for studying complex systems in digital environments.
☆ Addressing Hallucinations in Language Models with Knowledge Graph Embeddings as an Additional Modality
In this paper we present an approach to reduce hallucinations in Large
Language Models (LLMs) by incorporating Knowledge Graphs (KGs) as an additional
modality. Our method involves transforming input text into a set of KG
embeddings and using an adapter to integrate these embeddings into the language
model space, without relying on external retrieval processes.
To facilitate this, we created WikiEntities, a dataset containing over 3
million Wikipedia texts annotated with entities from Wikidata and their
corresponding embeddings from PyTorch-BigGraph. This dataset serves as a
valuable resource for training Entity Linking models and adapting the described
method to various LLMs using specialized adapters.
Our method does not require fine-tuning of the language models themselves;
instead, we only train the adapter. This ensures that the model's performance
on other tasks is not affected. We trained an adapter for the Mistral 7B, LLaMA
2-7B (chat), and LLaMA 3-8B (instruct) models using this dataset and
demonstrated that our approach improves performance on the HaluEval, True-False
benchmarks and FEVER dataset. The results indicate that incorporating KGs as a
new modality can effectively reduce hallucinations and improve the factual
accuracy of language models, all without the need for external retrieval.
☆ Search, Verify and Feedback: Towards Next Generation Post-training Paradigm of Foundation Models via Verifier Engineering
Xinyan Guan, Yanjiang Liu, Xinyu Lu, Boxi Cao, Ben He, Xianpei Han, Le Sun, Jie Lou, Bowen Yu, Yaojie Lu, Hongyu Lin
The evolution of machine learning has increasingly prioritized the
development of powerful models and more scalable supervision signals. However,
the emergence of foundation models presents significant challenges in providing
effective supervision signals necessary for further enhancing their
capabilities. Consequently, there is an urgent need to explore novel
supervision signals and technical approaches. In this paper, we propose
verifier engineering, a novel post-training paradigm specifically designed for
the era of foundation models. The core of verifier engineering involves
leveraging a suite of automated verifiers to perform verification tasks and
deliver meaningful feedback to foundation models. We systematically categorize
the verifier engineering process into three essential stages: search, verify,
and feedback, and provide a comprehensive review of state-of-the-art research
developments within each stage. We believe that verifier engineering
constitutes a fundamental pathway toward achieving Artificial General
Intelligence.
☆ Safe + Safe = Unsafe? Exploring How Safe Images Can Be Exploited to Jailbreak Large Vision-Language Models
Chenhang Cui, Gelei Deng, An Zhang, Jingnan Zheng, Yicong Li, Lianli Gao, Tianwei Zhang, Tat-Seng Chua
Recent advances in Large Vision-Language Models (LVLMs) have showcased strong
reasoning abilities across multiple modalities, achieving significant
breakthroughs in various real-world applications. Despite this great success,
the safety guardrail of LVLMs may not cover the unforeseen domains introduced
by the visual modality. Existing studies primarily focus on eliciting LVLMs to
generate harmful responses via carefully crafted image-based jailbreaks
designed to bypass alignment defenses. In this study, we reveal that a safe
image can be exploited to achieve the same jailbreak consequence when combined
with additional safe images and prompts. This stems from two fundamental
properties of LVLMs: universal reasoning capabilities and safety snowball
effect. Building on these insights, we propose Safety Snowball Agent (SSA), a
novel agent-based framework leveraging agents' autonomous and tool-using
abilities to jailbreak LVLMs. SSA operates through two principal stages: (1)
initial response generation, where tools generate or retrieve jailbreak images
based on potential harmful intents, and (2) harmful snowballing, where refined
subsequent prompts induce progressively harmful outputs. Our experiments
demonstrate that \ours can use nearly any image to induce LVLMs to produce
unsafe content, achieving high success jailbreaking rates against the latest
LVLMs. Unlike prior works that exploit alignment flaws, \ours leverages the
inherent properties of LVLMs, presenting a profound challenge for enforcing
safety in generative multimodal systems. Our code is avaliable at
\url{https://github.com/gzcch/Safety_Snowball_Agent}.
☆ Quantifying Preferences of Vision-Language Models via Value Decomposition in Social Media Contexts
The rapid advancement of Vision-Language Models (VLMs) has expanded
multimodal applications, yet evaluations often focus on basic tasks like object
recognition, overlooking abstract aspects such as personalities and values. To
address this gap, we introduce Value-Spectrum, a visual question-answering
benchmark aimed at assessing VLMs based on Schwartz's value dimensions, which
capture core values guiding people's beliefs and actions across cultures. We
constructed a vectorized database of over 50,000 short videos sourced from
TikTok, YouTube Shorts, and Instagram Reels, covering multiple months and a
wide array of topics such as family, health, hobbies, society, and technology.
We also developed a VLM agent pipeline to automate video browsing and analysis.
Benchmarking representative VLMs on Value-Spectrum reveals significant
differences in their responses to value-oriented content, with most models
exhibiting a preference for hedonistic topics. Beyond identifying natural
preferences, we explored the ability of VLM agents to adopt specific personas
when explicitly prompted, revealing insights into the models' adaptability in
role-playing scenarios. These findings highlight the potential of
Value-Spectrum as a comprehensive evaluation set for tracking VLM advancements
in value-based tasks and for developing more sophisticated role-playing AI
agents.
☆ Re-examining learning linear functions in context
In context learning (ICL) is an attractive method of solving a wide range of
problems. Inspired by Garg et al. (2022), we look closely at ICL in a variety
of train and test settings for several transformer models of different sizes
trained from scratch. Our study complements prior work by pointing out several
systematic failures of these models to generalize to data not in the training
distribution, thereby showing some limitations of ICL. We find that models
adopt a strategy for this task that is very different from standard solutions.
☆ Causal Effect of Group Diversity on Redundancy and Coverage in Peer-Reviewing
A large host of scientific journals and conferences solicit peer reviews from
multiple reviewers for the same submission, aiming to gather a broader range of
perspectives and mitigate individual biases. In this work, we reflect on the
role of diversity in the slate of reviewers assigned to evaluate a submitted
paper as a factor in diversifying perspectives and improving the utility of the
peer-review process. We propose two measures for assessing review utility:
review coverage -- reviews should cover most contents of the paper -- and
review redundancy -- reviews should add information not already present in
other reviews. We hypothesize that reviews from diverse reviewers will exhibit
high coverage and low redundancy. We conduct a causal study of different
measures of reviewer diversity on review coverage and redundancy using
observational data from a peer-reviewed conference with approximately 5,000
submitted papers. Our study reveals disparate effects of different diversity
measures on review coverage and redundancy. Our study finds that assigning a
group of reviewers that are topically diverse, have different seniority levels,
or have distinct publication networks leads to broader coverage of the paper or
review criteria, but we find no evidence of an increase in coverage for
reviewer slates with reviewers from diverse organizations or geographical
locations. Reviewers from different organizations, seniority levels, topics, or
publications networks (all except geographical diversity) lead to a decrease in
redundancy in reviews. Furthermore, publication network-based diversity alone
also helps bring in varying perspectives (that is, low redundancy), even within
specific review criteria. Our study adopts a group decision-making perspective
for reviewer assignments in peer review and suggests dimensions of diversity
that can help guide the reviewer assignment process.
★ Membership Inference Attack against Long-Context Large Language Models
Recent advances in Large Language Models (LLMs) have enabled them to overcome
their context window limitations, and demonstrate exceptional retrieval and
reasoning capacities on longer context. Quesion-answering systems augmented
with Long-Context Language Models (LCLMs) can automatically search massive
external data and incorporate it into their contexts, enabling faithful
predictions and reducing issues such as hallucinations and knowledge staleness.
Existing studies targeting LCLMs mainly concentrate on addressing the so-called
lost-in-the-middle problem or improving the inference effiencicy, leaving their
privacy risks largely unexplored. In this paper, we aim to bridge this gap and
argue that integrating all information into the long context makes it a
repository of sensitive information, which often contains private data such as
medical records or personal identities. We further investigate the membership
privacy within LCLMs external context, with the aim of determining whether a
given document or sequence is included in the LCLMs context. Our basic idea is
that if a document lies in the context, it will exhibit a low generation loss
or a high degree of semantic similarity to the contents generated by LCLMs. We
for the first time propose six membership inference attack (MIA) strategies
tailored for LCLMs and conduct extensive experiments on various popular models.
Empirical results demonstrate that our attacks can accurately infer membership
status in most cases, e.g., 90.66% attack F1-score on Multi-document QA
datasets with LongChat-7b-v1.5-32k, highlighting significant risks of
membership leakage within LCLMs input contexts. Furthermore, we examine the
underlying reasons why LCLMs are susceptible to revealing such membership
information.
☆ Rethinking Thinking Tokens: Understanding Why They Underperform in Practice
Thinking Tokens (TT) have been proposed as an unsupervised method to
facilitate reasoning in language models. However, despite their conceptual
appeal, our findings show that TTs marginally improves performance and
consistently underperforms compared to Chain-of-Thought (CoT) reasoning across
multiple benchmarks. We hypothesize that this underperformance stems from the
reliance on a single embedding for TTs, which results in inconsistent learning
signals and introduces noisy gradients. This paper provides a comprehensive
empirical analysis to validate this hypothesis and discusses the implications
for future research on unsupervised reasoning in LLMs.
☆ MAIRA-Seg: Enhancing Radiology Report Generation with Segmentation-Aware Multimodal Large Language Models ML4H 2024
Harshita Sharma, Valentina Salvatelli, Shaury Srivastav, Kenza Bouzid, Shruthi Bannur, Daniel C. Castro, Maximilian Ilse, Sam Bond-Taylor, Mercy Prasanna Ranjit, Fabian Falck, Fernando Pérez-García, Anton Schwaighofer, Hannah Richardson, Maria Teodora Wetscherek, Stephanie L. Hyland, Javier Alvarez-Valle
There is growing interest in applying AI to radiology report generation,
particularly for chest X-rays (CXRs). This paper investigates whether
incorporating pixel-level information through segmentation masks can improve
fine-grained image interpretation of multimodal large language models (MLLMs)
for radiology report generation. We introduce MAIRA-Seg, a segmentation-aware
MLLM framework designed to utilize semantic segmentation masks alongside CXRs
for generating radiology reports. We train expert segmentation models to obtain
mask pseudolabels for radiology-specific structures in CXRs. Subsequently,
building on the architectures of MAIRA, a CXR-specialised model for report
generation, we integrate a trainable segmentation tokens extractor that
leverages these mask pseudolabels, and employ mask-aware prompting to generate
draft radiology reports. Our experiments on the publicly available MIMIC-CXR
dataset show that MAIRA-Seg outperforms non-segmentation baselines. We also
investigate set-of-marks prompting with MAIRA and find that MAIRA-Seg
consistently demonstrates comparable or superior performance. The results
confirm that using segmentation masks enhances the nuanced reasoning of MLLMs,
potentially contributing to better clinical outcomes.
comment: Accepted as Proceedings Paper at ML4H 2024
☆ Mitigating Knowledge Conflicts in Language Model-Driven Question Answering
Knowledge-aware sequence to sequence generation tasks such as document
question answering and abstract summarization typically requires two types of
knowledge: encoded parametric knowledge and retrieved contextual information.
Previous work show improper correlation between parametric knowledge and
answers in the training set could cause the model ignore input information at
test time, resulting in un-desirable model behaviour such as over-stability and
hallucination. In this work, we argue that hallucination could be mitigated via
explicit correlation between input source and generated content. We focus on a
typical example of hallucination, entity-based knowledge conflicts in question
answering, where correlation of entities and their description at training time
hinders model behaviour during inference.
☆ Transcending Language Boundaries: Harnessing LLMs for Low-Resource Language Translation
Peng Shu, Junhao Chen, Zhengliang Liu, Hui Wang, Zihao Wu, Tianyang Zhong, Yiwei Li, Huaqin Zhao, Hanqi Jiang, Yi Pan, Yifan Zhou, Constance Owl, Xiaoming Zhai, Ninghao Liu, Claudio Saunt, Tianming Liu
Large Language Models (LLMs) have demonstrated remarkable success across a
wide range of tasks and domains. However, their performance in low-resource
language translation, particularly when translating into these languages,
remains underexplored. This gap poses significant challenges, as linguistic
barriers hinder the cultural preservation and development of minority
communities. To address this issue, this paper introduces a novel
retrieval-based method that enhances translation quality for low-resource
languages by focusing on key terms, which involves translating keywords and
retrieving corresponding examples from existing data. To evaluate the
effectiveness of this method, we conducted experiments translating from English
into three low-resource languages: Cherokee, a critically endangered indigenous
language of North America; Tibetan, a historically and culturally significant
language in Asia; and Manchu, a language with few remaining speakers. Our
comparison with the zero-shot performance of GPT-4o and LLaMA 3.1 405B,
highlights the significant challenges these models face when translating into
low-resource languages. In contrast, our retrieval-based method shows promise
in improving both word-level accuracy and overall semantic understanding by
leveraging existing resources more effectively.
☆ LP Data Pipeline: Lightweight, Purpose-driven Data Pipeline for Large Language Models
Creating high-quality, large-scale datasets for large language models (LLMs)
often relies on resource-intensive, GPU-accelerated models for quality
filtering, making the process time-consuming and costly. This dependence on
GPUs limits accessibility for organizations lacking significant computational
infrastructure. To address this issue, we introduce the Lightweight,
Purpose-driven (LP) Data Pipeline, a framework that operates entirely on CPUs
to streamline the processes of dataset extraction, filtering, and curation.
Based on our four core principles, the LP Data Pipeline significantly reduces
preparation time and cost while maintaining high data quality. Importantly, our
pipeline enables the creation of purpose-driven datasets tailored to specific
domains and languages, enhancing the applicability of LLMs in specialized
contexts. We anticipate that our pipeline will lower the barriers to LLM
development, enabling a wide range of organizations to access LLMs more easily.
☆ VersaTune: Fine-Tuning Multi-Ability LLMs Efficiently
Keer Lu, Keshi Zhao, Zheng Liang, Da Pan, Shusen Zhang, Xin Wu, Weipeng Chen, Zenan Zhou, Guosheng Dong, Bin Cui, Wentao Zhang
Large Language Models (LLMs) exhibit remarkable capabilities in handling
multiple tasks across domains due to their emergent properties. These
capabilities are further augmented during the Supervised Fine-Tuning (SFT)
phase. Despite their potential, existing work mainly focuses on domain-specific
enhancements during fine-tuning, the challenge of which lies in catastrophic
forgetting of knowledge across other domains. In this study, we introduce
VersaTune, a novel data composition framework designed for enhancing LLMs'
overall multi-ability performances during fine-tuning. We categorize knowledge
into distinct domains including law, medicine, finance, science, code. We begin
with detecting the distribution of domain-specific knowledge within the base
model, followed by the composition of training data that aligns with the
model's existing knowledge distribution. During the fine-tuning process,
weights of different domains are dynamically adjusted based on their learnable
potential and forgetting degree. Experimental results demonstrate that
VersaTune achieves significant improvements in multi-domain performance, with a
35.21% enhancement in comprehensive multi-domain tasks. Additionally, in
scenarios where specific domain optimization is required, VersaTune reduces the
degradation of performance in other domains by 38.77%, without compromising the
target domain's training efficacy.
☆ Large corpora and large language models: a replicable method for automating grammatical annotation
Much linguistic research relies on annotated datasets of features extracted
from text corpora, but the rapid quantitative growth of these corpora has
created practical difficulties for linguists to manually annotate large data
samples. In this paper, we present a replicable, supervised method that
leverages large language models for assisting the linguist in grammatical
annotation through prompt engineering, training, and evaluation. We introduce a
methodological pipeline applied to the case study of formal variation in the
English evaluative verb construction 'consider X (as) (to be) Y', based on the
large language model Claude 3.5 Sonnet and corpus data from Davies' NOW and
EnTenTen21 (SketchEngine). Overall, we reach a model accuracy of over 90% on
our held-out test samples with only a small amount of training data, validating
the method for the annotation of very large quantities of tokens of the
construction in the future. We discuss the generalisability of our results for
a wider range of case studies of grammatical constructions and grammatical
variation and change, underlining the value of AI copilots as tools for future
linguistic research.
☆ ZeFaV: Boosting Large Language Models for Zero-shot Fact Verification PRICAI 2024
In this paper, we propose ZeFaV - a zero-shot based fact-checking
verification framework to enhance the performance on fact verification task of
large language models by leveraging the in-context learning ability of large
language models to extract the relations among the entities within a claim,
re-organized the information from the evidence in a relationally logical form,
and combine the above information with the original evidence to generate the
context from which our fact-checking model provide verdicts for the input
claims. We conducted empirical experiments to evaluate our approach on two
multi-hop fact-checking datasets including HoVer and FEVEROUS, and achieved
potential results results comparable to other state-of-the-art fact
verification task methods.
comment: This pre-print has been published in PRICAI 2024: Trends in
Artificial Intelligence. The published version is available at
https://doi.org/10.1007/978-981-96-0119-6_28
☆ MEMO-Bench: A Multiple Benchmark for Text-to-Image and Multimodal Large Language Models on Human Emotion Analysis
Yingjie Zhou, Zicheng Zhang, Jiezhang Cao, Jun Jia, Yanwei Jiang, Farong Wen, Xiaohong Liu, Xiongkuo Min, Guangtao Zhai
Artificial Intelligence (AI) has demonstrated significant capabilities in
various fields, and in areas such as human-computer interaction (HCI), embodied
intelligence, and the design and animation of virtual digital humans, both
practitioners and users are increasingly concerned with AI's ability to
understand and express emotion. Consequently, the question of whether AI can
accurately interpret human emotions remains a critical challenge. To date, two
primary classes of AI models have been involved in human emotion analysis:
generative models and Multimodal Large Language Models (MLLMs). To assess the
emotional capabilities of these two classes of models, this study introduces
MEMO-Bench, a comprehensive benchmark consisting of 7,145 portraits, each
depicting one of six different emotions, generated by 12 Text-to-Image (T2I)
models. Unlike previous works, MEMO-Bench provides a framework for evaluating
both T2I models and MLLMs in the context of sentiment analysis. Additionally, a
progressive evaluation approach is employed, moving from coarse-grained to
fine-grained metrics, to offer a more detailed and comprehensive assessment of
the sentiment analysis capabilities of MLLMs. The experimental results
demonstrate that existing T2I models are more effective at generating positive
emotions than negative ones. Meanwhile, although MLLMs show a certain degree of
effectiveness in distinguishing and recognizing human emotions, they fall short
of human-level accuracy, particularly in fine-grained emotion analysis. The
MEMO-Bench will be made publicly available to support further research in this
area.
♻ ☆ Toxicity of the Commons: Curating Open-Source Pre-Training Data
Open-source large language models are becoming increasingly available and
popular among researchers and practitioners. While significant progress has
been made on open-weight models, open training data is a practice yet to be
adopted by the leading open-weight models creators. At the same time, there
researchers are working to make language models safer. We propose a data
curation pipeline to reduce harmful outputs by models trained on public domain
data. There are unique challenges to working with public domain data, as these
sources differ from web text in both form and content. Many sources are
historical documents and are the result of Optical Character Recognition (OCR).
Consequently, current state-of-the-art approaches to toxicity filtering are
often infeasible or inappropriate for open data models. In this paper, we
introduce a new fully open-source pipeline for open-data toxicity filtering.
Our contributions are threefold. We create a custom training dataset,
ToxicCommons, which is composed of texts which have been classified across five
different dimensions (racial/origin-based, gender/sex-based, religious,
ability-based discrimination, and violence). We use this dataset to train a
custom classifier, Celadon, that can be used to detect toxic content in open
data more efficiently at a larger scale. Finally, we describe the balanced
approach to content filtration that optimizes safety filtering with respect to
the filtered data available for training.
♻ ☆ A Perspective for Adapting Generalist AI to Specialized Medical AI Applications and Their Challenges
Zifeng Wang, Hanyin Wang, Benjamin Danek, Ying Li, Christina Mack, Hoifung Poon, Yajuan Wang, Pranav Rajpurkar, Jimeng Sun
The integration of Large Language Models (LLMs) into medical applications has
sparked widespread interest across the healthcare industry, from drug discovery
and development to clinical decision support, assisting telemedicine, medical
devices, and healthcare insurance applications. This perspective paper aims to
discuss the inner workings of building LLM-powered medical AI applications and
introduces a comprehensive framework for their development. We review existing
literature and outline the unique challenges of applying LLMs in specialized
medical contexts. Additionally, we introduce a three-step framework to organize
medical LLM research activities: 1) Modeling: breaking down complex medical
workflows into manageable steps for developing medical-specific models; 2)
Optimization: optimizing the model performance with crafted prompts and
integrating external knowledge and tools, and 3) System engineering:
decomposing complex tasks into subtasks and leveraging human expertise for
building medical AI applications. Furthermore, we offer a detailed use case
playbook that describes various LLM-powered medical AI applications, such as
optimizing clinical trial design, enhancing clinical decision support, and
advancing medical imaging analysis. Finally, we discuss various challenges and
considerations for building medical AI applications with LLMs, such as handling
hallucination issues, data ownership and compliance, privacy, intellectual
property considerations, compute cost, sustainability issues, and responsible
AI requirements.
♻ ☆ Watermark-based Detection and Attribution of AI-Generated Content
Several companies have deployed watermark-based detection to identify
AI-generated content. However, attribution--the ability to trace back to the
user of a generative AI (GenAI) service who created a given piece of
AI-generated content--remains largely unexplored despite its growing
importance. In this work, we aim to bridge this gap by conducting the first
systematic study on watermark-based, user-level attribution of AI-generated
content. Our key idea is to assign a unique watermark to each user of the GenAI
service and embed this watermark into the AI-generated content created by that
user. Attribution is then performed by identifying the user whose watermark
best matches the one extracted from the given content. This approach, however,
faces a key challenge: How should watermarks be selected for users to maximize
attribution performance? To address the challenge, we first theoretically
derive lower bounds on detection and attribution performance through rigorous
probabilistic analysis for any given set of user watermarks. Then, we select
watermarks for users to maximize these lower bounds, thereby optimizing
detection and attribution performance. Our theoretical and empirical results
show that watermark-based attribution inherits both the accuracy and
(non-)robustness properties of the underlying watermark. Specifically,
attribution remains highly accurate when the watermarked AI-generated content
is either not post-processed or subjected to common post-processing such as
JPEG compression, as well as black-box adversarial post-processing with limited
query budgets.
♻ ☆ AgentSquare: Automatic LLM Agent Search in Modular Design Space
Recent advancements in Large Language Models (LLMs) have led to a rapid
growth of agentic systems capable of handling a wide range of complex tasks.
However, current research largely relies on manual, task-specific design,
limiting their adaptability to novel tasks. In this paper, we introduce a new
research problem: Modularized LLM Agent Search (MoLAS). We propose a modular
design space that abstracts existing LLM agent designs into four fundamental
modules with uniform IO interface: Planning, Reasoning, Tool Use, and Memory.
Building on this design space, we present a novel LLM agent search framework
called AgentSquare, which introduces two core mechanisms, i.e., module
evolution and recombination, to efficiently search for optimized LLM agents. To
further accelerate the process, we design a performance predictor that uses
in-context surrogate models to skip unpromising agent designs. Extensive
experiments across six benchmarks, covering the diverse scenarios of web,
embodied, tool use and game applications, show that AgentSquare substantially
outperforms hand-crafted agents, achieving an average performance gain of 17.2%
against best-known human designs. Moreover, AgentSquare can generate
interpretable design insights, enabling a deeper understanding of agentic
architecture and its impact on task performance. We believe that the modular
design space and AgentSquare search framework offer a platform for fully
exploiting the potential of prior successful designs and consolidating the
collective efforts of research community. Code repo is available at
https://github.com/tsinghua-fib-lab/AgentSquare.
comment: 26 pages
♻ ☆ Fine-Tuning a Time Series Foundation Model with Wasserstein Loss
Inspired by recent advancements in large language models (LLMs) for Natural
Language Processing (NLP), there has been a surge in research focused on
developing foundational models for time series forecasting. One approach
involves training LLM architectures on tokenized time series data using
cross-entropy loss. Although this method has demonstrated promising results,
cross-entropy loss is primarily designed for classification tasks and does not
account for the distance between classes. To address this limitation, we
propose using the Wasserstein loss for such architectures. To validate our
approach, we fine-tuned a foundational time series model on $22$ zero-shot
datasets, comparing the performance of cross-entropy loss with that of
Wasserstein loss. Our results demonstrate that replacing cross-entropy loss
with Wasserstein loss significantly improves point estimation.
comment: 4 main pages; 2 figures
♻ ☆ Investigating OCR-Sensitive Neurons to Improve Entity Recognition in Historical Documents
This paper investigates the presence of OCR-sensitive neurons within the
Transformer architecture and their influence on named entity recognition (NER)
performance on historical documents. By analysing neuron activation patterns in
response to clean and noisy text inputs, we identify and then neutralise
OCR-sensitive neurons to improve model performance. Based on two open access
large language models (Llama2 and Mistral), experiments demonstrate the
existence of OCR-sensitive regions and show improvements in NER performance on
historical newspapers and classical commentaries, highlighting the potential of
targeted neuron modulation to improve models' performance on noisy text.
♻ ☆ BeeManc at the PLABA Track of TAC-2024: RoBERTa for task 1 -- LLaMA3.1 and GPT-4o for task 2
This report is the system description of the BeeManc team for shared task
Plain Language Adaptation of Biomedical Abstracts (PLABA) 2024. This report
contains two sections corresponding to the two sub-tasks in PLABA 2024. In task
one, we applied fine-tuned ReBERTa-Base models to identify and classify the
difficult terms, jargon and acronyms in the biomedical abstracts and reported
the F1 score. Due to time constraints, we didn't finish the replacement task.
In task two, we leveraged Llamma3.1-70B-Instruct and GPT-4o with the one-shot
prompts to complete the abstract adaptation and reported the scores in BLEU,
SARI, BERTScore, LENS, and SALSA. From the official Evaluation from PLABA-2024
on Task 1A and 1B, our \textbf{much smaller fine-tuned RoBERTa-Base} model
ranked 3rd and 2nd respectively on the two sub-task, and the \textbf{1st on
averaged F1 scores across the two tasks} from 9 evaluated systems. Our
LLaMA-3.1-70B-instructed model achieved the \textbf{highest Completeness} score
for Task-2. We share our fine-tuned models and related resources at
\url{https://github.com/HECTA-UoM/PLABA2024}
comment: ongoing work - system report
♻ ☆ Unconstrained Open Vocabulary Image Classification: Zero-Shot Transfer from Text to Image via CLIP Inversion WACV 2025
We introduce NOVIC, an innovative real-time uNconstrained Open Vocabulary
Image Classifier that uses an autoregressive transformer to generatively output
classification labels as language. Leveraging the extensive knowledge of CLIP
models, NOVIC harnesses the embedding space to enable zero-shot transfer from
pure text to images. Traditional CLIP models, despite their ability for open
vocabulary classification, require an exhaustive prompt of potential class
labels, restricting their application to images of known content or context. To
address this, we propose an "object decoder" model that is trained on a
large-scale 92M-target dataset of templated object noun sets and LLM-generated
captions to always output the object noun in question. This effectively inverts
the CLIP text encoder and allows textual object labels from essentially the
entire English language to be generated directly from image-derived embedding
vectors, without requiring any a priori knowledge of the potential content of
an image, and without any label biases. The trained decoders are tested on a
mix of manually and web-curated datasets, as well as standard image
classification benchmarks, and achieve fine-grained prompt-free prediction
scores of up to 87.5%, a strong result considering the model must work for any
conceivable image and without any contextual clues.
comment: Published at WACV 2025
♻ ☆ Separating Tongue from Thought: Activation Patching Reveals Language-Agnostic Concept Representations in Transformers ICML 2024
A central question in multilingual language modeling is whether large
language models (LLMs) develop a universal concept representation, disentangled
from specific languages. In this paper, we address this question by analyzing
latent representations (latents) during a word translation task in
transformer-based LLMs. We strategically extract latents from a source
translation prompt and insert them into the forward pass on a target
translation prompt. By doing so, we find that the output language is encoded in
the latent at an earlier layer than the concept to be translated. Building on
this insight, we conduct two key experiments. First, we demonstrate that we can
change the concept without changing the language and vice versa through
activation patching alone. Second, we show that patching with the mean over
latents across different languages does not impair and instead improves the
models' performance in translating the concept. Our results provide evidence
for the existence of language-agnostic concept representations within the
investigated models.
comment: 12 pages, 10 figures, previous version published under the title "How
Do Llamas Process Multilingual Text? A Latent Exploration through Activation
Patching" at the ICML 2024 mechanistic interpretability workshop at
https://openreview.net/forum?id=0ku2hIm4BS
♻ ★ BertaQA: How Much Do Language Models Know About Local Culture?
Large Language Models (LLMs) exhibit extensive knowledge about the world, but
most evaluations have been limited to global or anglocentric subjects. This
raises the question of how well these models perform on topics relevant to
other cultures, whose presence on the web is not that prominent. To address
this gap, we introduce BertaQA, a multiple-choice trivia dataset that is
parallel in English and Basque. The dataset consists of a local subset with
questions pertinent to the Basque culture, and a global subset with questions
of broader interest. We find that state-of-the-art LLMs struggle with local
cultural knowledge, even as they excel on global topics. However, we show that
continued pre-training in Basque significantly improves the models' performance
on Basque culture, even when queried in English. To our knowledge, this is the
first solid evidence of knowledge transfer from a low-resource to a
high-resource language. Our analysis sheds light on the complex interplay
between language and knowledge, and reveals that some prior findings do not
fully hold when reassessed on local topics. Our dataset and evaluation code are
available under open licenses at https://github.com/juletx/BertaQA.
comment: NEURIPS Datasets & Benchmarks 2024
♻ ☆ Estimating the Influence of Sequentially Correlated Literary Properties in Textual Classification: A Data-Centric Hypothesis-Testing Approach
Stylometry aims to distinguish authors by analyzing literary traits assumed
to reflect semi-conscious choices distinct from elements like genre or theme.
However, these components often overlap, complicating text classification based
solely on feature distributions. While some literary properties, such as
thematic content, are likely to manifest as correlations between adjacent text
units, others, like authorial style, may be independent thereof. We introduce a
hypothesis-testing approach to evaluate the influence of sequentially
correlated literary properties on text classification, aiming to determine when
these correlations drive classification. Using a multivariate binary
distribution, our method models sequential correlations between text units as a
stochastic process, assessing the likelihood of clustering across varying
adjacency scales. This enables us to examine whether classification is
dominated by sequentially correlated properties or remains independent. In
experiments on a diverse English prose corpus, our analysis integrates
traditional and neural embeddings within supervised and unsupervised
frameworks. Results demonstrate that our approach effectively identifies when
textual classification is not primarily influenced by sequentially correlated
literary properties, particularly in cases where texts differ in authorial
style or genre rather than by a single author within a similar genre.
♻ ☆ Utilize the Flow before Stepping into the Same River Twice: Certainty Represented Knowledge Flow for Refusal-Aware Instruction Tuning
Refusal-Aware Instruction Tuning (RAIT) enables Large Language Models (LLMs)
to refuse to answer unknown questions. By modifying responses of unknown
questions in the training data to refusal responses such as "I don't know",
RAIT enhances the reliability of LLMs and reduces their hallucination.
Generally, RAIT modifies training samples based on the correctness of the
initial LLM's response. However, this crude approach can cause LLMs to
excessively refuse answering questions they could have correctly answered, the
problem we call over-refusal. In this paper, we explore two primary causes of
over-refusal: Static conflict occurs when similar samples within the LLM's
feature space receive differing supervision signals (original vs. modified "I
don't know"). Dynamic conflict, on the other hand, emerges as the LLM's
knowledge evolves during SFT, allowing it to answer questions that were
previously unanswerable. Yet, these now-answerable training samples still
retain the original "I don't know" supervision signals based on the initial LLM
state, resulting in inconsistencies. These conflicts cause the trained LLM to
misclassify known questions as unknown, resulting in over-refusal. To address
this issue, we introduce Certainty Represented Knowledge Flow for Refusal-Aware
Instructions Tuning (CRaFT). CRaFT centers on two main contributions: First, we
additionally incorporate response certainty to selectively filter and modify
data, reducing static conflicts. Second, we implement preliminary rehearsal
training to characterize changes in the LLM's knowledge state, which helps
mitigate dynamic conflicts during the fine-tuning process. We conducted
extensive experiments on open-ended question answering and multiple-choice
question task. Experiment results show that CRaFT can improve LLM's overall
performance during the RAIT process. Source code and training data will be
released at Github.
comment: Equal contribution: Runchuan Zhu, Zhipeng Ma, Jiang Wu; Corresponding
author: Conghui He
♻ ☆ A Complete Survey on LLM-based AI Chatbots
The past few decades have witnessed an upsurge in data, forming the
foundation for data-hungry, learning-based AI technology. Conversational
agents, often referred to as AI chatbots, rely heavily on such data to train
large language models (LLMs) and generate new content (knowledge) in response
to user prompts. With the advent of OpenAI's ChatGPT, LLM-based chatbots have
set new standards in the AI community. This paper presents a complete survey of
the evolution and deployment of LLM-based chatbots in various sectors. We first
summarize the development of foundational chatbots, followed by the evolution
of LLMs, and then provide an overview of LLM-based chatbots currently in use
and those in the development phase. Recognizing AI chatbots as tools for
generating new knowledge, we explore their diverse applications across various
industries. We then discuss the open challenges, considering how the data used
to train the LLMs and the misuse of the generated knowledge can cause several
issues. Finally, we explore the future outlook to augment their efficiency and
reliability in numerous applications. By addressing key milestones and the
present-day context of LLM-based chatbots, our survey invites readers to delve
deeper into this realm, reflecting on how their next generation will reshape
conversational AI.
comment: 23 pages, 10 figures
♻ ☆ Fine-Grained Verifiers: Preference Modeling as Next-token Prediction in Vision-Language Alignment
The recent advancements in large language models (LLMs) and pre-trained
vision models have accelerated the development of vision-language large models
(VLLMs), enhancing the interaction between visual and linguistic modalities.
Despite their notable success across various domains, VLLMs face challenges in
modality alignment, which can lead to issues like hallucinations and unsafe
content generation. Current alignment techniques often rely on coarse feedback
and external datasets, limiting scalability and performance. In this paper, we
propose FiSAO (Fine-Grained Self-Alignment Optimization), a novel
self-alignment method that utilizes the model's own visual encoder as a
fine-grained verifier to improve vision-language alignment without the need for
additional data. By leveraging token-level feedback from the vision encoder,
FiSAO significantly improves vision-language alignment, even surpassing
traditional preference tuning methods that require additional data. Through
both theoretical analysis and experimental validation, we demonstrate that
FiSAO effectively addresses the misalignment problem in VLLMs, marking the
first instance of token-level rewards being applied to such models.
comment: 23 pages
♻ ☆ Not Eliminate but Aggregate: Post-Hoc Control over Mixture-of-Experts to Address Shortcut Shifts in Natural Language Understanding
Recent models for natural language understanding are inclined to exploit
simple patterns in datasets, commonly known as shortcuts. These shortcuts hinge
on spurious correlations between labels and latent features existing in the
training data. At inference time, shortcut-dependent models are likely to
generate erroneous predictions under distribution shifts, particularly when
some latent features are no longer correlated with the labels. To avoid this,
previous studies have trained models to eliminate the reliance on shortcuts. In
this study, we explore a different direction: pessimistically aggregating the
predictions of a mixture-of-experts, assuming each expert captures relatively
different latent features. The experimental results demonstrate that our
post-hoc control over the experts significantly enhances the model's robustness
to the distribution shift in shortcuts. Besides, we show that our approach has
some practical advantages. We also analyze our model and provide results to
support the assumption.
comment: 21 pages, 5 figures (the layout differs from the MIT Press
publication version)
♻ ★ Exploring Context Window of Large Language Models via Decomposed Positional Vectors
Transformer-based large language models (LLMs) typically have a limited
context window, resulting in significant performance degradation when
processing text beyond the length of the context window. Extensive studies have
been proposed to extend the context window and achieve length extrapolation of
LLMs, but there is still a lack of in-depth interpretation of these approaches.
In this study, we explore the positional information within and beyond the
context window for deciphering the underlying mechanism of LLMs. By using a
mean-based decomposition method, we disentangle positional vectors from hidden
states of LLMs and analyze their formation and effect on attention.
Furthermore, when texts exceed the context window, we analyze the change of
positional vectors in two settings, i.e., direct extrapolation and context
window extension. Based on our findings, we design two training-free context
window extension methods, positional vector replacement and attention window
extension. Experimental results show that our methods can effectively extend
the context window length.
comment: Accepted by Neurips 2024 as a spotlight
♻ ☆ Towards Evaluating Large Language Models for Graph Query Generation SC
Large Language Models (LLMs) are revolutionizing the landscape of Generative
Artificial Intelligence (GenAI), with innovative LLM-backed solutions emerging
rapidly. However, when applied to database technologies, specifically query
generation for graph databases and Knowledge Graphs (KGs), LLMs still face
significant challenges. While research on LLM-driven query generation for
Structured Query Language (SQL) exists, similar systems for graph databases
remain underdeveloped. This paper presents a comparative study addressing the
challenge of generating Cypher queries a powerful language for interacting with
graph databases using open-access LLMs. We rigorously evaluate several LLM
agents (OpenAI ChatGPT 4o, Claude Sonnet 3.5, Google Gemini Pro 1.5, and a
locally deployed Llama 3.1 8B) using a designed few-shot learning prompt and
Retrieval Augmented Generation (RAG) backed by Chain-of-Thoughts (CoT)
reasoning. Our empirical analysis of query generation accuracy reveals that
Claude Sonnet 3.5 outperforms its counterparts in this specific domain.
Further, we highlight promising future research directions to address the
identified limitations and advance LLM-driven query generation for graph
databases.
comment: Paper accepted and will be presented at CSCI2024 in December 2024,
Later will be published at Springer LNCS
♻ ★ Python is Not Always the Best Choice: Embracing Multilingual Program of Thoughts EMNLP 2024
Xianzhen Luo, Qingfu Zhu, Zhiming Zhang, Libo Qin, Xuanyu Zhang, Qing Yang, Dongliang Xu, Wanxiang Che
Program of Thoughts (PoT) is an approach characterized by its executable
intermediate steps, which ensure the accuracy of the logical calculations in
the reasoning process. Currently, PoT primarily uses Python. However, relying
solely on a single language may result in suboptimal solutions and overlook the
potential benefits of other programming languages. In this paper, we conduct
comprehensive experiments on the programming languages used in PoT and find
that no single language consistently delivers optimal performance across all
tasks and models. The effectiveness of each language varies depending on the
specific scenarios. Inspired by this, we propose a task and model agnostic
approach called MultiPoT, which harnesses strength and diversity from various
languages. Experimental results reveal that it significantly outperforms Python
Self-Consistency. Furthermore, it achieves comparable or superior performance
compared to the best monolingual PoT in almost all tasks across all models. In
particular, MultiPoT achieves more than 4.6% improvement on average on ChatGPT
(gpt-3.5-turbo-0701).
comment: Accepted by EMNLP 2024. Code and data are released at
https://github.com/Luowaterbi/MultiPoT
♻ ☆ LLMs and Memorization: On Quality and Specificity of Copyright Compliance
Memorization in large language models (LLMs) is a growing concern. LLMs have
been shown to easily reproduce parts of their training data, including
copyrighted work. This is an important problem to solve, as it may violate
existing copyright laws as well as the European AI Act. In this work, we
propose a systematic analysis to quantify the extent of potential copyright
infringements in LLMs using European law as an example. Unlike previous work,
we evaluate instruction-finetuned models in a realistic end-user scenario. Our
analysis builds on a proposed threshold of 160 characters, which we borrow from
the German Copyright Service Provider Act and a fuzzy text matching algorithm
to identify potentially copyright-infringing textual reproductions. The
specificity of countermeasures against copyright infringement is analyzed by
comparing model behavior on copyrighted and public domain data. We investigate
what behaviors models show instead of producing protected text (such as refusal
or hallucination) and provide a first legal assessment of these behaviors. We
find that there are huge differences in copyright compliance, specificity, and
appropriate refusal among popular LLMs. Alpaca, GPT 4, GPT 3.5, and Luminous
perform best in our comparison, with OpenGPT-X, Alpaca, and Luminous producing
a particularly low absolute number of potential copyright violations. Code can
be found at https://github.com/felixbmuller/llms-memorization-copyright.
comment: 10 pages, 3 figures, AIES 2024 conference
♻ ★ Clustering and Ranking: Diversity-preserved Instruction Selection through Expert-aligned Quality Estimation EMNLP2024
Yuan Ge, Yilun Liu, Chi Hu, Weibin Meng, Shimin Tao, Xiaofeng Zhao, Hongxia Ma, Li Zhang, Boxing Chen, Hao Yang, Bei Li, Tong Xiao, Jingbo Zhu
With contributions from the open-source community, a vast amount of
instruction tuning (IT) data has emerged. Given the significant resource
allocation required for training and evaluating models, it is advantageous to
have an efficient method for selecting high-quality IT data. However, existing
methods for instruction data selection have limitations such as relying on
fragile external APIs, being affected by biases in GPT models, or reducing the
diversity of the selected instruction dataset. In this paper, we propose an
industrial-friendly, expert-aligned and diversity-preserved instruction data
selection method: Clustering and Ranking (CaR). CaR employs a two-step process:
first, it ranks instruction pairs using a high-accuracy (84.25%) scoring model
aligned with expert preferences; second, it preserves dataset diversity through
clustering. In our experiment, CaR efficiently selected a mere 1.96% of
Alpaca's IT data, yet the resulting AlpaCaR model surpassed Alpaca's
performance by an average of 32.1% in GPT-4 evaluations. Moreover, we find that
data selecting is a consistent paradigm whether the pre-trained model is more
capable or the model parameters scaling up. Our approach employs compact models
with 550M parameters and incurs just 11.2% of the financial outlay of current
methods, enhancing its industrial deployability.
comment: Accepted by EMNLP2024
♻ ★ Word-Sequence Entropy: Towards Uncertainty Estimation in Free-Form Medical Question Answering Applications and Beyond
Zhiyuan Wang, Jinhao Duan, Chenxi Yuan, Qingyu Chen, Tianlong Chen, Yue Zhang, Ren Wang, Xiaoshuang Shi, Kaidi Xu
Uncertainty estimation is crucial for the reliability of safety-critical
human and artificial intelligence (AI) interaction systems, particularly in the
domain of healthcare engineering. However, a robust and general uncertainty
measure for free-form answers has not been well-established in open-ended
medical question-answering (QA) tasks, where generative inequality introduces a
large number of irrelevant words and sequences within the generated set for
uncertainty quantification (UQ), which can lead to biases. This paper
introduces Word-Sequence Entropy (WSE), a method that calibrates uncertainty at
both the word and sequence levels, considering semantic relevance. WSE
quantifies uncertainty in a way that is more closely aligned with the
reliability of LLMs during uncertainty quantification (UQ). We compare WSE with
six baseline methods on five free-form medical QA datasets, utilizing seven
popular large language models (LLMs). Experimental results demonstrate that WSE
exhibits superior performance in UQ under two standard criteria for correctness
evaluation. Additionally, in terms of real-world medical QA applications, the
performance of LLMs is significantly enhanced (e.g., a 6.36% improvement in
model accuracy on the COVID-QA dataset) by employing responses with lower
uncertainty that are identified by WSE as final answers, without any additional
task-specific fine-tuning or architectural modifications.
comment: Accepted by Engineering Applications of Artificial Intelligence
♻ ★ ConU: Conformal Uncertainty in Large Language Models with Correctness Coverage Guarantees EMNLP 2024
Zhiyuan Wang, Jinhao Duan, Lu Cheng, Yue Zhang, Qingni Wang, Xiaoshuang Shi, Kaidi Xu, Hengtao Shen, Xiaofeng Zhu
Uncertainty quantification (UQ) in natural language generation (NLG) tasks
remains an open challenge, exacerbated by the closed-source nature of the
latest large language models (LLMs). This study investigates applying conformal
prediction (CP), which can transform any heuristic uncertainty notion into
rigorous prediction sets, to black-box LLMs in open-ended NLG tasks. We
introduce a novel uncertainty measure based on self-consistency theory, and
then develop a conformal uncertainty criterion by integrating the uncertainty
condition aligned with correctness into the CP algorithm. Empirical evaluations
indicate that our uncertainty measure outperforms prior state-of-the-art
methods. Furthermore, we achieve strict control over the correctness coverage
rate utilizing 7 popular LLMs on 4 free-form NLG datasets, spanning
general-purpose and medical scenarios. Additionally, the calibrated prediction
sets with small size further highlights the efficiency of our method in
providing trustworthy guarantees for practical open-ended NLG applications.
comment: Accepted by EMNLP 2024 Findings
♻ ☆ Semantic Operators: A Declarative Model for Rich, AI-based Analytics Over Text Data
The semantic capabilities of language models (LMs) have the potential to
enable rich analytics and reasoning over vast knowledge corpora. Unfortunately,
existing systems lack high-level abstractions to perform bulk semantic queries
across large corpora. We introduce semantic operators, a declarative
programming interface that extends the relational model with composable
AI-based operations for bulk semantic queries (e.g., filtering, sorting,
joining or aggregating records using natural language criteria). Each operator
can be implemented and optimized in multiple ways, opening a rich space for
execution plans similar to relational operators. We implement our operators in
LOTUS, an open source query engine with a DataFrame API. Furthermore, we
develop several novel optimizations that take advantage of the declarative
nature of semantic operators to accelerate semantic filtering, clustering and
join operators by up to $400\times$ while offering statistical accuracy
guarantees. We demonstrate LOTUS' effectiveness on real AI applications
including fact-checking, extreme multi-label classification, and search. We
show that the semantic operator model is expressive, capturing state-of-the-art
AI pipelines in a few operator calls, and making it easy to express new
pipelines that achieve up to $180\%$ higher quality. Overall, LOTUS queries
match or exceed the accuracy of state-of-the-art AI pipelines for each task
while running up to 28$\times$ faster. LOTUS is publicly available at
https://github.com/stanford-futuredata/lotus.
♻ ☆ The why, what, and how of AI-based coding in scientific research
Computer programming (coding) is indispensable for researchers across
disciplines, yet it remains challenging to learn and time-consuming to carry
out. Generative AI, particularly large language models (LLMs), has the
potential to transform coding into intuitive conversations, but best practices
and effective workflows are only emerging. We dissect AI-based coding through
three key lenses: the nature and role of LLMs in coding (why), six types of
coding assistance they provide (what), and a five-step workflow in action with
practical implementation strategies (how). Additionally, we address the
limitations and future outlook of AI in coding. By offering actionable
insights, this framework helps to guide researchers in effectively leveraging
AI to enhance coding practices and education, accelerating scientific progress.
comment: 23 pages, 7 figure, 3 boxes
♻ ☆ Targeted Efficient Fine-tuning: Optimizing Parameter Updates with Data-Driven Sample Selection
Fine-tuning all parameters of Large Language Models (LLMs) is computationally
expensive. Parameter-Efficient Fine-Tuning (PEFT) methods address this by
selectively fine-tuning specific parameters. Most of the parameter efficient
fine-tuning (PEFT) methods center on selecting or introducing a set of
parameters to be fine-tuned. However, there are few methods that consider the
impact of data samples on parameter selecting. Representative data driven
methods include FISH Mask based method, which randomly selects a portion of
data samples as a basis when selecting parameters. However, this random data
sample selection method cannot select optimal parameters for unstable data
distribution. In this work, we introduce a data-centric approach and propose
the Iterative Range Decreasing (IRD) algorithm to optimize the sample-parameter
pair selection in FISH Mask. IRD iteratively refines the selection by
identifying subsets of samples and parameters exhibiting higher Fisher
information. We demonstrate the effectiveness and rationality of proposed
strategy by conducting experiments on GLUE benchmark. Experimental results show
our strategy optimizes the parameter selection and achieves preferable
performance over some typical baseline methods.
♻ ☆ A Framework for Leveraging Partially-Labeled Data for Product Attribute-Value Identification KDD 2025
In the e-commerce domain, the accurate extraction of attribute-value pairs
(e.g., Brand: Apple) from product titles and user search queries is crucial for
enhancing search and recommendation systems. A major challenge with neural
models for this task is the lack of high-quality training data, as the
annotations for attribute-value pairs in the available datasets are often
incomplete. To address this, we introduce GenToC, a model designed for training
directly with partially-labeled data, eliminating the necessity for a fully
annotated dataset. GenToC employs a marker-augmented generative model to
identify potential attributes, followed by a token classification model that
determines the associated values for each attribute. GenToC outperforms
existing state-of-the-art models, exhibiting upto 56.3% increase in the number
of accurate extractions. Furthermore, we utilize GenToC to regenerate the
training dataset to expand attribute-value annotations. This bootstrapping
substantially improves the data quality for training other standard NER models,
which are typically faster but less capable in handling partially-labeled data,
enabling them to achieve comparable performance to GenToC. Our results
demonstrate GenToC's unique ability to learn from a limited set of
partially-labeled data and improve the training of more efficient models,
advancing the automated extraction of attribute-value pairs. Finally, our model
has been successfully integrated into IndiaMART, India's largest B2B e-commerce
platform, achieving a significant increase of 20.2% in the number of correctly
identified attribute-value pairs over the existing deployed system while
achieving a high precision of 89.5%.
comment: Accepted to KDD 2025 ADS Track
♻ ☆ Enhancing High-order Interaction Awareness in LLM-based Recommender Model EMNLP 2024
Large language models (LLMs) have demonstrated prominent reasoning
capabilities in recommendation tasks by transforming them into text-generation
tasks. However, existing approaches either disregard or ineffectively model the
user-item high-order interactions. To this end, this paper presents an enhanced
LLM-based recommender (ELMRec). We enhance whole-word embeddings to
substantially enhance LLMs' interpretation of graph-constructed interactions
for recommendations, without requiring graph pre-training. This finding may
inspire endeavors to incorporate rich knowledge graphs into LLM-based
recommenders via whole-word embedding. We also found that LLMs often recommend
items based on users' earlier interactions rather than recent ones, and present
a reranking solution. Our ELMRec outperforms state-of-the-art (SOTA) methods in
both direct and sequential recommendations.
comment: Long paper accepted to EMNLP 2024 Main. 16 pages
♻ ☆ Information Extraction from Clinical Notes: Are We Ready to Switch to Large Language Models?
Yan Hu, Xu Zuo, Yujia Zhou, Xueqing Peng, Jimin Huang, Vipina K. Keloth, Vincent J. Zhang, Ruey-Ling Weng, Qingyu Chen, Xiaoqian Jiang, Kirk E. Roberts, Hua Xu
Backgrounds: Information extraction (IE) is critical in clinical natural
language processing (NLP). While large language models (LLMs) excel on
generative tasks, their performance on extractive tasks remains debated.
Methods: We investigated Named Entity Recognition (NER) and Relation Extraction
(RE) using 1,588 clinical notes from four sources (UT Physicians, MTSamples,
MIMIC-III, and i2b2). We developed an annotated corpus covering 4 clinical
entities and 16 modifiers, and compared instruction-tuned LLaMA-2 and LLaMA-3
against BiomedBERT in terms of performance, generalizability, computational
resources, and throughput to BiomedBERT. Results: LLaMA models outperformed
BiomedBERT across datasets. With sufficient training data, LLaMA showed modest
improvements (1% on NER, 1.5-3.7% on RE); improvements were larger with limited
training data. On unseen i2b2 data, LLaMA-3-70B outperformed BiomedBERT by 7%
(F1) on NER and 4% on RE. However, LLaMA models required more computing
resources and ran up to 28 times slower. We implemented "Kiwi," a clinical IE
package featuring both models, available at https://kiwi.clinicalnlp.org/.
Conclusion: This study is among the first to develop and evaluate a
comprehensive clinical IE system using open-source LLMs. Results indicate that
LLaMA models outperform BiomedBERT for clinical NER and RE but with higher
computational costs and lower throughputs. These findings highlight that
choosing between LLMs and traditional deep learning methods for clinical IE
applications should remain task-specific, taking into account both performance
metrics and practical considerations such as available computing resources and
the intended use case scenarios.
♻ ☆ ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search NeurIPS 2024
Recent methodologies in LLM self-training mostly rely on LLM generating
responses and filtering those with correct output answers as training data.
This approach often yields a low-quality fine-tuning training set (e.g.,
incorrect plans or intermediate reasoning). In this paper, we develop a
reinforced self-training approach, called ReST-MCTS*, based on integrating
process reward guidance with tree search MCTS* for collecting higher-quality
reasoning traces as well as per-step value to train policy and reward models.
ReST-MCTS* circumvents the per-step manual annotation typically used to train
process rewards by tree-search-based reinforcement learning: Given oracle final
correct answers, ReST-MCTS* is able to infer the correct process rewards by
estimating the probability this step can help lead to the correct answer. These
inferred rewards serve dual purposes: they act as value targets for further
refining the process reward model and also facilitate the selection of
high-quality traces for policy model self-training. We first show that the
tree-search policy in ReST-MCTS* achieves higher accuracy compared with prior
LLM reasoning baselines such as Best-of-N and Tree-of-Thought, within the same
search budget. We then show that by using traces searched by this tree-search
policy as training data, we can continuously enhance the three language models
for multiple iterations, and outperform other self-training algorithms such as
ReST$^\text{EM}$ and Self-Rewarding LM. We release all code at
https://github.com/THUDM/ReST-MCTS.
comment: Accepted to NeurIPS 2024
♻ ☆ SciInstruct: a Self-Reflective Instruction Annotated Dataset for Training Scientific Language Models NeurIPS
Dan Zhang, Ziniu Hu, Sining Zhoubian, Zhengxiao Du, Kaiyu Yang, Zihan Wang, Yisong Yue, Yuxiao Dong, Jie Tang
Large Language Models (LLMs) have shown promise in assisting scientific
discovery. However, such applications are currently limited by LLMs'
deficiencies in understanding intricate scientific concepts, deriving symbolic
equations, and solving advanced numerical calculations. To bridge these gaps,
we introduce SciInstruct, a suite of scientific instructions for training
scientific language models capable of college-level scientific reasoning.
Central to our approach is a novel self-reflective instruction annotation
framework to address the data scarcity challenge in the science domain. This
framework leverages existing LLMs to generate step-by-step reasoning for
unlabelled scientific questions, followed by a process of self-reflective
critic-and-revise. Applying this framework, we curated a diverse and
high-quality dataset encompassing physics, chemistry, math, and formal proofs.
We analyze the curated SciInstruct from multiple interesting perspectives
(e.g., domain, scale, source, question type, answer length, etc.). To verify
the effectiveness of SciInstruct, we fine-tuned different language models with
SciInstruct, i.e., ChatGLM3 (6B and 32B), Llama3-8B-Instruct, and Mistral-7B:
MetaMath, enhancing their scientific and mathematical reasoning capabilities,
without sacrificing the language understanding capabilities of the base model.
We release all codes and SciInstruct at https://github.com/THUDM/SciGLM.
comment: Accepted to NeurIPS D&B Track 2024
♻ ★ Open Domain Question Answering with Conflicting Contexts
Siyi Liu, Qiang Ning, Kishaloy Halder, Wei Xiao, Zheng Qi, Phu Mon Htut, Yi Zhang, Neha Anna John, Bonan Min, Yassine Benajiba, Dan Roth
Open domain question answering systems frequently rely on information
retrieved from large collections of text (such as the Web) to answer questions.
However, such collections of text often contain conflicting information, and
indiscriminately depending on this information may result in untruthful and
inaccurate answers. To understand the gravity of this problem, we collect a
human-annotated dataset, Question Answering with Conflicting Contexts (QACC),
and find that as much as 25% of unambiguous, open domain questions can lead to
conflicting contexts when retrieved using Google Search. We evaluate and
benchmark three powerful Large Language Models (LLMs) with our dataset QACC and
demonstrate their limitations in effectively addressing questions with
conflicting information. To explore how humans reason through conflicting
contexts, we request our annotators to provide explanations for their
selections of correct answers. We demonstrate that by finetuning LLMs to
explain their answers, we can introduce richer information into their training
that guide them through the process of reasoning with conflicting contexts.
♻ ☆ Matching Patients to Clinical Trials with Large Language Models
Qiao Jin, Zifeng Wang, Charalampos S. Floudas, Fangyuan Chen, Changlin Gong, Dara Bracken-Clarke, Elisabetta Xue, Yifan Yang, Jimeng Sun, Zhiyong Lu
Patient recruitment is challenging for clinical trials. We introduce
TrialGPT, an end-to-end framework for zero-shot patient-to-trial matching with
large language models. TrialGPT comprises three modules: it first performs
large-scale filtering to retrieve candidate trials (TrialGPT-Retrieval); then
predicts criterion-level patient eligibility (TrialGPT-Matching); and finally
generates trial-level scores (TrialGPT-Ranking). We evaluate TrialGPT on three
cohorts of 183 synthetic patients with over 75,000 trial annotations.
TrialGPT-Retrieval can recall over 90% of relevant trials using less than 6% of
the initial collection. Manual evaluations on 1,015 patient-criterion pairs
show that TrialGPT-Matching achieves an accuracy of 87.3% with faithful
explanations, close to the expert performance. The TrialGPT-Ranking scores are
highly correlated with human judgments and outperform the best-competing models
by 43.8% in ranking and excluding trials. Furthermore, our user study reveals
that TrialGPT can reduce the screening time by 42.6% in patient recruitment.
Overall, these results have demonstrated promising opportunities for
patient-to-trial matching with TrialGPT.
comment: Nature Communications
♻ ☆ Large Language Models and Cognitive Science: A Comprehensive Review of Similarities, Differences, and Challenges
Qian Niu, Junyu Liu, Ziqian Bi, Pohsun Feng, Benji Peng, Keyu Chen, Ming Li, Lawrence KQ Yan, Yichao Zhang, Caitlyn Heqi Yin, Cheng Fei, Tianyang Wang, Yunze Wang, Silin Chen
This comprehensive review explores the intersection of Large Language Models
(LLMs) and cognitive science, examining similarities and differences between
LLMs and human cognitive processes. We analyze methods for evaluating LLMs
cognitive abilities and discuss their potential as cognitive models. The review
covers applications of LLMs in various cognitive fields, highlighting insights
gained for cognitive science research. We assess cognitive biases and
limitations of LLMs, along with proposed methods for improving their
performance. The integration of LLMs with cognitive architectures is examined,
revealing promising avenues for enhancing artificial intelligence (AI)
capabilities. Key challenges and future research directions are identified,
emphasizing the need for continued refinement of LLMs to better align with
human cognition. This review provides a balanced perspective on the current
state and future potential of LLMs in advancing our understanding of both
artificial and human intelligence.
comment: 10 pages, 1 figure
♻ ☆ A Theoretical Understanding of Self-Correction through In-context Alignment NeurIPS 2024
Going beyond mimicking limited human experiences, recent studies show initial
evidence that, like humans, large language models (LLMs) are capable of
improving their abilities purely by self-correction, i.e., correcting previous
responses through self-examination, in certain circumstances. Nevertheless,
little is known about how such capabilities arise. In this work, based on a
simplified setup akin to an alignment task, we theoretically analyze
self-correction from an in-context learning perspective, showing that when LLMs
give relatively accurate self-examinations as rewards, they are capable of
refining responses in an in-context way. Notably, going beyond previous
theories on over-simplified linear transformers, our theoretical construction
underpins the roles of several key designs of realistic transformers for
self-correction: softmax attention, multi-head attention, and the MLP block. We
validate these findings extensively on synthetic datasets. Inspired by these
findings, we also illustrate novel applications of self-correction, such as
defending against LLM jailbreaks, where a simple self-correction step does make
a large difference. We believe that these findings will inspire further
research on understanding, exploiting, and enhancing self-correction for
building better foundation models.
comment: Accepted at NeurIPS 2024
♻ ☆ ptt5-v2: A Closer Look at Continued Pretraining of T5 Models for the Portuguese Language
Despite advancements in Natural Language Processing (NLP) and the growing
availability of pretrained models, the English language remains the primary
focus of model development. Continued pretraining on language-specific corpora
provides a practical solution for adapting models to other languages. However,
the impact of different pretraining settings on downstream tasks remains
underexplored. This work introduces $\texttt{ptt5-v2}$, investigating the
continued pretraining of T5 models for Portuguese. We first develop a baseline
set of settings and pretrain models with sizes up to 3B parameters. Finetuning
on three Portuguese downstream tasks (assin2 STS, assin2 RTE, and TweetSentBR)
yields SOTA results on the latter two. We then explore the effects of different
pretraining configurations, including pretraining data quality, optimization
strategies, and multi-epoch pretraining. Perhaps surprisingly, their impact
remains subtle compared to our baseline. We release $\texttt{ptt5-v2}$
pretrained checkpoints and their MonoT5-based finetuned $\texttt{MonoPTT5}$
rerankers on HuggingFace in their respective collections at
\url{https://huggingface.co/unicamp-dl}.
♻ ☆ Redefining Proactivity for Information Seeking Dialogue
Information-Seeking Dialogue (ISD) agents aim to provide accurate responses
to user queries. While proficient in directly addressing user queries, these
agents, as well as LLMs in general, predominantly exhibit reactive behavior,
lacking the ability to generate proactive responses that actively engage users
in sustained conversations. However, existing definitions of proactive dialogue
in this context do not focus on how each response actively engages the user and
sustains the conversation. Hence, we present a new definition of proactivity
that focuses on enhancing the `proactiveness' of each generated response via
the introduction of new information related to the initial query. To this end,
we construct a proactive dialogue dataset comprising 2,000 single-turn
conversations, and introduce several automatic metrics to evaluate response
`proactiveness' which achieved high correlation with human annotation.
Additionally, we introduce two innovative Chain-of-Thought (CoT) prompts, the
3-step CoT and the 3-in-1 CoT prompts, which consistently outperform standard
prompts by up to 90% in the zero-shot setting.
♻ ☆ MedCLIP-SAMv2: Towards Universal Text-Driven Medical Image Segmentation
Segmentation of anatomical structures and pathological regions in medical
images is essential for modern clinical diagnosis, disease research, and
treatment planning. While significant advancements have been made in deep
learning-based segmentation techniques, many of these methods still suffer from
limitations in data efficiency, generalizability, and interactivity. As a
result, developing precise segmentation methods that require fewer labeled
datasets remains a critical challenge in medical image analysis. Recently, the
introduction of foundation models like CLIP and Segment-Anything-Model (SAM),
with robust cross-domain representations, has paved the way for interactive and
universal image segmentation. However, further exploration of these models for
data-efficient segmentation in medical imaging is still needed and highly
relevant. In this paper, we introduce MedCLIP-SAMv2, a novel framework that
integrates the CLIP and SAM models to perform segmentation on clinical scans
using text prompts, in both zero-shot and weakly supervised settings. Our
approach includes fine-tuning the BiomedCLIP model with a new Decoupled Hard
Negative Noise Contrastive Estimation (DHN-NCE) loss, and leveraging the
Multi-modal Information Bottleneck (M2IB) to create visual prompts for
generating segmentation masks from SAM in the zero-shot setting. We also
investigate using zero-shot segmentation labels within a weakly supervised
paradigm to enhance segmentation quality further. Extensive testing across four
diverse segmentation tasks and medical imaging modalities (breast tumor
ultrasound, brain tumor MRI, lung X-ray, and lung CT) demonstrates the high
accuracy of our proposed framework. Our code is available at
https://github.com/HealthX-Lab/MedCLIP-SAMv2.
comment: 10 pages, 2 figures, 6 tables