Papers
arxiv:2503.08644

Exploiting Instruction-Following Retrievers for Malicious Information Retrieval

Published on Mar 11
· Submitted by parishadbehnam on Mar 12
Authors:
,

Abstract

Instruction-following retrievers have been widely adopted alongside LLMs in real-world applications, but little work has investigated the safety risks surrounding their increasing search capabilities. We empirically study the ability of retrievers to satisfy malicious queries, both when used directly and when used in a retrieval augmented generation-based setup. Concretely, we investigate six leading retrievers, including NV-Embed and LLM2Vec, and find that given malicious requests, most retrievers can (for >50% of queries) select relevant harmful passages. For example, LLM2Vec correctly selects passages for 61.35% of our malicious queries. We further uncover an emerging risk with instruction-following retrievers, where highly relevant harmful information can be surfaced by exploiting their instruction-following capabilities. Finally, we show that even safety-aligned LLMs, such as Llama3, can satisfy malicious requests when provided with harmful retrieved passages in-context. In summary, our findings underscore the malicious misuse risks associated with increasing retriever capability.

Community

Paper author Paper submitter
edited 1 day ago

Instruction-following retrievers can efficiently and accurately search for harmful and sensitive information on the internet! 💣

Retrievers need to be aligned too! 🚨

✨ AdvBench-IR
We create AdvBench-IR to evaluate if retrievers, such as LLM2Vec and NV-Embed, can select relevant harmful text from large corpora for a diverse range of malicious requests.
drawing

✨ Direct Malicious Retrieval
LLM-based retrievers correctly select malicious passages for more than 78% of AdvBench-IR queries (top-5)—a concerning level of capability. We also find that LLM alignment transfers poorly to retrieval. ⚠️
drawing

✨ Exploiting Instruction-Following Ability
Using fine-grained queries, a malicious user can steer the retriever to select specific passages that precisely match their malicious intent (e.g., constructing an explosive device with specific materials). 😈
drawing

✨ RAG-based exploitation
Using a RAG-based approach, even LLMs optimized for safety respond to malicious requests when harmful passages are provided in-context to ground their generation (e.g., Llama3 generates harmful responses to 67.12% of the queries with retrieval). 😬
drawing

Check out our paper for more details.
Paper: https://arxiv.org/abs/2503.08644
Data: https://huggingface.co/datasets/McGill-NLP/AdvBench-IR
Code: https://github.com/McGill-NLP/malicious-ir
Webpage: https://mcgill-nlp.github.io/malicious-ir/

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2503.08644 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2503.08644 in a Space README.md to link it from this page.

Collections including this paper 3