Human-Interpretable Adversarial Prompt Attack on Large Language Models with Situational Context

Das, Nilanjana; Raff, Edward; Gaur, Manas

Human-Interpretable Adversarial Prompt Attack on Large Language Models with Situational Context

dc.contributor.author	Das, Nilanjana
dc.contributor.author	Raff, Edward
dc.contributor.author	Gaur, Manas
dc.date.accessioned	2024-08-20T13:45:09Z
dc.date.available	2024-08-20T13:45:09Z
dc.date.issued	2024-07-25
dc.description.abstract	Previous research on testing the vulnerabilities in Large Language Models (LLMs) using adversarial attacks has primarily focused on nonsensical prompt injections, which are easily detected upon manual or automated review (e.g., via byte entropy). However, the exploration of innocuous human-understandable malicious prompts augmented with adversarial injections remains limited. In this research, we explore converting a nonsensical suffix attack into a sensible prompt via a situation-driven contextual re-writing. This allows us to show suffix conversion without any gradients, using only LLMs to perform the attacks, and thus better understand the scope of possible risks. We combine an independent, meaningful adversarial insertion and situations derived from movies to check if this can trick an LLM. The situations are extracted from the IMDB dataset, and prompts are defined following a few-shot chain-of-thought prompting. Our approach demonstrates that a successful situation-driven attack can be executed on both open-source and proprietary LLMs. We find that across many LLMs, as few as 1 attempt produces an attack and that these attacks transfer between LLMs. The link to our code is available at \url{https://anonymous.4open.science/r/Situation-Driven-Adversarial-Attacks-7BB1/README.md}.
dc.description.uri	http://arxiv.org/abs/2407.14644
dc.format.extent	12 pages
dc.genre	journal articles
dc.genre	preprints
dc.identifier	doi:10.13016/m2onjq-xyy7
dc.identifier.uri	https://doi.org/10.48550/arXiv.2407.14644
dc.identifier.uri	http://hdl.handle.net/11603/35697
dc.language.iso	en_US
dc.relation.isAvailableAt	The University of Maryland, Baltimore County (UMBC)
dc.relation.ispartof	UMBC Faculty Collection
dc.relation.ispartof	UMBC Computer Science and Electrical Engineering Department
dc.relation.ispartof	UMBC Data Science
dc.rights	Attribution 4.0 International
dc.rights.uri	https://creativecommons.org/licenses/by/4.0/
dc.subject	UMBC Ebiquity Research Group
dc.subject	UMBC Discovery, Research, and Experimental Analysis of Malware Lab (DREAM Lab)
dc.subject	UMBC Interactive Robotics and Language Lab (IRAL Lab)
dc.subject	Computer Science - Computation and Language
dc.title	Human-Interpretable Adversarial Prompt Attack on Large Language Models with Situational Context
dc.type	Text
dcterms.creator	https://orcid.org/0000-0002-9900-1972
dcterms.creator	https://orcid.org/0000-0002-5411-2230

Files

Original bundle

Now showing 1 - 1 of 1

Name:: 2407.14644v2.pdf
Size:: 282.2 KB
Format:: Adobe Portable Document Format

Download

Collections

UMBC Faculty Collection
UMBC Computer Science and Electrical Engineering Department
UMBC Data Science