Das, NilanjanaRaff, EdwardGaur, Manas2025-01-312025-01-312024-12-20https://doi.org/10.48550/arXiv.2412.16359http://hdl.handle.net/11603/37607Previous research on LLM vulnerabilities often relied on nonsensical adversarial prompts, which were easily detectable by automated methods. We address this gap by focusing on human-readable adversarial prompts, a more realistic and potent threat. Our key contributions are situation-driven attacks leveraging movie scripts to create contextually relevant, human-readable prompts that successfully deceive LLMs, adversarial suffix conversion to transform nonsensical adversarial suffixes into meaningful text, and AdvPrompter with p-nucleus sampling, a method to generate diverse, human-readable adversarial suffixes, improving attack efficacy in models like GPT-3.5 and Gemma 7B. Our findings demonstrate that LLMs can be tricked by sophisticated adversaries into producing harmful responses with human-readable adversarial prompts and that there exists a scope for improvement when it comes to robust LLMs.18 pagesen-USAttribution 4.0 Internationalhttps://creativecommons.org/licenses/by/4.0/Computer Science - Artificial IntelligenceUMBC Ebiquity Research GroupUMBC Discovery, Research, and Experimental Analysis of Malware Lab (DREAM Lab)UMBC Interactive Robotics and Language Lab (IRAL Lab)Computer Science - Computation and LanguageHuman-Readable Adversarial Prompts: An Investigation into LLM Vulnerabilities Using Situational ContextText