Jailbreaking LLMs: A Survey of Attacks, Defenses and Evaluation

Hakim, Safayat Bin; Gharami, Kanchon; Ghalaty, Nahid Farhady; Moni, Shafika Showkat; Xu, Shouhuai; Song, Houbing

Jailbreaking LLMs: A Survey of Attacks, Defenses and Evaluation

dc.contributor.author	Hakim, Safayat Bin
dc.contributor.author	Gharami, Kanchon
dc.contributor.author	Ghalaty, Nahid Farhady
dc.contributor.author	Moni, Shafika Showkat
dc.contributor.author	Xu, Shouhuai
dc.contributor.author	Song, Houbing
dc.date.accessioned	2026-02-03T18:14:30Z
dc.date.issued	2026-01-06
dc.description.abstract	Large Language Models (LLMs) excel at natural language understanding and generation, but deployment introduces critical security risks through jailbreak attacks that circumvent safety alignment. This survey provides the first unified rigorous systematization of the LLM security threat landscape (2022-2025), synthesizingresearch across premier security and AI venues. We introduce a comprehensive taxonomy of jailbreak vectors from elementary prompt manipulation to sophisticated multimodal exploits. We find a persistent asymmetry between attack sophistication and defensive capability: advanced automated attacks routinely achieve 90-99% success on open-weight models, while black-box attacks reach 80-94% effectiveness on proprietary models. Agent-driven multi-turn attacks demonstrate 95% success by decomposing harmful queries across conversation turns. Embodied AI vulnerabilities enable jailbreaks to trigger harmful physical actions in robotic platforms, expanding threats beyond digital domains. Defenses show fundamental limits: feedback-based attacks often retain residual success above 15% even against layered protections. Evaluation remains unreliable, with automated judge agreement varying 70-93% depending on implementation, weakening confidence in security assessments. By examining attack transferability, computational costs, and defense overhead, we identify gaps in multimodal safety protocols, agent-based security frameworks, and mechanistic interpretability. Robust LLM security requires shifting from reactive mitigation to proactive security-by-design architectures integrating constitutional AI principles with formal verification.
dc.description.uri	https://www.authorea.com/users/1011181/articles/1373070-jailbreaking-llms-a-survey-of-attacks-defenses-and-evaluation?commit=29f9cc521b895e7451101b71a6e557d48b6b2a9f
dc.format.extent	39 pages
dc.genre	journal articles
dc.genre	preprints
dc.identifier	doi:10.13016/m2ixjy-kne6
dc.identifier.uri	https://doi.org/10.36227/techrxiv.176773228.86819800/v1
dc.identifier.uri	http://hdl.handle.net/11603/41625
dc.language.iso	en
dc.relation.isAvailableAt	The University of Maryland, Baltimore County (UMBC)
dc.relation.ispartof	UMBC Faculty Collection
dc.relation.ispartof	UMBC Information Systems Department
dc.relation.ispartof	UMBC Student Collection
dc.rights	Attribution 4.0 International
dc.rights.uri	https://creativecommons.org/licenses/by/4.0/
dc.subject	UMBC Security and Optimization for Networked Globe Laboratory (SONG Lab)
dc.title	Jailbreaking LLMs: A Survey of Attacks, Defenses and Evaluation
dc.type	Text
dcterms.creator	https://orcid.org/0000-0002-8119-7911
dcterms.creator	https://orcid.org/0000-0003-2631-9223

Files

Original bundle

Now showing 1 - 1 of 1

Name:: 13730701.pdf
Size:: 2.47 MB
Format:: Adobe Portable Document Format

Download

Collections

UMBC Faculty Collection
UMBC Information Systems Department
UMBC Student Collection