Jailbreaking LLMs: A Survey of Attacks, Defenses and Evaluation

dc.contributor.authorHakim, Safayat Bin
dc.contributor.authorGharami, Kanchon
dc.contributor.authorGhalaty, Nahid Farhady
dc.contributor.authorMoni, Shafika Showkat
dc.contributor.authorXu, Shouhuai
dc.contributor.authorSong, Houbing
dc.date.accessioned2026-02-03T18:14:30Z
dc.date.issued2026-01-06
dc.description.abstractLarge Language Models (LLMs) excel at natural language understanding and generation, but deployment introduces critical security risks through jailbreak attacks that circumvent safety alignment. This survey provides the first unified rigorous systematization of the LLM security threat landscape (2022-2025), synthesizingresearch across premier security and AI venues. We introduce a comprehensive taxonomy of jailbreak vectors from elementary prompt manipulation to sophisticated multimodal exploits. We find a persistent asymmetry between attack sophistication and defensive capability: advanced automated attacks routinely achieve 90-99% success on open-weight models, while black-box attacks reach 80-94% effectiveness on proprietary models. Agent-driven multi-turn attacks demonstrate 95% success by decomposing harmful queries across conversation turns. Embodied AI vulnerabilities enable jailbreaks to trigger harmful physical actions in robotic platforms, expanding threats beyond digital domains. Defenses show fundamental limits: feedback-based attacks often retain residual success above 15% even against layered protections. Evaluation remains unreliable, with automated judge agreement varying 70-93% depending on implementation, weakening confidence in security assessments. By examining attack transferability, computational costs, and defense overhead, we identify gaps in multimodal safety protocols, agent-based security frameworks, and mechanistic interpretability. Robust LLM security requires shifting from reactive mitigation to proactive security-by-design architectures integrating constitutional AI principles with formal verification.
dc.description.urihttps://www.authorea.com/users/1011181/articles/1373070-jailbreaking-llms-a-survey-of-attacks-defenses-and-evaluation?commit=29f9cc521b895e7451101b71a6e557d48b6b2a9f
dc.format.extent39 pages
dc.genrejournal articles
dc.genrepreprints
dc.identifierdoi:10.13016/m2ixjy-kne6
dc.identifier.urihttps://doi.org/10.36227/techrxiv.176773228.86819800/v1
dc.identifier.urihttp://hdl.handle.net/11603/41625
dc.language.isoen
dc.relation.isAvailableAtThe University of Maryland, Baltimore County (UMBC)
dc.relation.ispartofUMBC Faculty Collection
dc.relation.ispartofUMBC Information Systems Department
dc.relation.ispartofUMBC Student Collection
dc.rightsAttribution 4.0 International
dc.rights.urihttps://creativecommons.org/licenses/by/4.0/
dc.subjectUMBC Security and Optimization for Networked Globe Laboratory (SONG Lab)
dc.titleJailbreaking LLMs: A Survey of Attacks, Defenses and Evaluation
dc.typeText
dcterms.creatorhttps://orcid.org/0000-0002-8119-7911
dcterms.creatorhttps://orcid.org/0000-0003-2631-9223

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
13730701.pdf
Size:
2.47 MB
Format:
Adobe Portable Document Format