Jailbreaking LLMs: A Survey of Attacks, Defenses and Evaluation
| dc.contributor.author | Hakim, Safayat Bin | |
| dc.contributor.author | Gharami, Kanchon | |
| dc.contributor.author | Ghalaty, Nahid Farhady | |
| dc.contributor.author | Moni, Shafika Showkat | |
| dc.contributor.author | Xu, Shouhuai | |
| dc.contributor.author | Song, Houbing | |
| dc.date.accessioned | 2026-02-03T18:14:30Z | |
| dc.date.issued | 2026-01-06 | |
| dc.description.abstract | Large Language Models (LLMs) excel at natural language understanding and generation, but deployment introduces critical security risks through jailbreak attacks that circumvent safety alignment. This survey provides the first unified rigorous systematization of the LLM security threat landscape (2022-2025), synthesizingresearch across premier security and AI venues. We introduce a comprehensive taxonomy of jailbreak vectors from elementary prompt manipulation to sophisticated multimodal exploits. We find a persistent asymmetry between attack sophistication and defensive capability: advanced automated attacks routinely achieve 90-99% success on open-weight models, while black-box attacks reach 80-94% effectiveness on proprietary models. Agent-driven multi-turn attacks demonstrate 95% success by decomposing harmful queries across conversation turns. Embodied AI vulnerabilities enable jailbreaks to trigger harmful physical actions in robotic platforms, expanding threats beyond digital domains. Defenses show fundamental limits: feedback-based attacks often retain residual success above 15% even against layered protections. Evaluation remains unreliable, with automated judge agreement varying 70-93% depending on implementation, weakening confidence in security assessments. By examining attack transferability, computational costs, and defense overhead, we identify gaps in multimodal safety protocols, agent-based security frameworks, and mechanistic interpretability. Robust LLM security requires shifting from reactive mitigation to proactive security-by-design architectures integrating constitutional AI principles with formal verification. | |
| dc.description.uri | https://www.authorea.com/users/1011181/articles/1373070-jailbreaking-llms-a-survey-of-attacks-defenses-and-evaluation?commit=29f9cc521b895e7451101b71a6e557d48b6b2a9f | |
| dc.format.extent | 39 pages | |
| dc.genre | journal articles | |
| dc.genre | preprints | |
| dc.identifier | doi:10.13016/m2ixjy-kne6 | |
| dc.identifier.uri | https://doi.org/10.36227/techrxiv.176773228.86819800/v1 | |
| dc.identifier.uri | http://hdl.handle.net/11603/41625 | |
| dc.language.iso | en | |
| dc.relation.isAvailableAt | The University of Maryland, Baltimore County (UMBC) | |
| dc.relation.ispartof | UMBC Faculty Collection | |
| dc.relation.ispartof | UMBC Information Systems Department | |
| dc.relation.ispartof | UMBC Student Collection | |
| dc.rights | Attribution 4.0 International | |
| dc.rights.uri | https://creativecommons.org/licenses/by/4.0/ | |
| dc.subject | UMBC Security and Optimization for Networked Globe Laboratory (SONG Lab) | |
| dc.title | Jailbreaking LLMs: A Survey of Attacks, Defenses and Evaluation | |
| dc.type | Text | |
| dcterms.creator | https://orcid.org/0000-0002-8119-7911 | |
| dcterms.creator | https://orcid.org/0000-0003-2631-9223 |
Files
Original bundle
1 - 1 of 1
