MaGrIP: Magnitude and Gradient-Informed Pruning for Task-Agnostic Large Language Models

dc.contributor.authorKallakuri, Uttej
dc.contributor.authorHumes, Edward
dc.contributor.authorRashid, Hasib-Al
dc.contributor.authorMohsenin, Tinoosh
dc.date.accessioned2025-10-22T19:58:03Z
dc.date.issued2025-09-05
dc.description.abstractLarge Language Models (LLMs) have become foundational tools in natural language processing, achieving state-of-the-art performance across a variety of tasks. However, their immense size and computational requirements make them impractical for deployment in resource-constrained environments, such as edge devices and embedded systems. In this work, we introduce Magnitude and Gradient-Informed Pruning (MaGrIP), a novel framework for task-agnostic pruning and compression of LLMs. MaGrIP employs a dual-threshold strategy combining magnitude- and gradient-based saliency measures to efficiently prune redundant neurons while retaining task performance. Our results demonstrate the effectiveness of MaGrIP in compressing state-of-the-art models. The compression reduced the total computational complexity of the FFN layers from O (d · ℎ) to O ( (d − q) · ℎ) . In terms of model size, our pruning approach significantly reduces both model parameters and storage requirements while maintaining competitive perplexity scores evaluated on WikiText-2. For the Gemma 7B model, our method reduces the total size from 28 GB to 5 GB, while for Gemma 2B, MaGrIP achieves a size reduction from 8 GB to 1.5 GB. MaGrIP furthermore exhibits robust performance across multiple benchmarks, such as BOOLQ, ARC-E, and CSQA. Specifically, the pruned Gemma 7B model at 50% pruning achieved 59.26% accuracy on ARC-E compared to 81.06% for the baseline, and 64.74% accuracy on BoolQ compared to 59.98% for the baseline. Similarly, the pruned Llama 3 8B at 50% pruning achieved 46.76% accuracy on ARC-E compared to 77.57% for the baseline, reflecting the trade-off between compression and accuracy. LLMs compressed using MaGrIP, when deployed on the Nvidia Jetson Orin Nano, achieved a 2.16 × improvement in throughput and a 2.3 × improvement in performance compared to baseline LLMs.
dc.description.sponsorshipResearch was partly sponsored by the Army Research Laboratory and was accomplished under Cooperative Agreement Number W911NF-24-2-0080. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the oicial policies, either expressed or implied, of the Army Research Laboratory or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein.
dc.description.urihttps://dl.acm.org/doi/10.1145/3766068
dc.format.extent32 pages
dc.genrejournal articles
dc.identifierdoi:10.13016/m2ptek-yfpt
dc.identifier.citationKallakuri, Uttej, Edward Humes, Hasib-Al Rashid, and Tinoosh Mohsenin. “MaGrIP: Magnitude and Gradient-Informed Pruning for Task-Agnostic Large Language Models.” ACM Trans. Embed. Comput. Syst., September 5, 2025. https://doi.org/10.1145/3766068.
dc.identifier.urihttps://doi.org/10.1145/3766068
dc.identifier.urihttp://hdl.handle.net/11603/40538
dc.language.isoen
dc.publisherACM
dc.relation.isAvailableAtThe University of Maryland, Baltimore County (UMBC)
dc.relation.ispartofUMBC Student Collection
dc.relation.ispartofUMBC Computer Science and Electrical Engineering Department
dc.rightsAttribution 4.0 International
dc.rights.urihttps://creativecommons.org/licenses/by/4.0/
dc.titleMaGrIP: Magnitude and Gradient-Informed Pruning for Task-Agnostic Large Language Models
dc.typeText
dcterms.creatorhttps://orcid.org/0000-0002-9983-6929

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
MaGrIP.pdf
Size:
4.57 MB
Format:
Adobe Portable Document Format