MaGrIP: Magnitude and Gradient-Informed Pruning for Task-Agnostic Large Language Models
| dc.contributor.author | Kallakuri, Uttej | |
| dc.contributor.author | Humes, Edward | |
| dc.contributor.author | Rashid, Hasib-Al | |
| dc.contributor.author | Mohsenin, Tinoosh | |
| dc.date.accessioned | 2025-10-22T19:58:03Z | |
| dc.date.issued | 2025-09-05 | |
| dc.description.abstract | Large Language Models (LLMs) have become foundational tools in natural language processing, achieving state-of-the-art performance across a variety of tasks. However, their immense size and computational requirements make them impractical for deployment in resource-constrained environments, such as edge devices and embedded systems. In this work, we introduce Magnitude and Gradient-Informed Pruning (MaGrIP), a novel framework for task-agnostic pruning and compression of LLMs. MaGrIP employs a dual-threshold strategy combining magnitude- and gradient-based saliency measures to efficiently prune redundant neurons while retaining task performance. Our results demonstrate the effectiveness of MaGrIP in compressing state-of-the-art models. The compression reduced the total computational complexity of the FFN layers from O (d · ℎ) to O ( (d − q) · ℎ) . In terms of model size, our pruning approach significantly reduces both model parameters and storage requirements while maintaining competitive perplexity scores evaluated on WikiText-2. For the Gemma 7B model, our method reduces the total size from 28 GB to 5 GB, while for Gemma 2B, MaGrIP achieves a size reduction from 8 GB to 1.5 GB. MaGrIP furthermore exhibits robust performance across multiple benchmarks, such as BOOLQ, ARC-E, and CSQA. Specifically, the pruned Gemma 7B model at 50% pruning achieved 59.26% accuracy on ARC-E compared to 81.06% for the baseline, and 64.74% accuracy on BoolQ compared to 59.98% for the baseline. Similarly, the pruned Llama 3 8B at 50% pruning achieved 46.76% accuracy on ARC-E compared to 77.57% for the baseline, reflecting the trade-off between compression and accuracy. LLMs compressed using MaGrIP, when deployed on the Nvidia Jetson Orin Nano, achieved a 2.16 × improvement in throughput and a 2.3 × improvement in performance compared to baseline LLMs. | |
| dc.description.sponsorship | Research was partly sponsored by the Army Research Laboratory and was accomplished under Cooperative Agreement Number W911NF-24-2-0080. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the oicial policies, either expressed or implied, of the Army Research Laboratory or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein. | |
| dc.description.uri | https://dl.acm.org/doi/10.1145/3766068 | |
| dc.format.extent | 32 pages | |
| dc.genre | journal articles | |
| dc.identifier | doi:10.13016/m2ptek-yfpt | |
| dc.identifier.citation | Kallakuri, Uttej, Edward Humes, Hasib-Al Rashid, and Tinoosh Mohsenin. “MaGrIP: Magnitude and Gradient-Informed Pruning for Task-Agnostic Large Language Models.” ACM Trans. Embed. Comput. Syst., September 5, 2025. https://doi.org/10.1145/3766068. | |
| dc.identifier.uri | https://doi.org/10.1145/3766068 | |
| dc.identifier.uri | http://hdl.handle.net/11603/40538 | |
| dc.language.iso | en | |
| dc.publisher | ACM | |
| dc.relation.isAvailableAt | The University of Maryland, Baltimore County (UMBC) | |
| dc.relation.ispartof | UMBC Student Collection | |
| dc.relation.ispartof | UMBC Computer Science and Electrical Engineering Department | |
| dc.rights | Attribution 4.0 International | |
| dc.rights.uri | https://creativecommons.org/licenses/by/4.0/ | |
| dc.title | MaGrIP: Magnitude and Gradient-Informed Pruning for Task-Agnostic Large Language Models | |
| dc.type | Text | |
| dcterms.creator | https://orcid.org/0000-0002-9983-6929 |
Files
Original bundle
1 - 1 of 1
