MaGrIP: Magnitude and Gradient-Informed Pruning for Task-Agnostic Large Language Models
Files
Links to Files
Permanent Link
Author/Creator
Author/Creator ORCID
Date
Type of Work
Department
Program
Citation of Original Publication
Kallakuri, Uttej, Edward Humes, Hasib-Al Rashid, and Tinoosh Mohsenin. “MaGrIP: Magnitude and Gradient-Informed Pruning for Task-Agnostic Large Language Models.” ACM Trans. Embed. Comput. Syst., September 5, 2025. https://doi.org/10.1145/3766068.
Rights
Attribution 4.0 International
Subjects
Abstract
Large Language Models (LLMs) have become foundational tools in natural language processing, achieving state-of-the-art performance across a variety of tasks. However, their immense size and computational requirements make them impractical for deployment in resource-constrained environments, such as edge devices and embedded systems. In this work, we introduce Magnitude and Gradient-Informed Pruning (MaGrIP), a novel framework for task-agnostic pruning and compression of LLMs. MaGrIP employs a dual-threshold strategy combining magnitude- and gradient-based saliency measures to efficiently prune redundant neurons while retaining task performance. Our results demonstrate the effectiveness of MaGrIP in compressing state-of-the-art models. The compression reduced the total computational complexity of the FFN layers from O (d · ℎ) to O ( (d − q) · ℎ) . In terms of model size, our pruning approach significantly reduces both model parameters and storage requirements while maintaining competitive perplexity scores evaluated on WikiText-2. For the Gemma 7B model, our method reduces the total size from 28 GB to 5 GB, while for Gemma 2B, MaGrIP achieves a size reduction from 8 GB to 1.5 GB. MaGrIP furthermore exhibits robust performance across multiple benchmarks, such as BOOLQ, ARC-E, and CSQA. Specifically, the pruned Gemma 7B model at 50% pruning achieved 59.26% accuracy on ARC-E compared to 81.06% for the baseline, and 64.74% accuracy on BoolQ compared to 59.98% for the baseline. Similarly, the pruned Llama 3 8B at 50% pruning achieved 46.76% accuracy on ARC-E compared to 77.57% for the baseline, reflecting the trade-off between compression and accuracy. LLMs compressed using MaGrIP, when deployed on the Nvidia Jetson Orin Nano, achieved a 2.16 × improvement in throughput and a 2.3 × improvement in performance compared to baseline LLMs.
