QUANTIZED LARGE LANGUAGE MODELS FOR MENTAL HEALTH APPLICATIONS: A BENCHMARK STUDY ON EFFICIENCY, ACCURACY AND RESOURCE ALLOCATION

Author/Creator ORCID

Department

Computer Science and Electrical Engineering

Program

Computer Science

Citation of Original Publication

Rights

This item may be protected under Title 17 of the U.S. Copyright Law. It is made available by UMBC for non-commercial research and education. For permission to publish or reproduce, please see http://aok.lib.umbc.edu/specoll/repro.php or contact Special Collections at speccoll(at)umbc.edu or contact Special Collections at speccoll(at)umbc.edu
Distribution Rights granted to UMBC by the author.

Abstract

Quantization is a technique that compresses numerical representations to reduce space requirements, though it may sacrifice some precision. Albeit this lossy compression method can improve efficiency, it often comes at the cost of performance. Large Language Models (LLMs) are computationally intensive, posing challenges for users with limited hardware resources. However, advancements in fine-tuning strategies such as QLoRa, LLM.int8(), GGUF, GGML, llama.cpp, and various quantization techniques (8/4-bit, NF4, FP16/32/64, BF16, bitsandbytes) have democratized access to LLMs by reducing the resource burden. LLM weights are typically stored as floating-point numbers, and quantization reduces the precision of these weights to decrease the model’s resource requirements. While this can significantly reduce model size, it may also impact accuracy due to the compressed representation of weights. Lower levels of quantization result in smaller models but may lead to diminished performance. The findings from this research provide critical insights into the viability of using quantized LLMs in sensitive domains like mental health. They highlight the importance of balancing explanation quality with computational efficiency. This benchmarking effort lays the groundwork for deploying effective and resource-efficient LLMs in mental health applications, ultimately supporting professionals and patients with reliable AI-driven insights. As the study progresses, models will be trained sequentially in groups, categorized by familiessuch as LLAMA, Phi, Mixtral, Hermes, Falcon, Gemma, Qwen, and others. This research explores the trade-off between weight precision and model accuracy, aiming to better understand the challenges and potential of quantized LLMs in mental health applications. All models were trained and tested with the generous support of the University of Maryland, Baltimore County’s High-Performance Computing Facility, which provided GPU-accelerated resources.