Gradient Properties of Hard Thresholding Operator
Loading...
Links to Files
Author/Creator
Author/Creator ORCID
Date
2022-09-17
Type of Work
Department
Program
Citation of Original Publication
Rights
This item is likely protected under Title 17 of the U.S. Copyright Law. Unless on a Creative Commons license, for uses protected by Copyright Law, contact the copyright holder or the author.
Attribution 4.0 International (CC BY 4.0)
Attribution 4.0 International (CC BY 4.0)
Subjects
Abstract
Sparse optimization receives increasing attention in many applications such as
compressed sensing, variable selection in regression problems, and recently neural
network compression in machine learning. For example, the problem of compressing a neural network is a bi-level, stochastic, and nonconvex problem that can be
cast into a sparse optimization problem. Hence, developing efficient methods for
sparse optimization plays a critical role in applications. The goal of this paper is to
develop analytical techniques for general, large size sparse optimization problems
using the hard thresholding operator. To this end, we study the iterative hard
thresholding (IHT) algorithm, which has been extensively studied in the literature
because it is scalable, fast, and easily implementable. In spite of extensive research
on the IHT scheme, we develop several new techniques that not only recover many
known results but also lead to new results. Specifically, we first establish a new
and critical gradient descent property of the hard thresholding (HT) operator. Our
gradient descent result can be related to the distance between points that are sparse.
However, the distance between sparse points cannot provide any information about
the gradient in the sparse setting. To the best of our knowledge, the other way
around (the gradient to the distance) has not been shown so far in the literature.
Also, our gradient descent property allows one to study the IHT when the stepsize
is less than or equal to 1/L, where L>0 is the Lipschitz constant of the gradient
of an objective function. Note that the existing techniques in the literature can
only handle the case when the stepsize is strictly less than 1/L. By exploiting this
we introduce and study HT-stable and HT-unstable stationary points and show
no matter how close an initialization is to a HT-unstable stationary point (saddle
point in sparse sense), the IHT sequence leaves it. Finally, we show that no matter
what sparse initial point is selected, the IHT sequence converges if the function
values at HT-stable stationary points are distinct, where the last condition is a new
assumption that has not been found in the literature. We provide a video of 4000
independent runs where the IHT algorithm is initialized very close to a HT-unstable
stationary point and show the sequences escape them.