MPT: Multiple Parallel Tempering for High-Throughput MCMC Samplers
Loading...
Permanent Link
Author/Creator
Author/Creator ORCID
Date
Type of Work
Department
Program
Citation of Original Publication
Rights
This item is likely protected under Title 17 of the U.S. Copyright Law. Unless on a Creative Commons license, for uses protected by Copyright Law, contact the copyright holder or the author.
Abstract
This paper proposes “Multiple Parallel Tempering”
(MPT) as a class of Markov Chain Monte Carlo (MCMC) algorithm
for high-throughput hardware implementations. MCMC
algorithms are used to generate samples from target probability
densities and are commonly employed in stochastic processing
techniques such as Bayesian inference, and maximum likelihood
estimation, in which computing large amount of data in real-time
with high-throughput samplers is critical. For high-dimensional
and multi-modal probability densities, Parallel Tempering (PT)
MCMC has proven to have superior mixing and higher convergence
to the target distribution as compared to other popular
MCMC algorithms such as Metropolis-Hastings (MH). MPT
algorithm, proposed in this paper, imposes a new integer parameter,
D, to the original algorithm of PT. Such modification
changes one MCMC sampler into multiple independent kernels
that alternatively generate their set of samples one after another.
Our experimental results on Gaussian mixture models show
that for large values of D, the auto-correlation function of the
proposed MPT falls comparably to that of a PT sampler. A
fully configurable and pipelined hardware accelerator for the
proposed MPT, as well as PT are designed in Verilog HDL and
implemented on FPGA. The two algorithms are also written
in C language and evaluated on Multi-core CPU from the
TX2 SoC. Our implementation results indicate that by selecting
an appropriate value for D in our case study the sampling
throughput of the MPT can raise from 4.5 Msps in PT to
135 Msps on average, an amount near maximum achievable
frequency of the target FPGA, which is about 1470x higher
than when implementing on fully exploited Multi-core CPU.