A database-based distributed computation architecture with Accumulo and D4M: An application of eigensolver for large sparse matrix

Author/Creator ORCID

Date

2015-12-28

Department

Program

Citation of Original Publication

Yin Huang, Yelena Yesha, and Shujia Zhou, A database-based distributed computation architecture with Accumulo and D4M: An application of eigensolver for large sparse matrix, 2015 IEEE International Conference on Big Data (Big Data) , DOI: 10.1109/BigData.2015.7364045

Rights

This item is likely protected under Title 17 of the U.S. Copyright Law. Unless on a Creative Commons license, for uses protected by Copyright Law, contact the copyright holder or the author.
© 2015 IEEE

Abstract

NoSQL distributed databases have been devised to tackle the challenges resulting from volume, velocity and variety of big data. Graph representation of datasets requires efficient distributed linear algebra operations for large sparse matrix constructed from big data. Storing the transformed matrix into the database not only speeds up the big data analysis process but also facilitates the computation because of indexing. The Hadoop based approach does not natively support iterative algorithms due to data shuffling during each iteration. This paper presents a novel database-based distributed computation architecture bridging the gap between Hadoop and HPC. The novelty results from exploring the indexing capability of D4M (Dynamic Distributed Dimensional Data Model) to support linear algebra operations in a distributed computation environment. The idea is to store input data and intermediate results in associative array format inside Accumulo table to facilitate the data sharing among working nodes. pMatlab is deployed as the parallel computation engine. Our proposed architecture is proved to be lighter, easier and faster than MapReduce based approach. One example application is calculating top k eigenvalues and eigenvectors for large sparse matrix. Experiments on Graph500 benchmark datasets demonstrate 2X speedup of our architecture as compared to HEIGEN (An eigensolver for billion-scale matrices using MapReduce).