yInMem: A Parallel Distributed Indexed In-Memory Computation System for Big Data Analytics

Author/Creator

Author/Creator ORCID

Date

2017-01-01

Department

Computer Science and Electrical Engineering

Program

Computer Science

Citation of Original Publication

Rights

This item may be protected under Title 17 of the U.S. Copyright Law. It is made available by UMBC for non-commercial research and education. For permission to publish or reproduce, please see http://aok.lib.umbc.edu/specoll/repro.php or contact Special Collections at speccoll(at)umbc.edu
Distribution Rights granted to UMBC by the author.

Abstract

Cluster computing is experiencing a surge of interest in in-memory computing system with the advances in hardware such as memory. However, the network media has the smallest bandwidth as compared to memory and disk in a typical setting of cluster computing environment. In addition, the sparse nature of graph applications, such as social network, imposes new challenges for in-memory computing system. Examples of such challenges are data locality, workload balance and memory management. As a result, fine control over data partitioning and data sharing plays a crucial role in improving the speed of large-scale data-parallel processing systems by reducing the cross-node communication. In order to maximize the performance, in-memory computing system should be offering optimized data throughput for parallel computation in large-scale data analytics. This dissertations presents yInMem: a parallel, distributed, indexed, in-memory computing system for big data analytics. With the goal of building an in-memory computing system that enables optimal data partitioning and improves efficiency of iterative machine learning and graph algorithms, yInMem bridges the gap between HPC and Hadoop by parallelizing the computation with MPI while obtaining the advantage of distributed data storage, such as NoSQL database built on top of Hadoop. The novelty of yInMem results from introducing indexes or associative arrays to the in-memory computing system. Such a design offers benefits of fine control over data distribution with parallel computation to maximize the computing resources usage in the cluster. By analyzing the linear algebra characteristics of iterative machine learning and graph algorithms, such as spectral clustering and PageRank, we find that yInMem is capable of maximizing the usage of computing resources in the cluster. Leveraging the insights of Sparse Matrix-Vector Multiplication (SpMV), we also provide an optimal data partitioning algorithm on top of yInMem for load balance and data locality. In order to evaluate yInMem, we investigate iterative machine learning and graph algorithms using both synthetic benchmarks and real user applications. yInMem matches or exceeds the performance of existing specialized systems.