Comparison of Distributed Data-Parallelization Patterns for Big Data Analysis: A Bioinformatics Case Study
Permanent Link
Author/Creator
Author/Creator ORCID
Date
Department
Program
Citation of Original Publication
Rights
This item is likely protected under Title 17 of the U.S. Copyright Law. Unless on a Creative Commons license, for uses protected by Copyright Law, contact the copyright holder or the author.
Subjects
Abstract
As a distributed data-parallelization (DDP) pattern, MapReduce has been adopted by many new big data analysis tools to achieve good scalability and performance in Cluster or Cloud environments. This paper explores how two binary DDP patterns, i.e., CoGroup and Match, could also be used in these tools. We reimplemented an existing bioinformatics tool,called CloudBurst, with three different DDP pattern combinations. We identify two factors, namely, input data balancing and value sparseness, which could greatly affect the performances using different DDP patterns. Our experiments show: (i) a simple DDP pattern switch could speed up performance by almost two times; (ii) the identified factors can explain the differences well.
