Benchmarking Resource Usage of Underlying Datatypes of Apache Spark
No Thumbnail Available
Links to Files
Permanent Link
Author/Creator ORCID
Date
2020-12-08
Type of Work
Department
Program
Citation of Original Publication
Nicholls, Brittany; Adangwa, Mariama; Estes, Rachel; Iradukunda, Hugues Nelson; Zhang, Qingquan; Zhu, Ting; Benchmarking Resource Usage of Underlying Datatypes of Apache Spark; Systems and Control (2020); https://arxiv.org/abs/2012.04192
Rights
This item is likely protected under Title 17 of the U.S. Copyright Law. Unless on a Creative Commons license, for uses protected by Copyright Law, contact the copyright holder or the author.
Abstract
The purpose of this paper is to examine how
resource usage of an analytic is affected by the different
underlying datatypes of Spark analytics - Resilient Distributed
Datasets (RDDs), Datasets, and DataFrames. The resource usage
of an analytic is explored as a viable, and preferred alternative of
benchmarking big data analytics instead of the current common
benchmarking performed using execution time. The run time of
an analytic is shown to not be guaranteed to be a reproducible
metric since many external factors to the job can affect the
execution time. Instead, metrics readily available through Spark
including peak execution memory are used to benchmark the
resource usage of these different datatypes in common
applications of Spark analytics, such as counting, caching,
repartitioning, and KMeans.