Efficient Scientific Big Data Aggregation through Parallelization and Subsampling

Author/Creator ORCID

Date

2019-01-01

Department

Information Systems

Program

Information Systems

Citation of Original Publication

Rights

Access limited to the UMBC community. Item may possibly be obtained via Interlibrary Loan thorugh a local library, pending author/copyright holder's permission.
This item may be protected under Title 17 of the U.S. Copyright Law. It is made available by UMBC for non-commercial research and education. For permission to publish or reproduce, please see http://aok.lib.umbc.edu/specoll/repro.php or contact Special Collections at speccoll(at)umbc.edu

Abstract

ABSTRACT Title of Document: EFFICIENT SCIENTIFIC BIG DATA AGGREGATION THROUGH PARALLELIZATION AND SUBSAMPLING Savio Sebastian Kay Master of Science, Information Systems, 2019 Directed By: Jianwu Wang Assistant Professor Department of Information Systems University of Maryland, Baltimore County In the various scientific research study, experiments related to atmospheric physics and satellite data administration, processing and manipulation does take a considerable amount of time and resources depending on the size of the project. Due to the tremendous amount of data existing even in an essential use case, computing information does take a longer time. It is in the cause of multiple variables included in the substantial scientific dataset sizes of the Satellite specific files. One of the methods scientific researcher and developers' approach is to use more resources to manage the significant data ingestion and manipulation along with process parallelization like file-level parallelization and or day-level parallelization. It drastically reduces the time taken to process data. However, the concept of subsampling is known to diminish the period to a shorter span, which is suitable for a lot of scientific study and experiments. In this theses, the procedure of subsampling has tested and proposed to be an approach to decrease processing time radically. Experimental results show the Xarray python package; a modern python framework provides enough support to process large volumes of data in a shorter period, which is suitable for the scientific research study. We process One Month of Satellite data which constitutes to be 8928 HDF files with the size of about 1.154TB (Terabytes) of information. It includes 8928 HDF files of MYD03 (357.23GB) and 8928 HDF files of MYD06_L2 (797.71GB) MODIS satellite datasets. We evaluate the cloud property variable by aggregating Level 2 data to Level 3 format, and we achieve this via two primary approaches of subsampling and parallel processing. Our research and experiments show along with parallel computing on multiple compute nodes through XArray & Dask; subsampling technique can reduce system execution time dramatically with little to no data loss in the final computed information. The code for the research and study can be found over at the GitHub account of 'saviokay' with the repository name 'masters-theses', it can be accessed via the link: https://github.com/saviokay/masters-theses .