Benchmarking of Parallel Climate Data Aggregation in a Distributed Environment

Author/Creator

Author/Creator ORCID

Date

2019-01-01

Department

Information Systems

Program

Information Systems

Citation of Original Publication

Rights

Access limited to the UMBC community. Item may possibly be obtained via Interlibrary Loan thorugh a local library, pending author/copyright holder's permission.
This item may be protected under Title 17 of the U.S. Copyright Law. It is made available by UMBC for non-commercial research and education. For permission to publish or reproduce, please see http://aok.lib.umbc.edu/specoll/repro.php or contact Special Collections at speccoll(at)umbc.edu

Subjects

Abstract

In atmospheric physics, the coverage of clouds with the frequency of its occurrence and the evaluation of different cloud properties give us Cloud Fraction. The climate data obtained from MODIS (Moderate Resolution Imaging Spectroradiometer) instrument in satellites are averaged to produce the cloud fraction on day scale and monthly scale to determine the cloud properties. There is a vast amount of data involved and takes tedious calculations and a longer time in the computation of Cloud Fraction. By introducing Big data platforms in this area and with the help of special features like data aggregation and data parallelization, results can be obtained in a faster way with effective reduction in computation time taken. This theses is one such project, where we use Python frameworks like Pandas and Dask to effectively perform the level-2 to level-3 data aggregation and to compute the cloud property results and the results are run on the parallel nodes, by gradual increase in the number of nodes used from 1,2 3 etc., and effectively monitoring the performance and to compare the time taken by these different frameworks in computing the results. Our experiments are carried out in the day level which uses close to 576 MODIS dataset file but used 100 files for all the other experimentation and we used Dask for parallel processing. Dask'sdifferent libraries like dask.dataframe, dask.delayed and dask distributed cluster methods have been used to achieve the parallelization. Our results demonstrate effective ways and the importance of parallel computing across the distributed clusters.