Scalable and Flexible Two-Phase Ensemble Algorithms for Causality Discovery
Loading...
Author/Creator
Author/Creator ORCID
Date
2021-11-15
Type of Work
Department
Program
Citation of Original Publication
Pei Guo, Yiyi Huang, Jianwu Wang. Scalable and Flexible Two-Phase Ensemble Algorithms for Causality Discovery. Big Data Research, vol. 26, no. 100252, November 2021. DOI:10.1016/j.bdr.2021.100252
Rights
This item is likely protected under Title 17 of the U.S. Copyright Law. Unless on a Creative Commons license, for uses protected by Copyright Law, contact the copyright holder or the author.
Attribution 4.0 International (CC BY 4.0)
Attribution 4.0 International (CC BY 4.0)
Abstract
Causality study investigates cause-effect relationships among different variables of a system and has been
widely used in many disciplines including climatology and neuroscience. To discover causal relationships,
many data-driven causality discovery methods, e.g., Granger causality, PCMCI and Dynamic Bayesian
Network, have been proposed. Many of these causality discovery approaches mine time-series data and
generate a directed causality graph where each graph edge denotes a cause-effect relationship between
the two connected graph nodes. Our benchmarking of different causality discovery approaches with realworld climate data show these approaches often generate quite different causality results with the same
input dataset due to their internal learning mechanism differences. Meanwhile, there are ever-increasing
available data in virtually every discipline, which makes it more and more difficult to use existing
causality discovery algorithms to produce causality results within reasonable time. To address these two
challenges, this paper utilizes data partitioning and ensemble techniques, and proposes a flexible twophase causality ensemble framework. The framework first conducts phase 1 ensemble for partitioned
data and then conducts phase 2 ensemble from phase 1 ensemble results. Based on the framework, we
develop two ensemble approaches: i) data ensemble at phase 1 and algorithm ensemble at phase 2,
and ii) algorithm ensemble at phase 1 and data ensemble at phase 2. To achieve scalability, we further
parallelize the ensemble approaches via the Spark big data analytics engine. The proposed ensemble
approaches are evaluated by synthetic and real-world datasets. Our experiments show that the proposed
approaches achieve good accuracy through ensemble and high scalability through data-parallelization in
distributed computing environments