Scalable and Flexible Two-Phase Ensemble Algorithms for Causality Discovery

Date

2021-11-15

Department

Program

Citation of Original Publication

Pei Guo, Yiyi Huang, Jianwu Wang. Scalable and Flexible Two-Phase Ensemble Algorithms for Causality Discovery. Big Data Research, vol. 26, no. 100252, November 2021. DOI:10.1016/j.bdr.2021.100252

Rights

This item is likely protected under Title 17 of the U.S. Copyright Law. Unless on a Creative Commons license, for uses protected by Copyright Law, contact the copyright holder or the author.
Attribution 4.0 International (CC BY 4.0)

Abstract

Causality study investigates cause-effect relationships among different variables of a system and has been widely used in many disciplines including climatology and neuroscience. To discover causal relationships, many data-driven causality discovery methods, e.g., Granger causality, PCMCI and Dynamic Bayesian Network, have been proposed. Many of these causality discovery approaches mine time-series data and generate a directed causality graph where each graph edge denotes a cause-effect relationship between the two connected graph nodes. Our benchmarking of different causality discovery approaches with realworld climate data show these approaches often generate quite different causality results with the same input dataset due to their internal learning mechanism differences. Meanwhile, there are ever-increasing available data in virtually every discipline, which makes it more and more difficult to use existing causality discovery algorithms to produce causality results within reasonable time. To address these two challenges, this paper utilizes data partitioning and ensemble techniques, and proposes a flexible twophase causality ensemble framework. The framework first conducts phase 1 ensemble for partitioned data and then conducts phase 2 ensemble from phase 1 ensemble results. Based on the framework, we develop two ensemble approaches: i) data ensemble at phase 1 and algorithm ensemble at phase 2, and ii) algorithm ensemble at phase 1 and data ensemble at phase 2. To achieve scalability, we further parallelize the ensemble approaches via the Spark big data analytics engine. The proposed ensemble approaches are evaluated by synthetic and real-world datasets. Our experiments show that the proposed approaches achieve good accuracy through ensemble and high scalability through data-parallelization in distributed computing environments