LIBFLOW: A PLATFORM TO SCHEDULE AND MANAGE WORKFLOWS USING DAGS

Author/Creator

Author/Creator ORCID

Date

2019-01-01

Department

Computer Science and Electrical Engineering

Program

Computer Science

Citation of Original Publication

Rights

Distribution Rights granted to UMBC by the author.
Access limited to the UMBC community. Item may possibly be obtained via Interlibrary Loan thorugh a local library, pending author/copyright holder's permission.
This item is likely protected under Title 17 of the U.S. Copyright Law. Unless on a Creative Commons license, for uses protected by Copyright Law, contact the copyright holder or the author.

Abstract

With continuous user growth year-on-year, Internet companies are collecting user data on a massive scale. This raw data is in turn used for generating interesting insights and using those insights to perform better. Due to various use cases, companies typically use different data stores to store a different kind of data. To name a few, Apache Hive is often being used for large-scale bulk data processing while Amazon Redshift is being for fast and real-time analytical queries. Thus, owing to various business needs and the increasing complexity of underlying data, companies are moving away from a traditional one-for-all data warehousing solution. The heterogeneous nature of these platforms' API possesses difficulty for data engineers to write a series of transformations to process data from various sources. In this work, we propose a platform, to help data engineers easily write workflows to process large-scale data involving multiple data warehouses, without much rudimentary work. To address the data dependency issues, this platform uses Directed Acyclic Graphs to define workflows and Johnson's algorithm to detect elementary cycles.