LIBFLOW: A PLATFORM TO SCHEDULE AND MANAGE WORKFLOWS USING DAGS

Panhalkar, ShreyasLIBFLOW: A PLATFORM TO SCHEDULE AND MANAGE WORKFLOWS USING DAGSMy University2019analyticsbig datagraphsworkflowsMy UniversityMy UniversityNicholas, Charles2021-01-292021-01-292019-01-01Text12043http://hdl.handle.net/11603/20711application:pdfWith continuous user growth year-on-year, Internet companies are collecting user data on a massive scale. This raw data is in turn used for generating interesting insights and using those insights to perform better. Due to various use cases, companies typically use different data stores to store a different kind of data. To name a few, Apache Hive is often being used for large-scale bulk data processing while Amazon Redshift is being for fast and real-time analytical queries. Thus, owing to various business needs and the increasing complexity of underlying data, companies are moving away from a traditional one-for-all data warehousing solution. The heterogeneous nature of these platforms' API possesses difficulty for data engineers to write a series of transformations to process data from various sources. In this work, we propose a platform, to help data engineers easily write workflows to process large-scale data involving multiple data warehouses, without much rudimentary work. To address the data dependency issues, this platform uses Directed Acyclic Graphs to define workflows and Johnson's algorithm to detect elementary cycles.