Kepler + Hadoop: a general architecture facilitating data-intensive applications in scientific workflow systems

Wang, Jianwu, Daniel Crawl, and Ilkay Altintas. “Kepler + Hadoop: A General Architecture Facilitating Data-Intensive Applications in Scientific Workflow Systems.” In Proceedings of the 4th Workshop on Workflows in Support of Large-Scale Science, 1–8. WORKS ’09. New York, NY, USA: Association for Computing Machinery, 2009. https://doi.org/10.1145/1645164.1645176.

Rights

This item is likely protected under Title 17 of the U.S. Copyright Law. Unless on a Creative Commons license, for uses protected by Copyright Law, contact the copyright holder or the author.

Subjects

MapReduce
Kepler
Hadoop
scientific workflow
parallel computing
distributed computing
UMBC Big Data Analytics Lab

Abstract

MapReduce provides a parallel and scalable programming model for data-intensive business and scientific applications. MapReduce and its de facto open source project, called Hadoop, support parallel processing on large datasets with capabilities including automatic data partitioning and distribution, load balancing, and fault tolerance management. Meanwhile, scientific workflow management systems, e.g., Kepler, Taverna, Triana, and Pegasus, have demonstrated their ability to help domain scientists solve scientific problems by synthesizing different data and computing resources. By integrating Hadoop with Kepler, we provide an easy-to-use architecture that facilitates users to compose and execute MapReduce applications in Kepler scientific workflows. Our implementation demonstrates that many characteristics of scientific workflow management systems, e.g., graphical user interface and component reuse and sharing, are very complementary to those of MapReduce. Using the presented Hadoop components in Kepler, scientists can easily utilize MapReduce in their domain-specific problems and connect them with other tasks in a workflow through the Kepler graphical user interface. We validate the feasibility of our approach via a word count use case.

Kepler + Hadoop: a general architecture facilitating data-intensive applications in scientific workflow systems

Files

Links to Files

Permanent Link

Collections

Author/Creator

Author/Creator ORCID

Date

Type of Work

Department

Program

Citation of Original Publication

Rights

Subjects

Abstract