A Hybrid Scheduling Algorithm for Data Intensive Workloads in a MapReduce Environment

Nguyen, Phuong; Simon, Tyler A.; Halem, Milton; Chapman, David; Le, Quang

A Hybrid Scheduling Algorithm for Data Intensive Workloads in a MapReduce Environment

dc.contributor.author	Nguyen, Phuong
dc.contributor.author	Simon, Tyler A.
dc.contributor.author	Halem, Milton
dc.contributor.author	Chapman, David
dc.contributor.author	Le, Quang
dc.date.accessioned	2025-06-05T14:03:53Z
dc.date.available	2025-06-05T14:03:53Z
dc.date.issued	2012-11
dc.description	2012 IEEE Fifth International Conference on Utility and Cloud Computing
dc.description.abstract	The specific choice of workload task schedulers for Hadoop MapReduce applications can have a dramatic effect on job workload latency. The Hadoop Fair Scheduler (FairS) assigns resources to jobs such that all jobs get, on average, an equal share of resources over time. Thus, it addresses the problem with a FIFO scheduler when short jobs have to wait for long running jobs to complete. We show that even for the FairS, jobs are still forced to wait significantly when the MapReduce system assigns equal sharing of resources due to dependencies between Map, Shuffle, Sort, Reduce phases. We propose a Hybrid Scheduler (HybS) algorithm based on dynamic priority in order to reduce the latency for variable length concurrent jobs, while maintaining data locality. The dynamic priorities can accommodate multiple task lengths, job sizes, and job waiting times by applying a greedy fractional knapsack algorithm for job task processor assignment. The estimated runtime of Map and Reduce tasks are provided to the HybS dynamic priorities from the historical Hadoop log files. In addition to dynamic priority, we implement a reordering of task processor assignment to account for data availability to automatically maintain the benefits of data locality in this environment. We evaluate our approach by running concurrent workloads consisting of the Word-count and Terasort benchmarks, and a satellite scientific data processing workload and developing a simulator. Our evaluation shows the HybS system improves the average response time for the workloads approximately 2.1x faster over the Hadoop FairS with a standard deviation of 1.4x.
dc.description.sponsorship	This work is supported in part by Center for Hybrid Multicore Productivity Research UMBC CSEE and an NSF CORBI grant between CHMPR MC2 and CHREC GWU Thanks also to Navid Golpayegani for his work building our initial Eucalyptus cloud testbed on the CHMPR bluegrit cluster
dc.description.uri	https://ieeexplore.ieee.org/document/6424941/
dc.format.extent	7 pages
dc.genre	conference papers and proceedings
dc.genre	preprints
dc.identifier	doi:10.13016/m2xy5m-q9xr
dc.identifier.citation	Nguyen, Phuong, Tyler Simon, Milton Halem, David Chapman, and Quang Le. “A Hybrid Scheduling Algorithm for Data Intensive Workloads in a MapReduce Environment.” In 2012 IEEE Fifth International Conference on Utility and Cloud Computing, 161–67, 2012. https://doi.org/10.1109/UCC.2012.32.
dc.identifier.uri	https://doi.org/10.1109/UCC.2012.32
dc.identifier.uri	http://hdl.handle.net/11603/38769
dc.language.iso	en_US
dc.relation.isAvailableAt	The University of Maryland, Baltimore County (UMBC)
dc.relation.ispartof	UMBC Faculty Collection
dc.relation.ispartof	UMBC Computer Science and Electrical Engineering Department
dc.relation.ispartof	UMBC Student Collection
dc.rights	© 2012 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
dc.subject	Runtime
dc.subject	Heuristic algorithms
dc.subject	MapReduce
dc.subject	UMBC Accelerated Cognitive Cybersecurity Lab
dc.subject	UMBC Ebiquity Research Group
dc.subject	dynamic priority
dc.subject	Dynamic scheduling
dc.subject	Time factors
dc.subject	workflow
dc.subject	Hadoop
dc.subject	Scheduler
dc.subject	Benchmark testing
dc.subject	Scheduling algorithms
dc.subject	scheduling
dc.subject	UMBC College of Engineering and Information Technology Center for Accelerated Real Time Analytics
dc.title	A Hybrid Scheduling Algorithm for Data Intensive Workloads in a MapReduce Environment
dc.type	Text
dcterms.creator	https://orcid.org/0000-0002-4862-8396

Files

Original bundle

Now showing 1 - 1 of 1

Name:: HybridSchedulingAlgorithm.pdf
Size:: 867.43 KB
Format:: Adobe Portable Document Format

Download

Collections

UMBC Faculty Collection
UMBC Computer Science and Electrical Engineering Department
UMBC Student Collection