Improving Application Resilience through Probabilistic Task Replication

dc.contributor.authorSimon, Tyler A.
dc.contributor.authorDorband, John
dc.date.accessioned2025-06-05T14:02:35Z
dc.date.available2025-06-05T14:02:35Z
dc.date.issued2013-06
dc.description ACM International Conference on Supercomputing (ICS)
dc.description.abstractMaintaining performance in a faulty distributed computing environment is a major challenge in the design of future peta and exa-scale class systems. Better defining application resilience as a function of scale, is a key to developing reliable software systems and programming methodologies. This paper defines the resilience of a task as the survivability of that task (i.e., how well will it survive until it completes). Resilience varies with mean time to failure (MTTF) and inversely with runtime. We develop an approach for defining a resilience index(RI) for applications running on a system with a fixed MTTF. Our approach, inspired by radioactive decay, defines an application as a collection of tasks, which we model as particles with an exponential decay rate and therefore measurable half-life. We determine the probability of the number of task failures for an application using a poisson distribution over the interval of the task lifetime. Further we have developed a distributed runtime system, ARRIA, that measures both system reliability and application performance at runtime, which schedules and replicates tasks based on the probability of failure and expected runtime. We demonstrate that the resilience index can help to better define the tradeoffs for the designers of future systems and developers of parallel software. Thus, we propose a formulation of application resilience that results in a resilience index. We evaluate some initial and fundamental properties of the resilience index as they relate to application performance on high performance computing systems composed of many components, each with varying degrees of reliability
dc.description.sponsorshipThe work was supported by an NSF IUCRC grant and the Laboratory for Physical Sciences The authors thank Milton Halem David Mountain and John Daly for their comments and insights which led to this work
dc.format.extent8 pages
dc.genreconference papers and proceedings
dc.genrepreprints
dc.identifierdoi:10.13016/m27waa-9dw0
dc.identifier.urihttp://hdl.handle.net/11603/38549
dc.language.isoen_US
dc.relation.isAvailableAtThe University of Maryland, Baltimore County (UMBC)
dc.relation.ispartofUMBC Computer Science and Electrical Engineering Department
dc.relation.ispartofUMBC Faculty Collection
dc.rightsThis item is likely protected under Title 17 of the U.S. Copyright Law. Unless on a Creative Commons license, for uses protected by Copyright Law, contact the copyright holder or the author.
dc.titleImproving Application Resilience through Probabilistic Task Replication
dc.typeText

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
SimonDorband.pdf
Size:
322.36 KB
Format:
Adobe Portable Document Format