Improving Application Resilience through Probabilistic Task Replication
| dc.contributor.author | Simon, Tyler A. | |
| dc.contributor.author | Dorband, John | |
| dc.date.accessioned | 2025-06-05T14:02:35Z | |
| dc.date.available | 2025-06-05T14:02:35Z | |
| dc.date.issued | 2013-06 | |
| dc.description | ACM International Conference on Supercomputing (ICS) | |
| dc.description.abstract | Maintaining performance in a faulty distributed computing environment is a major challenge in the design of future peta and exa-scale class systems. Better defining application resilience as a function of scale, is a key to developing reliable software systems and programming methodologies. This paper defines the resilience of a task as the survivability of that task (i.e., how well will it survive until it completes). Resilience varies with mean time to failure (MTTF) and inversely with runtime. We develop an approach for defining a resilience index(RI) for applications running on a system with a fixed MTTF. Our approach, inspired by radioactive decay, defines an application as a collection of tasks, which we model as particles with an exponential decay rate and therefore measurable half-life. We determine the probability of the number of task failures for an application using a poisson distribution over the interval of the task lifetime. Further we have developed a distributed runtime system, ARRIA, that measures both system reliability and application performance at runtime, which schedules and replicates tasks based on the probability of failure and expected runtime. We demonstrate that the resilience index can help to better define the tradeoffs for the designers of future systems and developers of parallel software. Thus, we propose a formulation of application resilience that results in a resilience index. We evaluate some initial and fundamental properties of the resilience index as they relate to application performance on high performance computing systems composed of many components, each with varying degrees of reliability | |
| dc.description.sponsorship | The work was supported by an NSF IUCRC grant and the Laboratory for Physical Sciences The authors thank Milton Halem David Mountain and John Daly for their comments and insights which led to this work | |
| dc.format.extent | 8 pages | |
| dc.genre | conference papers and proceedings | |
| dc.genre | preprints | |
| dc.identifier | doi:10.13016/m27waa-9dw0 | |
| dc.identifier.uri | http://hdl.handle.net/11603/38549 | |
| dc.language.iso | en_US | |
| dc.relation.isAvailableAt | The University of Maryland, Baltimore County (UMBC) | |
| dc.relation.ispartof | UMBC Computer Science and Electrical Engineering Department | |
| dc.relation.ispartof | UMBC Faculty Collection | |
| dc.rights | This item is likely protected under Title 17 of the U.S. Copyright Law. Unless on a Creative Commons license, for uses protected by Copyright Law, contact the copyright holder or the author. | |
| dc.title | Improving Application Resilience through Probabilistic Task Replication | |
| dc.type | Text |
Files
Original bundle
1 - 1 of 1
