Journal article
Transparent recovery from intermittent faults in time-triggered distributed systems
IEEE transactions on computers, v 52(2)
01 Feb 2003
Featured in Collection : UN Sustainable Development Goals @ Drexel
Abstract
The time-triggered model, with tasks scheduled in static (offline) fashion, provides a high degree of timing predictability in safety-critical distributed systems. Such systems must also tolerate transient and intermittent failures which occur far more frequently than permanent ones. Software-based recovery methods using temporal redundancy, such as task reexecution and primary/backup, while incurring performance overhead, are cost-effective methods of handling these failures. We present a constructive approach to integrating runtime recovery policies in a time-triggered distributed system. Furthermore, the method provides transparent failure recovery in that a processor recovering from task failures does not disrupt the operation of other processors. Given a general task graph with precedence and timing constraints and a specific fault model, the proposed method constructs the corresponding fault-tolerant (FT) schedule with sufficient slack to accommodate recovery. We introduce the cluster-based failure recovery concept which determines the best placement of slack within the FT schedule so as to minimize the resulting time overhead. Contingency schedules, also generated offline, revise this FT schedule to mask task failures on individual processors while preserving precedence and timing constraints. We present simulation results which show that, for small-scale embedded systems having task graphs of moderate complexity, the proposed approach generates FT schedules which incur about 30-40 percent performance overhead when compared to corresponding non-fault-tolerant ones.
Metrics
Details
- Title
- Transparent recovery from intermittent faults in time-triggered distributed systems
- Creators
- N Kandasamy - University of MichiganJ P HayesB T Murray
- Publication Details
- IEEE transactions on computers, v 52(2)
- Publisher
- IEEE
- Number of pages
- 13
- Resource Type
- Journal article
- Language
- English
- Academic Unit
- Electrical and Computer Engineering
- Web of Science ID
- WOS:000180520500003
- Scopus ID
- 2-s2.0-0037331028
- Other Identifier
- 991020546591904721
UN Sustainable Development Goals (SDGs)
This publication has contributed to the advancement of the following goals:
InCites Highlights
Data related to this publication, from InCites Benchmarking & Analytics tool:
- Collaboration types
- Industry collaboration
- Domestic collaboration
- Web of Science research areas
- Computer Science, Hardware & Architecture
- Engineering, Electrical & Electronic