Software Fault Tolerance of Distributed Programs Using Computation Slicing

Neeraj Mittal; Vijay K. Garg

doi:10.1109/ICDCS.2003.1203457

Abstract

Writing correct distributed programs is hard. In spite of extensive testing and debugging, software faults persist even in commercial grade software. Many distributed systems, especially those employed in safety-critical environments, should be able to operate properly even in the presence of software faults. Monitoring the execution of a distributed system, and, on detecting a fault, initiating the appropriate corrective action is an important way to tolerate such faults. This gives rise to the predicate detection problem which involves finding a consistent cut of a distributed computation, if it exists, that satisfies the given global predicate. Detecting a predicate in a computation is, however, an NP-complete problem. To ameliorate the associated combinatorial explosion problem, we introduce the notion of computation slice in our earlier papers [5, 10]. Intuitively, slice is a concise representation of those consistent cuts that satisfy a certain condition. To detect a predicate, rather than searching the state-space of the computation, it is much more efficient to search the state-space of the slice. In this paper, we provide efficient algorithms to compute the slice for several classes of predicates. Our experimental results demonstrate that slicing can lead to an exponential improvement over existing techniques in terms of time and space.

Software Fault Tolerance of Distributed Programs Using Computation Slicing

Authors

Abstract

Related Articles