The performance of independent checkpointing in distributed systems

P. Sens

doi:10.1109/HICSS.1995.375504

28th Hawaii International Conference on System Sciences (HICSS'95)

The performance of independent checkpointing in distributed systems

Year: 1995, Volume: 2, Pages: 525

DOI Bookmark: 10.1109/HICSS.1995.375504

Authors

P. Sens, MASI Lab., Paris VI Univ., France

Abstract

The paper describes performance measurements of an implementation of independent checkpointing in a network of workstations. Independent checkpointing is a simple technique for providing fault tolerance in distributed systems. Because processes do not coordinate during checkpointing, this technique has a low run-time overhead. To avoid the classical domino effect, our implementation relies on a message logging mechanism. We have measured fault management overhead for different kinds of parallel applications. The costs of checkpointing are very low. However, message logging introduces a sizeable overhead. We compare these results to other works implementing different checkpointing policies, and we show that independent checkpointing is an efficient way to provide fault tolerance for long-running distributed applications composed of processes exchanging small streams of data.

A low-overhead recovery technique using quasi-synchronous checkpointing
Proceedings of 16th International Conference on Distributed Computing Systems
PLinda 2.0: a transactional/checkpointing approach to fault tolerant Linda
Proceedings of IEEE 13th Symposium on Reliable Distributed Systems
On the use and implementation of message logging
Proceedings of IEEE 24th International Symposium on Fault- Tolerant Computing
Checkpointing multicomputer applications
Proceedings Tenth Symposium on Reliable Distributed Systems
The performance of consistent checkpointing in distributed shared memory systems
Reliable Distributed Systems, IEEE Symposium on
Adaptive independent checkpointing for reducing rollback propagation
Parallel and Distributed Processing, IEEE Symposium on
Optimistic message logging for independent checkpointing in message-passing systems
Proceedings 11th Symposium on Reliable Distributed Systems
A Communication-Induced Checkpointing and Asynchronous Recovery Protocol for Mobile Computing Systems
Sixth International Conference on Parallel and Distributed Computing Applications and Technologies
The performance of consistent checkpointing
Proceedings 11th Symposium on Reliable Distributed Systems
Incremental Checkpointing for Fault-Tolerant Stream Processing Systems: A Data Structure Approach
IEEE Transactions on Emerging Topics in Computing

The performance of independent checkpointing in distributed systems

Authors

Abstract

Related Articles