| Abstract |
|
Code coupling applications can be divided into communicating modules, that may be executed on different clusters in a cluster federation. As a cluster federation comprises of a large number of nodes, there is a high probability of a node failure. We propose a hierarchical checkpointing protocol that combines a synchronized checkpointing technique inside clusters and a communication-induced technique between clusters. This protocol fits to the characteristics of a cluster federation (large number of nodes, high latency and low bandwidth networking technologies between clusters). A preliminary performance evaluation performed using a discrete event simulator shows that the protocol is suitable for code coupling applications.
|
Additional Information
|
Index Terms- Cluster Federation, Checkpointing and Recovery, Fault-tolerance, Parallel Application, Code Coupling
Citation:
Sebastien Monnet, Christine Morin, Ramamurthy Badrinath,
"A Hierarchical Checkpointing Protocol for Parallel Applications in Cluster Federations,"
ipdps,
p. 211a,
18th International Parallel and Distributed Processing Symposium (IPDPS'04) - Workshop 11,
2004
|