Abstract
Computational Grids have the potential to become the main execution platform for high performance and distributed applications. However, such systems are extremely complex and prone to failures. In this paper, we present a survey with the grid community on which several people shared their actual experience regarding fault treatment. The survey reveals that, nowadays, users have to be highly involved in diagnosing failures, that most failures are due to configuration problems (a hint of the area's immaturity), and that solutions for dealing with failures are mainly application-dependent. Going further, we identify two main reasons for this state of affairs. First, grid components that provide high-level abstractions when working, do expose all gory details when broken. Since there are no appropriate mechanisms to deal with the complexity exposed (configuration, middleware, hardware and software issues), users need to be deeply involved in the diagnosis and correction of failures. To address this problem, one needs a way to coordinate different support teams working at the grids different levels of abstraction. Second, fault tolerance schemes today implemented on grids tolerate only crash failures. Since grids are prone to more complex failures, such those caused by heisenbugs, one needs to tolerate tougher failures. Our hope is that the very heterogeneity, that makes a grid a complex environment, can help in the creation of diverse software replicas, a strategy that can tolerate more complex failures.