**4. Application execution fault-tolerance**

On opportunistic grids, application execution can fail due to several reasons. System failures can result not only from an error on a single component but also from the usually complex interactions between the several grid components that comprise a range of different services. In addition to that, grid environments are extremely dynamic, with components joining and leaving the system at all times. Also, the likelihood of errors occurring during the execution of an application is exacerbated by the fact that many grid applications will perform long tasks that may require several days of computation.

To provide the necessary fault tolerance functionality for grid environments, several services must be available, such as: (a) **failure detection:** grid nodes and applications must be constantly monitored by a failure detection service; (b) **application failure handling:** various failure handling strategies can be employed in grid environments to ensure the continuity of application execution; and (c) **stable storage:** execution states that allow recovering the pre-failure state of applications must be saved in a data repository that can survive grid node failures. Those basic services will be discussed on the following sections.
