**4.2 Application failure handling**

12 Will-be-set-by-IN-TECH

they must also be easy to set-up and use; otherwise they might be a source of design and configuration errors. It is well-known that configuration errors are a common cause of grid failures (Medeiros et al., 2003). The aforementioned requirements are hard to meet as a whole and, to the best of our knowledge, no existing work in the literature addresses all of them. This is not surprising, as some goals, e.g., a reliable monitoring network and low network bandwidth consumption, are inherently conflicting. Nevertheless, they are all real issues that appear in large-scale opportunistic grids, and reliable grid applications are expected to deal

The InteGrade failure detection service (Filho et al., 2008) includes a number of features that, when combined and appropriately tuned, address all the above challenges while adopting reasonable compromises for the ones that conflict. The most noteworthy features of the proposed failure detector are the following: (i) a gossip- or infection-style approach (van Renesse et al., 1998), meaning that the network load imposed by the failure detector scales well with the number of processes in the network and that the monitoring network is highly reliable and descentralized; (ii) self-adaptation and self-organization in the face of changing network conditions; (iii) a crash-recover failure model, instead of simple crash; (iv) ease of use and configuration; (v) low resource consumption (memory, CPU cycles, and network

InteGrade's failure detection service is completely decentralized and runs on every grid node. Each process in the monitoring network established by the failure detection service is monitored by *K* other processes, where *K* is an administrator-defined parameter. This means that for a process failure to go undetected, all the *K* processes monitoring it would need to fail at the same time. A process *r* which is monitored by a process *s* has an open TCP connection with it through which it sends heartbeat messages and other kinds of information. If *r* perceives that it is being monitored by more than *K* processes, it can cancel the monitoring relationship with one or more randomly chosen processes. If it is monitored by more than *K* processes, it can select one or more random processes to start monitoring it. This approach yields considerable gains in reliability (Filho et al., 2009) at a very low cost in terms of extra

InteGrade's failure detector automatically adapts to changing network conditions. Instead of using a fixed timeout to determine the failure of a process, it continuously outputs the probability that a process has failed based on the inter-arrival times of the last *W* heartbeats and the time elapsed since the last heartbeat was received, where *W* is an administrator-defined parameter. The failure detector can then be configured to take recovery actions whenever the failure probability reaches a certain threshold. Multiple thresholds can be set, each one triggering a different recovery action, depending on the application

InteGrade employs a reactive and explicit approach to disseminate information about failed processes. This means that once a process learns about a new failure it automatically sends this information to *J* randomly-chosen processes that it monitors or that monitor it. The administrator-defined parameter *J* dictates the speed of dissemination. According to Ganesh et al. (2003), for a system with *N* processes, if each process disseminates a piece of information to (log *n*) + *c* randomly chosen processes, the probability that the information does not reach

with them in a realistic setting.

**4.1.1 InteGrade failure detection**

bandwidth).

control messages.

requirements.

The main techniques used to provide application execution fault-tolerance can be divided in 2 levels: task level and workflow level (Hwang & Kesselman, 2003). At the task level, fault-tolerance techniques apply recovery mechanisms directly at the tasks, to masks the failures effects. At the workflow level, the recovery mechanisms creates recovery procedures directly in the workflow execution control.
