**4.1 Failure detection**

Failure detection is a very important service for large-scale opportunistic grids. The high rate of churn makes failures a frequent event and the capability of the grid infrastructure to efficiently deal with them has a direct impact on its ability to make progress. Hence, failed nodes should be detected quickly and the monitoring network should itself be reliable, so as to ensure that a node failure does not go undetected. At the same time, due to the scale and geographic dispersion of grid nodes, failure detectors should be capable of disseminating information about failed nodes as fast and reliably as possible and work correctly even when no process has a globally consistent view of the system. Moreover, the non-dedicated nature of opportunistic grids requires that solutions for failure detection be very lightweight in terms of network bandwidth consumption and usage of memory and CPU cycles of resource provider machines. Besides all of these requirements pertaining to the functioning of failure detectors,

every process in the system is *<sup>e</sup>*(−*e*(−*<sup>c</sup>*)), with *<sup>n</sup>* <sup>→</sup> <sup>∞</sup>. For *<sup>J</sup>* <sup>=</sup> 7, this probability is less than 0.001. On the other hand, no explicit action is taken to disseminate information about new processes. Instead, processes get to know about new processes by simply receiving heartbeat messages. Each heartbeat that a process *p* sends to a process *q* includes some randomly chosen ids of *K* processes that *p* knows about. In a grid, it is important to quickly disseminate information about failed processes in order to initiate recovery as soon as possible and, when recovery is not possible and the application has to be re-initiated, to avoid wasting grid resources. On the other hand, information about new members is not so urgent, since not knowing about new members in general does not keep grid applications from making

Efficient Parallel Application Execution on Opportunistic Desktop Grids 125

InteGrade's group membership service is implemented in Lua (Ierusalimschy et al., 1996), an extensible and lightweight programming language. Lua makes it easy to use the proposed service from programs written in other programming languages, such as Java, C, and C++. Moreover, it executes in several platforms. Currently, we have successfully run the failure detector in Windows XP, Mac OS X, and several flavors of Linux. The entire implementation of the group membership service comprises approximately 80Kb of Lua source code, including

The main techniques used to provide application execution fault-tolerance can be divided in 2 levels: task level and workflow level (Hwang & Kesselman, 2003). At the task level, fault-tolerance techniques apply recovery mechanisms directly at the tasks, to masks the failures effects. At the workflow level, the recovery mechanisms creates recovery procedures

There are 3 task-level techniques are frequently used in computational grids: retrying,

*Retrying* is the simplest technique and consists in restarting the execution of the task after a failure. The task can be schedule at the same resource or at another one. Several scheduling can be used, such as FIFO (First-In First-Out), or algorithms that select resources with more

*Replication* consists in executing several replicas of the task on different resources, with the expectation that at least one of them finishes the execution successfully. The replicas can be scheduled to machines from the same or from different domains. Since networks failures can make an entire domain inaccessible, executing replicas in different domains improves the reliability. Replication is also useful to guarantee the integrity of the task execution results, since defective or malicious nodes can produce erroneous results. To prevent these errors, it is possible to wait until all executing tasks finish and apply a Byzantine agreement algorithm

*Checkpointing* consists in periodically store the application state in a way that, after a failure, the application can be restarted and continue its execution from the last saved state. The checkpointing mechanisms can be classified on how the application stated is obtained. There

are two main approaches, called system-level and application-level checkpointing.

computational power, more idleness or credibility (regarding security).

process.

comments.

**4.2 Application failure handling**

**4.2.1 Task-level techniques**

to compare the results.

replication and checkpointing.

directly in the workflow execution control.

they must also be easy to set-up and use; otherwise they might be a source of design and configuration errors. It is well-known that configuration errors are a common cause of grid failures (Medeiros et al., 2003). The aforementioned requirements are hard to meet as a whole and, to the best of our knowledge, no existing work in the literature addresses all of them. This is not surprising, as some goals, e.g., a reliable monitoring network and low network bandwidth consumption, are inherently conflicting. Nevertheless, they are all real issues that appear in large-scale opportunistic grids, and reliable grid applications are expected to deal with them in a realistic setting.

#### **4.1.1 InteGrade failure detection**

The InteGrade failure detection service (Filho et al., 2008) includes a number of features that, when combined and appropriately tuned, address all the above challenges while adopting reasonable compromises for the ones that conflict. The most noteworthy features of the proposed failure detector are the following: (i) a gossip- or infection-style approach (van Renesse et al., 1998), meaning that the network load imposed by the failure detector scales well with the number of processes in the network and that the monitoring network is highly reliable and descentralized; (ii) self-adaptation and self-organization in the face of changing network conditions; (iii) a crash-recover failure model, instead of simple crash; (iv) ease of use and configuration; (v) low resource consumption (memory, CPU cycles, and network bandwidth).

InteGrade's failure detection service is completely decentralized and runs on every grid node. Each process in the monitoring network established by the failure detection service is monitored by *K* other processes, where *K* is an administrator-defined parameter. This means that for a process failure to go undetected, all the *K* processes monitoring it would need to fail at the same time. A process *r* which is monitored by a process *s* has an open TCP connection with it through which it sends heartbeat messages and other kinds of information. If *r* perceives that it is being monitored by more than *K* processes, it can cancel the monitoring relationship with one or more randomly chosen processes. If it is monitored by more than *K* processes, it can select one or more random processes to start monitoring it. This approach yields considerable gains in reliability (Filho et al., 2009) at a very low cost in terms of extra control messages.

InteGrade's failure detector automatically adapts to changing network conditions. Instead of using a fixed timeout to determine the failure of a process, it continuously outputs the probability that a process has failed based on the inter-arrival times of the last *W* heartbeats and the time elapsed since the last heartbeat was received, where *W* is an administrator-defined parameter. The failure detector can then be configured to take recovery actions whenever the failure probability reaches a certain threshold. Multiple thresholds can be set, each one triggering a different recovery action, depending on the application requirements.

InteGrade employs a reactive and explicit approach to disseminate information about failed processes. This means that once a process learns about a new failure it automatically sends this information to *J* randomly-chosen processes that it monitors or that monitor it. The administrator-defined parameter *J* dictates the speed of dissemination. According to Ganesh et al. (2003), for a system with *N* processes, if each process disseminates a piece of information to (log *n*) + *c* randomly chosen processes, the probability that the information does not reach every process in the system is *<sup>e</sup>*(−*e*(−*<sup>c</sup>*)), with *<sup>n</sup>* <sup>→</sup> <sup>∞</sup>. For *<sup>J</sup>* <sup>=</sup> 7, this probability is less than 0.001. On the other hand, no explicit action is taken to disseminate information about new processes. Instead, processes get to know about new processes by simply receiving heartbeat messages. Each heartbeat that a process *p* sends to a process *q* includes some randomly chosen ids of *K* processes that *p* knows about. In a grid, it is important to quickly disseminate information about failed processes in order to initiate recovery as soon as possible and, when recovery is not possible and the application has to be re-initiated, to avoid wasting grid resources. On the other hand, information about new members is not so urgent, since not knowing about new members in general does not keep grid applications from making process.

InteGrade's group membership service is implemented in Lua (Ierusalimschy et al., 1996), an extensible and lightweight programming language. Lua makes it easy to use the proposed service from programs written in other programming languages, such as Java, C, and C++. Moreover, it executes in several platforms. Currently, we have successfully run the failure detector in Windows XP, Mac OS X, and several flavors of Linux. The entire implementation of the group membership service comprises approximately 80Kb of Lua source code, including comments.
