**4.2.1 Task-level techniques**

There are 3 task-level techniques are frequently used in computational grids: retrying, replication and checkpointing.

*Retrying* is the simplest technique and consists in restarting the execution of the task after a failure. The task can be schedule at the same resource or at another one. Several scheduling can be used, such as FIFO (First-In First-Out), or algorithms that select resources with more computational power, more idleness or credibility (regarding security).

*Replication* consists in executing several replicas of the task on different resources, with the expectation that at least one of them finishes the execution successfully. The replicas can be scheduled to machines from the same or from different domains. Since networks failures can make an entire domain inaccessible, executing replicas in different domains improves the reliability. Replication is also useful to guarantee the integrity of the task execution results, since defective or malicious nodes can produce erroneous results. To prevent these errors, it is possible to wait until all executing tasks finish and apply a Byzantine agreement algorithm to compare the results.

*Checkpointing* consists in periodically store the application state in a way that, after a failure, the application can be restarted and continue its execution from the last saved state. The checkpointing mechanisms can be classified on how the application stated is obtained. There are two main approaches, called system-level and application-level checkpointing.

The *redundancy* technique is similar to the alternative task, with the difference that the distinct implementations of the task are executed simultaneously. When the first one finishes, the task

Efficient Parallel Application Execution on Opportunistic Desktop Grids 127

The *user defined exception handling* allows the user to provide the handling procedure for specific failures of particular tasks. This approach will usually be more efficient, since the

Finally, the last approach consists in generating a new workflow, called *rescue workflow*, to execute the tasks of the original flow that failed and the ones that could not be executed due to task dependencies. If the rescue workflow fails, a new workflow can be

InteGrade provides a task-level fault tolerance mechanism. In order to overcome application execution failures, this mechanism provides support for the most used failure handling strategies: (1) **retrying**: when an application execution fails, it is restarted from scratch; (2) **replication**: the same application is submitted for execution multiple times, generating various application replicas; all replicas are active and execute the same code with the same input parameters at different nodes; and (3) **checkpointing**: periodically saves the process' state in stable storage during the failure-free execution time. Upon a failure, the process restarts from the latest available saved checkpoint, thereby reducing the amount of lost computation. As part of the application submission process, users can select the desired technique to be applied in case of failure. These techniques can also be combined resulting in four more elaborate failure handling techniques: *retrying* (without checkpoint and replication), *checkpointing* (without replication), *replication* (without checkpointing), and

InteGrade includes a portable application-level checkpointing mechanism (de Camargo et al., 2005) for sequential, bag-of-tasks, and BSP parallel applications written in C. This portability allows an application's stored state to be recovered on a machine with a different architecture from the one where the checkpoint was generated. A precompiler inserts, into the application code, the statements responsible for gathering and restoring the application state from the checkpoint. On BSP applications, checkpoints are generated immediately after the end of a BSP synchronization phase. For MPI parallel applications, we provide a system-level checkpointing mechanism based on a coordinated protocol (Cardozo & Costa, 2008). For storing the checkpointing data, InteGrade uses a distributed data storage system, called

InteGrade allows replication for sequential, bag-of-tasks, MPI and BSP applications. The amount of generated replicas is currently defined during the application execution request, issued through the Application Submission and Control Tool (ASCT). The request is forwarded to the Global Resource Manager (GRM), which runs a scheduling algorithm that guarantees that all replicas will be assigned to different nodes. Another InteGrade component, called Application Replication Manager (ARM), concentrates most of the code responsible for managing replication. In case of a replica failure, the ARM starts its recovery process. When the first application replica concludes its job, the ARM kills the remaining ones, releasing the

is considered complete, and the remaining tasks are killed.

user normally has specific knowledge about the tasks.

generated (Condor\_Team, 2004).

*replication with checkpointing*.

OppStore (Section 4.3.1).

allocated grid resources.

**4.2.3 InteGrade application failure handling**

When using system-level checkpointing, the application state is obtained directly from the process memory space, together with some register values and state information from the operating system (Litzkow et al., 1997; Plank et al., 1998). This may require modifications in the kernel of the operating system, which may not be possible due to security reasons, but has the advantage that the checkpointing process can be transparent to the application. Some implementations permits the applications to be written in several program languages and used without recompilation. An important disadvantage of this approach for computational grids is that the checkpoints are not portable and is useful only in homogeneous clusters.

With application-level checkpointing, the application provides the data that will be stored in the checkpoint. It is necessary to instrument the application source-code so that the application saves its state periodically and, during recovery, reconstruct its original state from the checkpoint data (Bronevetsky et al., 2003; de Camargo et al., 2005; Karablieh et al., 2001; Strumpen & Ramkumar, 1996). Manually inserting code to save and recover an application state is a very error prone process, but this problem can be solved by providing a precompiler which automatically inserts the required code. Other drawbacks of this approach are the need to have access to the application source-code and that the checkpoints can be generated only at specified points in the execution. But this approach has the advantage that semantic information about memory contents is available and, consequently, only the data necessary to recover the application state needs to be saved. Moreover, the semantic information permits the generation of portable checkpoints (de Camargo et al., 2005; Karablieh et al., 2001; Strumpen & Ramkumar, 1996), which is an important advantage for heterogeneous grids.

In the case of coupled parallel applications, the tasks may exchange messages, which can be in transit during the generation of local checkpoints by the application processes. The content of these messages, including sending and delivery ordering, must also be considered as part of the application state. To guarantee that the global state (which includes the local state of all tasks) is consistent, it is necessary to use checkpointing protocols, which are classified as non-coordinated, coordinated and communication induced (Elnozahy et al., 2002).

The parallel checkpointing protocols differ in the level of coordination among the processes, from the non-coordinated protocols, where each process of the parallel application generates its local checkpoint independently from the others, to fully coordinated protocols, where all the processes synchronize before generating their local checkpoints. The communication induced protocol is similar to the non-coordinated one, with the difference that, to guarantee the generation of global checkpoints with consistent states, processes may be required to generate additional checkpoints after receiving or before sending messages.
