**4.3.1 InteGrade stable storage**

16 Will-be-set-by-IN-TECH

To guarantee the continuity of application execution and prevent loss of data in case of failures, it is necessary to store the periodical checkpoints and application temporary, input, and output data using a reliable and fault-tolerant storage device or service, which is called stable storage. The reliability is provided by the usage of data redundancy and the system can determine the level of redundancy depending on the unavailability rate of the storage devices

A commonly used strategy for data storage in computational grids usually store several replicas of files in dedicated servers managed by replica management systems (Cai et al., 2004; Chervenak et al., 2004; Ripeanu & Foster, 2002). These systems usually target high-performance computing platforms, with applications that require very large amounts (petabytes) of data and run on supercomputers connected by specialized high-speed

When dealing with the storage of checkpoints of parallel applications in opportunistic grids, a common strategy is to use the grid machines used to execute applications to store checkpoints. It is possible to distribute the data over the nodes executing the parallel application that generates the checkpoints, in addition to other grid machines. In this case, data is transfered in a parallel way to the machines. To ensure fault-tolerance, data must be coded and stored in a redundant way, and the data coding strategy must be selected considering its scalability, computational cost, and fault-tolerance level. The main techniques used are *data replication*,

Using data replication, the system stores full replicas of the generated checkpoints. If one of the replicas becomes unaccessible, the system can easily find another. The advantage is that no extra coding is necessary, but the disadvantage is that this approach requires the transfer and storage of large amounts of data. For instance, to guarantee safety against a single failure, it is necessary to save two copies of the checkpoint, which can generate too much local network traffic, possibly compromising the execution of the parallel application. A possible approach to adopt is to store a copy of the checkpoint locally and another remotely (de Camargo et al., 2006). Although a failure in a machine running the application makes one of the checkpoints unaccessible, it is still possible to retrieve the other copy. Moreover, the other application processes can use their local checkpoint copies. Consequently, this storage mode permits

recovery as long as one of the two nodes containing a checkpoint replica is available.

used for storage in devices with higher rates of unavailability.

The two other coding techniques decompose a file into smaller data segments, called stripes, and distribute these stripes among the machines. To ensure fault-tolerancy, redundant stripes are also generated and stored, permitting the original file to be recovered even if a subset of the stripes is lost. There are several algorithms to code the file into redundant fragments. A commonly used one is the use of data parity (Malluhi & Johnston, 1998; Plank et al., 1998; Sobe, 2003), where one or more extra stripes are generated based on the evaluation of the parity of the bits in the original fragments. It has the advantage that it is fast to evaluate and that the original stripes can be stored without modifications. But they have the disadvantage that data cannot be recovered if two or more fragments are lost and, consequently, cannot be

The other strategy is to use erasure coding techniques, which allow one to code a vector *U* of size *n*, into *m* + *k* encoded vectors of size *n*/*m*, with the property that one can regenerate *U*

**4.3 Stable storage**

used by the service.

*data parity*, and *erasure codes*.

networks.

InteGrade implements a distributed data repository called OppStore (de Camargo & Kon, 2007). It is used for storing the application's input and output files and checkpointing data. Access to this distributed repository is performed through a library called *access broker*, which interacts with OppStore.

OppStore is a middleware that provides reliable distributed data storage using free disk space from shared grid machines. The goal is to use this free disk space in an opportunistic way, i.e., only during the idle periods of the machines. The system is structured as a federation of clusters and is connected by a Pastry peer-to-peer network (Rowstron & Druschel, 2001) in a scalable and fault-tolerant way. This federation structure allows the system to disperse application data throughout the grid. During storage, the system slices the data into several redundant, encoded fragments and stores them in different grid clusters. This distribution improves data availability and fault-tolerance, since fragments are located in geographically dispersed clusters. When performing data retrieval, applications can simultaneously download file fragments stored in the highest bandwidth clusters, enabling efficient data retrieval. This is OppSore standard storage mode, called perennial. Using OppStore, application input and output files can be obtained from any node in the system. Consequently, after a failure, restarting of an application execution in another machine can be easily performed. Also, when an application execution finishes, the output files uploaded to the distributed repositories can be accessed by the user from any machine connected to the grid.

OppStore also has an ephemeral mode, where data is stored in the machines of the same cluster where the request was issued. It is used for data that requires high bandwidth and only needs to be available for a few hours. It is used to store checkpointing (de Camargo et al., 2006; Elnozahy et al., 2002) data and other temporary application data. In this storage mode, the system stores the data only in the local cluster and can use IDA or data replication to provide fault-tolerance. For checkpointing data, the preferred strategy is to store two copies of each local checkpoint in the cluster, one in the machine where it was generated and the

Barlow, H. B. (1999). *Unsupervised learning: Foundations of Neural Computation*, MIT Press. Bisseling, R. H. (2004). *Parallel Scientific Computation: A Structured Approach using BSP and MPI*,

Bosilca, G., Bouteiller, A., Cappello, F., Djilali, S., Fedak, G., Germain, C., Herault, T.,

Efficient Parallel Application Execution on Opportunistic Desktop Grids 131

Bronevetsky, G., Marques, D., Pingali, K. & Stodghill, P. (2003). Automated application-level

Chervenak, A. L., Palavalli, N., Bharathi, S., Kesselman, C. & Schwartzkopf, R. (2004).

*(HPDC'04)*, IEEE Computer Society, Washington, DC, USA, pp. 182–191. Condor\_Team (2004). *Online Manual of Condor Version 7.4.4*, University of Wisconsin-Madison,

da Silva e Silva, F. J., Kon, F., Goldman, A., Finger, M., de Camargo, R. Y., Filho, F. C. & Costa,

de Camargo, R. Y. & Kon, F. (2007). Design and implementation of a middleware for

de Camargo, R. Y., Kon, F. & Goldman, A. (2005). Portable checkpointing and communication

de Ribamar Braga Pinheiro Júnior, J. (2008). *Xenia: um sistema de segurança para grades*

de Ribamar Braga Pinheiro Júnior, J., Vidal, A. C. T., Kon, F. & Finger, M. (2006). Trust

Dong, F. & Akl, S. G. (2006). Scheduling algorithms for grid computing: State of the art and

El-Rewini, H., Lewis, T. G. & Ali, H. H. (1994). *Task scheduling in parallel and distributed systems*,

*computacionais baseado em cadeias de confiana¸*, PhD thesis, IME/USP.

middleware, *Journal of Parallel and Distributed Computing* 70(5): 573 – 583. de Camargo, R. Y., Cerqueira, R. & Kon, F. (2006). Strategies for checkpoint storage on

http://www.cs.wisc.edu/condor/manual/v7.4.

*Computing*, Rio de Janeiro, Brazil, pp. 226–233.

ACM/IFIP/USENIX, Melbourne, Australia.

Prentice-Hall, Inc., Upper Saddle River, NJ, USA.

opportunistic grids, *IEEE Distributed Systems Online* 18(6).

*Symposium on Principles and Practice of Parallel Programming*, pp. 84–89. Cai, M., Chervenak, A. & Frank, M. (2004). A peer-to-peer replica location service based

*Supercomputing*, IEEE Computer Society, Washington, DC, USA, p. 56. Cardozo, M. C. & Costa, F. M. (2008). Mpi support on opportunistic grids based on

Lemarinier, P., Lodygensky, O., Magniette, F., Neri, V. & Selikhov, A. (2002). Mpich-v: toward a scalable fault tolerant mpi for volatile nodes, *Proceedings of the 2002 ACM/IEEE Conference on Supercomputing*, IEEE Computer Society Press, Baltimore,

checkpointing of MPI programs, *PPoPP '03: Proceedings of the 9th ACM SIGPLAN*

on a distributed hash table, *SC '04: Proceedings of the 2004 ACM/IEEE conference on*

the integrade middleware, *2nd Latin American Grid International Workshop (LAGrid)*,

Performance and scalability of a replica location service, *HPDC '04: Proceedings of the 13th IEEE International Symposium on High Performance Distributed Computing*

F. M. (2010). Application execution management on the integrade opportunistic grid

data storage in opportunistic grids, *CCGrid '07: Proceedings of the 7th IEEE/ACM International Symposium on Cluster Computing and the Grid*, IEEE Computer Society,

for BSP applications on dynamic heterogeneous Grid environments, *SBAC-PAD'05: The 17th International Symposium on Computer Architecture and High Performance*

in large-scale computational grids: An SPKI/SDSI extension for representing opinion, *4th International Workshop on Middleware for Grid Computing - MGC 2006*,

open problems, *Technical Report 2006-504*, School of Computing, Queens University,

Oxford University Press.

Maryland, USA, pp. 1–18.

Campo Grande, Brazil.

Washington, DC, USA.

Kingston, Ontario.

other on another cluster machine. During the recovery of failed parallel application, most of the processes will be able to recover their local checkpoints from the local machine.

The system stores most of the generated checkpoints using the ephemeral mode, since the usage of local networks generates a lower overhead. To prevent losing the entire computation due to a failure or disconnection of the cluster where the application was executing, periodically, the checkpoints are also stored in other clusters using the perennial storage mode, for instance, after every *k* generated checkpoints. With this strategy, InteGrade can obtain low overhead and high availability at the same time for checkpoint storage.
