**3.3 Application management: a mobile agents approach**

In distributed systems such as opportunistic grids, failures can occur due to several factors, most of them related to resource heterogeneity and distribution. These failures together with the use of the resources by its owners modify grid resource availability (i.e. resources can be active, busy, off-line, crashed, etc.). An opportunistic grid middleware should be able to monitor and detect such changes, rescheduling applications across the available resources, and dynamically tuning the fault tolerance mechanisms to better adapt to the execution environment.

When dealing with bag-of-tasks like applications, one interesting approach may be the use of mobile agents (Pham & Karmouch, 1998), which allows the implementation of dynamic fault tolerance mechanisms based on task replication and checkpointing. A task replica is a copy of the application binary that runs independently of the other copies. Through these mechanisms a middleware may be capable of migrating tasks when nodes fail and coordinate task replicas and its checkpoints in a rational manner, keeping only the most advanced checkpoint and by migrating slow replicas. This dynamically improves application execution, compensates the misspent of resources introduced by the task replication and solves scalability issues.

These mechanisms compose a feedback control system (Goel et al., 1999; Steere et al., 1999), gathering and analyzing information about the execution progress and adjusting its behavior accordingly. Agents are suitable for opportunistic environments due to intrinsic characteristics such as:

10 Will-be-set-by-IN-TECH

The Use Pattern Analysis performs unsupervised machine learning (Barlow, 1999; Theodoridis & Koutroumba, 2003) to obtain a fixed number of *use classes*, where each class is represented by its *prototypical* object. The idea is that each class represents a frequent use pattern, such as a busy work day, a light work day or a holiday. As in most machine learning processes, there are two phases involved in the process, which in the InteGrade architecture

**The Learning Phase.** Learning is performed off-line, using 60 objects collected by LUPA during the machine regular use. A clustering algorithm (Everitt et al., 2001) is applied to the training data, such that each cluster corresponds to a use class, represented by a prototypical object, which is obtained by averaging over the elements of the class. Learning can occur only when there is a considerable mass of data. InteGrade approach requires at least two months of data. As data collection proceeds, more data and more representative classes are obtained. **The Decision Phase.** There is one LUPA module per machine on the grid. Requests are sent by the scheduler specifying the amount of resources (CPU, disk space, RAM, etc.) and the expected duration needed by an application to be executed at that machine. The LUPA module decides whether this machine will be available for the expected duration, as explained below. LUPA is constantly keeping track of the current use of resources. For each resource, it focuses on the recent history, usually the last 24 hours, and computes a distance between the recent history and each of the use classes learned during the training phase. This distance takes into account the time of the day in which the request was made, so that the recent history is compared to the corresponding times in the use classes. The class with the smallest distance is the *current use class*, which is used to predict the availability in the near future. If all resources are predicted to be available, then the application is scheduled to be executed;

In distributed systems such as opportunistic grids, failures can occur due to several factors, most of them related to resource heterogeneity and distribution. These failures together with the use of the resources by its owners modify grid resource availability (i.e. resources can be active, busy, off-line, crashed, etc.). An opportunistic grid middleware should be able to monitor and detect such changes, rescheduling applications across the available resources, and dynamically tuning the fault tolerance mechanisms to better adapt to the execution

When dealing with bag-of-tasks like applications, one interesting approach may be the use of mobile agents (Pham & Karmouch, 1998), which allows the implementation of dynamic fault tolerance mechanisms based on task replication and checkpointing. A task replica is a copy of the application binary that runs independently of the other copies. Through these mechanisms a middleware may be capable of migrating tasks when nodes fail and coordinate task replicas and its checkpoints in a rational manner, keeping only the most advanced checkpoint and by migrating slow replicas. This dynamically improves application execution, compensates the

These mechanisms compose a feedback control system (Goel et al., 1999; Steere et al., 1999), gathering and analyzing information about the execution progress and adjusting its behavior accordingly. Agents are suitable for opportunistic environments due to intrinsic

misspent of resources introduced by the task replication and solves scalability issues.

are implemented by a module called Local Use Pattern Analyzer (LUPA), as follows.

otherwise, it is rejected.

environment.

characteristics such as:

**3.3 Application management: a mobile agents approach**


The InteGrade research group has been investigating the use of the agent paradigm for developing a grid software infrastructure since 2004, leading to the MobiGrid (Barbosa & Goldman, 2004; Pinheiro et al., 2011) and MAG (Mobile Agents for Grids) (Lopes et al., 2005) projects that are based on the InteGrade middleware.
