2. Description of workflow scheduling problems

In the grid environment, distributed computing is conducted with the resources that are scattered among different places around the world and connected together through networks. The composition of the grid environment is shown in Figure 2(a). The scattered computational resources are linked through the Internet. Each resource has its own computing ability and external bandwidth represented by ABu and BWu, respectively. Moreover, the resources in the grid environment are heterogeneous resources, meaning that the computing ability and external bandwidth of each resource are dissimilar. Generally, the grid environment can be represented with a schematic G(R, C) composed of nodes and edges. Each node stands for a resource (R = {Ri}). i = 1 ~ N represents the set of all resources. N is the total amount of resources in the grid environment. The connections between resources are represented by an edge. C = {Cuv} stands for the set of resource-resource connections. Cuv is the connection between resource u and resource v. The grid computing environmental schematic is shown in Figure 2 (b). The workflow application can be represented with a directed acyclic graph (DAG) G(V, T), as shown in Figure 3, in which V = {Vi}, where i =1~ M is the set of all tasks and M is the total number of tasks. Precedence exists between certain tasks, and each task has a workload; wi represents the workload of task i. Figure 3 also indicates the precedence relationship between tasks such as the following: task 1 is the predecessor of tasks 3 and 4, and tasks 3 and 4 are the successors of task 1. Meanwhile, task 5 cannot be executed until task 2 is done, and task 6 has to wait till tasks 3, 4, and 5 are accomplished. Certain required data for execution on successors have to be transmitted to the successors when the predecessor is completed, i.e., transmission costs exist. TCij represents the amount of data transmitted between tasks i and j. If the predecessor and the successors are arranged to be executed with different resources, a transmission cost exists. On the contrary, if the predecessor and the successors are arranged to be executed with the same resource, there will be no transmission cost. Meanwhile, task 0 and task 8 in the figure are virtual tasks representing the start and the end, respectively. They have no workload and involve no data transmission costs.

there are four execution sequence paths: (p1) task 0!task 1!task 4!task 7!task 8, (p2) task 0!task 1!task 5!task 7!task 8, (p3) task 0!task 2!task 6!task 7!task 8, and (p4) task 0!task 3!task 6!task 7!task 8. All the tasks in the four execution sequences must be executed to complete the job on the grid. The makespan is subject to the time of the longest execution sequence path. That is, the max(cost(pi)). cost(pi) is the cost of an execution sequence

Calculation of cost(pi) is shown in Eq. (1). cost(pi) is the aggregate of the total resource processing time on execution sequence path pi and the total data transmission time on that path. In Eq. (1), u(wt) represents the workload of the tasks allocated to resource u, and cost(tf) is the data transmission cost or time on that execution sequence path, as shown in Eq. (2):

u wð Þ<sup>t</sup>

<sup>þ</sup> <sup>X</sup> tf ∈ T

Stochastic Greedy-Based Particle Swarm Optimization for Workflow Application in Grid

minf g BWu; BWv

cost tf ð Þ (1)

http://dx.doi.org/10.5772/intechopen.73587

� �jpi <sup>∈</sup> DAG � �; (3)

(2)

45

ABu

cost tf ð Þ¼ TCij

If task i and task j are, respectively, allocated to be executed with the resources u and v, between the two tasks exists the amount of data transmission (TCij). Since resource u and resource v have different bandwidths, the data transmission time is subject to the smaller bandwidth (BWuv = min{BWu, BWv}). Hence, the data transmission time or cost is the amount

Therefore, the makespan is defined as the fitness function to denote the quality of workflow scheduling. The definition of fitness function (FIT) is shown in Eq. (3). The objective of

Many nature-inspired optimization algorithms have been proposed to find optimal solutions to workflow scheduling problems and metaheuristic algorithms that imitate the behaviors of biological creatures. Some that are extensively applied include ACO, GA, bee colony optimization (BCO), and the PSO adopted in this study. Among them, PSO requires fewer parameters and is easier to implement. Therefore, it has been well applied to solve diverse NP-complete problems, and the results have been rather remarkable. Meanwhile, PSO has also been employed to solve

As shown in Figure 3, if task 2 and task 6 are allocated to be computed by different resources, there will be a transmission time, and the makespan will be extended. On the contrary, if they are arranged to be executed by the same resource, there will be no transmission time, and the makespan will be shortened. Furthermore, if task 1 and task 2 are arranged to be executed by the same resource, executing task 1 before task 2 or vice versa will have an effect on the

path (pi) in DAG in the grid environment.

3. The proposed method

workflow scheduling problems with effectiveness.

cost pi � � <sup>¼</sup>

of data transmitted divided by the smaller resource bandwidth.

workflow scheduling is then to find the shortest makespan (min(FIT)):

FIT ¼ max cos t pi

P u∈pi

The goals of workflow scheduling optimization are to appropriately match tasks to resources and to suitably assign execution priorities to tasks without precedence restriction in the same computing resource to reduce the makespan of the application execution. The cost includes the resource processing time of the path and the data transmission time. In Figure 3, for example,

Figure 2. Grid computing environment. (a) The composition of the grid environment. (b) The grid computing environmental schematic.

Figure 3. Directed acyclic graph (DAG) of a workflow application on the grid.

there are four execution sequence paths: (p1) task 0!task 1!task 4!task 7!task 8, (p2) task 0!task 1!task 5!task 7!task 8, (p3) task 0!task 2!task 6!task 7!task 8, and (p4) task 0!task 3!task 6!task 7!task 8. All the tasks in the four execution sequences must be executed to complete the job on the grid. The makespan is subject to the time of the longest execution sequence path. That is, the max(cost(pi)). cost(pi) is the cost of an execution sequence path (pi) in DAG in the grid environment.

Calculation of cost(pi) is shown in Eq. (1). cost(pi) is the aggregate of the total resource processing time on execution sequence path pi and the total data transmission time on that path. In Eq. (1), u(wt) represents the workload of the tasks allocated to resource u, and cost(tf) is the data transmission cost or time on that execution sequence path, as shown in Eq. (2):

$$\text{cost}(p\_i) = \frac{\sum\_{u \in p\_i} \mu(w\_t)}{AB\_u} + \sum\_{t' \in \mathbb{T}} \text{cost}(tf) \tag{1}$$

$$cost(tf) = \frac{TC\_{ij}}{\min\{BW\_u, BW\_v\}}\tag{2}$$

If task i and task j are, respectively, allocated to be executed with the resources u and v, between the two tasks exists the amount of data transmission (TCij). Since resource u and resource v have different bandwidths, the data transmission time is subject to the smaller bandwidth (BWuv = min{BWu, BWv}). Hence, the data transmission time or cost is the amount of data transmitted divided by the smaller resource bandwidth.

Therefore, the makespan is defined as the fitness function to denote the quality of workflow scheduling. The definition of fitness function (FIT) is shown in Eq. (3). The objective of workflow scheduling is then to find the shortest makespan (min(FIT)):

$$FIT = \max\{\cos t(p\_i) | p\_i \in DAG\},\tag{3}$$
