**6. Experiments**

We conducted the experimental simulation to confirm advantages of our proposal. Thus, we compared with other conventional methods in terms of the following points of view.

{ } {} {} {} <sup>222</sup> RDY = cls (3) = n , cls (4) = n ,cls (7) = n 2 2 3 2 42 7

{ } { } 22 2 <sup>2</sup> cls (i) RDY pivot = cls (3) , i.e., LV (3) = max LV (i) . <sup>∈</sup>

(a). State of . φ

*<sup>2</sup>* (b). Temporary State for selecting pivot . *<sup>3</sup>*

2 2

Also we decided the Parallelism Factor (PF) is defined as *ρ*, taking values of 0.5, 1.0, and 2.0

<sup>43</sup> On the Effect of Applying the Task Clustering

The simulation environment was developed by JRE1.6.0\_0, the operating system is Windows XP SP3, the CPU architecture is Intel Core 2 Duo 2.66GHz, and the memory size is 2.0GB.

*cls*, *φR*) **and the schedule length**

 5 5 0.1 168 1.000 1.054 1.097 1.000 1.018 1.123 1.0 56 1.000 1.241 1.391 1.000 1.131 1.209 5.0 34 1.000 1.320 1.514 1.000 1.140 1.288 10.0 19 1.000 1.378 1.611 1.000 1.219 1.341 5 10 0.1 160 1.000 1.072 1.105 1.000 1.012 1.073 1.0 49 1.000 1.203 1.419 1.000 1.144 1.248 5.0 30 1.000 1.355 1.428 1.000 1.172 1.301 10.0 16 1.000 1.329 1.503 1.000 1.237 1.344 10 5 0.1 150 1.000 1.032 1.066 1.000 1.016 1.071 1.0 47 1.000 1.177 1.284 1.000 1.097 1.198 5.0 41 1.000 1.209 1.615 1.000 1.173 1.291 10.0 26 1.000 1.482 1.598 1.000 1.227 1.302 10 10 0.1 187 1.000 1.069 1.044 1.000 1.044 1.050 1.0 67 1.000 1.292 1.157 1.000 1.179 1.132 5.0 44 1.000 1.344 1.419 1.000 1.203 1.297 10.0 28 1.000 1.461 1.433 1.000 1.272 1.301

*cls*, *<sup>φ</sup>R*) Ratio *sl*(*G<sup>R</sup>*

*cls*, *φR*) and the Schedule Length in Random

*cls*, *<sup>φ</sup>R*) Ratio *sl*(*G<sup>R</sup>*

A B C A B C

*cls*, *φR*) and the Schedule Length in FFT DAGs(|*V*0| = 2048).

A B C A B C

*cls*<sup>|</sup> *slw*(*G<sup>R</sup>*

*cls*<sup>|</sup> *slw*(*G<sup>R</sup>*

 5 5 0.1 351 1.000 1.032 1.041 1.000 1.021 1.049 1.0 144 1.000 1.193 1.241 1.000 1.098 1.142 5.0 78 1.000 1.216 1.266 1.000 1.126 1.172 10.0 44 1.000 1.272 1.371 1.000 1.190 1.193 5 10 0.1 346 1.000 1.048 1.044 1.000 1.013 1.013 1.0 136 1.000 1.177 1.233 1.000 1.133 1.152 5.0 75 1.000 1.242 1.385 1.000 1.152 1.189 10.0 41 1.000 1.238 1.411 1.000 1.273 1.206 10 5 0.1 344 1.000 1.022 1.037 1.000 1.044 1.031 1.0 135 1.000 1.093 1.133 1.000 1.086 1.099 5.0 72 1.000 1.203 1.192 1.000 1.173 1.162 10.0 39 1.000 1.288 1.370 1.000 1.234 1.221 10 10 0.1 367 1.000 1.017 1.041 1.000 1.013 1.011 1.0 149 1.000 1.188 1.241 1.000 1.076 1.081 5.0 80 1.000 1.279 1.339 1.000 1.147 1.175 10.0 46 1.000 1.341 1.367 1.000 1.198 1.201

√|*<sup>V</sup>*0<sup>|</sup> *<sup>ρ</sup>* .

*cls*, *φR*) and the schedule length to confirm the validity

*cls*, *φR*) Ratio

*cls*, *φR*) Ratio

(H. Topcuoglu, 2002). By using PF, the depth of the DAG is defined as

**6.2 Comparison about** *slw*(*G<sup>R</sup>*

Table 3. Comparison of *slw*(*G<sup>R</sup>*

Table 4. Comparison of *slw*(*G<sup>R</sup>*

DAGs(|*V*0| = 1000).

In this experiment, we compared *slw*(*G<sup>R</sup>*

of theorem 4.1 and 4.2. Comparison targets are as follows.

for Identical Processor Utilization to Heterogeneous Systems

No. *<sup>α</sup> <sup>β</sup>* CCR <sup>|</sup>*V<sup>R</sup>*

No. *<sup>α</sup> <sup>β</sup>* CCR <sup>|</sup>*V<sup>R</sup>*

opt (c). The size of cls (3) < ( ). *3 0* δ φ 4 4 opt (d). State of , and the size of cls (3) ( ). φ δ ≥ φ*0*

Fig. 4. Example of the Task Clustering Algorithm.

1. Whether minimizing *slw*(*G<sup>R</sup> cls*, *φR*) leads to minimzing the schedule length or not.

2. The range of applicability by imposing *δopt*(*φ*0) as the lower bound of every cluster size.

We showed by theorem 4.1 and 4.2 that both the lower bound and the upper bound of the schedule length can be expressed by *slw*(*G<sup>s</sup> cls*, *φs*). Thus, in this experiment we confirm that the actual relationship between the schedule length and *slw*(*G<sup>s</sup> cls*, *φs*).

### **6.1 Experimental environment**

In the simulation, a random DAG is generated. In the DAG, each task size and each data size are decided randomly. Also CCR (Communication to Computation Ratio)(Sinnen, 2005; 2007) is changed from 0.1 to 10. The max to min ratio in terms of data size and task size is set to 100. 14 Will-be-set-by-IN-TECH

bandwidth = β1,p

<sup>1</sup> n <sup>2</sup>

P1

{ } 1,q 1 q m, 1 q bandwidth = min

β ≤≤ ≠

> 2 <sup>5</sup> <sup>n</sup> <sup>2</sup> 6 n

2 8 n

P1

3 <sup>3</sup> n <sup>3</sup> 4 n

> 3 6 n

> > 3 10 n

opt (c). The size of cls (3) < ( ). *3 0*

Fig. 4. Example of the Task Clustering Algorithm.

schedule length can be expressed by *slw*(*G<sup>s</sup>*

3 5 n

3 <sup>8</sup> <sup>n</sup> <sup>3</sup>

1. Whether minimizing *slw*(*G<sup>R</sup>*

**6.1 Experimental environment**

Pmax

Pmax

Pp

Pp

Pp

bandwidth = β1,p

Pp

Pmax

Pmax

assign

1 n

2 7 n

<sup>2</sup> cls (1)

Pmax

Pp

δ φ

the actual relationship between the schedule length and *slw*(*G<sup>s</sup>*

3 7 n

Pp

cls (3) *<sup>3</sup>* <sup>4</sup> cls (3)

{ } p,q 1 q m, p q bandwidth = min β ≤≤ ≠

{ } p,q 1 q m, p q bandwidth = min

Pmax

{ } i,j i,j bandwidth max β ∈β

= β { } p,q 1 q m, p q

2 10 n

P1

2 <sup>5</sup> <sup>n</sup> <sup>2</sup> 6 n

Pp

Pp

Pp

2 8 n

Pp

P1

*<sup>2</sup>* (b). Temporary State for selecting pivot . *<sup>3</sup>*

4 <sup>3</sup> n <sup>4</sup> 4 n

4 <sup>5</sup> <sup>n</sup> <sup>4</sup> 6 n

4 8 n

> 4 10 n

4 4 opt (d). State of , and the size of cls (3) ( ).

Pp

β1,p

Pmax

2. The range of applicability by imposing *δopt*(*φ*0) as the lower bound of every cluster size. We showed by theorem 4.1 and 4.2 that both the lower bound and the upper bound of the

In the simulation, a random DAG is generated. In the DAG, each task size and each data size are decided randomly. Also CCR (Communication to Computation Ratio)(Sinnen, 2005; 2007) is changed from 0.1 to 10. The max to min ratio in terms of data size and task size is set to 100.

β ≤≤ ≠

bandwidth = β1,p

Pmax

φ

*cls*, *φR*) leads to minimzing the schedule length or not.

assign

4 2 n

{ } { }

Pp

2 9 n

22 2 <sup>2</sup> cls (i) RDY pivot = cls (3) , i.e., LV (3) = max LV (i) . <sup>∈</sup>

> 4 1 n

> > 4 9 n

*cls*, *φs*). Thus, in this experiment we confirm that

*cls*, *φs*).

bandwidth = min

Pp

2 2

Pp

Pmax

4 7 n

<sup>4</sup> cls (1)

Pmax

{ } i,j bandwidth max i,j β ∈β = β

> φ*0*

 δ≥

<sup>2</sup> cls (1)

2 7 n

assign

2

<sup>2</sup> 2 n 2 <sup>3</sup> n <sup>2</sup> 4 n

> β ≤≤ ≠

2 9 n

Pmax

φ

<sup>3</sup> cls (1)

{ } {} {} {} <sup>222</sup> RDY = cls (3) = n , cls (4) = n ,cls (7) = n 2 2 3 2 42 7

assign

(a). State of .

9 n

<sup>2</sup> 2 n 2 <sup>3</sup> n <sup>2</sup> 4 n

> 2 10 n

Also we decided the Parallelism Factor (PF) is defined as *ρ*, taking values of 0.5, 1.0, and 2.0 (H. Topcuoglu, 2002). By using PF, the depth of the DAG is defined as √|*<sup>V</sup>*0<sup>|</sup> *<sup>ρ</sup>* .

The simulation environment was developed by JRE1.6.0\_0, the operating system is Windows XP SP3, the CPU architecture is Intel Core 2 Duo 2.66GHz, and the memory size is 2.0GB.

#### **6.2 Comparison about** *slw*(*G<sup>R</sup> cls*, *φR*) **and the schedule length**

In this experiment, we compared *slw*(*G<sup>R</sup> cls*, *φR*) and the schedule length to confirm the validity of theorem 4.1 and 4.2. Comparison targets are as follows.


Table 3. Comparison of *slw*(*G<sup>R</sup> cls*, *φR*) and the Schedule Length in Random DAGs(|*V*0| = 1000).


Table 4. Comparison of *slw*(*G<sup>R</sup> cls*, *φR*) and the Schedule Length in FFT DAGs(|*V*0| = 2048).

**6.3 Applicabiligy of** *δopt*(*φ*0)

for Identical Processor Utilization to Heterogeneous Systems

issue in the future works.

account.

**8. References**

**7. Conclusion and future works**

16, pp. 276-291, 1992.

In this experiment, we confirmed that how optimal the lower bound of the cluster size, *δopt*(*φ*0) derived by eq. (25). Comparison targets in this experiment are based on "A" at sec. 6.2, but only the lower bound of the cluster size is changed, i.e., *δopt*(*φ*0), 0.2*δopt*(*φ*0), 0.5*δopt*(*φ*0), 1.5*δopt*(*φ*0), and 2.0*δopt*(*φ*0). The objective of this experiment is to confirm the range of applicability of *δopt*(*φ*0), due to the fact that *δopt*(*φ*0) is not a value when *slw*(*G<sup>s</sup>*

<sup>45</sup> On the Effect of Applying the Task Clustering

can be minimized for 1 ≤ *s*. Fig. 5 shows comparison results in terms of the optimality of *δopt*(*φ*0). (a) corresponds to the case of the degree of heterogeneity (*α*, *β*)=(5, 5), and (b) corresponds to (10, 10). From (a), it can be seen that *δopt*(*φ*0) takes the best schedule length than other cases during CCR takea from 0.1 to 5.0. However, when CCR is 7 or more, 1.5*δopt*(*φ*0) takes the best schedule length. This is because *δopt*(*φ*0) may be too small for a data intensive DAG. Thus, it can be said that 1.5*δopt*(*φ*0) is more appropriate size than *δopt*(*φ*0) when CCR exceeds a certain value. On the other hand, in (b), the larger CCR becomes, the better the schedule length by case of 1.5*δopt*(*φ*0) becomes. However, during CCR is less than 3.0, *δopt*(*φ*0) can be the best lower bound of the cluster size. As for other lower bounds, 2.0 *δopt*(*φ*0) has the local maximum value of the schedule length ratio when CCR takes from 0.1 to 2.0 in both figures. Then in larger CCR, the schedule length ratio decreases because such size becomes more appropriate for a data intensive DAG. On the other hand, in the case of 0.25*δopt*(*φ*0), the schedule length ratio increases with CCR. This means that 0.25*δopt*(*φ*0)

From those results, it can be said that the lower bound for the cluster size should be derived according to the mapping state. For example, if the lower bound can be adjusted as a function of each assigned processor's ability (e.g., the processing speed and the communication bandwidth), the better schedule length may be obtained. For example in this chapter the lower bound is derived by using the mapping state of *φ*0. Thowever, by using the other mapping state, we may be obtain the better schedule length. To do this, it must be considered that which mapping state has good effect on the schedule length. This point of view is an

In this chapter, we presented a policy for deciding the assignment unit size to a processor and a task clustering for processor utilization in heterogeneous distributed systems. We defined the indicative value for the schedule length for heterogeneous distributed systems. Then we theoretically proved that minimizing the indicative value leads to minimization of the schedule length. Furthermore, we defined the lower bound of the cluster size by assuming the initial mapping state. From the experimental results, it is concluded that minimizing the indicative value has good effect on the schedule length. However, we found that the lower bound of the cluster size should be adjusted with taking an assigned processor's ability into

As a future work, we will study on how to adjust the lower bound of the cluster size for

A. Gerasoulis and T. Yang., A Comparison of Clustering Heuristics for Scheduling Directed

Acyclic Graphs on Multiprocessors, *Journal of Parallel and Distributed Computing*, Vol.

obtaining the better schedule length and more effective processor utilization.

becomes smaller for a data intensive DAG with CCR increases.

*cls*, *φs*)


The difference between A, B and C is how to merge clusters, while they have the common lower bound for the cluster size and the common processor assignment policy. We compared *slw*(*G<sup>R</sup> cls*, *φR*) and the schedule length by averaging them in 100 random DAGs.

Table 3 and 4 show comparison results in terms of *slw*(*G<sup>R</sup> cls*, *φR*) and the schedule length. The former is the result in the case of random DAGs. On the other hand, the latter is the result in the case of FFT DAGs. In both tables, *α* corresponds to max-min ratio for processing speed in *P*, and *β* corresponds to max-min ratio for communication bandwidth in *P*. "*slw*(*G<sup>R</sup> cls*, *φR*) Ratio" and "*sl*(*G<sup>R</sup> cls*, *φR*) Ratio" correspond to ratios to "A", i.e., a value larger than 1 means that *slw*(*G<sup>R</sup> cls*, *<sup>φ</sup>R*) or *sl*(*G<sup>R</sup> cls*, *φR*) is larger than that of "A". In table 3, it can be seen that both *slw*(*G<sup>R</sup> cls*, *<sup>φ</sup>R*) and *sl*(*G<sup>R</sup> cls*, *φR*) in "A" are better than "B" and "C" as a whole. Especially, the larger CCR becomes, the better both *slw*(*G<sup>R</sup> cls*, *<sup>φ</sup>R*) and *sl*(*G<sup>R</sup> cls*, *φR*) in "A" become. It can not be seen that noteworthy characteristics related to *slw*(*G<sup>R</sup> cls*, *<sup>φ</sup>R*) and *sl*(*G<sup>R</sup> cls*, *φR*) with varying the degree of heterogeneity (i.e., *α* and *β*). The same results hold to table 4. From those results, it can be concluded that minimizing *slw*(*G<sup>R</sup> cls*, *φR*) leads to minimizing the schedule length as theoretically proved by theorem 4.1 and 4.2.

Fig. 5. Optimality for the Lower Bound of the Cluster Size.
