**3.3 System model**

We assume that each PE is completly connected to other PEs, with non-identical processing speeds and communication bandwidths The set of PEs is expressed as *P* = {*P*1, *P*2,..., *Pm*},

when the indicative value for the schedule length is minimized. In this chapter, the indicative

<sup>33</sup> On the Effect of Applying the Task Clustering

after *s* task merging steps and *φ<sup>s</sup>* is the set of mapping between PEs and clusters after *s*

includes both task execution time and data transfer time, provided that each task is scheduled as late as possible and every data from its immediate predecessors has been arrived before the scheduled time (its start time). Table 1 shows notations and definitions for deriving

*PEs, i.e., P*<sup>1</sup> *and P*2*. The DAG has two clusters, i.e., cls*5(1) *and cls*5(4) *after 5 task merging steps. In (a), numerical values on tasks and edges mean the time unit to be processed on the reference PE and the time unit to be transferred among reference PEs on the reference communication bandwidth. On the other hand, (b) corresponds to the state that cls*5(1) *and cls*5(4) *have been assigned to P*<sup>1</sup> *and*

*cls*, *φs*) **and the schedule length**

processor utilization. The schedule length can not be known before scheduling every task,

to minimizing the schedule length to some extent. In this section we present that relationship

processor system, provided that every processor speed and communication bandwidth are

*slw*(*G<sup>s</sup>*

As for Δ*sls*−<sup>1</sup> *<sup>w</sup>*,*up*, it is defined in the literature (Kanemitsu, 2010). Furthermore, it can be proved

that the upper bound of the schedule length can be reduced by reducing *slw*(*G<sup>s</sup>*

*cls*). In the literature (Kanemitsu, 2010), it is proved that minimizing *slw*(*G<sup>s</sup>*

*cls*) <sup>−</sup> <sup>Δ</sup>*sls*−<sup>1</sup> *<sup>w</sup>*,*up*

1 + <sup>1</sup> *gmin*

effect on the schedule length. In this section, we show that minimizing *slw*(*G<sup>s</sup>*

between Table 2 shows notations for showing characteristics of *slw*(*G<sup>s</sup>*

leads to minimizing the lower bound of the schedule length as follows. **Lemma 1.** *In an identical processor system, let* Δ*sls*−<sup>1</sup> *<sup>w</sup>*,*up which satisfies slw*(*G<sup>s</sup>*

> *sl*(*G<sup>s</sup> cls*) ≥

1, no processor assignment policy is needed. Thus, let *slw*(*G<sup>s</sup>*

*be derived before s task merging steps. Then we obtain*

*where cp and gmin are defined in table 2, and sl*(*G<sup>s</sup>*

*<sup>l</sup>* <sup>∈</sup> *clss*(*j*). In table 1, especially *<sup>S</sup>*(*n<sup>s</sup>*

*<sup>k</sup>* which dominates *slw*(*G<sup>s</sup>*

*P*2*, respectively. The bottom are shows the derivation process for slw*(*G*<sup>5</sup>

*process, it is shown that the schedule length may be maximized if n*<sup>5</sup>

means that the schedule length may be maximized if *n<sup>s</sup>*

for Identical Processor Utilization to Heterogeneous Systems

**Example 1.** *Fig. 1 shows one example for deriving slw*(*G<sup>s</sup>*

*cls*, *φs*). In the table, assigned PEs for *clss*(*i*) and *clss*(*j*) are *Pp* and *Pq*, respectively. And

*<sup>k</sup>*. Threrfore, the smaller *<sup>S</sup>*(*n<sup>s</sup>*

*cls*, *φs*), that means the indicative value for the schedule length

*cls*, *φs*) is the maximum value of the execution path length which

*cls*, *<sup>φ</sup>s*) (In the case of *slw*(*G<sup>s</sup>*

*cls*, *φs*) with maintaining the certain size of each cluster for

*cls*, *<sup>φ</sup>s*). Thus, it must be proved that *slw*(*G<sup>s</sup>*

*<sup>k</sup>*, *i*) means the degree of increase of

*<sup>k</sup>* is scheduled as late as possible.

*cls*, *φs*)*(s* = 5*). In the figure, there are two*

*<sup>k</sup>*, *<sup>i</sup>*), the earlier *<sup>n</sup><sup>s</sup>*

*cls*, *φ*5)*. From the derivation*

<sup>2</sup> *is scheduled as late as possible.*

*cls*, *<sup>φ</sup>s*) = *level*(*n<sup>s</sup>*

*<sup>k</sup>* can be

*cls*, *φs*) can

*cls*) by the

*cls*)

*cls*, *φs*) leads

*cls*, *φs*). In an identical

*cls*) <sup>−</sup> *cp* <sup>≤</sup> <sup>Δ</sup>*sls*−<sup>1</sup> *<sup>w</sup>*,*up and*

*cls*, *φs*) in an identical processor

, (2)

*cls*) *is the schedule length after s task merging steps.*

*k*))

value is defined as *slw*(*G<sup>s</sup>*

task merging steps. *slw*(*G<sup>s</sup>*

scheduled. The task *n<sup>s</sup>*

*<sup>k</sup>* <sup>∈</sup> *clss*(*i*), *<sup>n</sup><sup>s</sup>*

**4.2 Relationship between** *slw*(*G<sup>s</sup>*

system as *slw*(*G<sup>s</sup>*

following lemma.

Our objective is to minimize *slw*(*G<sup>s</sup>*

we must estimate it by using *slw*(*G<sup>s</sup>*

execution time by independent tasks for *n<sup>s</sup>*

*slw*(*G<sup>s</sup>*

suppose *n<sup>s</sup>*


Table 1. Parameter Definition Related to *slw*(*G<sup>s</sup> cls*) (Here *<sup>n</sup><sup>s</sup> <sup>k</sup>* ∈ *clss*(*i*)).

and let the set of processing speeds as *alpha*, i.e., *α* = {*α*1, *α*2,..., *αm*}. Let the set of communication bandwidths as *β*, i.e.,

$$
\beta = \begin{pmatrix}
\infty & \beta\_{1,2} & \beta\_{1,3} & \dots \beta\_{1,m} \\
\beta\_{2,1} & \infty & \beta\_{2,3} & \dots \beta\_{2,m} \\
\beta\_{3,1} & \beta\_{3,2} & \infty & \dots \beta\_{3,m} \\
\vdots & \vdots & \vdots & \vdots & \vdots \\
\beta\_{m,1} & \beta\_{m,2} & \beta\_{m,3} & \dots & \infty
\end{pmatrix}.
\tag{1}
$$

*βi*,*<sup>j</sup>* means the communication bandwidth from *Pi* to *Pj*. The processing time in the case that *ns <sup>k</sup>* is processed on *Pi* is expressed as *tp*(*n<sup>s</sup> <sup>k</sup>*, *<sup>α</sup>i*) = *<sup>w</sup>*(*n<sup>s</sup> <sup>k</sup>*)/*αi*. The data transfer time of *<sup>e</sup><sup>s</sup> <sup>k</sup>*,*<sup>l</sup>* over *βi*,*<sup>j</sup>* is *tc*(*e<sup>s</sup> i*,*j* , *βk*,*l*) = *c*(*e<sup>s</sup> i*,*j* )/*βk*,*l*. This means that both processing time and data transfer time are not changed with time, and suppose that data transfer time within one PE is negligible.

#### **4. Processor utilization**

#### **4.1 The indicative value for the schedule length**

The schedule length depends on many factors, i.e., execution time for each task, communication time for each data exchanged among tasks, execution order after the task scheduling, processing speed, and communication bandwidth. Furthermore, whether a data transfer time can be localized or not depends on the cluster structure. The proposed method is that a cluster is generated after the lower bound of the cluster size (the total execution time of every task included in the cluster) has been derived. The lower bound is decided 4 Will-be-set-by-IN-TECH

*<sup>l</sup>* ∈/ *clss*(*i*)

*<sup>l</sup>* ∈/ *clss*(*i*)

*<sup>l</sup>* ∈/ *clss*(*i*)

*<sup>l</sup>* ∈/ *clss*(*i*)

, *αp*) − ∑*n<sup>s</sup>*

�

*<sup>k</sup>*, *<sup>α</sup>p*) + *tc*(*e<sup>s</sup>*

�

�

�

�

*<sup>l</sup>* <sup>∈</sup> *clss*(*i*)}∪{*n<sup>s</sup>*

*l*∈*desc*(*n<sup>s</sup>*

, *αp*) + *tc*(*el*,*k*, *βq*,*p*)

*tlevel*(*n<sup>s</sup> k*) �

*k*,*l*

*<sup>k</sup>*) + *blevel*(*n<sup>s</sup>*

*ns <sup>k</sup>*∈*clss*(*i*)

*cls*) (Here *<sup>n</sup><sup>s</sup>*

*<sup>k</sup>*, *<sup>i</sup>*) + *blevel*(*n<sup>s</sup>*

{*LVs*(*i*)}

∪ {*START Tasks* ∈ *clss*(*i*)}.

∪ {*START Tasks* ∈ *clss*(*i*)}.

∪ {*END Tasks* ∈ *clss*(*i*)}.

∪ {*END Tasks* ∈ *clss*(*i*)}.

*l*) �

. (1)

*<sup>k</sup>*,*<sup>l</sup>* over

*<sup>k</sup>*)/*αi*. The data transfer time of *<sup>e</sup><sup>s</sup>*

*<sup>k</sup>* ∈ *tops*(*i*).

*k*}

� , *if n<sup>s</sup>*

*<sup>k</sup>* ,*i*) *tp*(*n<sup>s</sup> l* , *αp*)

, *βp*,*q*) + *blevel*(*n<sup>s</sup>*

*k*) �

*k*)

*<sup>k</sup>* ∈ *clss*(*i*)).

⎞

⎟⎟⎟⎟⎟⎠

)/*βk*,*l*. This means that both processing time and data transfer time

� *level*(*n<sup>s</sup> k*) �

*<sup>k</sup>*)*s*.*t*., *<sup>n</sup><sup>s</sup>*

*<sup>k</sup>*)*s*.*t*., *<sup>n</sup><sup>s</sup>*

*<sup>k</sup>*)*s*.*t*., *<sup>n</sup><sup>s</sup>*

*<sup>k</sup>*),*s*.*t*., *<sup>n</sup><sup>s</sup>*

*l*

*<sup>k</sup>*∈*tops*(*i*)

� *S*(*n<sup>s</sup>*

*clss*(*i*)∈*V<sup>s</sup> cls*

and let the set of processing speeds as *alpha*, i.e., *α* = {*α*1, *α*2,..., *αm*}. Let the set of

∞ *β*1,2 *β*1,3 ... *β*1,*<sup>m</sup> β*2,1 ∞ *β*2,3 ... *β*2,*<sup>m</sup> β*3,1 *β*3,2 ∞ ... *β*3,*<sup>m</sup>*

*βm*,1 *βm*,2 *βm*,3 ... ∞

*βi*,*<sup>j</sup>* means the communication bandwidth from *Pi* to *Pj*. The processing time in the case that

are not changed with time, and suppose that data transfer time within one PE is negligible.

The schedule length depends on many factors, i.e., execution time for each task, communication time for each data exchanged among tasks, execution order after the task scheduling, processing speed, and communication bandwidth. Furthermore, whether a data transfer time can be localized or not depends on the cluster structure. The proposed method is that a cluster is generated after the lower bound of the cluster size (the total execution time of every task included in the cluster) has been derived. The lower bound is decided

*<sup>k</sup>*, *<sup>α</sup>i*) = *<sup>w</sup>*(*n<sup>s</sup>*

*<sup>l</sup>*) + *tp*(*n<sup>s</sup> l*

*<sup>l</sup>* <sup>|</sup>*n<sup>s</sup> <sup>k</sup>* <sup>≺</sup> *<sup>n</sup><sup>s</sup> l* , *n<sup>s</sup>*

*<sup>k</sup>*, *i*), *otherwise*.

*<sup>k</sup>*∈*outs*(*i*)

*φ<sup>s</sup>* {..., *< clss*(*i*), *Pp >*,... }

*<sup>l</sup>*∈*clss*(*i*) *tp*(*n<sup>s</sup>*

*tlevel*(*n<sup>s</sup>*

Parameter Definition

*<sup>l</sup>* <sup>∈</sup> *pred*(*n<sup>s</sup>*

*<sup>l</sup>* <sup>∈</sup> *pred*(*n<sup>s</sup>*

*<sup>l</sup>* <sup>∈</sup> *suc*(*n<sup>s</sup>*

*<sup>l</sup>* <sup>∈</sup> *suc*(*n<sup>s</sup>*

*tops*(*i*) �

*ins*(*i*) �

*outs*(*i*) �

*btms*(*i*) �

*desc*(*n<sup>s</sup>*

*S*(*n<sup>s</sup>*

*tlevel*(*n<sup>s</sup> k*)

*blevel*(*n<sup>s</sup>*

*level*(*n<sup>s</sup>*

*slw*(*G<sup>s</sup>*

*ns*

*βi*,*<sup>j</sup>* is *tc*(*e<sup>s</sup>*

*i*,*j*

**4. Processor utilization**

*ns k*|∀*n<sup>s</sup>*

*ns k*|∃*n<sup>s</sup>*

*ns k*|∃*n<sup>s</sup>*

*ns k*|∀*n<sup>s</sup>*

*<sup>k</sup>*, *i*) ∑*n<sup>s</sup>*

� max *<sup>n</sup><sup>s</sup> l*∈*pred*(*n<sup>s</sup> k* ) �

*<sup>k</sup>*) max *<sup>n</sup><sup>s</sup>*

*BLs*(*i*) max *<sup>n</sup><sup>s</sup>*

Table 1. Parameter Definition Related to *slw*(*G<sup>s</sup>*

communication bandwidths as *β*, i.e.,

*<sup>k</sup>* is processed on *Pi* is expressed as *tp*(*n<sup>s</sup>*

*i*,*j*

**4.1 The indicative value for the schedule length**

, *βk*,*l*) = *c*(*e<sup>s</sup>*

*<sup>k</sup>*, *<sup>i</sup>*) {*n<sup>s</sup>*

*TLs*(*i*) + *S*(*n<sup>s</sup>*

*TLs*(*i*) max *<sup>n</sup><sup>s</sup>*

*l*∈*suc*(*n<sup>s</sup> k* ) � *tp*(*n<sup>s</sup>*

*LVs*(*i*) *TLs*(*i*) + *BLs*(*i*) = max

*cls*, *φs*) max

*β* =

⎛

⎜⎜⎜⎜⎜⎝

. . . . . . . . . . . . . . .

*<sup>k</sup>*) *tlevel*(*n<sup>s</sup>*

when the indicative value for the schedule length is minimized. In this chapter, the indicative value is defined as *slw*(*G<sup>s</sup> cls*, *φs*), that means the indicative value for the schedule length after *s* task merging steps and *φ<sup>s</sup>* is the set of mapping between PEs and clusters after *s* task merging steps. *slw*(*G<sup>s</sup> cls*, *φs*) is the maximum value of the execution path length which includes both task execution time and data transfer time, provided that each task is scheduled as late as possible and every data from its immediate predecessors has been arrived before the scheduled time (its start time). Table 1 shows notations and definitions for deriving *slw*(*G<sup>s</sup> cls*, *φs*). In the table, assigned PEs for *clss*(*i*) and *clss*(*j*) are *Pp* and *Pq*, respectively. And suppose *n<sup>s</sup> <sup>k</sup>* <sup>∈</sup> *clss*(*i*), *<sup>n</sup><sup>s</sup> <sup>l</sup>* <sup>∈</sup> *clss*(*j*). In table 1, especially *<sup>S</sup>*(*n<sup>s</sup> <sup>k</sup>*, *i*) means the degree of increase of execution time by independent tasks for *n<sup>s</sup> <sup>k</sup>*. Threrfore, the smaller *<sup>S</sup>*(*n<sup>s</sup> <sup>k</sup>*, *<sup>i</sup>*), the earlier *<sup>n</sup><sup>s</sup> <sup>k</sup>* can be scheduled. The task *n<sup>s</sup> <sup>k</sup>* which dominates *slw*(*G<sup>s</sup> cls*, *<sup>φ</sup>s*) (In the case of *slw*(*G<sup>s</sup> cls*, *<sup>φ</sup>s*) = *level*(*n<sup>s</sup> k*)) means that the schedule length may be maximized if *n<sup>s</sup> <sup>k</sup>* is scheduled as late as possible.

**Example 1.** *Fig. 1 shows one example for deriving slw*(*G<sup>s</sup> cls*, *φs*)*(s* = 5*). In the figure, there are two PEs, i.e., P*<sup>1</sup> *and P*2*. The DAG has two clusters, i.e., cls*5(1) *and cls*5(4) *after 5 task merging steps. In (a), numerical values on tasks and edges mean the time unit to be processed on the reference PE and the time unit to be transferred among reference PEs on the reference communication bandwidth. On the other hand, (b) corresponds to the state that cls*5(1) *and cls*5(4) *have been assigned to P*<sup>1</sup> *and P*2*, respectively. The bottom are shows the derivation process for slw*(*G*<sup>5</sup> *cls*, *φ*5)*. From the derivation process, it is shown that the schedule length may be maximized if n*<sup>5</sup> <sup>2</sup> *is scheduled as late as possible.*

#### **4.2 Relationship between** *slw*(*G<sup>s</sup> cls*, *φs*) **and the schedule length**

Our objective is to minimize *slw*(*G<sup>s</sup> cls*, *φs*) with maintaining the certain size of each cluster for processor utilization. The schedule length can not be known before scheduling every task, we must estimate it by using *slw*(*G<sup>s</sup> cls*, *<sup>φ</sup>s*). Thus, it must be proved that *slw*(*G<sup>s</sup> cls*, *φs*) can effect on the schedule length. In this section, we show that minimizing *slw*(*G<sup>s</sup> cls*, *φs*) leads to minimizing the schedule length to some extent. In this section we present that relationship between Table 2 shows notations for showing characteristics of *slw*(*G<sup>s</sup> cls*, *φs*). In an identical processor system, provided that every processor speed and communication bandwidth are 1, no processor assignment policy is needed. Thus, let *slw*(*G<sup>s</sup> cls*, *φs*) in an identical processor system as *slw*(*G<sup>s</sup> cls*). In the literature (Kanemitsu, 2010), it is proved that minimizing *slw*(*G<sup>s</sup> cls*) leads to minimizing the lower bound of the schedule length as follows.

**Lemma 1.** *In an identical processor system, let* Δ*sls*−<sup>1</sup> *<sup>w</sup>*,*up which satisfies slw*(*G<sup>s</sup> cls*) <sup>−</sup> *cp* <sup>≤</sup> <sup>Δ</sup>*sls*−<sup>1</sup> *<sup>w</sup>*,*up and be derived before s task merging steps. Then we obtain*

$$sl(G\_{cls}^s) \ge \frac{sl\_{w}(G\_{cls}^s) - \Delta sl\_{w,up}^{s-1}}{1 + \frac{1}{\mathcal{G}\_{min}}},\tag{2}$$

*where cp and gmin are defined in table 2, and sl*(*G<sup>s</sup> cls*) *is the schedule length after s task merging steps.* 

As for Δ*sls*−<sup>1</sup> *<sup>w</sup>*,*up*, it is defined in the literature (Kanemitsu, 2010). Furthermore, it can be proved that the upper bound of the schedule length can be reduced by reducing *slw*(*G<sup>s</sup> cls*) by the following lemma.

*Proof.* In *seq*≺

Also, only in the case of *sl*(*G<sup>S</sup>*

− max *<sup>p</sup>*∈*G*<sup>0</sup> *cls*

for Identical Processor Utilization to Heterogeneous Systems

*slw*(*G<sup>S</sup>*

⇔ − max *<sup>p</sup>*∈*G*<sup>0</sup> *cls*

<sup>⇔</sup> *sl*(*G<sup>S</sup>*

can be minimized if *slw*(*Gcls*) is minimized.

As a next step, we show the relationship between *slw*(*G<sup>s</sup>*

**Lemma 3.** In an identical processor system, we have

⎧ ⎪⎪⎪⎨

⎪⎪⎪⎩

*n*0 *<sup>k</sup>* ,*n*<sup>0</sup> *<sup>l</sup>* ∈*p*, *n*0 *k*∈*pred*(*n*<sup>0</sup> *l* )

*cls*) <sup>−</sup> *cp* <sup>≤</sup> *slw*(*G<sup>S</sup>*

∑

⎧ ⎪⎪⎪⎨

⎪⎪⎪⎩

*n*0 *<sup>k</sup>* ,*n*<sup>0</sup> *<sup>l</sup>* ∈*p*, *n*0 *k*∈*pred*(*n*<sup>0</sup> *l* )

*cls*) <sup>≤</sup> *slw*(*G<sup>S</sup>*

∑

in *seq*≺

2007).

following corollary.

proved.

*the table 2. Then we have*

*<sup>s</sup>* , some edges are localized and others may be not localized. Furthermore, edges

<sup>≤</sup> *slw*(*G<sup>S</sup>*

*cls*)

<sup>≤</sup> *slw*(*G<sup>S</sup>*

∑

*cls*) <sup>−</sup> *sl*(*G<sup>S</sup>*

*c*(*e* 0 *k*,*l* ) *cls*)

. (5)

⎫ ⎪⎪⎪⎬

⎪⎪⎪⎭

*cls*, *φs*) and the schedule length in a

*cls*). (6)

*cls*, *φs*). (7)

*cls*) − *cp*. (4)

*<sup>s</sup>* do not always belong to the critical path. Then we have the following relationship.

<sup>35</sup> On the Effect of Applying the Task Clustering

⎫ ⎪⎪⎪⎬

⎪⎪⎪⎭

*cls*) ≤ *cp*, we have the following rlationship.

⎫ ⎪⎪⎪⎬

⎪⎪⎪⎭

⎧ ⎪⎪⎪⎨

⎪⎪⎪⎩

*n*0 *<sup>k</sup>* ,*n*<sup>0</sup> *<sup>l</sup>* ∈*p*, *n*0 *k*∈*pred*(*n*<sup>0</sup> *l* )

*cls*) <sup>−</sup> *sl*(*G<sup>S</sup>*

*c*(*e* 0 *k*,*l* )

*cls*) + max *<sup>p</sup>*∈*G*<sup>0</sup> *cls*

From lemma 1 and 2, it is concluded that in an identical processor system the schedule length

heterogeneous distributed system. The following lemma is proved in the literature (Sinnen,

In a heterogeneous distributed system, we assume the state like fig. 2, i.e., at the initial state every task is assigned a processor with the fastest and the widest communication bandwidth (let the processor as *Pmax*). In fig. 2 (a), each task belongs to respective processor. Furthermore, we virtually assign *Pmax* to each task to decide the processing time for each task and the data transfer time among any two tasks. Let the mapping as *φ*0. Under the situation, we have the

**Corollary 1.** *In a heterogeneous distributed system, let cpw*(*φ*0) *as the one with the mapping φ*<sup>0</sup> *in*

As for the relationship between *cp* and *cpw*, in the literature (Sinnen, 2007), the following is

*cpw*(*φ*0) <sup>≤</sup> *sl*(*G<sup>s</sup>*

*cpw* <sup>≤</sup> *sl*(*G<sup>s</sup>*

*c*(*e* 0 *k*,*l* )

Fig. 1. Example of *slw*(*G*<sup>5</sup> *cls*, *φ*5) derivation.

**Lemma 2.** *In an identical processor system, if sl*(*G<sup>S</sup> cls*) ≤ *cp, then we obtain*

$$\text{lsf}(\mathbf{G}\_{cls}^{\mathbf{S}}) \le sl\_w(\mathbf{G}\_{cls}^{\mathbf{S}}) + \max\_{p \in \mathbf{G}\_{cls}^{\mathbf{0}}} \left\{ \sum\_{\substack{n\_k^0, n\_l^0 \in p\_\succ\\n\_k^0 \in pred(n\_l^0)}} c(e\_{k,l}^0) \right\} \cdot \text{l.} \tag{3}$$

6 Will-be-set-by-IN-TECH

(a) After 5 tesk merging steps have been done. (b) After processor assignments have been done.

<sup>5</sup> cls (1)

5 n2

5 5 4


> 5 4

level(n ) = 22, tlev

blevel(n ) = 1, level(n ) = 12.5,

> 5 k 5

5

*cls*) ≤ *cp, then we obtain*

*c*(*e* 0 *k*,*l* )

⎫ ⎪⎪⎪⎬

⎪⎪⎪⎭

∑

∈

6 5 p4

5 5 5 7 5 p4 p6

5 kk n out (4) 5 5 4 4

BL (4) = max {S(n ,4) + blevel(n )} = S(n ,4) + blevel(n ) = 0 +13.5 13.5, LV (4) = TL (4) + BL (4) = level(n ) 8.

5 55 4


5 1 n 

5 3 n

 <sup>5</sup> <sup>4</sup> <sup>n</sup>

5

5 5 5 5 4 p4 5 6

5 5

5 5 w cls 55 5 2

sl (G , ) = max{LV (1),LV (4)} = LV (1) = level(n ) = 25. φ

el(n ) = TL (4) + t (n ) + t (n ) = 8.5 + 0.5 + 2.5 = 11.5,

5

= =

5 +13.5 = 22.

. (3)

5 5 n <sup>5</sup> cls (4)

5 <sup>6</sup> n 

5 7 n <sup>5</sup> cls (4)

assign

2 p 1 1 p 3 1 p 5 1

e , ) + blevel(n ) 1.5 0 1 10 1 13.5,

*c*

=

ααα

+ =

= + ++ +=

5 1 ),

*cls*) + max *<sup>p</sup>*∈*G*<sup>0</sup> *cls*

⎧ ⎪⎪⎪⎨

⎪⎪⎪⎩

*n*0 *<sup>k</sup>* ,*n*<sup>0</sup> *<sup>l</sup>* ∈*p*, *n*0 *k*∈*pred*(*n*<sup>0</sup> *l* )

β

) +S(n ,1) = 0 + t (n , ) + t (n , ) + t (n , ) = 3,

5 n6

5 7 n

P1 P2 1,2 β = 0.5 2,1 β = 2

α<sup>1</sup> = 4 α<sup>2</sup> = 2

tlevel(n ) = TL (1 5 555

blevel(n ) = t (n ) + t (e , ) + blevel(n ) = 1+ 20 +1 = 22,

β

tlevel(n ) = TL (1) +S(n ,1) = 0 + t (n ) + t (n ) = 0.5 1 1.5,

5 5 5 55 5 1 p1 2 3 1,4 1,2 4

blevel(n ) = t (n ) + max{blevel(n ),blevel(n ), t (e , ) + blevel(n )}

5 2

(n ) = 25.

*cls*, *φ*5) derivation.

*cls*) <sup>≤</sup> *slw*(*G<sup>S</sup>*

5 5 5 15 4 5 5 5 5 5 15 5 47

top (1) = {n }, top (4) = {n }, in (1) = {n ,n },in (4) = {n ,n } out (1) = {n ,n ,n }, out (4) = {n ,n } btm (1) = {n ,n }, btm (4) = {n } - - - - - - cls (1) section- - - - - - TL (1) = tlevel(n ) = 0,

5 5 5 1 5 2 5

> 5 2

level(n ) = 25,

level(n ) 15, S(n ,1) = 0,

5 5 3 p3

blevel(n ) t (n ) + t (

=

5 k 5

5

Fig. 1. Example of *slw*(*G*<sup>5</sup>

∈

blevel(n ) = 4 + 5 + 2 = 12, level(n ) = 15,

> 5 kk n out (1) 5 5 2 2 5 55

BL (1) = max {S(n ,1) + blevel(n )} = S(n ,1) + blevel(n ) = 25, LV (1) = TL (1) + BL (1) = level


555 5 5 5 125 5 47 5 5 5 5 25 5 7

> 5 55 5 2 p2 2,7 1,2 7

*c*

*c*

5 55 5 35 3 p1 p2

= 5 5 3,5 1,2 5

β

= 0.5 + max{22, 13.5, 8 +13.5} = 22.5 level(n

5 5

**Lemma 2.** *In an identical processor system, if sl*(*G<sup>S</sup>*

*sl*(*G<sup>S</sup>*

5 55 5 5 55 5 p1 p2 p3

tlevel(n ) = TL (1) +S(n ,1) = 0 + t (n ) + t (n ) + t (n ) = 3,

 

5 5 n 5 4 n

5 1 n

<sup>5</sup> cls (1)

assign

5 <sup>2</sup> n <sup>5</sup> n3 *Proof.* In *seq*≺ *<sup>s</sup>* , some edges are localized and others may be not localized. Furthermore, edges in *seq*≺ *<sup>s</sup>* do not always belong to the critical path. Then we have the following relationship.

$$-\max\_{p \in G\_{cls}^0} \left\{ \sum\_{\substack{n\_k^0, n\_l^0 \in p, \\ n\_k^0 \in p \text{red}(n\_l^0)}} c(e\_{k,l}^0) \right\} \le sl\_w(G\_{cls}^S) - cp. \tag{4}$$

Also, only in the case of *sl*(*G<sup>S</sup> cls*) ≤ *cp*, we have the following rlationship.

$$\begin{split} sl\_{w}(\mathbf{G}\_{\mathrm{cls}}^{\mathcal{S}}) - cp &\leq sl\_{w}(\mathbf{G}\_{\mathrm{cls}}^{\mathcal{S}}) - sl(\mathbf{G}\_{\mathrm{cls}}^{\mathcal{S}}) \\ \Leftrightarrow & - \max\_{p \in \mathcal{C}\_{\mathrm{cls}}^{0}} \left\{ \sum\_{\begin{subarray}{c} n\_{l}^{0}, n\_{l}^{0} \in \mathcal{P}, \\ n\_{k}^{0} \in prrd(n\_{l}^{0}) \end{subarray}} c(\boldsymbol{e}\_{k,l}^{0}) \right\} \leq sl\_{w}(\mathbf{G}\_{\mathrm{cls}}^{\mathcal{S}}) - sl(\mathbf{G}\_{\mathrm{cls}}^{\mathcal{S}}) \\ \Leftrightarrow & sl(\mathbf{G}\_{\mathrm{cls}}^{\mathcal{S}}) \leq sl\_{w}(\mathbf{G}\_{\mathrm{cls}}^{\mathcal{S}}) + \max\_{p \in \mathcal{C}\_{\mathrm{cls}}^{0}} \left\{ \sum\_{\begin{subarray}{c} n\_{k}^{0}, n\_{l}^{0} \in \mathcal{P}, \\ n\_{k}^{0} \in prrd(n\_{l}^{0}) \end{subarray}} c(\boldsymbol{e}\_{k,l}^{0}) \right\}. \end{split} \tag{5}$$

From lemma 1 and 2, it is concluded that in an identical processor system the schedule length can be minimized if *slw*(*Gcls*) is minimized.

As a next step, we show the relationship between *slw*(*G<sup>s</sup> cls*, *φs*) and the schedule length in a heterogeneous distributed system. The following lemma is proved in the literature (Sinnen, 2007).

**Lemma 3.** In an identical processor system, we have

$$c p\_{\overline{w}} \le s! (G\_{cls}^s). \mathbf{I} \tag{6}$$

In a heterogeneous distributed system, we assume the state like fig. 2, i.e., at the initial state every task is assigned a processor with the fastest and the widest communication bandwidth (let the processor as *Pmax*). In fig. 2 (a), each task belongs to respective processor. Furthermore, we virtually assign *Pmax* to each task to decide the processing time for each task and the data transfer time among any two tasks. Let the mapping as *φ*0. Under the situation, we have the following corollary.

**Corollary 1.** *In a heterogeneous distributed system, let cpw*(*φ*0) *as the one with the mapping φ*<sup>0</sup> *in the table 2. Then we have*

$$\text{lcp}\_w(\phi\_0) \le \text{sl}(G\_{\text{cls}}^s, \phi\_\text{s}).\text{L} \tag{7}$$

As for the relationship between *cp* and *cpw*, in the literature (Sinnen, 2007), the following is proved.

**Lemma 4.** *In an identical processor system, by using gmin defined in table 2, we have*

By using lemma 4, in a heterogeneous distributed system, the following is derived.

*cpw*(*φ*0) <sup>≤</sup> *cpw*(*φs*) <sup>≤</sup> *sl*(*G<sup>s</sup>*

**Thorem 4.1.** *In a heterogeneous distributed system, let the DAG after s task merging steps as G<sup>s</sup>*

*slw*(*G<sup>s</sup>*

1 *gmin*(*φ*0)

1 *gmin*(*φ*0)

1 + <sup>1</sup> *gmin*(*φ*0)

Assume that <sup>Δ</sup>*sls*−<sup>1</sup> *<sup>w</sup>*,*up* is the value which is decided after *<sup>s</sup>* <sup>−</sup> 1 task merging steps. Since

*cp*(*φ*0) = *slw*(*G*<sup>0</sup>

1 *gmin*

<sup>37</sup> On the Effect of Applying the Task Clustering

1 *gmin*(*φs*) *cls is assigned to a processor in P. Let the schedule length as sl*(*G<sup>s</sup>*

*cls*, *<sup>φ</sup>s*) <sup>−</sup> <sup>Δ</sup>*sls*−<sup>1</sup> *<sup>w</sup>*,*up*

1 + <sup>1</sup> *gmin*(*φ*0)

 *sl*(*G<sup>s</sup>*

*cls*, *<sup>φ</sup>s*) <sup>−</sup> <sup>Δ</sup>*sls*−<sup>1</sup> *<sup>w</sup>*,*up*

⇔ (14)

*cls*, *<sup>φ</sup>s*) <sup>−</sup> *cp*(*φ*0) <sup>≤</sup> <sup>Δ</sup>*sls*−<sup>1</sup> *<sup>w</sup>*,*up, the following relationship is*

*cpw*. (8)

*cpw*(*φs*). (9)

*cls*, *φs*). (10)

. (11)

*cpw*(*φ*0) <sup>≤</sup> <sup>Δ</sup>*sls*−<sup>1</sup> *<sup>w</sup>*,*up*. (12)

*cls*, *<sup>φ</sup>s*) <sup>≤</sup> <sup>Δ</sup>*sls*−<sup>1</sup> *<sup>w</sup>*,*up* (13)

. (15)

*cls*, *φ*0), (16)

*cls*, *φs*). Thus if this is applied to (12), we

*cls.*

*cls*, *φs*)*.*

*cp* ≤ 1 +

> 1 +

**Corollary 2.** *In a heterogeneous distributed system, we have*

for Identical Processor Utilization to Heterogeneous Systems

**Corollary 3.** *In a heterogeneous distributed system, we have*

From corollary 2 and 3, the following theorem is derived.

*sl*(*G<sup>s</sup>*

*cls*, *φs*) −

*cls*, *φs*) −

*cls*, *φs*) ≥

*Proof.* From the assumption and corollary 2, we have

*slw*(*G<sup>s</sup>*

Also, from corollary 3, we obtain *cpw*(*φ*0) <sup>≤</sup> *sl*(*G<sup>s</sup>*

*slw*(*G<sup>s</sup>*

*sl*(*G<sup>s</sup>*

*cls*, *φs*) ≥

 1 +

 1 +

*slw*(*G<sup>s</sup>*

From corollary 1, the following is derived.

*And assume every cluster in V<sup>s</sup>*

*derived.*

have

*If we define* Δ*sls*−<sup>1</sup> *<sup>w</sup>*,*up that satisfies slw*(*G<sup>s</sup>*

*cp*(*φs*) ≤

Fig. 2. Assumed condition during cluster generation procedures.


Table 2. Parameter Definitions which are used in analysis on *slw*(*G<sup>s</sup> cls*, *φs*). 8 Will-be-set-by-IN-TECH

0 1 n

<sup>0</sup> cls (5) <sup>0</sup> cls (6)

Pmax

0 cls <sup>0</sup> (a) Initial state (DAG : G , Assignment : ). φ

Fig. 2. Assumed condition during cluster generation procedures.

*n*0

*<sup>k</sup>*) The processor to which *<sup>n</sup><sup>s</sup>*

*<sup>k</sup>*, *αp*) + ∑ *es k*,*l* ∈*p*

> *n*0 *k*∈*V*<sup>0</sup> *cls*

> *n*0 *k*∈*V*<sup>0</sup> *cls*

Table 2. Parameter Definitions which are used in analysis on *slw*(*G<sup>s</sup>*

Parameter Definition

by which a sequence *< n*<sup>0</sup>

Pmax

<sup>3</sup> n <sup>0</sup> 4 n

<sup>0</sup> cls (1)

<sup>0</sup> cls (3)

0 <sup>2</sup> <sup>n</sup> <sup>0</sup>

Pmax

{ } i <sup>i</sup> cpu speed max α ∈α

*p* One path of *G*<sup>0</sup>

*cp* max

*tp*(*n<sup>s</sup>*

*cpw* max

*cpw*(*φs*) max

*g*min min

*g*max max

*seq*≺

*seq*≺

*proc*(*n<sup>s</sup>*

*cp*(*φs*) max

*p*

� ∑ *ns <sup>k</sup>*∈*p*

assign

<sup>0</sup> cls (2)

assign

max P1 P2 P

i,j bandwidth max i,j β ∈β = β

<sup>2</sup>,..., *<sup>n</sup>*<sup>0</sup>

*<sup>k</sup>*) + ∑ *e*0 *k*,*l* ∈*p c*(*e<sup>s</sup> k*,*l* ) ⎫ ⎬ ⎭ .

�

*w*(*n*<sup>0</sup> *k* ) � .

*tp*(*n<sup>s</sup> <sup>k</sup>*, *αp*) �

> min *n*0 *<sup>l</sup>* <sup>∈</sup>*suc*(*n*<sup>0</sup> *k* ) {*w*(*n*<sup>0</sup> *l* )}

> max *n*0 *<sup>l</sup>* <sup>∈</sup>*suc*(*n*<sup>0</sup> *k* ) {*c*(*e*<sup>0</sup> *k*,*l* )}

> max *n*0 *<sup>l</sup>* <sup>∈</sup>*suc*(*n*<sup>0</sup> *k* ) {*w*(*n*<sup>0</sup> *l* )}

> min *n*0 *<sup>l</sup>* <sup>∈</sup>*suc*(*n*<sup>0</sup> *k* ) {*c*(*e*<sup>0</sup> *k*,*l* )}

5 1 n

<sup>3</sup> n <sup>5</sup> 4 n

> 5 <sup>6</sup> n <sup>5</sup> 5 n

<sup>5</sup> cls (4)

5 7 n

5

cls <sup>5</sup> (b) The state after task clustering (DAG : G , Assignment : ). φ

*k*}∪{*e*<sup>0</sup>

0,1,*e*<sup>0</sup>

*<sup>k</sup> <sup>&</sup>gt;* is constructed, where *<sup>e</sup>*<sup>0</sup>

*<sup>k</sup>* is an END task.

*<sup>k</sup>* has been assigned.

*<sup>k</sup>*, *<sup>n</sup><sup>s</sup>*

⎫ ⎪⎬

⎪⎭ .

⎫ ⎪⎬

⎪⎭ .

*cls*, *φs*).

, where *n<sup>s</sup>*

1,2,...*e*<sup>0</sup>

*<sup>k</sup>*−1,*k*},

*<sup>l</sup>* are assigned to *Pp*, *Pq*.

*<sup>l</sup>*−1,*<sup>l</sup>* <sup>∈</sup> *<sup>E</sup>*0,

*s* .

assign assign

5 <sup>2</sup> n <sup>5</sup>

<sup>5</sup> cls (1)

0 <sup>6</sup> n <sup>0</sup> 5 n

<sup>0</sup> cls (4)

0 7 n

Pmax

Pmax

*cls*, i.e., {*n*<sup>0</sup>

<sup>0</sup>, *<sup>n</sup>*<sup>0</sup> <sup>1</sup>, *<sup>n</sup>*<sup>0</sup>

*<sup>s</sup>* One path in which every task belongs to *seqs*.

⎧ ⎨ ⎩ ∑ *n*0 *<sup>k</sup>*∈*p*

*tc*(*c*(*e<sup>s</sup> k*,*l* ), *βp*,*q*)

> *<sup>p</sup>*∈*G*<sup>0</sup> *cls*

*<sup>p</sup>*∈*G<sup>s</sup> cls*

min *n*0 *<sup>j</sup>* <sup>∈</sup>*pred*(*n*<sup>0</sup> *k* ) � *w*(*n*<sup>0</sup> *j* ) �

max *n*0 *<sup>j</sup>* <sup>∈</sup>*pred*(*n*<sup>0</sup> *k* ) � *c*(*e*<sup>0</sup> *j*,*k* ) � ,

max *n*0 *<sup>j</sup>* <sup>∈</sup>*pred*(*n*<sup>0</sup> *k* ) � *w*(*n*<sup>0</sup> *j* ) �

min *n*0 *<sup>j</sup>* <sup>∈</sup>*pred*(*n*<sup>0</sup> *k* ) � *c*(*e*<sup>0</sup> *j*,*k* ) � ,

⎧ ⎪⎨

⎪⎩

⎧ ⎪⎨

⎪⎩

*p*

<sup>0</sup>, *<sup>n</sup>*<sup>0</sup> <sup>1</sup>, *<sup>n</sup>*<sup>0</sup>

<sup>0</sup> is a START task and *<sup>n</sup>*<sup>0</sup>

*<sup>s</sup>* (*i*) Set of subpaths in each of which every task in *clss*(*i*) belongs to *seq*<sup>≺</sup>

<sup>2</sup>,..., *<sup>n</sup>*<sup>0</sup>

*w*(*n<sup>s</sup>*

� ∑ *n*0 *<sup>k</sup>*∈*p*

� ∑ *ns <sup>k</sup>*∈*p*

= α { }

**Lemma 4.** *In an identical processor system, by using gmin defined in table 2, we have*

$$cp \le \left(1 + \frac{1}{\mathcal{g}\_{\min}}\right) cp\_{\text{w.}} \mathbf{L} \tag{8}$$

By using lemma 4, in a heterogeneous distributed system, the following is derived.

**Corollary 2.** *In a heterogeneous distributed system, we have*

$$ccp(\phi\_s) \le \left(1 + \frac{1}{g\_{min}(\phi\_s)}\right) cp\_w(\phi\_s).\blacksquare \tag{9}$$

From corollary 1, the following is derived.

**Corollary 3.** *In a heterogeneous distributed system, we have*

$$c p\_w(\phi\_0) \le c p\_w(\phi\_s) \le s l(\mathbb{G}\_{cls}^s \phi\_s). \mathbf{L} \tag{10}$$

From corollary 2 and 3, the following theorem is derived.

**Thorem 4.1.** *In a heterogeneous distributed system, let the DAG after s task merging steps as G<sup>s</sup> cls. And assume every cluster in V<sup>s</sup> cls is assigned to a processor in P. Let the schedule length as sl*(*G<sup>s</sup> cls*, *φs*)*. If we define* Δ*sls*−<sup>1</sup> *<sup>w</sup>*,*up that satisfies slw*(*G<sup>s</sup> cls*, *<sup>φ</sup>s*) <sup>−</sup> *cp*(*φ*0) <sup>≤</sup> <sup>Δ</sup>*sls*−<sup>1</sup> *<sup>w</sup>*,*up, the following relationship is derived.*

$$\text{scl}(G\_{cls}^{\text{s}}\phi\_{\text{s}}) \ge \frac{s l\_{\text{w}}(G\_{cls}^{\text{s}}\phi\_{\text{s}}) - \Delta s l\_{\text{w},\mu p}^{\text{s}-1}}{1 + \frac{1}{g\_{\text{min}}(\phi\_{\text{0}})}}.\tag{11}$$

*Proof.* From the assumption and corollary 2, we have

$$\operatorname{ssl}\_{w}(G\_{\operatorname{cls}^{s}}^{\operatorname{s}}\phi\_{\operatorname{s}}) - \left(1 + \frac{1}{g\_{\operatorname{min}}(\phi\_{\operatorname{0}})}\right) \operatorname{cp}\_{w}(\phi\_{\operatorname{0}}) \leq \operatorname{ssl}\_{w,\operatorname{\mu}p}^{\operatorname{s}-1}.\tag{12}$$

Also, from corollary 3, we obtain *cpw*(*φ*0) <sup>≤</sup> *sl*(*G<sup>s</sup> cls*, *φs*). Thus if this is applied to (12), we have

$$slw(\mathcal{G}\_{cls}^{s}, \phi\_{\mathbb{S}}) - \left(1 + \frac{1}{\mathcal{g}\_{min}(\phi\_{0})}\right) sl(\mathcal{G}\_{cls}^{s}, \phi\_{\mathbb{S}}) \le \Delta sl\_{w, up}^{s-1} \tag{13}$$

⇔ (14)

$$\text{lsl}(G\_{cls}^s, \phi\_{\mathbb{S}}) \ge \frac{\text{sl}\_{\text{w}}(G\_{cls}^s, \phi\_{\mathbb{S}}) - \Delta \text{sl}\_{\text{w}, \text{up}}^{s-1}}{1 + \frac{1}{\mathcal{G}\_{\text{min}}(\phi\_{\mathbb{S}})}}. \tag{15}$$

Assume that <sup>Δ</sup>*sls*−<sup>1</sup> *<sup>w</sup>*,*up* is the value which is decided after *<sup>s</sup>* <sup>−</sup> 1 task merging steps. Since

$$\exp(\phi\_0) = s l\_w(G\_{\rm cls'}^0 \phi\_0),\tag{16}$$

this value is an upper bound of increase in terms of *slw*(*G<sup>s</sup> cls*, *φs*) and can be defined in any policy, e.g., the slowest processor is assigned to each cluster and so on. However, at least Δ*sls*−<sup>1</sup> *<sup>w</sup>*,*up* must be decided before *s* task merging steps. From the theorem, it can be said that reducing *slw*(*G<sup>s</sup> cls*, *φs*) leads to reduction of the lower bound of the schedule length in a heterogeneous distributed system.

As for the upper bound of the schedule length, the following theorem is derived.

**Thorem 4.2.** *In a heterogeneous distributed sytem, if and only if sl*(*G<sup>s</sup> cls*, *<sup>φ</sup>s*) <sup>≤</sup> *cp*(*φ*0) = *sl*(*G*<sup>0</sup> *cls*, *φ*0)*, we have*

$$\text{sl}(\mathcal{G}\_{\text{cls}'}^{s}\phi\_{\text{s}}) \le \text{sl}\_{\text{w}}(\mathcal{G}\_{\text{cls}'}^{s}\phi\_{\text{s}}) + \mathbb{J} - \lambda - \mu,\tag{17}$$

If *sl*(*G<sup>s</sup>*

*cls*, *<sup>φ</sup>s*) <sup>≤</sup> *cp*(*φ*0) = *sl*(*G*<sup>0</sup>

objective of our proposal is to minimize *slw*(*G<sup>s</sup>*

**4.3 The lower bound of each cluster size**

*cls*, *φ*0), we obtain

*cls*, *<sup>φ</sup>R*) <sup>≤</sup> *slw*(*G<sup>R</sup>*

reduction of the schedule length in a heterogeneous distributed system. Thus, the first

<sup>39</sup> On the Effect of Applying the Task Clustering

because this value does not guarantee each cluster size. Thus, in this section we present how large each cluster size should be. In the literature (Kanemitsu, 2010), the lower bound of each

> ⎧ ⎪⎨

max *n*0 *<sup>l</sup>* <sup>∈</sup>*V*<sup>0</sup> *cls* � *w*(*n*<sup>0</sup> *l* ) �

*g*max

⎫ ⎪⎬

⎪⎭

*<sup>l</sup>* , max *<sup>α</sup>i*∈*<sup>α</sup>* {*αi*})

*g*max(*φ*0)

⎪⎩

every cluster size is above a certain threshold, *δ*. And *R* corresponds to the number of merging steps when every cluster size is *δopt* or more. If taking the initial state of the DAG in a

> ⎧ ⎪⎪⎪⎨

max *n*0 *<sup>l</sup>* ∈*V*<sup>0</sup> � *tp*(*n*<sup>0</sup>

*cls*, *φs*) can not always be minimized by *δopt*(*φ*0), because the mapping of each

⎪⎪⎪⎩

this chapter, one heuristic of our method is to impose the same lower bound (*δopt*(*φ*0)) for

In the previous section, we presented how large each cluster size should be set for processor utilization. In this section, we present the task clustering algorithm with incorporating the

*cls*, *φs*).

*cls*, *<sup>φ</sup>R*) <sup>−</sup> *sl*(*G<sup>R</sup>*

*cls*, *φR*) (22)

*cls*, *<sup>φ</sup>s*) <sup>≤</sup> *sl*(*Gs*−<sup>1</sup>

*cls*, *φs*) minimization" not enough,

. (24)

. (25)

*cls*, *φ*0). In

*cls*) can be minimized, provided that

�

*cls*, *<sup>φ</sup>s*) is not equal to *slw*(*G*<sup>0</sup>

⎫ ⎪⎪⎪⎬

⎪⎪⎪⎭

*cls*, *φ*0) can be minimized. However,

*cls* , *<sup>φ</sup>s*−1).

*cls*, *φs*) leads to the

*cls*, *φR*) + *ζ* − *λ* − *μ*. (23)

<sup>−</sup>*<sup>ζ</sup>* <sup>+</sup> *<sup>λ</sup>* <sup>+</sup> *<sup>μ</sup>* <sup>≤</sup> *slw*(*G<sup>R</sup>*

Theorem 4.2 is true if we adopt a clustering policy such that *sl*(*G<sup>s</sup>*

From theorem 4.1 and 4.2, it can be concluded that reducing the *slw*(*G<sup>s</sup>*

<sup>⇔</sup> *sl*(*G<sup>R</sup>*

for Identical Processor Utilization to Heterogeneous Systems

To achieve processor utilization, satisfying only "*slw*(*G<sup>s</sup>*

*δopt* =

(24) is the lower bound of each cluster size when *slw*(*G<sup>R</sup>*

��������

By imposing *δopt*(*φ*0), it can be said that at least *slw*(*G*<sup>0</sup>

cluster and each processor is changed and then *slw*(*G<sup>s</sup>*

every cluster which will be generated by the task clustering.

*δopt*(*φ*0) =

for *<sup>s</sup>* <sup>≥</sup> <sup>1</sup> *slw*(*G<sup>s</sup>*

**5. Task clustering algorithm**

**5.1 Overview of the algorithm**

following two requirements.

1. Every cluster size is *δopt*(*φ*0) or more.

cluster size in an identical processor system is derived as follows.

������*cpw* max *n*0 *<sup>k</sup>*∈*V*<sup>0</sup>

heterogeneous system into account, *δopt* is expressed by *δopt*(*φ*0) as follows.

*cpw*(*φ*0) max *n*0 *<sup>l</sup>* ∈*V*<sup>0</sup>

*where*

$$\zeta = \max\_{p} \left\{ \sum\_{\substack{n\_{\ell}^{0}, n\_{\ell}^{0} \in p\_{\ell} \\ n\_{\ell}^{0} \in pred(n\_{\ell}^{0}), \\ t\_{\ell} \left( \boldsymbol{e}\_{i,\ell}^{0} \max\_{\boldsymbol{\beta}\_{i,\ell} \in \hat{\boldsymbol{\beta}}} \{ \boldsymbol{\beta}\_{i,\ell} \} \right) = 0}} \boldsymbol{t}\_{\ell} \left( \boldsymbol{e}\_{k,\ell}^{0} \max\_{\boldsymbol{\beta}\_{i,\ell} \in \hat{\boldsymbol{\beta}}} \left\{ \boldsymbol{\beta}\_{i,\ell} \right\} \right) \right\},\tag{18}$$

$$\lambda = \min\_{p} \left\{ \sum\_{\substack{n\_k^0 \in p\_r \\ p \text{cov}(n\_k^s) = p\_m}} \left( t\_p(n\_{k'}^s \mathbf{a}\_m) - t\_p(n\_{k'}^0 \max\_{a\_i \in \mathfrak{a}} \{ \mathbf{a}\_i \}) \right) \right\}, \tag{19}$$

$$\mu = \min\_{p} \left\{ \sum\_{\substack{n\_k^0, n\_l^0 \in p, \operatorname{proc}(n\_k^s) = P\_{l\_\cdot} \\ \operatorname{proc}(n\_l^s) = P\_{l'} \\ \operatorname{proc}(n\_l^s) = P\_{l'} \\ n\_k^0 \in \operatorname{proc}(n\_l^0)}} \left( t\_\mathfrak{c}(e\_{k,l'}^s \beta\_{l,j}) - t\_\mathfrak{c}(e\_{k,l'}^0 \max\_{\beta\_{l,j} \in \beta} \left\{ \beta\_{l,j} \right\}) \right) \right\}. \tag{20}$$

*p and proc*(*n*<sup>0</sup> *<sup>k</sup>* ) *are defined in table 2. That is, ζ*, *λ*, *μ is derived by scanning every path in the DAG.*

*Proof.* After *s* task merging steps, there may be both localized edges and not localized edges which compose *slw*(*G<sup>s</sup> cls*, *<sup>φ</sup>s*). Obviously, we have *slw*(*G*<sup>0</sup> *cls*, *φ*0) = *cp*(*φ*0), such edges are not always ones which belongs to *cp*(*φ*0). Therefore the lower bound of *slw*(*G<sup>s</sup> cls*, *φs*) − *cp*(*φ*0) can be derived by three factors, i.e., decrease of the data transfer time by localization in one path, increase of the processing time by task merging steps (from *φ*<sup>0</sup> to *φs*), and increase of data transfer time for each unlocalized edges (from *φ*<sup>0</sup> to *φs*). The localized data transfer time is derived by taking the sum of localized data transfer time for one path. On the other hand, if increase of the processing time is derived by taking the minimum of the sum of increase of task processing time from *φ*<sup>0</sup> to *φ<sup>s</sup>* for each path, this value is *λ* or more. The unlocalized data transfer time is expressed as *μ*. Then we have

$$-\mathcal{L} + \lambda + \mu \le sl\_{\mathcal{W}}(G\_{cls}^s, \phi\_s) - \varepsilon p(\phi\_0). \tag{21}$$

If *sl*(*G<sup>s</sup> cls*, *<sup>φ</sup>s*) <sup>≤</sup> *cp*(*φ*0) = *sl*(*G*<sup>0</sup> *cls*, *φ*0), we obtain

10 Will-be-set-by-IN-TECH

policy, e.g., the slowest processor is assigned to each cluster and so on. However, at least Δ*sls*−<sup>1</sup> *<sup>w</sup>*,*up* must be decided before *s* task merging steps. From the theorem, it can be said

**Thorem 4.2.** *In a heterogeneous distributed sytem, if and only if*

As for the upper bound of the schedule length, the following theorem is derived.

*cls*, *<sup>φ</sup>s*) <sup>≤</sup> *slw*(*G<sup>s</sup>*

*tc*(*e* 0 *k*,*l* , max *βi*,*j*∈*β*

*<sup>k</sup>*, *<sup>α</sup>m*) <sup>−</sup> *tp*(*n*<sup>0</sup>

*cls*, *φ*0)*, we have*

*sl*(*G<sup>s</sup>*

∑

{*βi*,*j*})=<sup>0</sup>

� *tp*(*n<sup>s</sup>*

*<sup>k</sup>* )=*Pi*,

*cls*, *<sup>φ</sup>s*). Obviously, we have *slw*(*G*<sup>0</sup>

always ones which belongs to *cp*(*φ*0). Therefore the lower bound of *slw*(*G<sup>s</sup>*

<sup>−</sup> *<sup>ζ</sup>* <sup>+</sup> *<sup>λ</sup>* <sup>+</sup> *<sup>μ</sup>* <sup>≤</sup> *slw*(*G<sup>s</sup>*

� *tc*(*e s k*,*l*

*cls*, *φs*) leads to reduction of the lower bound of the schedule length in

� *βi*,*<sup>j</sup>* � )

⎫

⎪⎪⎪⎪⎪⎪⎪⎪⎪⎬

⎪⎪⎪⎪⎪⎪⎪⎪⎪⎭

� ⎫ ⎪⎪⎪⎬

⎪⎪⎪⎭

� *βi*,*<sup>j</sup>* � ) �

*<sup>k</sup>* , max *<sup>α</sup>i*∈*<sup>α</sup>* {*αi*})

> 0 *k*,*l* , max *βi*,*j*∈*β*

, *βi*,*j*) − *tc*(*e*

*<sup>k</sup>* ) *are defined in table 2. That is, ζ*, *λ*, *μ is derived by scanning every path in the DAG.*

*Proof.* After *s* task merging steps, there may be both localized edges and not localized edges

be derived by three factors, i.e., decrease of the data transfer time by localization in one path, increase of the processing time by task merging steps (from *φ*<sup>0</sup> to *φs*), and increase of data transfer time for each unlocalized edges (from *φ*<sup>0</sup> to *φs*). The localized data transfer time is derived by taking the sum of localized data transfer time for one path. On the other hand, if increase of the processing time is derived by taking the minimum of the sum of increase of task processing time from *φ*<sup>0</sup> to *φ<sup>s</sup>* for each path, this value is *λ* or more. The unlocalized data

*cls*, *φs*) and can be defined in any

, (18)

, (19)

. (20)

*cls*, *φs*) − *cp*(*φ*0) can

⎫ ⎪⎪⎪⎪⎪⎪⎬

⎪⎪⎪⎪⎪⎪⎭

*cls*, *φ*0) = *cp*(*φ*0), such edges are not

*cls*, *φs*) − *cp*(*φ*0). (21)

*cls*, *φs*) + *ζ* − *λ* − *μ*, (17)

this value is an upper bound of increase in terms of *slw*(*G<sup>s</sup>*

that reducing *slw*(*G<sup>s</sup>*

*sl*(*G<sup>s</sup>*

*where*

a heterogeneous distributed system.

*cls*, *<sup>φ</sup>s*) <sup>≤</sup> *cp*(*φ*0) = *sl*(*G*<sup>0</sup>

*ζ* = max *p*

*λ* = min *p*

*μ* = min *p*

*p and proc*(*n*<sup>0</sup>

which compose *slw*(*G<sup>s</sup>*

⎧

⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩

⎧ ⎪⎪⎪⎨

⎪⎪⎪⎩

⎧ ⎪⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎪⎩

transfer time is expressed as *μ*. Then we have

*n*0 *<sup>k</sup>* ,*n*<sup>0</sup>

*n*0 *<sup>k</sup>*∈*p*, *proc*(*n<sup>s</sup> <sup>k</sup>* )=*pm*

*n*0 *<sup>k</sup>* ,*n*<sup>0</sup> *<sup>l</sup>* ∈*p*, *n*0 *k*∈*pred*(*n*<sup>0</sup> *l* ),

*tc* (*e<sup>s</sup> k*,*l* ,max *βi*,*j*∈*β*

∑

∑

*<sup>l</sup>* ∈*p*,*proc*(*n<sup>s</sup>*

*proc*(*n<sup>s</sup> <sup>l</sup>*)=*Pj*,

*n*0 *k*∈*pred*(*n*<sup>0</sup> *l* )

$$-\mathcal{L} + \lambda + \mu \le sl\_w(G\_{cls}^R, \phi\_R) - sl(G\_{cls}^R, \phi\_R) \tag{22}$$

$$\Leftrightarrow \text{sl}(\mathbb{G}\_{\text{cls}}^{\mathbb{R}}, \phi\_{\mathbb{R}}) \leq \text{sl}\_{w}(\mathbb{G}\_{\text{cls}}^{\mathbb{R}}, \phi\_{\mathbb{R}}) + \zeta - \lambda - \mu. \tag{23}$$

Theorem 4.2 is true if we adopt a clustering policy such that *sl*(*G<sup>s</sup> cls*, *<sup>φ</sup>s*) <sup>≤</sup> *sl*(*Gs*−<sup>1</sup> *cls* , *<sup>φ</sup>s*−1). From theorem 4.1 and 4.2, it can be concluded that reducing the *slw*(*G<sup>s</sup> cls*, *φs*) leads to the reduction of the schedule length in a heterogeneous distributed system. Thus, the first objective of our proposal is to minimize *slw*(*G<sup>s</sup> cls*, *φs*).

#### **4.3 The lower bound of each cluster size**

To achieve processor utilization, satisfying only "*slw*(*G<sup>s</sup> cls*, *φs*) minimization" not enough, because this value does not guarantee each cluster size. Thus, in this section we present how large each cluster size should be. In the literature (Kanemitsu, 2010), the lower bound of each cluster size in an identical processor system is derived as follows.

$$\delta\_{opt} = \sqrt{c p\_{\text{w}} \max\_{n\_l^0 \in V\_0} \left\{ \frac{\max\_l \left\{ w(n\_l^0) \right\}}{s\_l^0} \right\}} \,\,\,\,\, \text{g}\,\,\,\tag{24}$$

(24) is the lower bound of each cluster size when *slw*(*G<sup>R</sup> cls*) can be minimized, provided that every cluster size is above a certain threshold, *δ*. And *R* corresponds to the number of merging steps when every cluster size is *δopt* or more. If taking the initial state of the DAG in a heterogeneous system into account, *δopt* is expressed by *δopt*(*φ*0) as follows.

$$\delta\_{opt}(\phi\_0) = \sqrt{\alpha\_w(\phi\_0) \max\_{n\_i^0 \in V\_0} \left\{ \frac{\max\_{n\_i^0 \in V\_0} \left\{ t\_p(n\_{1'}^0 \max\_{n\_i \in \mathcal{a}} \{n\_i\}) \right\}}{\mathcal{g}\_{\max}(\phi\_0)} \right\}}\,\tag{25}$$

By imposing *δopt*(*φ*0), it can be said that at least *slw*(*G*<sup>0</sup> *cls*, *φ*0) can be minimized. However, for *<sup>s</sup>* <sup>≥</sup> <sup>1</sup> *slw*(*G<sup>s</sup> cls*, *φs*) can not always be minimized by *δopt*(*φ*0), because the mapping of each cluster and each processor is changed and then *slw*(*G<sup>s</sup> cls*, *<sup>φ</sup>s*) is not equal to *slw*(*G*<sup>0</sup> *cls*, *φ*0). In this chapter, one heuristic of our method is to impose the same lower bound (*δopt*(*φ*0)) for every cluster which will be generated by the task clustering.

#### **5. Task clustering algorithm**

#### **5.1 Overview of the algorithm**

In the previous section, we presented how large each cluster size should be set for processor utilization. In this section, we present the task clustering algorithm with incorporating the following two requirements.

1. Every cluster size is *δopt*(*φ*0) or more.

**INPUT:** *G*<sup>0</sup>

*cls* **OUTPUT:** *G<sup>R</sup>*

For each *nk* <sup>∈</sup> *<sup>V</sup>*, let *<sup>n</sup>*<sup>0</sup>

8. **ENDWHILE** 9. **ENDWHILE** 10. **RETURN** *G<sup>R</sup>*

*cls*

Set the mapping *φ*<sup>0</sup> to the input DAG.

5. **WHILE** size of *pivots < δopt*(*φ*0) **DO** 6. *targets* ← *getTarget*(*pivots*);

*cls*;

**5.3 Selection for** *pivots* **and** *targets*

contribute to minimize *slw*(*G<sup>R</sup>*

**5.4 Merging** *pivots* **and** *targets*

the next merging step.

**6. Experiments**

is selected as *target*2. similarly in (c) *n*<sup>3</sup>

dominate *slw*(*G<sup>s</sup>*

*slw*(*G<sup>s</sup>*

Fig. 3. Procedures for the Task Clustering.

the new *pivot*<sup>4</sup> for generating the new cluster.

*cls*, *<sup>φ</sup>s*) and then *slw*(*Gs*+<sup>1</sup>

having the maximum *LV* value in *RDY*<sup>2</sup> is selected. Then *n*<sup>2</sup>

0. Derive *δopt*(*φ*0) by eq. (25). 1. *<sup>E</sup>*<sup>0</sup> <sup>←</sup> *<sup>E</sup>*, *UEX*<sup>0</sup> <sup>←</sup> *<sup>V</sup>*<sup>0</sup>

2. **WHILE** *UEXs* �= ∅ **DO** 3. select a processor *Pp* from *P*; 4. *pivots* ← *getPivot*(*RDYs*);

Define *UEXs* as a set of clusters whose size is under *δopt*(*φ*0); Define *RDYs* as a set of clusters which statisies eq. (26).;

for Identical Processor Utilization to Heterogeneous Systems

*<sup>k</sup>* <sup>←</sup> *nk*, *cls*0(*k*) = {*n*<sup>0</sup>

7. *RDYs*<sup>+</sup><sup>1</sup> ← *merging*(*pivots*, *targets*) and update *pivots*;

As mentioned in 5.1, one objective of the algorithm is to minimize *slw*(*G<sup>R</sup>*

After *pivots* and *targets* have been selected, the merging procedure, i.e.,

*cls*, *RDY*<sup>0</sup> ← {*cls*0(*k*)|*cls*0(*k*) = {*n*<sup>0</sup>

exceeds *δopt*(*φ*0), the mapping is changed i.e., clusters in *UEX*<sup>4</sup> are assigned to *Pmax* to select

*RDYs*, *pivots* should have maximum *LV* value (defined in table 1), because such a cluster may

i.e., *targets* should be the cluster which dominates *LV* value of *pivots*. In fig. 4 (b), *cls*2(3)

is performed. This procedure means that every cluster in *targets* is included in *pivots*+1. Then *pivots* and *targets* are removed from *UEXs* and *RDYs*. After this merging step has been performed, clusters satisfying requirements for *RDYs*+1(in eq. (26)) are included in *RDYs*+1. Furthermore, every cluster's *LV* value is updated for selecting *pivots*+<sup>1</sup> and *targets*+<sup>1</sup> before

We conducted the experimental simulation to confirm advantages of our proposal. Thus, we

compared with other conventional methods in terms of the following points of view.

*cls*, *φs*). Our heuristic behined the algorithm is that this policy for selecting *pivots* can

*<sup>k</sup>*} and put *cls*0(*k*) into *<sup>V</sup>*<sup>0</sup>

<sup>41</sup> On the Effect of Applying the Task Clustering

*cls*.

*<sup>k</sup>* ) = ∅};

*cls* , *φs*+1) after *s* + 1 th merging may became lower than

<sup>5</sup>, i.e., *cls*3(5) dominating *LV*3(3) is selected as *target*3.

*pivots*+<sup>1</sup> ← *pivots* ∪ *targets* (27)

*cls*, *φR*). The same requirement holds to the selection of *targets*,

*cls*, *φR*). Therefore, in

<sup>6</sup>, i.e., *cls*2(6) dominating *LV*2(3)

*<sup>k</sup>*}, *pred*(*n*<sup>0</sup>

2. Minimize *slw*(*G<sup>R</sup> cls*, *φR*), where *R* is the total number of merging steps until the first requirement is satisfied.

Fig. 3 shows the task clustering algorithm. At first, the mapping *φ*<sup>0</sup> is applied to every task. Then *δopt*(*φ*0) is derived. Before the main procedures, two sets are defined, i.e., *UEXs* and *RDYs*. *UEXs* is the set of clusters whose size is smaller than *δopt*(*φ*0), and *RDYs* is defined as follow.

$$\begin{split} \text{RDY}\_{\text{s}} &= \{ \text{cls}\_{\text{s}}(r) | \text{cls}\_{\text{s}}(r) \in \text{UEX}\_{\text{s}}, \text{pred}(n\_{r'}^{\text{s}}) = \bigotimes\_{r'} \text{cls}\_{\text{s}}(r) = \{ n\_{r'}^{\text{s}} \} \} \\ &\cup \left\{ \begin{aligned} \text{cls}\_{\text{s}}(r) | \text{cls}\_{\text{s}}(r) \in \text{UEX}\_{\text{s}}, \text{cls}\_{\text{s}}(q) \notin \text{UEX}\_{\text{s}} \\ n\_{q'}^{\text{s}} \in \text{cls}\_{\text{s}}(q), n\_{q'}^{\text{s}} \in \text{pred}(n\_{r'}^{\text{s}}) \text{ for } \forall n\_{r'}^{\text{s}} \in \text{top}\_{\text{s}}(r) \end{aligned} \right\}. \end{split} \tag{26}$$

*RDYs* is the set of clusters whose preceding cluster sizes are *δopt*(*φ*0) or more. That is, the algorithm tries to merge each cluster in top-to-bottom manner.

The algorithm is proceeded during *UEXs* �= ∅, which implies that at least one cluster in *UEXs* exists. At line 3, one processor is selected by a processor selection method, e.g., by CHP(C. Boeres, 2004) (In this chapter, we do not present processor selection methods). At line 4, one cluster is selected as *pivots*, which corresponds to "the first cluster for merging". Once the *pivots* is selected, "the second cluster for merging", i.e., *targets* is needed. Thus, during line 5 to 7, procedures for selecting *targets* and merging *pivots* and *targets* are performed. After those procedures, at line 7 *RDYs* is updated to become *RDYs*+1, and *pivots* is also updated to become *pivots*+1. Procedures at line 6 and 7 are repeated until the size of *pivots* is *δopt*(*φ*0) or more. The algorithm in fig. 3 has common parts with that of the literature (Kanemitsu, 2010), i.e., both algorithms use *pivots* and *targets* for merging two clusters until the size of *pivots* exceeds a lower bound of the cluster size. However, one difference among them is that the algorithm in fig. 3 keeps the same *pivots* during merging steps until its size exceeds *δopt*(*φ*0), while the algorithm in (Kanemitsu, 2010) selects the new *pivots* in every merging step. The reason of keeping the same *pivots* is to reduce the time complexity in selection for *pivots*, which requires scanning every cluster in *RDYs*. As a result, the number of scanning *RDYs* can be reduced with compared to that of (Kanemitsu, 2010).

#### **5.2 Processor assignment**

In the algorithm presented in fig. 3, the processor assignment is performed before selecting *pivots*. Suppose that a processor *Pp* is selected before the *s* + 1th merging step. Then we assume that *Pp* is assigned to every cluster to which *Pmax* is assigned, i.e., no actual processor has been assigned. By doing that, we assume that such unassigned clusters are assigned to "an identical processor system by *Pp*" in order to select *pivots*. Fig. 4 shows an example of the algorithm. In the figure, (a) is the state of *φ*2, in which the size of *cls*2(1) is *δopt*(*φ*0) or more. Thus, *RDY*<sup>2</sup> = *cls*2(3) = *n*2 3 , *cls*2(4) = *n*2 4 , *cls*2(7) = *n*2 7 . The communication bandwidth from *P*<sup>1</sup> to *Pmax* is set as *min* 1≤*q*≤m,1�=*q β*1,*<sup>q</sup>* in order to regard communication

bandwidth between an actual processor and *Pmax* bottleneck in the schedule length. In (b), it is assumed that every cluster in *UEX*<sup>2</sup> is assigned to *Pp* after *Pp* is selected. Bandwidths among *Pp* are set as *min* 1≤*q*≤m,*p*�=*q βp*,*q* to estimate the *slw*(*G*<sup>2</sup> *cls*, *φ*2) of the worst case. Therefore, *pivot*2(in this case, *cls*2(3)) is selected by deriving *LV* value for each cluster in *RDY*2, provided that such a mapping state. After (b), if the size of *cls*3(3) is smaller than *δopt*(*φ*0), every cluster in *UEX*<sup>3</sup> is still assigned to *Pp* to maintain the mapping state. In (d) if the size of *cls*4(3) 12 Will-be-set-by-IN-TECH

Fig. 3 shows the task clustering algorithm. At first, the mapping *φ*<sup>0</sup> is applied to every task. Then *δopt*(*φ*0) is derived. Before the main procedures, two sets are defined, i.e., *UEXs* and *RDYs*. *UEXs* is the set of clusters whose size is smaller than *δopt*(*φ*0), and *RDYs* is defined as

*clss*(*r*)|*clss*(*r*) <sup>∈</sup> *UEXs*, *clss*(*q*) <sup>∈</sup>/ *UEXs*,

*RDYs* is the set of clusters whose preceding cluster sizes are *δopt*(*φ*0) or more. That is, the

The algorithm is proceeded during *UEXs* �= ∅, which implies that at least one cluster in *UEXs* exists. At line 3, one processor is selected by a processor selection method, e.g., by CHP(C. Boeres, 2004) (In this chapter, we do not present processor selection methods). At line 4, one cluster is selected as *pivots*, which corresponds to "the first cluster for merging". Once the *pivots* is selected, "the second cluster for merging", i.e., *targets* is needed. Thus, during line 5 to 7, procedures for selecting *targets* and merging *pivots* and *targets* are performed. After those procedures, at line 7 *RDYs* is updated to become *RDYs*+1, and *pivots* is also updated to become *pivots*+1. Procedures at line 6 and 7 are repeated until the size of *pivots* is *δopt*(*φ*0) or more. The algorithm in fig. 3 has common parts with that of the literature (Kanemitsu, 2010), i.e., both algorithms use *pivots* and *targets* for merging two clusters until the size of *pivots* exceeds a lower bound of the cluster size. However, one difference among them is that the algorithm in fig. 3 keeps the same *pivots* during merging steps until its size exceeds *δopt*(*φ*0), while the algorithm in (Kanemitsu, 2010) selects the new *pivots* in every merging step. The reason of keeping the same *pivots* is to reduce the time complexity in selection for *pivots*, which requires scanning every cluster in *RDYs*. As a result, the number of scanning *RDYs* can

In the algorithm presented in fig. 3, the processor assignment is performed before selecting *pivots*. Suppose that a processor *Pp* is selected before the *s* + 1th merging step. Then we assume that *Pp* is assigned to every cluster to which *Pmax* is assigned, i.e., no actual processor has been assigned. By doing that, we assume that such unassigned clusters are assigned to "an identical processor system by *Pp*" in order to select *pivots*. Fig. 4 shows an example of the algorithm. In the figure, (a) is the state of *φ*2, in which the size of *cls*2(1) is *δopt*(*φ*0) or

, *cls*2(4) =

1≤*q*≤m,1�=*q*

bandwidth between an actual processor and *Pmax* bottleneck in the schedule length. In (b), it is assumed that every cluster in *UEX*<sup>2</sup> is assigned to *Pp* after *Pp* is selected. Bandwidths

to estimate the *slw*(*G*<sup>2</sup>

*pivot*2(in this case, *cls*2(3)) is selected by deriving *LV* value for each cluster in *RDY*2, provided that such a mapping state. After (b), if the size of *cls*3(3) is smaller than *δopt*(*φ*0), every cluster in *UEX*<sup>3</sup> is still assigned to *Pp* to maintain the mapping state. In (d) if the size of *cls*4(3)

*n*2 4 

 *β*1,*<sup>q</sup>*

, *cls*2(7) =

*n*2 7

in order to regard communication

*cls*, *φ*2) of the worst case. Therefore,

. The communication

*<sup>q</sup>*� <sup>∈</sup> *pred*(*n<sup>s</sup>*

*RDYs* <sup>=</sup> {*clss*(*r*)|*clss*(*r*) <sup>∈</sup> *UEXs*, *pred*(*n<sup>s</sup>*

*<sup>q</sup>*� <sup>∈</sup> *clss*(*q*), *<sup>n</sup><sup>s</sup>*

∪

*ns*

be reduced with compared to that of (Kanemitsu, 2010).

*cls*2(3) =

 *βp*,*q*

bandwidth from *P*<sup>1</sup> to *Pmax* is set as *min*

1≤*q*≤m,*p*�=*q*

*n*2 3 

**5.2 Processor assignment**

more. Thus, *RDY*<sup>2</sup> =

among *Pp* are set as *min*

algorithm tries to merge each cluster in top-to-bottom manner.

*cls*, *φR*), where *R* is the total number of merging steps until the first

*<sup>r</sup>*�) *f or* <sup>∀</sup>*n<sup>s</sup>*

*<sup>r</sup>*�) = <sup>∅</sup>, *clss*(*r*) = {*n<sup>s</sup>*

*<sup>r</sup>*� ∈ *tops*(*r*)

*r*� }}

. (26)

2. Minimize *slw*(*G<sup>R</sup>*

follow.

requirement is satisfied.

```
INPUT: G0
         cls
OUTPUT: GR
           cls
Set the mapping φ0 to the input DAG.
Define UEXs as a set of clusters whose size is under δopt(φ0);
Define RDYs as a set of clusters which statisies eq. (26).;
For each nk ∈ V, let n0
                    k ← nk, cls0(k) = {n0
                                       k} and put cls0(k) into V0
                                                              cls.
0. Derive δopt(φ0) by eq. (25).
1. E0 ← E, UEX0 ← V0
                       cls, RDY0 ← {cls0(k)|cls0(k) = {n0
                                                       k}, pred(n0
                                                                k ) = ∅};
2. WHILE UEXs �= ∅ DO
3. select a processor Pp from P;
4. pivots ← getPivot(RDYs);
5. WHILE size of pivots < δopt(φ0) DO
6. targets ← getTarget(pivots);
7. RDYs+1 ← merging(pivots, targets) and update pivots;
8. ENDWHILE
9. ENDWHILE
10. RETURN GR
                cls;
```
Fig. 3. Procedures for the Task Clustering.

exceeds *δopt*(*φ*0), the mapping is changed i.e., clusters in *UEX*<sup>4</sup> are assigned to *Pmax* to select the new *pivot*<sup>4</sup> for generating the new cluster.
