**2.1. Model definition**

A model should always have a form that is more concise and closer to a designer's intuition about what a model should look like. In the case of a processor pipeline, the simplest description would be that the instructions flow and pass through separate pipeline stages connected by buffers. Control dependences stall the inflow of useful instructions (fluid) into the pipeline, whereas true data dependences decrease the aperture of the pipeline and the outflow rate. The buffer levels always vary and affect both the inflow and outflow rates. Branch prediction techniques tend to eliminate stalls in the inflow, while value prediction techniques help keeping outflow rate as high as possible.

Fluid Stochastic Petri Nets:

From Fluid Atoms in ILP Processor Pipelines to Fluid Atoms in P2P Streaming Networks 229

We assume that the pipeline is organized in four stages: Fetch, Decode/Issue, Execute and Commit. Fluid places *PIC*, *PIB, PRS/LSQ, PROB, PRR, PEX* and *PREG*, depicted by means of two concentric circles (Figure 1), represent buffers between pipeline stages: *instruction cache*, *instruction buffer*, *reservation stations and load/store queue*, *reorder buffer, rename registers, instructions that have completed execution* and *architectural registers*. Five of them have limited capacities: *ZIBmax*, *ZRS/LSQmax*, *ZRRmax*, *ZROBmax* and *ZEXmax*. We prohibit both an overflow and a negative level in a fluid place. The fluid place *PTIME* has the function of an hourglass: it is constantly filled at rate 1 up to the level 1 and then flushed out, which corresponds to the machine clock cycle. *ZTIME*(*t*) denotes the fluid level in *PTIME* at time *t*. Fluid arcs are drawn as double arrows to suggest a pipe. Flow rates are piecewise constant, i.e. take different values at the beginning of each cycle and are limited by the fetch/issue width of the machine (*W*). Rates depend on the vector of fluid levels **Z**( )*t* and change when *TCLOCK* fires and the fluid in

Let 00 0 0 0 0 / , , , , and *ZZZ Z Z Z IC IB RS LSQ RR ROB EX* be the fluid levels at the beginning of the clock cycle, i.e. 0 <sup>0</sup> ( ) *Z Zt IC IC* , 0 <sup>0</sup> ( ) *Z Zt IB IB* , 0 / / 0 ( ) *Z Zt RS LSQ RS LSQ* , 0 <sup>0</sup> ( ) *Z Zt RR RR* ,

A high-bandwidth instruction fetch mechanism fetches up to *W* instructions per cycle and

max 0 0

*TCLASSn n*

*rISSUE rISSUE rISSUE*

*PRR PROB*

*PBEP PVEP*

*rCOMMIT rCOMMIT rCOMMIT*

min( , ,) *FETCH IB IB ISSUE IC r Z Z r ZW* (1)

*1*

*TREEXECUTE*

*TEND*

*μ*

*TVEPC pVEPC TVHPC*

*pVEP*

*TVEPMIS*

*1-pVEP*

*TVEP TVHP*

*PCONSUMER*

*TVHPMIS*

*PVMIS*

*TCONSUMER*

*PINITIATE*

*pVHPC*

*1-pVHPC*

*1-pVEPC*

*if Z =V AND Z =1 REG TIME*

*1*

*PBHP PVHP*

*TEXECUTE*

*TEND*

*rINITIATE rCOMPLETE*

*PEX*

*rCOMMIT*

*PREG*

*TCOMMIT*

*PRS/LSQ*

*TISSUE*

*rISSUE*

*TEND TEND*

*PCLASS*

*PTIME* is flushed out. The flush out arc is drawn as thick single arrow.

<sup>0</sup> <sup>0</sup> ( ) *Z Zt ROB ROB* and 0 <sup>0</sup> ( ) *Z Zt EX EX* , where 0*t t* and 0 () 0 *Z t TIME* .

*TCLASS1 TCLASS2*

*1 2*

*PIC PIB*

*pBHPC rFETCH rFETCH*

*PSTART*

places them in the instruction buffer. The fetch rate is given by:

*if Z =1 TIME*

*λ*

*TBRANCH*

*PFETCH*

*TCLOCK*

*pBEPC*

*PRESOLVED*

*TBEPC*

*TCONTINUE*

*pBEP 1-pBEP*

*TBEP TBHP*

*PBRANCH*

*TBHPC*

*TBHPMIS*

*PBMIS*

**Figure 1.** A Fluid Stochastic Petri Net model of an ILP processor

*1-pBEPC 1-pBHPC*

*TEND TEND*

*if Z =1 TIME*

*TCOUNT*

*TBEPMIS*

*1 1 TTIME PTIME*

**2.2. FSPN representation** 

Representing the dynamic behavior of systems subject to randomness or variability is the main concern of *stochastic modeling*. It relies on the use of random variables and their distribution functions [16]. We assume that the distribution of the time between two consecutive occurrences of branch instructions in the fluid stream is exponential with rate *λ*. The rate depends on the instruction fetch bandwidth, as well as the program's average *basic block size*. Branches vary widely in their dynamic behavior, and predictors that work well on one type of branches may not work as well on others. A set of hard-to-predict branches that comprise a fundamental limit to traditional branch predictors can always be identified [17]. We assume that there are two classes: *easy-to-predict* and *hard-to-predict branches*, and the expected branch prediction accuracy is higher for the first, and lower for the second. The probabilities to classify a branch as either easy- or hard-to-predict depend on the program characteristics.

When the instruction fetch rate is low, a significant portion of data dependences span across instructions that are fetched consecutively [18]. As a result, these instructions (a producerconsumer pair) will eventually initiate their execution in a sequential manner. In this case, the prediction becomes useless due to the availability of the consumer's input value. Hence, in each cycle, an important factor is the number of instructions that consume results of *simultaneously* initiated producer instructions. We assume that the distribution of the time between two consecutive occurrences of consuming instructions in the fluid stream is exponential with rate *μ*. The rate depends on the number of instructions that simultaneously initiate execution at a functional unit, as well as the program's average *dynamic instruction distance*. We assume that there are two classes of consuming instructions: (1) instructions that consume *easy-to-predict values* and (2) instructions that consume *hard-to-predict values*. The expected value prediction accuracy is higher for the first and lower for the second. The probability to classify a value as either easy- or hard-to-predict depends on the program's characteristics, similarly to the branch classification.

The set of programs executed on the machine represent the *input space*. Programs with different characteristics are executed randomly and independently according to the *operational profile*. We partition the input space by grouping programs that exhibit as nearly as possible homogenous behavior into *program classes*. Since there are a finite number of partitions (classes), the upper limits of *λ* and *μ*, as well as the probabilities to classify a branch/value as either easy- or hard-to-predict are considered to be discrete random variables and have different values for different program classes.

#### **2.2. FSPN representation**

228 Petri Nets – Manufacturing and Computer Science

techniques help keeping outflow rate as high as possible.

characteristics, similarly to the branch classification.

variables and have different values for different program classes.

A model should always have a form that is more concise and closer to a designer's intuition about what a model should look like. In the case of a processor pipeline, the simplest description would be that the instructions flow and pass through separate pipeline stages connected by buffers. Control dependences stall the inflow of useful instructions (fluid) into the pipeline, whereas true data dependences decrease the aperture of the pipeline and the outflow rate. The buffer levels always vary and affect both the inflow and outflow rates. Branch prediction techniques tend to eliminate stalls in the inflow, while value prediction

Representing the dynamic behavior of systems subject to randomness or variability is the main concern of *stochastic modeling*. It relies on the use of random variables and their distribution functions [16]. We assume that the distribution of the time between two consecutive occurrences of branch instructions in the fluid stream is exponential with rate *λ*. The rate depends on the instruction fetch bandwidth, as well as the program's average *basic block size*. Branches vary widely in their dynamic behavior, and predictors that work well on one type of branches may not work as well on others. A set of hard-to-predict branches that comprise a fundamental limit to traditional branch predictors can always be identified [17]. We assume that there are two classes: *easy-to-predict* and *hard-to-predict branches*, and the expected branch prediction accuracy is higher for the first, and lower for the second. The probabilities to classify a branch as either easy- or hard-to-predict depend on the program characteristics.

When the instruction fetch rate is low, a significant portion of data dependences span across instructions that are fetched consecutively [18]. As a result, these instructions (a producerconsumer pair) will eventually initiate their execution in a sequential manner. In this case, the prediction becomes useless due to the availability of the consumer's input value. Hence, in each cycle, an important factor is the number of instructions that consume results of *simultaneously* initiated producer instructions. We assume that the distribution of the time between two consecutive occurrences of consuming instructions in the fluid stream is exponential with rate *μ*. The rate depends on the number of instructions that simultaneously initiate execution at a functional unit, as well as the program's average *dynamic instruction distance*. We assume that there are two classes of consuming instructions: (1) instructions that consume *easy-to-predict values* and (2) instructions that consume *hard-to-predict values*. The expected value prediction accuracy is higher for the first and lower for the second. The probability to classify a value as either easy- or hard-to-predict depends on the program's

The set of programs executed on the machine represent the *input space*. Programs with different characteristics are executed randomly and independently according to the *operational profile*. We partition the input space by grouping programs that exhibit as nearly as possible homogenous behavior into *program classes*. Since there are a finite number of partitions (classes), the upper limits of *λ* and *μ*, as well as the probabilities to classify a branch/value as either easy- or hard-to-predict are considered to be discrete random

**2.1. Model definition** 

We assume that the pipeline is organized in four stages: Fetch, Decode/Issue, Execute and Commit. Fluid places *PIC*, *PIB, PRS/LSQ, PROB, PRR, PEX* and *PREG*, depicted by means of two concentric circles (Figure 1), represent buffers between pipeline stages: *instruction cache*, *instruction buffer*, *reservation stations and load/store queue*, *reorder buffer, rename registers, instructions that have completed execution* and *architectural registers*. Five of them have limited capacities: *ZIBmax*, *ZRS/LSQmax*, *ZRRmax*, *ZROBmax* and *ZEXmax*. We prohibit both an overflow and a negative level in a fluid place. The fluid place *PTIME* has the function of an hourglass: it is constantly filled at rate 1 up to the level 1 and then flushed out, which corresponds to the machine clock cycle. *ZTIME*(*t*) denotes the fluid level in *PTIME* at time *t*. Fluid arcs are drawn as double arrows to suggest a pipe. Flow rates are piecewise constant, i.e. take different values at the beginning of each cycle and are limited by the fetch/issue width of the machine (*W*). Rates depend on the vector of fluid levels **Z**( )*t* and change when *TCLOCK* fires and the fluid in *PTIME* is flushed out. The flush out arc is drawn as thick single arrow.

Let 00 0 0 0 0 / , , , , and *ZZZ Z Z Z IC IB RS LSQ RR ROB EX* be the fluid levels at the beginning of the clock cycle, i.e. 0 <sup>0</sup> ( ) *Z Zt IC IC* , 0 <sup>0</sup> ( ) *Z Zt IB IB* , 0 / / 0 ( ) *Z Zt RS LSQ RS LSQ* , 0 <sup>0</sup> ( ) *Z Zt RR RR* , <sup>0</sup> <sup>0</sup> ( ) *Z Zt ROB ROB* and 0 <sup>0</sup> ( ) *Z Zt EX EX* , where 0*t t* and 0 () 0 *Z t TIME* .

A high-bandwidth instruction fetch mechanism fetches up to *W* instructions per cycle and places them in the instruction buffer. The fetch rate is given by:

$$\mathbf{r}\_{\text{FETCH}} = \min \{ \mathbf{Z}\_{\text{IB}\_{\text{max}}} - \mathbf{Z}\_{\text{IB}\_0} + \mathbf{r}\_{\text{ISSLE}'} \mathbf{Z}\_{\text{IC}\_0}, \mathbf{W} \} \tag{1}$$

**Figure 1.** A Fluid Stochastic Petri Net model of an ILP processor

In the case of a branch misprediction, the fetch unit is effectively stalled and no useful instructions are added to the buffer. Instruction cache misses are ignored.

Instruction issue tries to send *W* instructions to the appropriate reservation stations or the load/store queue on every clock cycle. Rename registers are allocated to hold the results of the instructions and reorder buffer entries are allocated to ensure in-order completion. Among the instructions that initiate execution in the same cycle, speculatively executed consuming instructions are forced to retain their reservation stations. As a result, the issue rate is given by:

$$r\_{\rm ISSLE} = \min \{ Z\_{\rm RR\,max} - Z\_{\rm RR0} + r\_{\rm COMMIT}, Z\_{\rm ROR\,max} - Z\_{\rm ROR0} + r\_{\rm COMMIT}, Z\_{\rm RS/LSQ\,max} - Z\_{\rm RS/LSQ0}, Z\_{\rm IBS}, \mathcal{W} \} \tag{2}$$

Up to *W* instructions are *in execution* at the same time. With the assumptions that functional units are always available and out-of-order execution is allowed, the instructions *initiate* and *complete* execution with rate:

$$r\_{\text{INITIATE}} = r\_{\text{COMPLETE}} = \min(Z\_{RS/LSQ\_0}, W) \tag{3}$$

Fluid Stochastic Petri Nets:

From Fluid Atoms in ILP Processor Pipelines to Fluid Atoms in P2P Streaming Networks 231

 (6)

Firing of exponential transition *TBRANCH* corresponds to a branch instruction occurrence. The

f # f *FETCH FETCH FETCH CLASS i r rr P i*

branch is classified as easy-to-predict with probability *pBEP*, or hard-to-predict with probability *1-pBEP*. In either case, it is correctly predicted with probability *pBEPC* (*pBHPC*), or mispredicted with probability *1-pBEPC* (*1-pBHPC*). These probabilities are included in the FSPN model as weights assigned to immediate transitions *TBEP*, *TBHP*, *TBEPC*, *TBHPC*, *TBEPMIS* and *TBHPMIS*, respectively. This approach is known as *synthetic branch prediction.* Branch mispredictions stall the fluid inflow for as many cycles as necessary to resolve the branch (*CBR* tokens in place *PBMIS*). Usually, a branch is not resolved until its execution stage (*CBR*=3). With several consecutive firings of *TCLOCK,* these tokens are consumed one at a time and moved to *PRESOLVED.* As soon as the branch is resolved, transition *TCONTINUE* fires, a token

Similar to this, firing of exponential transition *TCONSUMER* corresponds to the occurrence of a consuming instruction among the instructions that initiated execution. The parameter

changes at the beginning of each clock cycle and formally depends on both the number of

g # g *INITIATE INITIATE INITIATE CLASS i r rr P i*

instructions simultaneously initiate execution (*rINITIATE=W*). The consumed value is classified as easy-to-predict with probability *pVEP*, or hard-to-predict with probability *1-pVEP*. In either case, it is correctly predicted with probability *pVEPC* (*pVHPC*), or mispredicted with probability *1-pVEPC* (*1-pVHPC*). These probabilities are included in the FSPN model as weights assigned to immediate transitions *TVEP*, *TVHP*, *TVEPC*, *TVHPC*, *TVEPMIS* and *TVHPMIS*, respectively. Whenever a misprediction occurs (token in place *PVMIS*), the consuming instruction has to be *rescheduled* for execution. The firing of immediate transition *TREEXECUTE* causes transportation of fluid in zero time. Fluid jumps have deterministic height of 1 (one instruction) and take place when

that would go beyond the boundaries cannot be carried out. The arcs connecting fluid places and immediate transitions are drawn as thick single arrows. The fluid flow terminates at the

When executing a class *i* program, the nodes *mi* of the reachability graph (Figure 2) consist of all the tangible discrete markings, as well as those in which the enabling of immediate

*W WW*

is its upper limit for a given program class *i* when maximum possible number of

 (7)

( ) 1 and ( ) 1 *ZtZ Zt RS RS EX* . Jumps

changes at the beginning of each clock cycle and formally depends on both the

*W WW*

is its upper limit for a given program class *i* at maximum fetch rate (*rFETCH=W*). The

parameter

where *<sup>i</sup>* 

where *<sup>i</sup>* 

number of tokens in *PCLASS* and the fetch rate:

appears in place *PFETCH* and the inflow resumes.

tokens in *PCLASS* and the initiation rate:

**2.3. Derivation of state equations** 

the fluid levels in *PRS* and *PEX* satisfy the condition max

end of the cycle when all the fluid places except *PREG* are empty and *TEND* fires.

During the execute stage, the instructions first check to see if their source operands are available (predicted or computed). For simplicity, we assume that the execution latency of each instruction is a single cycle. Instructions execute and forward their own results back to subsequent instructions that might be waiting for them (no result forwarding delay). Every reference to memory is present in the first-level cache. With the last assumption, we eliminate the effect of the memory hierarchy.

The instructions that have completed execution are ready to move to the last stage. Up to *W* instructions may commit per cycle. The results in the rename registers are written into the register file and the rename registers and reorder buffer entries freed. Hence:

$$r\_{\text{COMMT}} = \min(Z\_{\text{EX}\_0}, W) \tag{4}$$

In order to capture the relative occurrence frequencies of different program classes, we introduce a set of weighted immediate transitions in the Petri Net. Each program class is assigned an immediate transition *CLASSi <sup>T</sup>* with weight *CLASSi w* . The operational profile is a set of weights. The probability of firing the immediate transition *CLASSi T* represents the probability of occurrence of a class *i* program, given by:

$$\hat{w}\_{T\_{\text{CLASS}\_i}} = \frac{w\_{T\_{\text{CLASS}\_i}}}{\sum\_{k=1}^n w\_{T\_{\text{CLASS}\_k}}} \tag{5}$$

A token in *PSTART* denotes that a new execution is about to begin. The process of firing one of the immediate transitions randomly chooses a program from one of the classes. The firing of transition *CLASSi T* puts *i* tokens in place *PCLASS*, which identify the class. At the same time instant, tokens occur in places *PFETCH* and *PINITIATE*, while the fluid place *PIC* is filled with fluid with volume *Vi* equivalent to the total number of useful instructions (*program volume*).

Firing of exponential transition *TBRANCH* corresponds to a branch instruction occurrence. The parameter changes at the beginning of each clock cycle and formally depends on both the number of tokens in *PCLASS* and the fetch rate:

$$\mathcal{A} = \mathbf{f}\left(\#\boldsymbol{P}\_{\rm CLASS}\right) \frac{r\_{\rm FETCH}}{W} = \mathbf{f}\left(\mathbf{i}\right) \frac{r\_{\rm FETCH}}{W} = \mathcal{A}\_{\rm i} \frac{r\_{\rm FETCH}}{W} \tag{6}$$

where *<sup>i</sup>* is its upper limit for a given program class *i* at maximum fetch rate (*rFETCH=W*). The branch is classified as easy-to-predict with probability *pBEP*, or hard-to-predict with probability *1-pBEP*. In either case, it is correctly predicted with probability *pBEPC* (*pBHPC*), or mispredicted with probability *1-pBEPC* (*1-pBHPC*). These probabilities are included in the FSPN model as weights assigned to immediate transitions *TBEP*, *TBHP*, *TBEPC*, *TBHPC*, *TBEPMIS* and *TBHPMIS*, respectively. This approach is known as *synthetic branch prediction.* Branch mispredictions stall the fluid inflow for as many cycles as necessary to resolve the branch (*CBR* tokens in place *PBMIS*). Usually, a branch is not resolved until its execution stage (*CBR*=3). With several consecutive firings of *TCLOCK,* these tokens are consumed one at a time and moved to *PRESOLVED.* As soon as the branch is resolved, transition *TCONTINUE* fires, a token appears in place *PFETCH* and the inflow resumes.

Similar to this, firing of exponential transition *TCONSUMER* corresponds to the occurrence of a consuming instruction among the instructions that initiated execution. The parameter changes at the beginning of each clock cycle and formally depends on both the number of tokens in *PCLASS* and the initiation rate:

$$\mu = \mathbf{g}\left(\#\mathbf{P}\_{\text{CLASS}}\right)\frac{r\_{\text{INITIATE}}}{\mathcal{W}} = \mathbf{g}\left(\mathbf{i}\right)\frac{r\_{\text{INITIATE}}}{\mathcal{W}} = \mu\_{\text{i}}\frac{r\_{\text{INITIATE}}}{\mathcal{W}}\tag{7}$$

where *<sup>i</sup>* is its upper limit for a given program class *i* when maximum possible number of instructions simultaneously initiate execution (*rINITIATE=W*). The consumed value is classified as easy-to-predict with probability *pVEP*, or hard-to-predict with probability *1-pVEP*. In either case, it is correctly predicted with probability *pVEPC* (*pVHPC*), or mispredicted with probability *1-pVEPC* (*1-pVHPC*). These probabilities are included in the FSPN model as weights assigned to immediate transitions *TVEP*, *TVHP*, *TVEPC*, *TVHPC*, *TVEPMIS* and *TVHPMIS*, respectively. Whenever a misprediction occurs (token in place *PVMIS*), the consuming instruction has to be *rescheduled* for execution. The firing of immediate transition *TREEXECUTE* causes transportation of fluid in zero time. Fluid jumps have deterministic height of 1 (one instruction) and take place when the fluid levels in *PRS* and *PEX* satisfy the condition max ( ) 1 and ( ) 1 *ZtZ Zt RS RS EX* . Jumps that would go beyond the boundaries cannot be carried out. The arcs connecting fluid places and immediate transitions are drawn as thick single arrows. The fluid flow terminates at the end of the cycle when all the fluid places except *PREG* are empty and *TEND* fires.

#### **2.3. Derivation of state equations**

230 Petri Nets – Manufacturing and Computer Science

*complete* execution with rate:

eliminate the effect of the memory hierarchy.

assigned an immediate transition *CLASSi*

transition *CLASSi*

probability of occurrence of a class *i* program, given by:

In the case of a branch misprediction, the fetch unit is effectively stalled and no useful

Instruction issue tries to send *W* instructions to the appropriate reservation stations or the load/store queue on every clock cycle. Rename registers are allocated to hold the results of the instructions and reorder buffer entries are allocated to ensure in-order completion. Among the instructions that initiate execution in the same cycle, speculatively executed consuming instructions are forced to retain their reservation stations. As a result, the issue rate is given by:

min( max 0 max 0 / max / 0 0 , , , ,) *ISSUE RR RR COMMIT ROB ROB COMMIT RS LSQ RS LSQ IB r Z Z r Z Z r Z Z ZW* (2)

<sup>0</sup> *INITIATE COMPLETE* min( , ) *RS LSQ* / *r r ZW* (3)

Up to *W* instructions are *in execution* at the same time. With the assumptions that functional units are always available and out-of-order execution is allowed, the instructions *initiate* and

During the execute stage, the instructions first check to see if their source operands are available (predicted or computed). For simplicity, we assume that the execution latency of each instruction is a single cycle. Instructions execute and forward their own results back to subsequent instructions that might be waiting for them (no result forwarding delay). Every reference to memory is present in the first-level cache. With the last assumption, we

The instructions that have completed execution are ready to move to the last stage. Up to *W* instructions may commit per cycle. The results in the rename registers are written into the

In order to capture the relative occurrence frequencies of different program classes, we introduce a set of weighted immediate transitions in the Petri Net. Each program class is

*<sup>T</sup>* with weight *CLASSi*

1

*k*

A token in *PSTART* denotes that a new execution is about to begin. The process of firing one of the immediate transitions randomly chooses a program from one of the classes. The firing of

instant, tokens occur in places *PFETCH* and *PINITIATE*, while the fluid place *PIC* is filled with fluid

with volume *Vi* equivalent to the total number of useful instructions (*program volume*).

*CLASSi*

*T*

*T*

*w* 

*w*

*CLASSk*

*T* puts *i* tokens in place *PCLASS*, which identify the class. At the same time

0

min( , ) *COMMIT EX r ZW* (4)

(5)

*w* . The operational profile is a

*T* represents the

register file and the rename registers and reorder buffer entries freed. Hence:

set of weights. The probability of firing the immediate transition *CLASSi*

*w*

*CLASSi*

*T n*

 

instructions are added to the buffer. Instruction cache misses are ignored.

When executing a class *i* program, the nodes *mi* of the reachability graph (Figure 2) consist of all the tangible discrete markings, as well as those in which the enabling of immediate

#### 232 Petri Nets – Manufacturing and Computer Science

transitions depends on fluid levels and cannot be eliminated, since they are of mixed tangible/vanishing type (Table 1). It is important to note that the number of discrete markings does not depend on the machine width in any way.

Fluid Stochastic Petri Nets:

(8)

From Fluid Atoms in ILP Processor Pipelines to Fluid Atoms in P2P Streaming Networks 233

*PROB*) is equal to the total amount of fluid in *PRS/LSQ* and *PEX*. Therefore, only the fluid levels *ZIB*(t)*, ZRS/LSQ*(t)*, ZEX*(t) and *ZREG*(t) are identified as four supplementary variables (components

The instantaneous rates at which fluid builds in each fluid place are collected in diagonal

*r r r r rr r r r r*

( ) 0 0 0 0 00

*p p*

*VMIS VMIS*

0 000 0 00 0 0 0 00 0 0 0 00 0 00 0 0 000 0 00 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 00

1 11

*BMIS BEP BEPC BEP BHPC*

*VMIS VEP VEPC VEP VHPC*

transient probability of being in discrete marking *mi* at time *t*, with fluid levels in an infinitesimal environment around / [ ] *IB RS LSQ EX REG* **<sup>z</sup>** *zz z z* . If 12 9 [ ... ] *ii i <sup>i</sup>* **<sup>π</sup>**

according to [4-6] the evolution of the process is described by a coupled system of nine

(0, ) 0 (2 9)

*m*

*i i i*

*pp p p p*

*i i i*

/ /

 **π π R π R π R π R**

If 0000 **<sup>0</sup> z** is the vector of initial fluid levels, the *initial conditions* are:

*i mi*

*IB RS LSQ EX REG tz z z z* 

**z**

( ) ( ) ( )( ) *i i IB i RS LSQ i EX i REG*

<sup>1</sup> (0, ) ( )

**<sup>0</sup> z zz**

*partial differential equations* in four continuous dimensions plus time:

*pp p p p*

be an abbreviation for the *volume density* / (, , , , ) *m IB RS LSQ EX REG <sup>i</sup>*

1 11

*i i*

*p p*

*VMIS VMIS*

*p p*

*i i*

 

*tz z z z* that is the

  

. (9)

*i i*

(11)

**π Q** (10)

 ,

*VMIS VMIS*

*i i*

 

*p p*

 

*i i*

, , , , , , , ,0

of the fluid vector **Z**( )*t* ), which provide a full description of each state.

 

*RS/ LSQ ISSUE INITIATE ISSUE INITIATE EX COMPLETE COMMIT COMPLETE COMMIT*

> 

and

 

*rr rr rr rr*

, ..., , 0 , ..., , 0

The matrix of *transition rates* of exponential transitions causing the state changes is:

*ii i i*

*BMIS VMIS BMIS VMIS*

*pp p p*

*<sup>i</sup> BMIS BMIS*

*IB FETCH ISSUE ISSUE ISSUE ISSUE FETCH ISSUE ISSUE ISSUE ISSUE*

matrices:

**R diag R diag R diag**

*REG COMMI*

*r*

**R diag** , ..., , 0 *T COMMIT r*

**Q**

where:

Let *mi* 

**Figure 2.** Reachability graph of the FSPN model


**Table 1.** Discrete markings of the FSPN model (*CBR*=3)

A vector of fluid levels supplements discrete markings. It gives rise to a stochastic process in continuous time with continuous state space. The total amount of fluid contained in *PIC*, *PIB, PRS/LSQ, PEX* and *PREG* is always equal to *Vi*, and the amount of fluid contained in *PRR* (as well as *PROB*) is equal to the total amount of fluid in *PRS/LSQ* and *PEX*. Therefore, only the fluid levels *ZIB*(t)*, ZRS/LSQ*(t)*, ZEX*(t) and *ZREG*(t) are identified as four supplementary variables (components of the fluid vector **Z**( )*t* ), which provide a full description of each state.

The instantaneous rates at which fluid builds in each fluid place are collected in diagonal matrices:

$$\begin{aligned} \mathbf{R}\_{IB} &= \mathbf{diag}\left(r\_{\text{FFCH}} - r\_{\text{ISSLE}'} - r\_{\text{ISSLE}'} - r\_{\text{ISSLE}'} - r\_{\text{ISSLE}'} \\ \mathbf{R}\_{RISLSQ} &= \mathbf{diag}\left(r\_{\text{ISSLE}} - r\_{\text{INTIATE}'} \dots r\_{\text{ISSLE}'} - r\_{\text{INTIATE}'} \, 0\right) \\ \mathbf{R}\_{EX} &= \mathbf{diag}\left(r\_{\text{COMPLEE}} - r\_{\text{CMMIAT'}} \dots r\_{\text{CMMIAT'}} - r\_{\text{COMIETE}'} \, 0\right) \\ \mathbf{R}\_{REG} &= \mathbf{diag}\left(r\_{\text{COMIETE}} - r\_{\text{COMIET'}} \dots r\_{\text{COMIETE}'} - r\_{\text{COMIAT'}} \, 0\right) \end{aligned} \tag{8}$$

The matrix of *transition rates* of exponential transitions causing the state changes is:


where:

*9i*

*TEND*

*TEND*

232 Petri Nets – Manufacturing and Computer Science

*λpBMISi*

*λpBMISi*

*μpVMISi*

**Figure 2.** Reachability graph of the FSPN model

Number of tokens

Marking (*mi*)

**Table 1.** Discrete markings of the FSPN model (*CBR*=3)

*μpVMISi*

*TREEXECUTE*

*TCLOCK*

markings does not depend on the machine width in any way.

*1i 2i 3i 4i*

*TCOUNT*

*TCOUNT*

*TEND*

*5i 6i 7i 8i*

*TEND*

*TCOUNT*

*TREEXECUTE*

*μpVMISi*

*TEND*

*i* 1 0 1 0 *i* 0 3 1 0 *i* 0 2 1 0 *i* 0 1 1 0 *i* 1 0 0 1 *i* 0 3 0 1 *i* 0 2 0 1 *i* 0 1 0 1 *i* 0 0 0 0

A vector of fluid levels supplements discrete markings. It gives rise to a stochastic process in continuous time with continuous state space. The total amount of fluid contained in *PIC*, *PIB, PRS/LSQ, PEX* and *PREG* is always equal to *Vi*, and the amount of fluid contained in *PRR* (as well as

*TREEXECUTE*

*μpVMISi*

*TREEXECUTE*

*TCOUNT*

*TCOUNT*

#*PFETCH* #*PBMIS* #*PINITIATE* #*PVMIS*

*TEND*

*TEND*

transitions depends on fluid levels and cannot be eliminated, since they are of mixed tangible/vanishing type (Table 1). It is important to note that the number of discrete

*TCOUNT*

*TEND*

$$p\_{BMS\_i} = p\_{BEP\_i} \left(1 - p\_{BEPC}\right) + \left(1 - p\_{BEP\_i}\right) \left(1 - p\_{BHPC}\right)$$
 
$$\text{and}$$
 
$$p\_{VMS\_i} = p\_{VEP\_i} \left(1 - p\_{VEPC}\right) + \left(1 - p\_{VEP\_i}\right) \left(1 - p\_{VHPC}\right)$$

Let *mi* be an abbreviation for the *volume density* / (, , , , ) *m IB RS LSQ EX REG <sup>i</sup> tz z z z* that is the transient probability of being in discrete marking *mi* at time *t*, with fluid levels in an infinitesimal environment around / [ ] *IB RS LSQ EX REG* **<sup>z</sup>** *zz z z* . If 12 9 [ ... ] *ii i <sup>i</sup>* **<sup>π</sup>** , according to [4-6] the evolution of the process is described by a coupled system of nine *partial differential equations* in four continuous dimensions plus time:

$$\frac{\partial \mathfrak{m}\_{i}}{\partial t} + \frac{\partial (\mathfrak{m}\_{i} \cdot \mathbf{R}\_{IB})}{\partial \mathbf{z}\_{IB}} + \frac{\partial (\mathfrak{m}\_{i} \cdot \mathbf{R}\_{RS/LSQ})}{\partial \mathbf{z}\_{RS/LSQ}} + \frac{\partial (\mathfrak{m}\_{i} \cdot \mathbf{R}\_{EX})}{\partial \mathbf{z}\_{EX}} + \frac{\partial (\mathfrak{m}\_{i} \cdot \mathbf{R}\_{REG})}{\partial \mathbf{z}\_{REG}} = \mathfrak{m}\_{i} \cdot \mathbf{Q}\_{i} \tag{10}$$

If 0000 **<sup>0</sup> z** is the vector of initial fluid levels, the *initial conditions* are:

$$\begin{aligned} \pi\_{1\_i}(0, \mathbf{z}) &= \mathcal{S}(\mathbf{z} - \mathbf{z}\_0) \\ \pi\_{m\_i}(0, \mathbf{z}) &= 0 \quad \text{ ( $2 \le m \le 9$ )} \end{aligned} \tag{11}$$

Since fluid jumps shift probability mass along the continuous axes (in addition to discrete state change), firing of transition *TREEXECUTE* at time *t* can be seen as a *jump* to another location in the four-dimensional hypercube defined by the components of the fluid vector. It can be described by the following *boundary conditions*:

Fluid Stochastic Petri Nets:

(17)

From Fluid Atoms in ILP Processor Pipelines to Fluid Atoms in P2P Streaming Networks 235

/ 0 / /

 

*z t z z z z dz dz dz dz*

*RS LSQ m IB RS LSQ EX REG IB EX REG RS LSQ*

marginal density for ZRS/LSQ

,, )

/ /

*LSQ EX REG IB RS LSQ REG EX*

*z z dz dz dz dz*

/

*dz dz dz dz*

are computed as indicated by Eqs. 1-4, 6

*Am t V i m <sup>i</sup>* 

> 

(19)

*<sup>i</sup> i i EX IPC V t* (20)

(21)

.

(18)

*IB RS LSQ EX REG*

(, , , , )

marginal density for ZIB

*i*

marginal density for ZEX

 

*tz z z z*

 

marginal density for ZREG

9 / / 9

When the input space is partitioned, *IPC* is the ratio between the average volume and the average execution time of all the programs of different classes, as indicated by the

*n n*

*k k IPC V w t w* 

1 1 *CLASS k CLASS k k*

*k T EX T*

0 1 () *<sup>i</sup> EXi EX <sup>t</sup> t F t dt*

*i i*

,, , , ( ,0,0,0, )

*t z z z z dz dz dz dz t V*

*IB RS LSQ EX REG IB RS LSQ EX REG i*

9

*m*

/ /0

*Z EZ t z*

and 7, respectively.

**2.4. Performance measures** 

0 0 00

with *mean execution time*:

operational profile:

*IB RS LSQ EX MAX MAX MAX i*

*ZZ Z V*

*RS LSQ RS LSQ*

*Z EZ t*

0

0

0

*EXi*

0

( )

max / max max

*Z ZZ V*

*IB RS LSQ EX i*

0 0 00 1

0 0

0 00 0

*Vi ZZ Z IB RS LSQ EX*

0 0 00 1

*EX EX EX m IB RS*

max max / max

*Z ZZ V*

*EX IB RS LSQ i*

*Z EZ t z tz z*

Finally, the flow rates and the parameters

( ) (, ,

9 0 0 / /

*i*

*Z EZ t z t z z z z dz dz dz dz*

*IB IB IB m IB RS LSQ EX REG RS LSQ EX REG IB*

*i*

9

*m*

( ) (, , , , )

1

*i*

 and 

Let <sup>τ</sup> be a random variable representing the time to absorb into { ( ,0,0,0, ) 1} *<sup>i</sup>*

9

*m*

( ) (, , , , )

/ max max max

*Z Z Z V*

*RS LSQ IB EX i*

0 0 /

*REG REG REG m IB RS LSQ EX REG*

0 0 00 0 0 0 00

The *distribution of the execution time* of a program with volume *Vi* is:

Consequently, the *sustained number of instructions per cycle* (*IPC*) is given by:

( ) Pr ( ) Pr ( ) 9

*t i i i*

*F t t Mt A Mt*

/

*IC i IB RS LSQ EX REG RR ROB RS LSQ EX*

*Z VZ Z Z Z Z Z Z Z*

/ /

( ) and .

max / max max

*m*

 

0 0 00 1

$$
\pi\_{\uparrow\downarrow}(t^\*, z\_{\rm B}, z\_{\rm S\rm S\rm I\rm I\rm S} + 1, z\_{\rm E\rm X\rm I\rm I\rm S} - 1, z\_{\rm E\rm X\rm C\rm I\rm S} = \pi\_{\uparrow\downarrow}(t^-, z\_{\rm B\rm r}, z\_{\rm S\rm I\rm I\rm S\rm Q} + 1, z\_{\rm E\rm X\rm I\rm I\rm S} + 1, z\_{\rm E\rm X\rm I\rm S\rm I\rm S\rm I\rm S\rm Z\rm V\rm Z\rm E\rm Z\rm E\rm Z\rm E\rm Z\rm E\rm Z\rm E\rm I\rm S}
$$

$$
\pi\_{\uparrow\downarrow}(t^+, z\_{\rm B\rm r}, z\_{\rm S\rm I\rm S\rm I\rm Q} + 1, z\_{\rm E\rm X\rm I\rm C\rm I\rm S} = \pi\_{\downarrow\downarrow}(t^-, z\_{\rm B\rm r}, z\_{\rm S\rm I\rm S\rm Q} + 1, z\_{\rm E\rm X\rm C\rm I\rm S\rm I\rm S} + 1, z\_{\rm E\rm X\rm C\rm I\rm S\rm I\rm S\rm I\rm S\rm Z\rm E\rm Z\rm E\rm Z\rm E\rm Z\rm E\rm Z\rm E\rm Z\rm E\rm S\rm I\rm S\rm I\rm S\rm I\rm S\rm I\rm S\rm I\rm S\rm I\rm S\rm I\rm S\rm I\rm S\rm I\rm S\rm I\rm S\rm I\rm S\rm I\rm S\rm I\rm S\rm I\rm S\rm I\rm S\rm$$

The firing of transitions *TCLOCK* and *TCOUNT* at time *t0* causes switching from one discrete marking to another. Therefore:

10 / 10 / 40 / 50 / 80 / 40 / 30 / (,, , , ) (,, , , ) (,, , , ) (,, , , ) (,, , , ) (,, , , ) (,, , *i ii i i i i IB RS LSQ EX REG IB RS LSQ EX REG IB RS LSQ EX REG IB RS LSQ EX REG IB RS LSQ EX REG IB RS LSQ EX REG IB RS LSQ t zz z z t zz z z tzz z z tzz z z tzz z z t zz z z t zz* 70 / 30 / 20 / 60 / 0 / , ) (,, , , ) (,, , , ) (,, , , ) (,, , , ) ( , , , , ) 0 ( 2,5,6,7,8 ) *i i ii i EX REG IB RS LSQ EX REG IB RS LSQ EX REG IB RS LSQ EX REG IB RS LSQ EX REG m IB RS LSQ EX REG z z tzz z z t zz z z t zz z z tzz z z t zz z z m* (13)

Similarly, the firing of transition *TEND* when all the fluid places except *PREG* are empty, causes switching from any discrete marking to *9i*:

$$\pi\_{\mathcal{B}\_i}(t\_0^+,0,0,0,V\_i) = \sum\_{m=1}^9 \pi\_{m\_i}(t\_0^-,0,0,0,V\_i) \qquad \pi\_{m\_i}(t\_0^+,0,0,0,V\_i) = 0 \quad (m \le 8\text{ })\tag{14}$$

The *probability mass conservation law* is used as a normalization condition. It corresponds to the condition that the sum of all state probabilities must equal one. Since no particle can pass beyond barriers, the sum of integrals of the volume densities over the definition range evaluates to one:

$$\sum\_{m=1}^{9} \int\_{0}^{Z\_{\text{Bmax}}} \int\_{0}^{Z\_{\text{BS/LSQ}\_{\text{max}}}} \int\_{0}^{Z\_{\text{EX}}} \int\_{0}^{V\_{j}} \int\_{0}^{\pi} \pi\_{m\_{i}} \left(t, z\_{\text{IB'}} z\_{\text{RS/LSQ'}} z\_{\text{EX'}} z\_{\text{REG}}\right) dz\_{\text{IB'}} dz\_{\text{RS/LSQ}} dz\_{\text{EXG}} dz\_{\text{REG}} = 1 \tag{15}$$

Let ( ) *Mi t* be the state of the discrete marking process at time *t*. The *probabilities of the discrete markings* are obtained by integrating volume densities:

$$\Pr\left\{M\_{l}(t) = m\_{l}\right\} = \int\_{0}^{Z\_{\text{HAT}}} \int\_{0}^{Z\_{\text{ESLSQ}\_{\text{MAX}}}} \int\_{0}^{Z\_{\text{ETLS}}} \int\_{0}^{V\_{\text{j}}} \int\_{0}^{\pi} \pi\_{m\_{l}}(t, z\_{\text{IB}}, z\_{\text{BS}/LSQ}, z\_{\text{EXC}}, z\_{\text{REG}}) dz\_{\text{IB}} dz\_{\text{ES}/LSQ} dz\_{\text{EXC}} dz\_{\text{REG}} \quad (\mathfrak{m} \le 9) \tag{16}$$

The *fluid levels* at the beginning of each clock cycle are computed as follows:

Fluid Stochastic Petri Nets: From Fluid Atoms in ILP Processor Pipelines to Fluid Atoms in P2P Streaming Networks 235

 max / max max 0 0 9 0 0 / / 0 0 00 1 / /0 ( ) (, , , , ) marginal density for ZIB ( ) *IB RS LSQ EX i i Z ZZ V IB IB IB m IB RS LSQ EX REG RS LSQ EX REG IB m RS LSQ RS LSQ Z EZ t z t z z z z dz dz dz dz Z EZ t* / max max max 0 9 / 0 / / 0 0 00 1 0 0 (, , , , ) marginal density for ZRS/LSQ ( ) (, , *RS LSQ IB EX i i i Z Z Z V RS LSQ m IB RS LSQ EX REG IB EX REG RS LSQ m EX EX EX m IB RS z t z z z z dz dz dz dz Z EZ t z tz z* max max / max 0 9 / / 0 0 00 1 9 0 0 / 1 ,, ) marginal density for ZEX ( ) (, , , , ) *EX IB RS LSQ i i Z ZZ V LSQ EX REG IB RS LSQ REG EX m REG REG REG m IB RS LSQ EX REG m z z dz dz dz dz Z EZ t z tz z z z* max / max max 0 0 00 0 0 0 00 / 0 00 0 / / marginal density for ZREG ( ) and . *Vi ZZ Z IB RS LSQ EX IB RS LSQ EX REG IC i IB RS LSQ EX REG RR ROB RS LSQ EX dz dz dz dz Z VZ Z Z Z Z Z Z Z* (17)

Finally, the flow rates and the parameters and are computed as indicated by Eqs. 1-4, 6 and 7, respectively.

#### **2.4. Performance measures**

234 Petri Nets – Manufacturing and Computer Science

3

*i*

*i*

*i*

evaluates to one:

*m*

(, ,

*tz z*

*IB*

/

marking to another. Therefore:

*tz z z z*

(, , , ,

*m IB RS LSQ EX*

0 /

*m IB RS LSQ EX REG*

described by the following *boundary conditions*:

40 / 30 /

*t zz z z t zz*

*t zz z z m*

switching from any discrete marking to *9i*:

1 0 0 00

*IB RS LSQ EX i*

*ZZ Z V*

*i i*

(,, , , ) (,, ,

*IB RS LSQ EX REG IB RS LSQ*

( , , , , ) 0 ( 2,5,6,7,8 )

9

*m*

Since fluid jumps shift probability mass along the continuous axes (in addition to discrete state change), firing of transition *TREEXECUTE* at time *t* can be seen as a *jump* to another location in the four-dimensional hypercube defined by the components of the fluid vector. It can be

( , , 1, 1, ) ( , , 1, 1, ) ( , , , , ) ( , , 1, 1, ) ( , , 1, 1, ) ( , , , , )

*tz z z z tz z z z tz z z z tz z z z tz z z z tz z z z*

/ 3 / 7 /

*tz z z z tz z z z tz z z z*

( , , 1, 1, ) ( , , 1, 1, ) ( , , , , )

max / / ) 0 ( if 1, 1, 5 8 ) *REG RS LSQ RS LSQ EX zZ z m*

The firing of transitions *TCLOCK* and *TCOUNT* at time *t0* causes switching from one discrete

(,, , , ) (,, , , ) (,, , , )

*t zz z z t zz z z tzz z z*

*t zz z z t zz z z tzz z z*

*IB RS LSQ EX REG IB RS LSQ EX REG IB RS LSQ EX REG*

*i i*

(,, , , ) (,, , , ) (,, , , )

Similarly, the firing of transition *TEND* when all the fluid places except *PREG* are empty, causes

( ,0,0,0, ) ( ,0,0,0, ) ( ,0,0,0, ) 0 ( 8 )

*t V t V t Vm*

The *probability mass conservation law* is used as a normalization condition. It corresponds to the condition that the sum of all state probabilities must equal one. Since no particle can pass beyond barriers, the sum of integrals of the volume densities over the definition range

Let ( ) *Mi t* be the state of the discrete marking process at time *t*. The *probabilities of the discrete* 

Pr ( ) ,, , , ( 9)

*Mt m i i m IB RS LSQ EX REG IB RS LSQ EX REG*

*i*

The *fluid levels* at the beginning of each clock cycle are computed as follows:

 

/ /

*t z z z z dz dz dz dz*

(15)

*m IB RS LSQ EX REG IB RS LSQ EX REG*

/ /

*t z z z z dz dz dz dz m* (16)

, , ,, 1

*IB RS LSQ EX REG IB RS LSQ EX REG IB RS LSQ EX REG*

*IB RS LSQ EX REG IB RS LSQ EX REG IB RS LSQ EX REG IB RS LSQ EX REG IB RS LSQ EX REG IB RS LSQ EX REG*

1, 1, ) ( , , 1, 1, ) ( , , , , )

  (12)

(13)

*z z tz z z z tz z z z*

50 / 80 /

(,, , , ) (,, , , )

 

*tzz z z tzz z z*

*IB RS LSQ EX REG IB RS LSQ EX REG*

(14)

*i*

70 /

*z z tzz z z*

*EX REG IB RS LSQ EX REG*

, ) (,, , , )

*i i*

*RS LSQ EX REG IB RS LSQ EX REG IB RS LSQ EX REG IB RS LSQ EX REG IB RS LSQ EX REG IB RS LSQ EX REG*

1 / 1 / 5 / 2 / 2 / 6 /

4 / 4 / 8 /

10 / 10 / 40 /

30 / 20 / 60 /

 

 

 

*iii i ii*

*i ii*

*i ii*

*i ii*

9 0 0 0 1

<sup>9</sup> max / max max

/

*markings* are obtained by integrating volume densities:

0 0 00

*IB RS LSQ EX MAX MAX MAX i*

*ZZ Z V*

*i*

*ii i im i m i*

 

Let <sup>τ</sup> be a random variable representing the time to absorb into { ( ,0,0,0, ) 1} *<sup>i</sup> Am t V i m <sup>i</sup>* . The *distribution of the execution time* of a program with volume *Vi* is:

$$\begin{split} F\_{t\_{\mathcal{K}\_i}}(t) &= \Pr\left\{\tau \le t \wedge M\_i(t) \in A\right\} = \Pr\left\{M\_i(t) = 9\_i\right\} = \\ &\frac{Z\_{\text{IB}\text{-}K}}{\text{d}} \int\_0^{Z\_{\text{IS}\text{-}L\text{S}\downarrow\text{-}Q\_{\text{MAX}}} Z\_{\text{EX}\text{-}K} \, V\_i}{\int\_0^{\tau} \int\_0^{\tau} \int\_{\mathcal{B}\_i} \pi\_{\text{\mathcal{B}}\_i} \left(t, \tau\_{\text{IB}}, z\_{\text{IS}\text{-}L\text{S}Q'}, z\_{\text{ES}}, z\_{\text{REG}}\right) dz\_{\text{IB}} dz\_{\text{ES}\text{-}L\text{S}Q} \, dz\_{\text{EX}} dz\_{\text{RE}}} \pi\_{\text{\mathcal{E}K}} \, d\tau\_{\text{\mathcal{E}K}} \end{split} \tag{18}$$

with *mean execution time*:

$$t\_{EX\_i} = \bigcap\_{0}^{\alpha} \left(1 - F\_{t\_{EX\_i}}(t)\right) dt \tag{19}$$

Consequently, the *sustained number of instructions per cycle* (*IPC*) is given by:

$$IPC\_i = V\_i \not\models\_{EX\_i} \tag{20}$$

When the input space is partitioned, *IPC* is the ratio between the average volume and the average execution time of all the programs of different classes, as indicated by the operational profile:

$$IPC = \sum\_{k=1}^{n} V\_k \hat{w}\_{T\_{\text{CLASS}\_k}} \left/ \sum\_{k=1}^{n} t\_{EX\_k} \hat{w}\_{T\_{\text{CLASS}\_k}} \right. \tag{21}$$

The sum of probabilities of the discrete markings that do not carry a token in place *PFETCH* gives the *probability of a stall* in the instruction fetch unit at time *t*:

Fluid Stochastic Petri Nets:

From Fluid Atoms in ILP Processor Pipelines to Fluid Atoms in P2P Streaming Networks 237

floating-point operations,

bytes,

*<sup>Z</sup> ROB*max

*<sup>Z</sup>* max

2 *Z W EX* and

size *z* in direction of *zIB*, *zRS*, *zEX* and *zREG*, and step size *t* in time. The computational

since for each of *t t* time steps we must increment each solution value in the fourdimensional grid for eight of the nine discrete markings. The storage requirements of the

since for eight of nine discrete markings we must store a four-dimensional grid of floating-

 *t z nW* ( ) , where *n*=4 is the number of continuous dimensions. With these capacities of fluid places, virtually all name dependences and structural conflicts are eliminated. Step size *z* is varied between *z* 1 2 (coarser grid, usually when the prediction accuracy is high)

Considering a low-volume program (*Vi*=50 instructions) executed on a four-wide machine

The influence of branch prediction accuracy on the distribution of the program's

The influence of branch prediction accuracy on the probability of a complete stall in the

The influence of value prediction accuracy on the distribution of the program's

It is indisputably clear that both branch and value prediction accuracy improvements reduce the mean execution time of a program and increase performance. As an illustration, the size of the shaded area in Figure 3a is equal to the mean execution time when perfect branch prediction is involved, and IPC is computed as indicated by Eq. (20). In addition, looking at Figure 3b one can see that the probability of a complete stall in the instruction fetch unit, which can lead to an empty instruction buffer in the subsequent cycle, decreases with branch prediction accuracy improvement. As a result, both the utilization of the processor and the

size of dynamic scheduling window increase as branch prediction accuracy increases.

The correctness of the discretization method is verified by comparing the numerical transient analysis results with the results obtained by discrete-event simulation, which is specifically implemented for this model and not for a general FSPN. The types of events that need to be scheduled in the event queue are either *transition firings* or the *hitting of a threshold* dependent on fluid levels. We have used a *Unif[0,1]* pseudo-random number generator to

*ZRS LSQ* / *RR*max

max max max / <sup>4</sup> 8 4 *ZZ Z V IB RS LSQ EX z* 

*Z W IB* , max

max max max / <sup>4</sup> <sup>8</sup> *<sup>t</sup> ZZ Z V IB RS LSQ EX i*

point numbers (solutions at successive time steps can be overwritten).

and *z* 1 6 (finer grid, usually when the prediction accuracy is low).

execution time, when value prediction is not involved (Figure 3a),

execution time, when *perfect* branch prediction is involved (Figure 4).

 

*t z*

complexity for the solution is

algorithm are at least

(*W*=4), we investigate:

Unless indicated otherwise, max

instruction fetch unit (Figure 3b), and

$$\begin{split} P\_{SIALL\_i}(t) &= \Pr\{M\_i(t) \neq 1\_i \land M\_i(t) \neq 5\_i\} = \\ &= \sum\_{\substack{m \neq 1 \\ m \neq 5}}^{Z\_{\text{IB}\text{AM}X}} \int\_0^{Z\_{\text{IS}\text{US}\{\text{SQ}\lambda X}}} \int\_0^{Z\_{\text{IS}\text{AS}X}} \int\_0^{V\_i} \int\_0^{\pi} \pi\_{m\_i}(t, z\_{\text{IB}\text{'}\{\text{IS}\lambda Q'}, z\_{\text{EC}\text{'}\{\text{}}}) dz\_{\text{IB}} dz\_{\text{IS}\text{'}\{\text{IS}\lambda Q'}} dz\_{\text{EC}\text{'}\{\text{}}} \end{split} \tag{22}$$

Because of the discrete nature of pipelining, additional attention should be given to the probability that no useful instructions will be added to the instruction buffer in the cycle beginning at time *t*0 (complete stall in the instruction fetch unit that can lead to an effectively empty instruction buffer) due to branch misprediction. It can be obtained by summing up the probabilities of the discrete markings that still carry one or more tokens in place *PBMIS* immediately after firing of *TCLOCK*:

$$\begin{split} P\_{\text{NO\\_EETCH}\_i}(t\_0) &= \Pr\{M\_i(t\_0) = \mathbf{3}\_i \times M\_i(t\_0) = \mathbf{4}\_i\} = \\ &= \sum\_{m=3}^4 \int\_0^{Z\_{\text{HdX}}} \int\_0^{Z\_{\text{BS}|LSQ\_{\text{MAX}}}} \int\_0^{Z\_{\text{EX}}} \int\_0^{V\_i} \pi\_{m\_i}(t\_0, z\_{\text{IB}}, z\_{\text{BS}|LSQ}, z\_{\text{EX}}, z\_{\text{RE}}) dz\_{\text{IB}} dz\_{\text{BS}|LSQ} \, dz\_{\text{EX}} \, dz\_{\text{RE}} \, dz\_{\text{RE}} \end{split} \tag{23}$$

In addition, the *execution efficiency* is introduced, taken as a ratio between the number of useful instructions and the total number of instructions executed during the course of a program's execution:

$$\eta\_{EX\_i} = \frac{V\_i}{V\_i + \left(\frac{\text{instrations recovered}}{\text{due to value misprodiction}}\right)} \approx \frac{V\_i}{V\_i + \overline{\mu} \cdot p\_{\text{VMS}\_i} \cdot t\_{EX\_i}} = \frac{1}{1 + \frac{\mu\_i \cdot \overline{\tau}\_{\text{INIIATE}\_i} \cdot p\_{\text{VMS}\_i}}{\mathcal{W} \cdot \text{IPC}\_i}} \tag{24}$$

where *INITIATE r* is the *average initiation rate*.

#### **2.5. Numerical experiments and performance evaluation results**

We have used *finite difference approximations* to replace the derivatives that appear in the PDEs: *forward* difference approximation for the time derivative and first-order *upwind* differencing for the space derivatives, in order to improve the stability of the method [7,8]:

$$\begin{split} \frac{\partial \pi\_{m\_{i}}(t, z\_{1}, \ldots, z\_{n})}{\partial t} &\approx \frac{\pi\_{m\_{i}}(t + \Delta t, z\_{1}, \ldots, z\_{n}) - \pi\_{m\_{i}}(t, z\_{1}, \ldots, z\_{n})}{\Delta t} \\ \sim \frac{\partial \pi\_{m\_{i}}(t, z\_{1}, \ldots, z\_{k}, \ldots, z\_{n})}{\partial z\_{k}} &\approx r \cdot \text{sgn}(r) \cdot \frac{\pi\_{m\_{i}}(t, z\_{1}, \ldots, z\_{k}, \ldots, z\_{n}) - \pi\_{m\_{i}}(t, z\_{1}, \ldots, z\_{k} - \text{sgn}(r)\Delta z\_{k}, \ldots, z\_{n})}{\Delta z\_{k}} \end{split} \tag{25}$$

The explicit discretization of the right-hand-side coupling term allows the equations for each discrete state to be solved separately before going on to the next time step. The discretization is carried out on a hypercube of size max max max *ZZ Z V IB RS LSQ EX i* / with step size *z* in direction of *zIB*, *zRS*, *zEX* and *zREG*, and step size *t* in time. The computational complexity for the solution is

236 Petri Nets – Manufacturing and Computer Science

5

immediately after firing of *TCLOCK*:

4

*m*

*i*

program's execution:

*i*

*i*

where *INITIATE*

*EX*

\_0 0 0

*P t Mt Mt*

*NO FETCH i ii i*

*m m*  

*i*

The sum of probabilities of the discrete markings that do not carry a token in place *PFETCH*

/

Because of the discrete nature of pipelining, additional attention should be given to the probability that no useful instructions will be added to the instruction buffer in the cycle beginning at time *t*0 (complete stall in the instruction fetch unit that can lead to an effectively empty instruction buffer) due to branch misprediction. It can be obtained by summing up the probabilities of the discrete markings that still carry one or more tokens in place *PBMIS*

/

In addition, the *execution efficiency* is introduced, taken as a ratio between the number of useful instructions and the total number of instructions executed during the course of a

*Vpt r p <sup>V</sup>*

We have used *finite difference approximations* to replace the derivatives that appear in the PDEs: *forward* difference approximation for the time derivative and first-order *upwind* differencing for the space derivatives, in order to improve the stability of the method [7,8]:

( , ,..., ,..., ) ( , ,..., ,..., ) ( , ,..., sgn( ) ,..., ) sgn( )

*tz z z tz z z tz z r z z*

*m kn m knm k kn*

The explicit discretization of the right-hand-side coupling term allows the equations for each discrete state to be solved separately before going on to the next time step. The

*i*

<sup>1</sup> due to value misprediction

**2.5. Numerical experiments and performance evaluation results** 

1 1 1

 

*z z*

*k k*

discretization is carried out on a hypercube of size max max max

*i i*

*V V*

*i*

,, , ,

(22)

/ /

*t z z z z dz dz dz dz*

0 / /

*m IB RS LSQ EX REG IB RS LSQ EX REG*

1

*W IPC*

*i*

*i i i i*

*ZZ Z V IB RS LSQ EX i* / with step

*i VMIS EX i INITIATE VMIS*

(23)

(24)

(25)

*t z z z z dz dz dz dz*

,, , ,

*m IB RS LSQ EX REG IB RS LSQ EX REG*

gives the *probability of a stall* in the instruction fetch unit at time *t*:

3 0 0 00

instructions reexecuted

1 11

*m n m nm n*

*ii i*

*t t*

*r r r*

( , ,...., ) ( , ,...., ) ( , ,...., )

*tz z t tz z tz z*

*i i i*

*r* is the *average initiation rate*.

*IB RS LSQ EX MAX MAX MAX i*

*ZZ Z V*

( ) Pr ( ) 3 ( ) 4

*IB RS LSQ EX MAX MAX MAX i*

*ZZ Z V*

1 0 0 00

( ) Pr ( ) 1 ( ) 5

*STALL i ii i*

*P t Mt Mt*

$$\text{O}\left(8 \cdot \frac{\text{t}}{\Delta t} \cdot \frac{Z\_{\text{IB}\_{\text{max}}} \cdot Z\_{\text{RS}/LSQ\_{\text{max}}} \cdot Z\_{\text{EX}\_{\text{max}}} \cdot V\_i}{\Delta z^4}\right) \text{ floating-point operations, } \Delta t$$

since for each of *t t* time steps we must increment each solution value in the fourdimensional grid for eight of the nine discrete markings. The storage requirements of the algorithm are at least

$$8 \cdot \frac{Z\_{IB\_{\text{max}}} \cdot Z\_{RS/LSQ\_{\text{max}}} \cdot Z\_{EX\_{\text{max}}} \cdot V}{\Delta z^4} \cdot 4 \text{ byteses} \cdot 1$$

since for eight of nine discrete markings we must store a four-dimensional grid of floatingpoint numbers (solutions at successive time steps can be overwritten).

Unless indicated otherwise, max *Z W IB* , max *ZRS LSQ* / *RR*max *<sup>Z</sup> ROB*max *<sup>Z</sup>* max 2 *Z W EX* and *t z nW* ( ) , where *n*=4 is the number of continuous dimensions. With these capacities of fluid places, virtually all name dependences and structural conflicts are eliminated. Step size *z* is varied between *z* 1 2 (coarser grid, usually when the prediction accuracy is high) and *z* 1 6 (finer grid, usually when the prediction accuracy is low).

Considering a low-volume program (*Vi*=50 instructions) executed on a four-wide machine (*W*=4), we investigate:


It is indisputably clear that both branch and value prediction accuracy improvements reduce the mean execution time of a program and increase performance. As an illustration, the size of the shaded area in Figure 3a is equal to the mean execution time when perfect branch prediction is involved, and IPC is computed as indicated by Eq. (20). In addition, looking at Figure 3b one can see that the probability of a complete stall in the instruction fetch unit, which can lead to an empty instruction buffer in the subsequent cycle, decreases with branch prediction accuracy improvement. As a result, both the utilization of the processor and the size of dynamic scheduling window increase as branch prediction accuracy increases.

The correctness of the discretization method is verified by comparing the numerical transient analysis results with the results obtained by discrete-event simulation, which is specifically implemented for this model and not for a general FSPN. The types of events that need to be scheduled in the event queue are either *transition firings* or the *hitting of a threshold* dependent on fluid levels. We have used a *Unif[0,1]* pseudo-random number generator to generate samples from the respective cumulative distribution functions and determine transition firing times via inversion of the *cdf* ("Golden Rule for Sampling"). Discrete-event simulation alone has been used to obtain performance evaluation results for wide machines with much more aggressive instruction issue (*W*>>1).

Fluid Stochastic Petri Nets:

DES NTA

From Fluid Atoms in ILP Processor Pipelines to Fluid Atoms in P2P Streaming Networks 239

<sup>0</sup> 0.2 0.4 0.6 0.8 <sup>1</sup>

**W=4, Vi=50, λi=1, μi=1.5, 1-pBMISi=1**

**1-pVMISi**

NTA DES

It takes quite some effort to tune the numerical algorithm parameters appropriately, so that a sufficiently accurate approximation is obtained. Various discretization and convergence errors may cancel each other, so that sometimes a solution obtained on a coarse grid may agree better with the discrete-event simulation than a solution on a finer grid – which, by definition, should be more accurate. In Figures 5a-b, a comparison of discretization results and results obtained using discrete-event simulation for a four-wide machine is given. Furthermore, Figure 5c shows the performance of several machines with realistic predictors executing a program with an average basic block size of eight instructions, given that about 25% of the instructions that initiate execution in the same clock cycle are consuming

**Figure 5.** Comparison of numerical transient analysis results (NTA) and results given by discrete-event

(c)

12

8

**W**

4

**IPC**

(a) (b)

**1-pBMISi=0.95, 1- pVMISi=0.65, λi=W/8, μi=W/4**

0 0.5 1 1.5 2 2.5 3 3.5

**IPC**

DES NTA

Since the conservation of probability mass is enforced, the differences between the numerical transient analysis and the discrete-event simulation results arise only from the improper distribution of the probability mass over the solution domain. Due to the inherent dissipation error of the first-order accurate numerical methods, the solution at successive

instructions.

0 0.5 1 1.5 2 2.5

**IPC**

<sup>0</sup> 0.2 0.4 0.6 0.8 <sup>1</sup>

**W=4, Vi=50, λi=1, μi=1, 1-pVMISi=0**

**1-pBMISi**

simulation (DES)

(a)

**Figure 3.** Influence of branch prediction accuracy on (a) the distribution of the program's execution time and (b) the probability of a complete stall in the instruction fetch unit

**Figure 4.** Influence of value prediction accuracy on the distribution of the program's execution time

It takes quite some effort to tune the numerical algorithm parameters appropriately, so that a sufficiently accurate approximation is obtained. Various discretization and convergence errors may cancel each other, so that sometimes a solution obtained on a coarse grid may agree better with the discrete-event simulation than a solution on a finer grid – which, by definition, should be more accurate. In Figures 5a-b, a comparison of discretization results and results obtained using discrete-event simulation for a four-wide machine is given. Furthermore, Figure 5c shows the performance of several machines with realistic predictors executing a program with an average basic block size of eight instructions, given that about 25% of the instructions that initiate execution in the same clock cycle are consuming instructions.

238 Petri Nets – Manufacturing and Computer Science

0 0.2 0.4 0.6 0.8 1

0.00

0.33

0.67

**PNO\_FETCHi(t)**

1.00

**P9i(t)**

with much more aggressive instruction issue (*W*>>1).

*t ex* <sup>i</sup>

generate samples from the respective cumulative distribution functions and determine transition firing times via inversion of the *cdf* ("Golden Rule for Sampling"). Discrete-event simulation alone has been used to obtain performance evaluation results for wide machines

**=1, 1-pVMISi=0**

IPC = 2.46 2.01 1.62 1.47 1.37 1.29

**W=4, Vi =50, λ<sup>i</sup> =1, μ<sup>i</sup>**

**W=4, Vi =50, λ<sup>i</sup> =1, μ<sup>i</sup>**

(a)

**=1, 1-pVMISi=0**

0 5 10 15 20 25 30 35 40 45 **t [cycles]**

**Figure 3.** Influence of branch prediction accuracy on (a) the distribution of the program's execution

0 10 20 30 40 **t [cycles]**

(b)

**=1.5, 1-pBMISi=1**

IPC = 3.21 2.65 2.40

1-pVMISi=0.0 1-pVMISi=0.5 1-pVMISi=1.0

1-pBMISi=0.0 1-pBMISi=0.2 1-pBMISi=0.4 1-pBMISi=0.6 1-pBMISi=0.8 1-pBMISi=1.0

1-pBMISi=0.0 1-pBMISi=0.2 1-pBMISi=0.4 1-pBMISi=0.6 1-pBMISi=0.8

**Figure 4.** Influence of value prediction accuracy on the distribution of the program's execution time

0 3 6 9 12 15 18 21 24 27 **t [cycles]**

time and (b) the probability of a complete stall in the instruction fetch unit

0 0.2 0.4 0.6 0.8 1

**P9i(t)**

**W=4, Vi =50, λ<sup>i</sup> =1, μ<sup>i</sup>**

**Figure 5.** Comparison of numerical transient analysis results (NTA) and results given by discrete-event simulation (DES)

Since the conservation of probability mass is enforced, the differences between the numerical transient analysis and the discrete-event simulation results arise only from the improper distribution of the probability mass over the solution domain. Due to the inherent dissipation error of the first-order accurate numerical methods, the solution at successive

#### 240 Petri Nets – Manufacturing and Computer Science

time steps is more or less dissipated to neighboring grid nodes. The phenomenon is emphasized when the number of discrete state changes is increased owing to the larger number of mispredictions.

Fluid Stochastic Petri Nets:

λi=4, μi=15 λi=2, μi=4 λi=4, μi=8 λi=4, μi=4

*<sup>i</sup> W* ).

λi=4, μi=4, 1-pBMISi=1 λi=4, μi=12, 1 pBMISi=0.95 λi=4, μi=12, 1-pBMISi=1

From Fluid Atoms in ILP Processor Pipelines to Fluid Atoms in P2P Streaming Networks 241

<sup>0</sup> 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 <sup>1</sup>

**W=16**

**W=16**

**1-pBMISi**

<sup>0</sup> 0.2 0.4 0.6 0.8 <sup>1</sup>

**1-pVMISi**

Because mispredicted branches limit the number of useful instructions that enter the instruction window, the processor is able to provide almost the same number of instructions to leave the instruction window, even with lower value prediction accuracy. As a result, graphs tend to flatten out. Correct value predictions can only be exploited when the fetch rate is quite high, i.e. when mispredicted branches are infrequent. Branch misprediction

becomes a more significant performance limitation with wider processors (Figure 7b).

λi=1, μi=3 λi=0.75, μi=1 λi=1, μi=1.5 λi=1, μi=1

**Figure 6.** Speedup achieved by branch prediction with varying accuracy

λi=1, μi=1, 1-pBMISi=1 λi=1, μi=3, 1-pBMISi=0.9 λi=1, μi=3, 1-pBMISi=1

<sup>0</sup> 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 <sup>1</sup>

**W=4**

**W=4**

**1-pBMISi**

<sup>0</sup> 0.2 0.4 0.6 0.8 <sup>1</sup>

**1-pVMISi**

0 0.5 1 1.5 2 2.5 3

0 0.5 1 1.5 2 2.5 3

**Speedup**

**Speedup**

**Figure 7.** Speedup achieved by value prediction with varying accuracy

In addition, we investigate branch and value prediction efficiency with varying machine width (Figures 8a-c). The speedup in this case is computed by dividing the IPC achieved in a machine over the IPC achieved in a *scalar* counterpart (*W*=1, *μi*=0). The speedup due to branch prediction is obviously higher in wider machines. With perfect branch prediction, the speedup unconditionally increases with the machine width. For a given width, the speedup is higher when there are a smaller number of consuming instructions (low /

(a) (b)

**Speedup**

(a) (b)

**Speedup**

With realistic branch prediction, there is a threshold effect on the machine width: below the

The results are satisfactorily close to each other, especially when the prediction accuracy is high, which is common in recent architectures. Yet, we believe that much work is still uncompleted and many questions are still open for further research in the field of development of strategies for reducing the amount of memory needed to represent the volume densities, as well as efficient discretization schemes for numerical transient analysis of general FSPNs. *Alternating direction implicit* (ADI) methods [19] in order to save memory, and parallelization of the numerical algorithms to reduce runtime have been suggested.

In the remainder of this part, we do not distinguish the numerical transient analysis results from the results given by discrete-event simulation of the FSPN model. Initially we analyze the efficiency of branch prediction by varying branch prediction accuracy. Value prediction is not involved at all. The speedup is computed by dividing the IPC achieved with certain branch prediction accuracy over the IPC achieved without branch prediction ( 1 0 *<sup>i</sup> BMIS <sup>p</sup>* ). For the moment, the input space is not partitioned and program volume is set to *Vi*=106 instructions.

It is observed that, looking at Figures 6a-b, branch prediction curves have an exponential shape. Therefore, building branch predictors that improve the accuracy just a little bit may be reflected in a significant performance increase. The impact of a given increment in accuracy is more noticeable when it experiences a slight improvement beyond the 90%. Another conclusion drawn from these figures is that one can benefit most from branch prediction in programs with relatively short basic blocks (high / *<sup>i</sup> W* ) and which do not suffer excessively from true data dependences (low / *<sup>i</sup> W* ). When the ratio / *<sup>i</sup> W* is high, true data dependences overshadow control dependences. As a result, the amount of ILP that is expected without value prediction in a machine with extremely aggressive instruction issue is far below the maximum possible value, even with perfect branch prediction. Value prediction has to be involved to go beyond the limits imposed by true data dependences.

Next, we analyze the efficiency of value prediction by varying value prediction accuracy (Figures 7a-b). The speedup is computed by dividing the IPC achieved with certain value prediction accuracy over the IPC achieved without value prediction ( 1 0 *VMISi p* ). With perfect branch prediction, it seems clear that the value prediction curves have a linear behavior. Therefore, it is worthwhile to build a predictor that significantly improves the accuracy. Only a small improvement on the value predictor accuracy has a little impact on ILP processor performance, regardless of the accuracy range. Another conclusion drawn from these figures is that the effect of value prediction is more noticeable when a significant number of instructions consume results of simultaneously initiated producer-instructions during execution (high / *<sup>i</sup> W* ), i.e. when true data dependences have a much higher influence on the program's total execution time.

Branch prediction has a very important influence on the benefits of value prediction. One can see that the performance increase is less significant when branch prediction is realistic. Because mispredicted branches limit the number of useful instructions that enter the instruction window, the processor is able to provide almost the same number of instructions to leave the instruction window, even with lower value prediction accuracy. As a result, graphs tend to flatten out. Correct value predictions can only be exploited when the fetch rate is quite high, i.e. when mispredicted branches are infrequent. Branch misprediction becomes a more significant performance limitation with wider processors (Figure 7b).

**Figure 6.** Speedup achieved by branch prediction with varying accuracy

240 Petri Nets – Manufacturing and Computer Science

number of mispredictions.

time steps is more or less dissipated to neighboring grid nodes. The phenomenon is emphasized when the number of discrete state changes is increased owing to the larger

The results are satisfactorily close to each other, especially when the prediction accuracy is high, which is common in recent architectures. Yet, we believe that much work is still uncompleted and many questions are still open for further research in the field of development of strategies for reducing the amount of memory needed to represent the volume densities, as well as efficient discretization schemes for numerical transient analysis of general FSPNs. *Alternating direction implicit* (ADI) methods [19] in order to save memory, and parallelization of the numerical algorithms to reduce runtime have been suggested.

In the remainder of this part, we do not distinguish the numerical transient analysis results from the results given by discrete-event simulation of the FSPN model. Initially we analyze the efficiency of branch prediction by varying branch prediction accuracy. Value prediction is not involved at all. The speedup is computed by dividing the IPC achieved with certain branch prediction accuracy over the IPC achieved without branch prediction ( 1 0 *<sup>i</sup> BMIS <sup>p</sup>* ). For the moment, the input space is not partitioned and program volume is set to *Vi*=106 instructions.

It is observed that, looking at Figures 6a-b, branch prediction curves have an exponential shape. Therefore, building branch predictors that improve the accuracy just a little bit may be reflected in a significant performance increase. The impact of a given increment in accuracy is more noticeable when it experiences a slight improvement beyond the 90%. Another conclusion drawn from these figures is that one can benefit most from branch

true data dependences overshadow control dependences. As a result, the amount of ILP that is expected without value prediction in a machine with extremely aggressive instruction issue is far below the maximum possible value, even with perfect branch prediction. Value prediction has to be involved to go beyond the limits imposed by true data dependences.

Next, we analyze the efficiency of value prediction by varying value prediction accuracy (Figures 7a-b). The speedup is computed by dividing the IPC achieved with certain value

perfect branch prediction, it seems clear that the value prediction curves have a linear behavior. Therefore, it is worthwhile to build a predictor that significantly improves the accuracy. Only a small improvement on the value predictor accuracy has a little impact on ILP processor performance, regardless of the accuracy range. Another conclusion drawn from these figures is that the effect of value prediction is more noticeable when a significant number of instructions consume results of simultaneously initiated producer-instructions

Branch prediction has a very important influence on the benefits of value prediction. One can see that the performance increase is less significant when branch prediction is realistic.

prediction accuracy over the IPC achieved without value prediction ( 1 0 *VMISi*

*<sup>i</sup> W* ), i.e. when true data dependences have a much higher

*<sup>i</sup> W* ). When the ratio /

*<sup>i</sup> W* ) and which do not

*<sup>i</sup> W* is high,

*p* ). With

prediction in programs with relatively short basic blocks (high /

suffer excessively from true data dependences (low /

influence on the program's total execution time.

during execution (high /

**Figure 7.** Speedup achieved by value prediction with varying accuracy

In addition, we investigate branch and value prediction efficiency with varying machine width (Figures 8a-c). The speedup in this case is computed by dividing the IPC achieved in a machine over the IPC achieved in a *scalar* counterpart (*W*=1, *μi*=0). The speedup due to branch prediction is obviously higher in wider machines. With perfect branch prediction, the speedup unconditionally increases with the machine width. For a given width, the speedup is higher when there are a smaller number of consuming instructions (low / *<sup>i</sup> W* ). With realistic branch prediction, there is a threshold effect on the machine width: below the threshold the speedup increases with the machine width, whereas above the threshold the speedup is close to a limit – machine width is by far larger than the average number of instructions provided by the fetch unit. The threshold decreases with increasing the number of mispredicted branches.

Fluid Stochastic Petri Nets:

1-pBMISi=0.90 1-pBMISi=0.94 1-pBMISi=0.98 1-pBMISi=1

From Fluid Atoms in ILP Processor Pipelines to Fluid Atoms in P2P Streaming Networks 243

4

0 0.5 1 1.5 2

**Additional speedup**

16

**W**

64

1-pBMISi=0.90 1-pBMISi=0.94 1-pBMISi=0.98 1-pBMISi=1

**λi =W/8, μ<sup>i</sup>**

256

1024

**=W/2, 1-pVMISi=1**

4096

**Figure 9.** Additional speedup achieved by perfect value prediction with varying machine width

256

(varying capacity *RS LSQ* / *MAX*

4

0 0.5 1 1.5 2

**Additional speedup**

16

**W**

64

**λi =W/4, μ<sup>i</sup>**

256

1024

**=W/2, 1-pVMISi=1**

4096

4

**Additional speedup**

16

**W**

64

**λi =W/8, μ<sup>i</sup>**

of consuming instructions (low /

The rate at which consuming instructions occur depends on the initiation rate. Therefore, we also investigate the value prediction efficiency with varying instruction window size

(c)

1024

4096

(a) (b)

**=7W/8, 1-pVMISi=1**

1-pBMISi=0.90 1-pBMISi=0.94 1-pBMISi=0.98 1-pBMISi=1

computed in the same way as in the previous instance. It increases with the instruction window size in [*W*, 2*W*], but the increase is more moderate when there are a smaller number

instruction window grows larger, performance without value prediction saturates, as does the performance with perfect value prediction. The upper limit value emerges from the fact

*W* consuming instructions may be forced to retain their reservation stations. One should also note that the speedup for *W*>>1 and realistic branch prediction is almost constant with increasing instruction window size. Two scenarios arise in this case: (1) the number of consuming instructions is large – the speedup is constant but still noticeable as there are not enough independent instructions in the window without value prediction, and (2) the number of consuming instructions is small – there is no speedup as there are enough

that in each cycle up to *W* new instructions may enter the fluid place *RS LSQ* / *MAX*

*Z* of the fluid place *RS LSQ* / *P* ) (Figures 10a-b). The speedup is

*<sup>i</sup> W* ) and/or branch prediction is not perfect. As the

*Z* and up to

**Figure 8.** Speedup achieved by branch prediction with varying machine width

The maximum *additional speedup* that value prediction can provide is computed by dividing the IPC achieved with perfect value prediction over the IPC achieved without value prediction (Figures 9a-c). With perfect branch prediction, some true data dependences can always be eliminated, regardless of the machine width. Actually, the maximum additional speedup is predetermined by the ratio / ( ) *W W <sup>i</sup>* . However, with realistic branch prediction, the additional speedup diminishes when the machine width is above a threshold value. It happens earlier when there are a smaller number of consuming instructions and/or a larger number of mispredicted branches. In either case, the number of independent instructions examined for simultaneous execution is sufficiently higher than the number of fetched instructions that enter the instruction window. Again, branch prediction becomes more important with wider processors.

242 Petri Nets – Manufacturing and Computer Science

of mispredicted branches.

**λi =W/4, μ<sup>i</sup>**

4

**Speedup**

16

**W**

64

256

1024

**=W/2, 1-pVMISi=0**

254 510 1012 2004

4096

threshold the speedup increases with the machine width, whereas above the threshold the speedup is close to a limit – machine width is by far larger than the average number of instructions provided by the fetch unit. The threshold decreases with increasing the number

(a) (b)

**=7W/8, 1-pVMISi=0**

4

253 500

**Speedup**

16

**W**

64

1-pBMISi=0.90 1-pBMISi=0.94 1-pBMISi=0.98 1-pBMISi=1

**λi =W/8, μ<sup>i</sup>**

256

1024

**=W/2, 1-pVMISi=0**

4096

254 510 1012 2004

1-pBMISi=0.90 1-pBMISi=0.94 1-pBMISi=0.98 1-pBMISi=1

1-pBMISi=0.90 1-pBMISi=0.94 1-pBMISi=0.98 1-pBMISi=1

**Figure 8.** Speedup achieved by branch prediction with varying machine width

4

**Speedup**

16

**W**

64

**λi =W/8, μ<sup>i</sup>**

256

speedup is predetermined by the ratio / ( ) *W W <sup>i</sup>*

more important with wider processors.

The maximum *additional speedup* that value prediction can provide is computed by dividing the IPC achieved with perfect value prediction over the IPC achieved without value prediction (Figures 9a-c). With perfect branch prediction, some true data dependences can always be eliminated, regardless of the machine width. Actually, the maximum additional

(c)

1024

4096

prediction, the additional speedup diminishes when the machine width is above a threshold value. It happens earlier when there are a smaller number of consuming instructions and/or a larger number of mispredicted branches. In either case, the number of independent instructions examined for simultaneous execution is sufficiently higher than the number of fetched instructions that enter the instruction window. Again, branch prediction becomes

. However, with realistic branch

**Figure 9.** Additional speedup achieved by perfect value prediction with varying machine width

The rate at which consuming instructions occur depends on the initiation rate. Therefore, we also investigate the value prediction efficiency with varying instruction window size (varying capacity *RS LSQ* / *MAX Z* of the fluid place *RS LSQ* / *P* ) (Figures 10a-b). The speedup is computed in the same way as in the previous instance. It increases with the instruction window size in [*W*, 2*W*], but the increase is more moderate when there are a smaller number of consuming instructions (low / *<sup>i</sup> W* ) and/or branch prediction is not perfect. As the instruction window grows larger, performance without value prediction saturates, as does the performance with perfect value prediction. The upper limit value emerges from the fact that in each cycle up to *W* new instructions may enter the fluid place *RS LSQ* / *MAX Z* and up to *W* consuming instructions may be forced to retain their reservation stations. One should also note that the speedup for *W*>>1 and realistic branch prediction is almost constant with increasing instruction window size. Two scenarios arise in this case: (1) the number of consuming instructions is large – the speedup is constant but still noticeable as there are not enough independent instructions in the window without value prediction, and (2) the number of consuming instructions is small – there is no speedup as there are enough independent instructions in the window even without value prediction, regardless of the window size. Again, the main reasons for this behavior are the small number of consuming instructions and the large number of mispredicted branches.

Fluid Stochastic Petri Nets:

From Fluid Atoms in ILP Processor Pipelines to Fluid Atoms in P2P Streaming Networks 245

constant disruptions of the streaming data delivery. This peer churn has high influence on the quality of offered services, especially for P2P systems that offer live video broadcast. Also, P2P network members are heterogeneous in their upload bandwidth capabilities and provide quite different contribution to the overall system performance. Efficient construction of P2P live video streaming network requires data latency reduction as much as possible, in order to disseminate the content in a live manner. This latency is firstly introduced by network infrastructure latency presented as a sum of serialization latency, propagation delay, router processing delay and router queuing delay. The second type of delay is the initial start-up delay required for filling the peer's buffer prior to the start of the video play. The buffer is used for short term storage of video chunks which often arrive out of sequence in manner of order and/or time, and resolving this latency issue requires careful buffer modeling and management. Thus, buffer size requires precise dimensioning because even though larger buffers offer better sequence order or latency compensation, they introduce larger video playback delay. Contrary, small buffers offer smaller playback delay, but the system becomes more error prone. Also, since the connections between participating peers in these P2P logical networks are maintained by the means of control messages exchange, the buffer content (buffer map) is incorporated in these control messages and it is used for missing chunks acquisition. Chunk requesting and forwarding is controlled by a chunk scheduling algorithm, which is responsible for on-time chunk acquisition and delivery among the neighboring peers, which is usually based on the available content and bandwidth of the neighboring peers. A lot of research activities are strictly focused on designing better chunk scheduling algorithms [26,27] that present the great importance of carefully composed scheduling algorithm which can significantly compensate for churn or bandwidth/latency disruptions. Beside the basic coding schemes, in latest years an increasing number of P2P live streaming protocols use *Scalable Video Coding* (SVC) technologies. SVC is an emerging paradigm where the video stream is split in several sub-streams and each sub-stream contributes to one or more characteristics of video content in terms of temporal, spatial and SNR/quality scalability. Mainly, two different concepts of SVC are in greater use: *Layered Video Coding* (LVC) where the video stream is split in several dependently decodable sub-stream called *Layers*, and *Multiple Description Coding* (MDC) where the video stream is split in several independently decodable sub-stream called *Descriptions*. A number of P2P video streaming

models use LVC [27] or MDC [23,28] and report promising results.

individual peer at a given time, which is presented in Eq. (26).

*MAX SERVER*

*r r*

As a base for our modeling we use the work in [29,30], where several important terms are defined. One of them is the maximum achievable rate that can be streamed to any

min , <sup>1</sup>

*n SERVER Pi i*

(26)

*r r*

*n* 

**3.1. Model definition** 

**Figure 10.** Speedup achieved by perfect value prediction with varying instruction window size

In order to investigate the operational environment influence, we partitioned the input space into several program classes, each of them with at least one different aspect: branch rate, consuming instruction rate, probability to classify a branch as easy-to-predict or probability to classify a value as easy-to-predict. We concluded that the set of programs executed on a machine have a considerable influence on the *perceived IPC.* Since the term *program* may be interchangeably used with the term *instruction stream*, these observations give good reason for the analysis of the time varying behavior of programs in order to find simulation points in applications to achieve results representative of the program as a whole. From a user perspective, a machine with more sophisticated prediction mechanisms will not always lead to a higher *perceived performance* as compared to a machine with more modest prediction mechanisms but more favorable operational profile [20,21].
