**3. Models of simultaneous multithreading**

A timed Petri net model of a simple multithreaded processor is shown in Fig.1 (as usually, timed transitions are represented by solid bars, and immediate ones, by thin bars).

For simplicity, Fig.1 shows only one level of memory; this simplification is removed further in this section.

*Ready* is a pool of available threads; it is assumed that the number of of threads is constant and does not change during program execution (this assumption is motivated by steady–state considerations). If the processor is idle (place *Next* is marked), one of available threads is selected for execution (transition *Tsel*). *Cont*, if marked, indicates that an instruction is ready to be issued to the execution pipeline. Instruction execution is modeled by transition *Trun* which represents the first stage of the execution pipeline. It is assumed that once the instruction enters the pipeline, it will progress through the stages and, eventually, leave the pipeline; since these pipeline implementation details are not important for performance analysis of the processor, they are not represented here.

*Done* is another free-choice place which determines if the current instruction performs a long–latency access to memory or not. If the current instruction is a non–long–latency one, *Tnxt* occurs (with the corresponding probability), and another instruction is fetched for issuing. *Pnxt* is a free-choice place with three possible outcomes: *Tst0* (with the choice probability *ps*0) represents issuing an instruction without any further delay; *Tst1* (with the

**Figure 1.** Petri net model of a multithreaded processor.

4 Will-be-set-by-IN-TECH

referred to as D–timed nets [18], in the second, for the (negative) exponential distribution of firing times, the nets are called M–timed nets (Markovian nets [17]). In both cases, the concepts of state and state transitions have been formally defined and used in the derivation of different performance characteristics of the model [14]. Only D–timed Petri nets are used

The firing times of some transitions may be equal to zero, which means that the firings are instantaneous; all such transitions are called *immediate* while the other are called *timed*. Since the immediate transitions have no tangible effects on the (timed) behavior of the model, it is convenient to split the set of transitions into two parts, the set of immediate and the set of timed transitions, and to fire first the (enabled) immediate transitions; only when no more immediate transitions are enabled, the firings of (enabled) timed transitions are initiated (still in the same instant of time). It should be noted that such a convention effectively introduces the priority of immediate transitions over the timed ones, so the conflicts of immediate and timed transitions should be avoided. Consequently, the free–choice and conflict classes of transitions must be "uniform", i.e., all transitions in each such class must be either immediate

Performance analysis of net models can be based on their behavior (i.e., the set of reachable states) or on the structure of the net; the former is called *reachability analysis* and the latter – *structural analysis*. For reachability analysis, the state space of the analyzed model must be finite and reasonably small while for structural analysis the model must satisfy a number of structural conditions. However, since timed Petri net models are discrete–event systems, their analysis can also be based on discrete–event simulation, which imposes very few restrictions on the class of analyzed models. All performance characteristics of simultaneous multithreading presented in Section 4 are obtained by event–driven simulation [19] of timed

A timed Petri net model of a simple multithreaded processor is shown in Fig.1 (as usually,

For simplicity, Fig.1 shows only one level of memory; this simplification is removed further in

*Ready* is a pool of available threads; it is assumed that the number of of threads is constant and does not change during program execution (this assumption is motivated by steady–state considerations). If the processor is idle (place *Next* is marked), one of available threads is selected for execution (transition *Tsel*). *Cont*, if marked, indicates that an instruction is ready to be issued to the execution pipeline. Instruction execution is modeled by transition *Trun* which represents the first stage of the execution pipeline. It is assumed that once the instruction enters the pipeline, it will progress through the stages and, eventually, leave the pipeline; since these pipeline implementation details are not important for performance analysis of the

*Done* is another free-choice place which determines if the current instruction performs a long–latency access to memory or not. If the current instruction is a non–long–latency one, *Tnxt* occurs (with the corresponding probability), and another instruction is fetched for issuing. *Pnxt* is a free-choice place with three possible outcomes: *Tst0* (with the choice probability *ps*0) represents issuing an instruction without any further delay; *Tst1* (with the

timed transitions are represented by solid bars, and immediate ones, by thin bars).

in this paper.

or timed, but not both.

this section.

Petri net models shown in the next section.

processor, they are not represented here.

**3. Models of simultaneous multithreading**

choice probability *ps*1) represents a single-cycle pipeline stall (modeled by *Td1*), and *Tst2* (with the choice probability *ps*2) represents a two–cycle pipeline stall (*Td2* and then *Td1*); other pipeline stalls could be represented in a similar way, if needed.

If long–latency operation is detected in the issued instruction, *Tend* initiates two concurrent actions: (i) context switching performed by enabling an occurrence of *Tcsw*, after which a new thread is selected for execution (if it is available), and (ii) a memory access request is entered into *Mreq*, the memory queue, and after accessing the memory (transition *Tmem*), the thread, suspended for the duration of memory access, becomes "ready" again and joins the pool of threads *Ready*. *Tmem* will typically represent a cache miss (with all its consequences); cache hits (at the first level cache memory) are not considered long-latency operations.

The choice probability associated with *Tend* determines the runlength of a thread, *<sup>t</sup>*, i.e., the average number of instructions between two consecutive long–latency operations; if this choice probability is equal to 0.1, the runlength is equal to 10, if it is equal to 0.2, the runlength is 5, and so on.

*Proc*, which is connected to *Trun*, controls the number of pipelines. If the processor contains just one instruction execution pipeline, the initial marking assigns a single token to *Proc* as only one instruction can be issued in each processor cycle. In order to model a processor with two (identical) pipelines, two initial tokens are needed in *Proc*, and so on.

The number of memory ports, i.e., the number of simultaneous accesses to memory, is controlled by the initial marking of *Mem*; for a single port memory, the initial marking assigns just a single token to *Mem*, for dual-port memory, two tokens are assigned to *Mem*, and so on.

In a similar way, the number of simultaneous threads (or instruction issue units) is controlled by the initial marking of *Next*.

Memory hierarchy can be incorporated into the model shown in Fig.1 by refining the representation of memory. In particular, levels of memory hierarchy can be introduced by replacing the subnet *Tmem*–*Mem* by a number of subnets, each subnet for one level of the hierarchy, and adding a free–choice structure which randomly selects the submodel according

*symbol parameter value nt* number of available threads 1,...,10 *np* number of execution pipelines 1,2,... *ns* number of simultaneous threads 1,2,3,... *<sup>t</sup>* thread runlength 10 *tm* average memory access time 5 *tcs* context switching time 1,3 *ps*<sup>1</sup> prob. of one–cycle pipeline stall 0.2 *ps*<sup>2</sup> prob. of two–cycle pipeline stall 0.1

Timed Petri Nets in Performance Exploration of Simultaneous Multithreading 305

0 2 4 6 8 10

0 2 4 6 8 10

The value of the processor utilization for *nt* = 1 (i.e., for one thread) can be derived from the (average) number of unused instruction issuing slots. Since the probability of a single–cycle stall is 0.2, and probability of a two–cycle stall is 0.1, on average 40 % of issuing slots remain unused because of pipeline stalls (for all instructions except the first one in each

**Figure 4.** Processor (-o-) and memory (-x-) utilization for a 2-1 processor; *lt* = 10, *tm* = 5, *tcs* = 1,

number of available threads

**Figure 3.** Processor (-o-) and memory (-x-) utilization for a 1-1 processor; *lt* = 10, *tm* = 5, *tcs* = 1,

number of available threads

Processor and memory utilization (2-1)

Processor and memory utilization (1-1)

**Table 1.** Simultaneous multithreading – modeling parameters and their typical values

0

0

0.2

0.4

0.6

utilization

0.8

1

0.2

0.4

0.6

utilization

*ps*<sup>1</sup> = 0.2, *ps*<sup>2</sup> = 0.1

*ps*<sup>1</sup> = 0.2, *ps*<sup>2</sup> = 0.1

0.8

1

**Figure 2.** Petri net model of a multithreaded processor with a two–level memory.

to probabilities describing the use of the hierarchical memory. Such a refinement, for two levels of memory (in addition to the first-level cache), is shown in Fig.2, where *Mreq* is a free–choice place selecting either level–1 (submodel *Mem*–*Tmem1*) or level–2 (submodel *Mem*–*Tmem2*). More levels of memory can be easily added similarly, if needed.

The effects of memory hierarchy can be compared with a uniform, non–hierarchical memory by selecting the parameters in such a way that the average access time of the hierarchical model (Fig.2) is equal to the access time of the non–hierarchical model (Fig.1).

Processors with different numbers of instruction issue units and instruction execution pipelines can be described by a pair of numbers, the first number denoting the number of instruction issue units, and the second – the number of instruction execution pipelines. In this sense a 3-2 processor is a (multithreaded) processor with 3 instruction issue units and 2 instruction execution pipelines.

For convenience, all temporal properties are expressed in processor cycles, so, the occurrence times of *Trun*, *Td1* and *Td2* are all equal to 1 (processor cycle), the occurrence time of *Tcsw* is equal to the number of processor cycles needed for a context switch (which is equal to 1 for many of the following performance analyzes), and the occurrence time of *Tmem* is the average number of processor cycles needed for a long–latency access to memory.

The main modeling parameters and their typical values are shown in Table 1.
