**2. Timed Petri nets**

2 Will-be-set-by-IN-TECH

execution on superscalar pipelines are not so impressive, and it is difficult to obtain a speedup greater than 2 using 4 or 8-way superscalar issue [7]. Moreover, in modern processors, memory latencies are so long that out–of–order processors require very large instruction

Although ultra–wide out-of-order superscalar processors were predicted as the architecture of one-billion-transistor chips, with a single 16 or 32-wide-issue processing core and huge branch predictors to sustain good instruction level parallelism, the industry has not been moving toward the wide–issue superscalar model [8]. Design complexity and power efficiency direct the industry toward narrow–issue, high–frequency cores and multithreaded processors. According to [6]: "Clearly something is very wrong with the out–of–order approach to concurrency if this extravagant consumption of on–chip resources is only providing a practical

Instruction–level multithreading [9], [10], [1] is a technique of tolerating long–latency memory accesses by switching to another thread (if it is available for execution) rather than waiting for the completion of the long–latency operation. If different threads are associated with different sets of processor registers, switching from one thread to another (called "context switching")

In simultaneous multithreading [12], [6] several threads can issue instructions at the same time. If a processor contains several functional units or it contains more than one instruction execution pipeline, the instructions can be issued simultaneously; if there is only one pipeline, only one instruction can be issued in each processor cycle, but the (simultaneous) threads complement each other in the sense that whenever one thread cannot issue an instruction (because of pipeline stalls or context switching), an instruction is issued from another thread, eliminating 'empty' instruction slots and increasing the overall performance of the processor. Simultaneous multithreading combines hardware features of wide-issue superscalar processors and multithreaded processors [12]. From superscalar processors it inherits the ability to issue multiple instructions in each cycle; from multithreaded processors it takes hardware state for several threads. The result is a processor that can issue multiple instructions from multiple threads in each processor cycle, achieving better performance for a

The main objective of this work is to study the performance of simultaneously multithreaded processors in order to determine how effective simultaneous multithreading can be. In particular, an indication is sought if simultaneous multithreading can overcome the out–of–order's "barrier" of the speedup (equal to 2 [13]). A timed Petri net [14] model of multithreaded processors at the instruction execution level is developed, and performance results for this model are obtained by event–driven simulation of the developed model. Since the model is rather simple, simulation results are verified (with respect to accuracy) by state–space–based performance analysis (for those combinations of modeling parameters

Section 2 recalls basic concepts of timed Petri nets which are used in this study. A model of simultaneous multithreading, used for performance exploration, is presented in Section 3. Section 4 discusses the results obtained by event–driven simulation of the model introduced in Section 3. Section 5 contains concluding remarks including a short comparison of simulation

can be done very efficiently [11], in one or just a few processor cycles.

windows to tolerate them.

limit on speedup of about 2."

variety of workloads.

and analytical results.

for which the state space remains reasonably small).

A marked place/transition Petri net M is typically defined [15] [16] as M = (N , *m*0), where the structure N is a bipartite directed graph, N = (*P*, *T*, *A*), with a set of places *P*, a set of transitions *T*, a set of directed arcs *A* connecting places with transitions and transitions with places, *A* ⊆ *T* × *P* ∪ *P* × *T*, and the initial marking function *m*<sup>0</sup> which assigns nonnegative numbers of tokens to places of the net, *m*<sup>0</sup> : *P* → {0, 1, ...}. Marked nets can be equivalently defined as M = (*P*, *T*, *A*, *m*0).

A place *p* is an input place of a transition *t* if the (directed) arc (*p*, *t*) is in the set *A*. A place is shared if it is an input place to more than one transition. If a net does not contain shared places, the net is (structurally) conflict–free, otherwise the net contains conflicts. The simplest case of conflicts is known as a free–choice (or generalized free–choice) structure; a shared place is (generalized) free–choice if all transitions sharing it have identical sets of input places. A net is free–choice if all its shared places are free–choice. The transitions sharing a free–choice place constitute a free–choice class of transitions. For each marking function, and each free–choice class of transitions, either all transitions in this class are enabled or none of them is. It is assumed that the selection of transitions for firing within each free–choice class is a random process which can be described by "choice probabilities" assigned to (free–choice) transitions. Moreover, it is usually assumed that the random variables describing choice probabilities in different free–choice classes are independent.

All places which are not conflict–free and not free–choice, are conflict places. Transitions sharing conflict places are (directly or indirectly) potentially in conflict (i.e., they are in conflict or not depending upon a marking function; for different marking functions the sets of transitions which are in conflict can be different). All transitions which are potentially is conflict constitute a conflict class. All conflict classes are disjoint. It is assumed that conflicts are resolved by random choices of occurrences among the conflicting transitions. These random choice are independent in different conflict classes.

In timed nets [14], occurrence times are associated with transitions, and transition occurrences are real–time events, i.e., tokens are removed from input places at the beginning of the occurrence period, and they are deposited to the output places at the end of this period. All occurrences of enabled transitions are initiated in the same instants of time in which the transitions become enabled (although some enabled transitions may not initiate their occurrences). If, during the occurrence period of a transition, the transition becomes enabled again, a new, independent occurrence can be initiated, which will overlap with the other occurrence(s). There is no limit on the number of simultaneous occurrences of the same transition (sometimes this is called infinite occurrence semantics). Similarly, if a transition is enabled "several times" (i.e., it remains enabled after initiating an occurrence), it may start several independent occurrences in the same time instant.

More formally, a timed Petri net is a triple, T = (M, *c*, *f*), where M is a marked net, *c* is a choice function which assigns choice probabilities to free–choice classes of transitions or relative frequencies of occurrences to conflicting transitions (for non–conflict transitions *c* simply assigns 1.0), *<sup>c</sup>* : *<sup>T</sup>* <sup>→</sup> **<sup>R</sup>**0,1, where **<sup>R</sup>**0,1 is the set of real numbers in the interval [0,1], and *f* is a timing function which assigns an (average) occurrence time to each transition of the net, *<sup>f</sup>* : *<sup>T</sup>* <sup>→</sup> **<sup>R</sup>**+, where **<sup>R</sup>**<sup>+</sup> is the set of nonnegative real numbers.

The occurrence times of transitions can be either deterministic or stochastic (i.e., described by some probability distribution function); in the first case, the corresponding timed nets are

#### 4 Will-be-set-by-IN-TECH 302 Petri Nets – Manufacturing and Computer Science Timed Petri Nets in Performance Exploration of Simultaneous Multithreading <sup>5</sup>

referred to as D–timed nets [18], in the second, for the (negative) exponential distribution of firing times, the nets are called M–timed nets (Markovian nets [17]). In both cases, the concepts of state and state transitions have been formally defined and used in the derivation of different performance characteristics of the model [14]. Only D–timed Petri nets are used in this paper.

Td2

Pst1

Trun

Tcsw

choice probability *ps*1) represents a single-cycle pipeline stall (modeled by *Td1*), and *Tst2* (with the choice probability *ps*2) represents a two–cycle pipeline stall (*Td2* and then *Td1*); other

If long–latency operation is detected in the issued instruction, *Tend* initiates two concurrent actions: (i) context switching performed by enabling an occurrence of *Tcsw*, after which a new thread is selected for execution (if it is available), and (ii) a memory access request is entered into *Mreq*, the memory queue, and after accessing the memory (transition *Tmem*), the thread, suspended for the duration of memory access, becomes "ready" again and joins the pool of threads *Ready*. *Tmem* will typically represent a cache miss (with all its consequences); cache

The choice probability associated with *Tend* determines the runlength of a thread, *<sup>t</sup>*, i.e., the average number of instructions between two consecutive long–latency operations; if this choice probability is equal to 0.1, the runlength is equal to 10, if it is equal to 0.2, the runlength

*Proc*, which is connected to *Trun*, controls the number of pipelines. If the processor contains just one instruction execution pipeline, the initial marking assigns a single token to *Proc* as only one instruction can be issued in each processor cycle. In order to model a processor with

The number of memory ports, i.e., the number of simultaneous accesses to memory, is controlled by the initial marking of *Mem*; for a single port memory, the initial marking assigns just a single token to *Mem*, for dual-port memory, two tokens are assigned to *Mem*, and so on. In a similar way, the number of simultaneous threads (or instruction issue units) is controlled

Memory hierarchy can be incorporated into the model shown in Fig.1 by refining the representation of memory. In particular, levels of memory hierarchy can be introduced by replacing the subnet *Tmem*–*Mem* by a number of subnets, each subnet for one level of the hierarchy, and adding a free–choice structure which randomly selects the submodel according

hits (at the first level cache memory) are not considered long-latency operations.

two (identical) pipelines, two initial tokens are needed in *Proc*, and so on.

Proc Mem

Tend

Timed Petri Nets in Performance Exploration of Simultaneous Multithreading 303

Mreq

Tmem

Pcsw

Done

Tst0 Td1

Cont

Tst1

Pst2

Tnxt

Tst2

Next

pipeline stalls could be represented in a similar way, if needed.

Pnxt

Tsel

**Figure 1.** Petri net model of a multithreaded processor.

Ready

is 5, and so on.

by the initial marking of *Next*.

The firing times of some transitions may be equal to zero, which means that the firings are instantaneous; all such transitions are called *immediate* while the other are called *timed*. Since the immediate transitions have no tangible effects on the (timed) behavior of the model, it is convenient to split the set of transitions into two parts, the set of immediate and the set of timed transitions, and to fire first the (enabled) immediate transitions; only when no more immediate transitions are enabled, the firings of (enabled) timed transitions are initiated (still in the same instant of time). It should be noted that such a convention effectively introduces the priority of immediate transitions over the timed ones, so the conflicts of immediate and timed transitions should be avoided. Consequently, the free–choice and conflict classes of transitions must be "uniform", i.e., all transitions in each such class must be either immediate or timed, but not both.

Performance analysis of net models can be based on their behavior (i.e., the set of reachable states) or on the structure of the net; the former is called *reachability analysis* and the latter – *structural analysis*. For reachability analysis, the state space of the analyzed model must be finite and reasonably small while for structural analysis the model must satisfy a number of structural conditions. However, since timed Petri net models are discrete–event systems, their analysis can also be based on discrete–event simulation, which imposes very few restrictions on the class of analyzed models. All performance characteristics of simultaneous multithreading presented in Section 4 are obtained by event–driven simulation [19] of timed Petri net models shown in the next section.
