**1. Introduction**

In modern computer systems, the performance of the whole system is increasingly often limited by the performance of its memory subsystem [1]. Due to continuous progress in manufacturing technologies, the performance of processors has been doubling every 18 months (the so–called Moore's law [2]), but the performance of memory chips has been improving only by 10% per year [1], creating a "performance gap" in matching processor's performance with the required memory bandwidth [3]. More detailed studies have shown that the number of processor cycles required to access main memory doubles approximately every six years [4]. In effect, it is becoming more and more often the case that the performance of applications depends on the performance of the system's memory hierarchy and it is not unusual that as much as 60% of time processors spend waiting for the completion of memory operations [4].

Memory hierarchies, and in particular multi–level cache memories, have been introduced to reduce the effective latency of memory accesses [5]. Cache memories provide efficient access to information when the information is available at lower levels of memory hierarchy; occasionally, however, long–latency memory operations are needed to transfer the information from the higher levels of memory hierarchy to the lower ones. Extensive research has focused on reducing and tolerating these large memory access latencies.

Techniques which tolerate long–latency memory accesses include out–of–order execution of instructions and instruction–level multithreading. The idea of out–of–order execution [1] is to execute, instead of waiting for the completion of a long–latency operation, instructions which (logically) follow the long–latency one, but which do not depend upon the result of this long–latency operation. Since out–of–order execution exploits instruction–level concurrency in the executed sequential instruction stream, it conveniently maintains code–base compatibility [6]. In effect, the instruction stream is dynamically decomposed into micro-threads, which are scheduled and synchronized at no cost in terms of executing additional instructions. Although this is desirable, speedups using out–of–order

©2012 Zuberek, licensee InTech. This is an open access chapter distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0),which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. ©2012 Zuberek, licensee InTech. This is a paper distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### 2 Will-be-set-by-IN-TECH 300 Petri Nets – Manufacturing and Computer Science Timed Petri Nets in Performance Exploration of Simultaneous Multithreading <sup>3</sup>

execution on superscalar pipelines are not so impressive, and it is difficult to obtain a speedup greater than 2 using 4 or 8-way superscalar issue [7]. Moreover, in modern processors, memory latencies are so long that out–of–order processors require very large instruction windows to tolerate them.

**2. Timed Petri nets**

defined as M = (*P*, *T*, *A*, *m*0).

different free–choice classes are independent.

These random choice are independent in different conflict classes.

several independent occurrences in the same time instant.

*<sup>f</sup>* : *<sup>T</sup>* <sup>→</sup> **<sup>R</sup>**+, where **<sup>R</sup>**<sup>+</sup> is the set of nonnegative real numbers.

A marked place/transition Petri net M is typically defined [15] [16] as M = (N , *m*0), where the structure N is a bipartite directed graph, N = (*P*, *T*, *A*), with a set of places *P*, a set of transitions *T*, a set of directed arcs *A* connecting places with transitions and transitions with places, *A* ⊆ *T* × *P* ∪ *P* × *T*, and the initial marking function *m*<sup>0</sup> which assigns nonnegative numbers of tokens to places of the net, *m*<sup>0</sup> : *P* → {0, 1, ...}. Marked nets can be equivalently

Timed Petri Nets in Performance Exploration of Simultaneous Multithreading 301

A place *p* is an input place of a transition *t* if the (directed) arc (*p*, *t*) is in the set *A*. A place is shared if it is an input place to more than one transition. If a net does not contain shared places, the net is (structurally) conflict–free, otherwise the net contains conflicts. The simplest case of conflicts is known as a free–choice (or generalized free–choice) structure; a shared place is (generalized) free–choice if all transitions sharing it have identical sets of input places. A net is free–choice if all its shared places are free–choice. The transitions sharing a free–choice place constitute a free–choice class of transitions. For each marking function, and each free–choice class of transitions, either all transitions in this class are enabled or none of them is. It is assumed that the selection of transitions for firing within each free–choice class is a random process which can be described by "choice probabilities" assigned to (free–choice) transitions. Moreover, it is usually assumed that the random variables describing choice probabilities in

All places which are not conflict–free and not free–choice, are conflict places. Transitions sharing conflict places are (directly or indirectly) potentially in conflict (i.e., they are in conflict or not depending upon a marking function; for different marking functions the sets of transitions which are in conflict can be different). All transitions which are potentially is conflict constitute a conflict class. All conflict classes are disjoint. It is assumed that conflicts are resolved by random choices of occurrences among the conflicting transitions.

In timed nets [14], occurrence times are associated with transitions, and transition occurrences are real–time events, i.e., tokens are removed from input places at the beginning of the occurrence period, and they are deposited to the output places at the end of this period. All occurrences of enabled transitions are initiated in the same instants of time in which the transitions become enabled (although some enabled transitions may not initiate their occurrences). If, during the occurrence period of a transition, the transition becomes enabled again, a new, independent occurrence can be initiated, which will overlap with the other occurrence(s). There is no limit on the number of simultaneous occurrences of the same transition (sometimes this is called infinite occurrence semantics). Similarly, if a transition is enabled "several times" (i.e., it remains enabled after initiating an occurrence), it may start

More formally, a timed Petri net is a triple, T = (M, *c*, *f*), where M is a marked net, *c* is a choice function which assigns choice probabilities to free–choice classes of transitions or relative frequencies of occurrences to conflicting transitions (for non–conflict transitions *c* simply assigns 1.0), *<sup>c</sup>* : *<sup>T</sup>* <sup>→</sup> **<sup>R</sup>**0,1, where **<sup>R</sup>**0,1 is the set of real numbers in the interval [0,1], and *f* is a timing function which assigns an (average) occurrence time to each transition of the net,

The occurrence times of transitions can be either deterministic or stochastic (i.e., described by some probability distribution function); in the first case, the corresponding timed nets are

Although ultra–wide out-of-order superscalar processors were predicted as the architecture of one-billion-transistor chips, with a single 16 or 32-wide-issue processing core and huge branch predictors to sustain good instruction level parallelism, the industry has not been moving toward the wide–issue superscalar model [8]. Design complexity and power efficiency direct the industry toward narrow–issue, high–frequency cores and multithreaded processors. According to [6]: "Clearly something is very wrong with the out–of–order approach to concurrency if this extravagant consumption of on–chip resources is only providing a practical limit on speedup of about 2."

Instruction–level multithreading [9], [10], [1] is a technique of tolerating long–latency memory accesses by switching to another thread (if it is available for execution) rather than waiting for the completion of the long–latency operation. If different threads are associated with different sets of processor registers, switching from one thread to another (called "context switching") can be done very efficiently [11], in one or just a few processor cycles.

In simultaneous multithreading [12], [6] several threads can issue instructions at the same time. If a processor contains several functional units or it contains more than one instruction execution pipeline, the instructions can be issued simultaneously; if there is only one pipeline, only one instruction can be issued in each processor cycle, but the (simultaneous) threads complement each other in the sense that whenever one thread cannot issue an instruction (because of pipeline stalls or context switching), an instruction is issued from another thread, eliminating 'empty' instruction slots and increasing the overall performance of the processor.

Simultaneous multithreading combines hardware features of wide-issue superscalar processors and multithreaded processors [12]. From superscalar processors it inherits the ability to issue multiple instructions in each cycle; from multithreaded processors it takes hardware state for several threads. The result is a processor that can issue multiple instructions from multiple threads in each processor cycle, achieving better performance for a variety of workloads.

The main objective of this work is to study the performance of simultaneously multithreaded processors in order to determine how effective simultaneous multithreading can be. In particular, an indication is sought if simultaneous multithreading can overcome the out–of–order's "barrier" of the speedup (equal to 2 [13]). A timed Petri net [14] model of multithreaded processors at the instruction execution level is developed, and performance results for this model are obtained by event–driven simulation of the developed model. Since the model is rather simple, simulation results are verified (with respect to accuracy) by state–space–based performance analysis (for those combinations of modeling parameters for which the state space remains reasonably small).

Section 2 recalls basic concepts of timed Petri nets which are used in this study. A model of simultaneous multithreading, used for performance exploration, is presented in Section 3. Section 4 discusses the results obtained by event–driven simulation of the model introduced in Section 3. Section 5 contains concluding remarks including a short comparison of simulation and analytical results.
