**4. Performance exploration**

The model developed in the previous section is evaluated for different combinations of modeling parameters. Performance results are obtained by event-driven simulation of timed Petri net models.

The utilization of the processor and memory, as a function of the number of available threads, for a 1-1 processor (i.e., a processor with a single instruction issue unit and a single instruction execution pipeline) is shown in Fig. 3.



**Table 1.** Simultaneous multithreading – modeling parameters and their typical values

Td2

Td1

Proc Tcsw

*Mem*–*Tmem2*). More levels of memory can be easily added similarly, if needed.

model (Fig.2) is equal to the access time of the non–hierarchical model (Fig.1).

number of processor cycles needed for a long–latency access to memory.

The main modeling parameters and their typical values are shown in Table 1.

**Figure 2.** Petri net model of a multithreaded processor with a two–level memory.

Pst1

Trun

Pcsw

to probabilities describing the use of the hierarchical memory. Such a refinement, for two levels of memory (in addition to the first-level cache), is shown in Fig.2, where *Mreq* is a free–choice place selecting either level–1 (submodel *Mem*–*Tmem1*) or level–2 (submodel

The effects of memory hierarchy can be compared with a uniform, non–hierarchical memory by selecting the parameters in such a way that the average access time of the hierarchical

Processors with different numbers of instruction issue units and instruction execution pipelines can be described by a pair of numbers, the first number denoting the number of instruction issue units, and the second – the number of instruction execution pipelines. In this sense a 3-2 processor is a (multithreaded) processor with 3 instruction issue units and 2

For convenience, all temporal properties are expressed in processor cycles, so, the occurrence times of *Trun*, *Td1* and *Td2* are all equal to 1 (processor cycle), the occurrence time of *Tcsw* is equal to the number of processor cycles needed for a context switch (which is equal to 1 for many of the following performance analyzes), and the occurrence time of *Tmem* is the average

The model developed in the previous section is evaluated for different combinations of modeling parameters. Performance results are obtained by event-driven simulation of timed

The utilization of the processor and memory, as a function of the number of available threads, for a 1-1 processor (i.e., a processor with a single instruction issue unit and a single instruction

Done

Tend

Mreq

Tms1 Mreq1

Tms2 Mreq2

Mem

Tmem2

Tmem1

Tst0

Next

Pnxt

Tsel

instruction execution pipelines.

**4. Performance exploration**

execution pipeline) is shown in Fig. 3.

Petri net models.

Ready

Tst1

Pst2

Tnxt

Cont

Tst2

**Figure 3.** Processor (-o-) and memory (-x-) utilization for a 1-1 processor; *lt* = 10, *tm* = 5, *tcs* = 1, *ps*<sup>1</sup> = 0.2, *ps*<sup>2</sup> = 0.1

**Figure 4.** Processor (-o-) and memory (-x-) utilization for a 2-1 processor; *lt* = 10, *tm* = 5, *tcs* = 1, *ps*<sup>1</sup> = 0.2, *ps*<sup>2</sup> = 0.1

The value of the processor utilization for *nt* = 1 (i.e., for one thread) can be derived from the (average) number of unused instruction issuing slots. Since the probability of a single–cycle stall is 0.2, and probability of a two–cycle stall is 0.1, on average 40 % of issuing slots remain unused because of pipeline stalls (for all instructions except the first one in each

thread). Processor utilization for one thread is thus *<sup>t</sup>*/(*<sup>t</sup>* + (*<sup>t</sup>* − 1) ∗ 0.4 + *tm*) = 10/18.6 = 0.537, which corresponds very well with Fig.3. For a large number of threads processor utilization is obtained similarly, but with the context switching time, *tcs*, replacing *tm*, so it is *<sup>t</sup>*/(*<sup>t</sup>* + (*<sup>t</sup>* − 1) ∗ 0.4 + *tcs*) = 0.685.

Fig. 6 shows the utilizations of processor and memory for reduced probabilities of pipeline stalls, i.e., for *ps*<sup>1</sup> = 0.2 and *ps*<sup>2</sup> = 0. As is expected, the utilizations are higher than in Fig.3

A more realistic model of memory, that captures the idea of a two–level hierarchy, is shown in Fig.2. In order to compare the results of this model with Fig.3 and Fig.4, the parameters of the two–level memory are chosen in such a way that the average memory access time is equal to the memory access time in Fig.1 (where *tm* = 5). Let the two levels of memory have access times equal to 4 and 20, respectively; then the choice probabilities are equal to 15/16 and 1/16

<sup>16</sup> <sup>+</sup> <sup>20</sup> <sup>∗</sup> <sup>1</sup>

The results for a 1-1 processor with a two–level memory are shown in Fig.7, and for a 2-1

Processor and memory utilization (1-1.2)

0 2 4 6 8 10

0 2 4 6 8 10

**Figure 8.** Processor (-o-) and memory (-x-) utilization for a 2-1 processor with 2-level memory; *lt* = 10,

number of available threads

**Figure 7.** Processor (-o-) and memory (-x-) utilization for a 1-1 processor with 2-level memory; *lt* = 10,

number of available threads

Processor and memory utilization (2-1-2)

<sup>16</sup> <sup>=</sup> 5.

Timed Petri Nets in Performance Exploration of Simultaneous Multithreading 307

for level–1 and level–2, respectively, and the average access time is:

0

0

0.2

0.4

0.6

utilization

*tm* = 4 + 20, *tcs* = 1, *ps*<sup>1</sup> = 0.2, *ps*<sup>2</sup> = 0.1

0.8

1

0.2

0.4

0.6

utilization

*tm* = 4 + 20, *tcs* = 1, *ps*<sup>1</sup> = 0.2, *ps*<sup>2</sup> = 0.1

0.8

1

<sup>4</sup> <sup>∗</sup> <sup>15</sup>

and Fig.5.

processor in Fig.8.

The utilization of the processor can be improved by introducing a second (simultaneous) thread which issues its instructions in the slots unused by the first slot. Fig.4 shows the utilization of the processor and memory for a 2-1 processor, i.e., a processor with two (simultaneous) threads (or two instruction issue units) and a single pipeline. The utilization of the processor is improved by almost 50 % and is within a few percent from its upper bound (of 100 %).

The influence of pipeline stalls (probabilities *ps*<sup>1</sup> and *ps*2) is shown in Fig.5 and Fig.6. Fig.5 shows that the performance actually depends upon the total number of stalls rather than specific values of *ps*<sup>1</sup> and *ps*2; in Fig.5 all pipeline stalls are single–cycle ones, so *ps*<sup>1</sup> = 0.4 and *ps*<sup>2</sup> = 0, and the results are practically the same as in Fig. 3.

**Figure 5.** Processor (-o-) and memory (-x-) utilization for a 1-1 processor; *lt* = 10, *tm* = 5, *tcs* = 1, *ps*<sup>1</sup> = 0.4, *ps*<sup>2</sup> = 0

**Figure 6.** Processor (-o-) and memory (-x-) utilization for a 1-1 processor; *lt* = 10, *tm* = 5, *tcs* = 1, *ps*<sup>1</sup> = 0.2, *ps*<sup>2</sup> = 0

Fig. 6 shows the utilizations of processor and memory for reduced probabilities of pipeline stalls, i.e., for *ps*<sup>1</sup> = 0.2 and *ps*<sup>2</sup> = 0. As is expected, the utilizations are higher than in Fig.3 and Fig.5.

8 Will-be-set-by-IN-TECH

thread). Processor utilization for one thread is thus *<sup>t</sup>*/(*<sup>t</sup>* + (*<sup>t</sup>* − 1) ∗ 0.4 + *tm*) = 10/18.6 = 0.537, which corresponds very well with Fig.3. For a large number of threads processor utilization is obtained similarly, but with the context switching time, *tcs*, replacing *tm*, so it

The utilization of the processor can be improved by introducing a second (simultaneous) thread which issues its instructions in the slots unused by the first slot. Fig.4 shows the utilization of the processor and memory for a 2-1 processor, i.e., a processor with two (simultaneous) threads (or two instruction issue units) and a single pipeline. The utilization of the processor is improved by almost 50 % and is within a few percent from its upper bound

The influence of pipeline stalls (probabilities *ps*<sup>1</sup> and *ps*2) is shown in Fig.5 and Fig.6. Fig.5 shows that the performance actually depends upon the total number of stalls rather than specific values of *ps*<sup>1</sup> and *ps*2; in Fig.5 all pipeline stalls are single–cycle ones, so *ps*<sup>1</sup> = 0.4

Processor and memory utilization (1-1-0)

0 2 4 6 8 10

0 2 4 6 8 10

**Figure 6.** Processor (-o-) and memory (-x-) utilization for a 1-1 processor; *lt* = 10, *tm* = 5, *tcs* = 1,

number of available threads

**Figure 5.** Processor (-o-) and memory (-x-) utilization for a 1-1 processor; *lt* = 10, *tm* = 5, *tcs* = 1,

number of available threads

Processor and memory utilization (1-1-1)

and *ps*<sup>2</sup> = 0, and the results are practically the same as in Fig. 3.

0

0

0.2

0.4

0.6

utilization

0.8

1

0.2

0.4

0.6

utilization

0.8

1

is *<sup>t</sup>*/(*<sup>t</sup>* + (*<sup>t</sup>* − 1) ∗ 0.4 + *tcs*) = 0.685.

(of 100 %).

*ps*<sup>1</sup> = 0.4, *ps*<sup>2</sup> = 0

*ps*<sup>1</sup> = 0.2, *ps*<sup>2</sup> = 0

A more realistic model of memory, that captures the idea of a two–level hierarchy, is shown in Fig.2. In order to compare the results of this model with Fig.3 and Fig.4, the parameters of the two–level memory are chosen in such a way that the average memory access time is equal to the memory access time in Fig.1 (where *tm* = 5). Let the two levels of memory have access times equal to 4 and 20, respectively; then the choice probabilities are equal to 15/16 and 1/16 for level–1 and level–2, respectively, and the average access time is:

$$4 \ast \frac{15}{16} + 20 \ast \frac{1}{16} = 5.$$

The results for a 1-1 processor with a two–level memory are shown in Fig.7, and for a 2-1 processor in Fig.8.

**Figure 7.** Processor (-o-) and memory (-x-) utilization for a 1-1 processor with 2-level memory; *lt* = 10, *tm* = 4 + 20, *tcs* = 1, *ps*<sup>1</sup> = 0.2, *ps*<sup>2</sup> = 0.1

**Figure 8.** Processor (-o-) and memory (-x-) utilization for a 2-1 processor with 2-level memory; *lt* = 10, *tm* = 4 + 20, *tcs* = 1, *ps*<sup>1</sup> = 0.2, *ps*<sup>2</sup> = 0.1

The results in Fig.7 and Fig.8 are practically the same as in Fig.3 and Fig.4. This is the reason that the remaining results are shown for (equivalent) one-level memory models; the multiple levels of memory hierarchy apparently have no significant effect on the performance results.

system bottleneck, so its performance also needs to be improved, for example, by introducing dual ports (which allow to handle two accesses at the same time). The performance of a 5-3 processor with a dual-port memory is shown in Fig.11 (the utilization of the processor is the

Processor and memory utilization (5-3-2)

Timed Petri Nets in Performance Exploration of Simultaneous Multithreading 309

0 2 4 6 8 10

**Figure 11.** Processor (-o-) and memory (-x-) utilization for a 5-3 processor with dual–port memory;

Fig.11 shows that for 3 pipelines and 5 simultaneous threads, the number of available threads

System bottlenecks can be identified by comparing service demands for different components of the system (in this case, the memory and the pipelines); the component with the maximum service demand is the bottleneck because it is the first component to reach its utilization limit and to prevent any increase of the overall performance. For a single runlength (of all simultaneous threads) the total service demand for memory is equal to *ns* ∗ *tm*, while the service demand for each pipeline (assuming an ideal, uniform distribution of load over the pipelines) is equal to *ns* ∗ *<sup>t</sup>*/*np*. For a 4-2 processor, the service demands are equal (such a system is usually called "balanced"), so the utilizations of both, the processor and the memory, tend to their limits in a "synchronous" way. For a 5-3 processor with a dual-port memory, the service demand for the pipelines is greater than the service demand for memory, so the number of pipelines could be increased (by one pipeline); for more than 4 pipelines, the

Simultaneous multithreading is quite flexible with respect to context switching times because the (simultaneous) threads fill the instruction issuing slots which normally would remain empty during context switching. Fig.12 shows the utilization of the processor and memory in a 1-1 processor with *tcs* = 3, i.e., context switching time 3 times longer than in Fig.3. The reduction of the processor's utilization is more than 10 %, and is due to the additional 2 cycles

Fig.13 shows utilization of the processor and memory in a 2-1 processor, also for *tcs* = 3. The reduction of utilization is much smaller in this case and is within 5 % (when compared with

of context switching which remain empty (out of 17 cycles, on average).

number of available threads

sum of utilizations of its 3 pipelines, so it ranges from 0 to 3).

greater than 6 provides the speedup that is almost equal to 3.

 0 0.5 1 1.5 2 2.5 3

*lt* = 10, *tm* = 5||2, *tcs* = 1, *ps*<sup>1</sup> = 0.2, *ps*<sup>2</sup> = 0.1

memory again becomes the bottleneck.

Fig.4).

utilization

The effects of simultaneous multithreading in a more complex processor, e.g., a processor with two instruction issue units and two instruction execution pipelines, i.e., a 2-2 processor, can be obtained in a very similar way. The utilization of the processor (shown as the sum of the utilizations of both pipelines, with the values ranging from 0 to 2), is shown in Fig.9.

**Figure 9.** Processor (-o-) and memory (-x-) utilization for a 2-2 processor; *lt* = 10, *tm* = 5, *tcs* = 1, *ps*<sup>1</sup> = 0.2, *ps*<sup>2</sup> = 0.1

When another instruction issue unit is added, the utilization increases by about 40 %, as shown in Fig.10.

**Figure 10.** Processor (-o-) and memory (-x-) utilization for a 3-2 processor; *lt* = 10, *tm* = 5, *tcs* = 1, *ps*<sup>1</sup> = 0.2, *ps*<sup>2</sup> = 0.1

Further increase of the number of the simultaneous threads (in a processor with 2 pipelines) can provide only small improvements of the performance because the utilizations of both, the processor and the memory, are quite close to their limits. The performance of the system can be improved by increasing the number of pipelines, but then the memory becomes the system bottleneck, so its performance also needs to be improved, for example, by introducing dual ports (which allow to handle two accesses at the same time). The performance of a 5-3 processor with a dual-port memory is shown in Fig.11 (the utilization of the processor is the sum of utilizations of its 3 pipelines, so it ranges from 0 to 3).

10 Will-be-set-by-IN-TECH

The results in Fig.7 and Fig.8 are practically the same as in Fig.3 and Fig.4. This is the reason that the remaining results are shown for (equivalent) one-level memory models; the multiple levels of memory hierarchy apparently have no significant effect on the performance results. The effects of simultaneous multithreading in a more complex processor, e.g., a processor with two instruction issue units and two instruction execution pipelines, i.e., a 2-2 processor, can be obtained in a very similar way. The utilization of the processor (shown as the sum of the

0 2 4 6 8 10

When another instruction issue unit is added, the utilization increases by about 40 %, as shown

0 2 4 6 8 10

Further increase of the number of the simultaneous threads (in a processor with 2 pipelines) can provide only small improvements of the performance because the utilizations of both, the processor and the memory, are quite close to their limits. The performance of the system can be improved by increasing the number of pipelines, but then the memory becomes the

**Figure 10.** Processor (-o-) and memory (-x-) utilization for a 3-2 processor; *lt* = 10, *tm* = 5, *tcs* = 1,

number of available threads

**Figure 9.** Processor (-o-) and memory (-x-) utilization for a 2-2 processor; *lt* = 10, *tm* = 5, *tcs* = 1,

number of available threads

Processor and memory utilization (3-2)

Processor and memory utilization (2-2)

utilizations of both pipelines, with the values ranging from 0 to 2), is shown in Fig.9.

0

0

0.5

1

utilization

1.5

2

0.5

1

utilization

*ps*<sup>1</sup> = 0.2, *ps*<sup>2</sup> = 0.1

*ps*<sup>1</sup> = 0.2, *ps*<sup>2</sup> = 0.1

in Fig.10.

1.5

2

**Figure 11.** Processor (-o-) and memory (-x-) utilization for a 5-3 processor with dual–port memory; *lt* = 10, *tm* = 5||2, *tcs* = 1, *ps*<sup>1</sup> = 0.2, *ps*<sup>2</sup> = 0.1

Fig.11 shows that for 3 pipelines and 5 simultaneous threads, the number of available threads greater than 6 provides the speedup that is almost equal to 3.

System bottlenecks can be identified by comparing service demands for different components of the system (in this case, the memory and the pipelines); the component with the maximum service demand is the bottleneck because it is the first component to reach its utilization limit and to prevent any increase of the overall performance. For a single runlength (of all simultaneous threads) the total service demand for memory is equal to *ns* ∗ *tm*, while the service demand for each pipeline (assuming an ideal, uniform distribution of load over the pipelines) is equal to *ns* ∗ *<sup>t</sup>*/*np*. For a 4-2 processor, the service demands are equal (such a system is usually called "balanced"), so the utilizations of both, the processor and the memory, tend to their limits in a "synchronous" way. For a 5-3 processor with a dual-port memory, the service demand for the pipelines is greater than the service demand for memory, so the number of pipelines could be increased (by one pipeline); for more than 4 pipelines, the memory again becomes the bottleneck.

Simultaneous multithreading is quite flexible with respect to context switching times because the (simultaneous) threads fill the instruction issuing slots which normally would remain empty during context switching. Fig.12 shows the utilization of the processor and memory in a 1-1 processor with *tcs* = 3, i.e., context switching time 3 times longer than in Fig.3. The reduction of the processor's utilization is more than 10 %, and is due to the additional 2 cycles of context switching which remain empty (out of 17 cycles, on average).

Fig.13 shows utilization of the processor and memory in a 2-1 processor, also for *tcs* = 3. The reduction of utilization is much smaller in this case and is within 5 % (when compared with Fig.4).

The presented models of multithreaded processors are quite simple, and for small values of modeling parameters (*nt*, *np*, *ns*) can be analyzed by the explorations of the state space. The

Timed Petri Nets in Performance Exploration of Simultaneous Multithreading 311

*number analytical simulated nt of states utilization utilization* 11 0.538 0.536 52 0.670 0.671 102 0.684 0.685 152 0.685 0.686 202 0.685 0.686

*number analytical simulated nt of states utilization utilization* 11 0.538 0.536 80 1.030 1.031 264 1.384 1.381 555 1.568 1.568 951 1.655 1.647

The comparisons show that the results obtained by simulation of net models are very similar

A similar performance analysis of simultaneous multithreading, but using a slightly different model, was presented in [20]. All results presented there are very similar to results presented in this work which is an indication that the performance of simultaneous multithreaded

It should also be noted that the presented model is oversimplified with respect to the probabilities of pipeline stalls and does not take into account the dependence of stall probabilities on the history of instruction issuing. In fact, the model is "pessimistic" in this regard, and the predicted performance, presented in the paper, is worse than the expected performance of real systems. However, the simplification effects are not expected to be

The Natural Sciences and Engineering Research Council of Canada partially supported this

to the analytical results obtained from the analysis of states and state transitions.

following tables compare some results for the 1-1 processor and 3-2 processors:

**Table 2.** A comparison of simulation and analytical results for 1-1 processors.

**Table 3.** A comparison of simulation and analytical results for 3-2 processors.

systems is insensitive to (at least some) variations of implementation.

significant.

**Acknowledgement**

**Author details** Wlodek M. Zuberek

research through grant RGPIN-8222.

*Memorial University, St.John's, Canada, University of Life Sciences, Warsaw, Poland*

**Figure 12.** Processor (-o-) and memory (-x-) utilization for a 1-1 processor; *lt* = 10, *tm* = 5, *tcs* = 3, *ps*<sup>1</sup> = 0.2, *ps*<sup>2</sup> = 0.1

**Figure 13.** Processor (-o-) and memory (-x-) utilization for a 2-1 processor; *lt* = 10, *tm* = 5, *tcs* = 3, *ps*<sup>1</sup> = 0.2, *ps*<sup>2</sup> = 0.1

## **5. Concluding remarks**

Simultaneous multithreading discussed in this paper is used to increase the performance of processors by tolerating long–latency operations. Since the long–latency operations are playing increasingly important role in modern computer system, so is simultaneous multithreading. Its implementation as well as the required hardware resources are much simpler than in the case of out–of–order approach, and the resulting speedup scales well with the number of simultaneous threads. The main challenge of simultaneous multithreading is to balance the system by maintaining the right relationship between the number of simultaneous threads and the performance of the memory hierarchy.

All presented results indicate that the number of available threads, required for improved performance of the processor, is quite small, and is typically greater by 2 or 3 threads than the number of simultaneous threads. The results show that a larger number of available threads provides rather insignificant improvements of system's performance.

The presented models of multithreaded processors are quite simple, and for small values of modeling parameters (*nt*, *np*, *ns*) can be analyzed by the explorations of the state space. The following tables compare some results for the 1-1 processor and 3-2 processors:


**Table 2.** A comparison of simulation and analytical results for 1-1 processors.


**Table 3.** A comparison of simulation and analytical results for 3-2 processors.

The comparisons show that the results obtained by simulation of net models are very similar to the analytical results obtained from the analysis of states and state transitions.

A similar performance analysis of simultaneous multithreading, but using a slightly different model, was presented in [20]. All results presented there are very similar to results presented in this work which is an indication that the performance of simultaneous multithreaded systems is insensitive to (at least some) variations of implementation.

It should also be noted that the presented model is oversimplified with respect to the probabilities of pipeline stalls and does not take into account the dependence of stall probabilities on the history of instruction issuing. In fact, the model is "pessimistic" in this regard, and the predicted performance, presented in the paper, is worse than the expected performance of real systems. However, the simplification effects are not expected to be significant.
