**2.7. Dependability**

Dependability of a computer system must be understood as the ability to deliver services with respect to some agreed-upon specifications of desired service that can be fully trusted [1, 13]. Indeed, dependability is related to disciplines such as fault tolerance and reliability. Reliability is the probability that the system will deliver a set of services for a given period of time, whereas a system is fault tolerant when it does not fail even when there are faulty components. Availability is also another important concept, which quantifies the mixed effect of both failure and repair process in a system. In general, availability and reliability are related concepts, but they differ in the sense that the former may consider maintenance of failed components [8] (e.g., a failed component is restored to a specified condition).

In many situations, modeling is the method of choice either because the system might not yet exist or due to the inherent complexity for creating specific scenarios under which the system should be evaluated. In a very broad sense, models for dependability evaluation can be classified as simulation and mathematical models. However, this does not mean that mathematical models cannot be simulated. Indeed, many mathematical models, besides being analytically tractable, may also be evaluated by simulation. Mathematical models can be characterized as being either state-based or non-state-based.

12 Petri Nets

• *Atts* : (*Dist*, *W*, *G*, *Policy*,*Concurrency*)*<sup>m</sup>* comprises a set of attributes for the *m* transitions,

• *Dist* <sup>∈</sup> **<sup>N</sup>***<sup>m</sup>* → F is a possibly marking dependent firing probability distribution function. In a stochastic timed Petri net, time has to elapse between the enabling and firing of a transition. The actual firing time is a random variable, for which the distribution is specified by F. We differ between immediate transitions (F = 0) and

• *<sup>W</sup>* <sup>∈</sup> **<sup>R</sup>**<sup>+</sup> is the weight function, that represents a firing weight *wt* for immediate transitions or a rate *λt* for timed transitions. The latter is only meaningful for the standard case of timed transitions with exponentially distributed firing delays. For immediate transitions, the value specifies a relative probability to fire the transition when there are several immediate transitions enabled in a marking, and all have the

• *Policy* ∈ {*prd*, *prs*} is the preemption policy (*prd* — *preemptive repeat different* means that when a preempted transition becomes enabled again the previously elapsed firing time is lost; *prs* — *preemptive resume*, in which the firing time related to a preempted

• *Concurrency* ∈ {*ss*, *is*} is the concurrency degree of transitions, where *ss* represents single server semantics and *is* depicts infinity server semantics in the same sense as in queueing models. Transitions with policy *is* can be understood as having an individual

In many circumstances, it might be suitable to represent the initial marking as a mapping from the set of places to natural numbers (*m*<sup>0</sup> : *P* → **N**), where *m*0(*pi*) denotes the initial marking of place *pi*. *m*(*pi*) denotes a reachable marking (reachable state) of place *pi*. In this work, the

Dependability of a computer system must be understood as the ability to deliver services with respect to some agreed-upon specifications of desired service that can be fully trusted [1, 13]. Indeed, dependability is related to disciplines such as fault tolerance and reliability. Reliability is the probability that the system will deliver a set of services for a given period of time, whereas a system is fault tolerant when it does not fail even when there are faulty components. Availability is also another important concept, which quantifies the mixed effect of both failure and repair process in a system. In general, availability and reliability are related concepts, but they differ in the sense that the former may consider maintenance of failed components [8]

In many situations, modeling is the method of choice either because the system might not yet exist or due to the inherent complexity for creating specific scenarios under which the system should be evaluated. In a very broad sense, models for dependability evaluation

same probability. A random choice is then applied using the probabilites *wt*. • *<sup>G</sup>* <sup>∈</sup> **<sup>N</sup>***<sup>n</sup>* → {*true*, *false*} is a function that assigns a guard condition related to place markings to each transition. Depending on the current marking, transitions may not fire (they are disabled) when the guard function returns false. This is an extension of

transition is resumed when the transition becomes enabled again),

transition for each set of input tokens, all running in parallel.

notation #*pi* has also been adopted for representing *m*(*pi*).

(e.g., a failed component is restored to a specified condition).

• *<sup>M</sup>*<sup>0</sup> <sup>∈</sup> **<sup>N</sup>***<sup>n</sup>* is a vector that contains the initial marking for each place (initial state).

timed transitions, for which the domain of F is (0, ∞).

where

inhibitor arcs.

**2.7. Dependability**

Dependability metrics (e.g., availability, reliability and downtime) might be calculated either by using RBD or SPN (to mention only the models adopted in this work). RBDs allow to one represent component networks and provide closed-form equations, so the results are usually obtained faster than using SPN simulation. Nevertheless, when faced with representing maintenance policies and redundant mechanisms, particularly those based on dynamic redundancy methods, such models experience drawbacks concerning the thorough handling of failures and repairing dependencies. On the other hand, state-based methods can easily consider those dependencies, so allowing the representation of complex redundant mechanisms as well as sophisticated maintenance policies. However, they suffer from the state-space explosion. Some of those formalism allow both numerical analysis and stochastic simulation, and SPN is one of the most prominent models of such class.

If one is interested in calculating the availability (*A*) of given device or system, he/she might need either the uptime and downtime or the time to failure (*TTF*) and time to repair (*TTR*). Considering that the uptime and downtime are not available, the later option is the mean. If the evaluator needs only the mean value, the metrics commonly adopted are Mean Time to Failure (*MTTF*) and Mean Time To Repair (*MTTR*) (other central values might also be adopted). However, if one is also interested in the availability variation, the standard deviation of time to failure (*sd*(*TTF*)), and the respective standard deviation of time to repair (*sd*(*TTR*)) allow one the estimate the availability variation.

The availability (*A*) is obtained by steady-state analysis or simulation, and the following equation expresses the relation concerning *MTTF* and *MTTR*:

$$A = \frac{MTTF}{MTTF + MTTR} \tag{1}$$

Through transient analysis or simulation, the reliability (*R*) is obtained, and, then, the *MTTF* can be calculated as well as the standard deviation of the Time To Failure (*TTF*):

$$MTTF = \int\_0^\infty tf(t)dt = \int\_0^\infty -\frac{dR(t)}{dt}t dt = \int\_0^\infty R(t)dt\tag{2}$$

$$sd(TTF) = \sqrt{\int\_0^\infty t^2 f(t)dt - (MTTF)^2} \tag{3}$$

Considering a given period *t*, *R*(*t*) is the probability that the time to failure is greater than or equal to t. Regarding exponential failure distributions, reliability is computed as follows:

$$R(t) = \exp\left[-\int\_0^t \lambda(t')dt'\right] \tag{4}$$

where *λ*(*t*� ) is the instantaneous failure rate. One should bear in mind that, for computing reliability of a given system service, the repairing activity of the respective service must not be represented. Besides, taking into account *UA* = 1 − *A* (unavailability) and Equation 1, the following equation is derived

$$\text{MTTR} = \text{MTTF} \times \frac{\text{U}A}{A} \tag{5}$$

original activity is matched with the first and second moments of s-transition (*expolynomial distribution*). According to the aforementioned method, one activity with *μ*<*σ* is approximated

*<sup>r</sup>*<sup>1</sup> <sup>=</sup> <sup>2</sup>*μ*<sup>2</sup>

*<sup>λ</sup>* <sup>=</sup> <sup>2</sup>*<sup>μ</sup>* (*μ*<sup>2</sup> + *σ*2)

where *λ* is the rate associated to phase 1, *r*<sup>1</sup> is the probability of related to this phase, and *r*<sup>2</sup> is the probability assigned to phase 2. In this particular model, the rate assigned to phase 2 is

Activities with coefficients of variation less than one might be mapped either to

distribution with parameters *λ*1, *λ*2(exponential rates); and *γ*, the integer representing the number of phases with rate equal to *λ*2, whereas the number of phases with rate equal to *λ*<sup>1</sup> is one. In other words, the s-transition is represented by a subnet composed of two exponential and one immediate transitions. The average delay assigned to the exponential transition *t*<sup>1</sup> is equal to *μ*<sup>1</sup> (*λ*<sup>1</sup> = 1/*μ*1), and the respective average delay assigned to the exponential transition *t*<sup>2</sup> is *μ*2(*λ*<sup>2</sup> = 1/*μ*2). *γ* is the integer value considered as the weight assigned to the output arc of transition *t*<sup>1</sup> as well as the input arc weight value of the immediate transition *t*<sup>3</sup>

<sup>2</sup> <sup>−</sup> <sup>1</sup> <sup>≤</sup> *<sup>γ</sup>* <sup>&</sup>lt; (

*<sup>μ</sup>*<sup>1</sup> <sup>=</sup> *<sup>μ</sup>* <sup>±</sup> *<sup>γ</sup>*(*<sup>γ</sup>* <sup>+</sup> <sup>1</sup>)*σ*<sup>2</sup> <sup>−</sup> *γμ*<sup>2</sup>

*<sup>μ</sup>*<sup>2</sup> <sup>=</sup> *γμ* <sup>∓</sup> *<sup>γ</sup>*(*<sup>γ</sup>* <sup>+</sup> <sup>1</sup>)*σ*<sup>2</sup> <sup>−</sup> *γμ*<sup>2</sup>

(see Figure 14). These parameters are calculated by the following expressions:

*<sup>λ</sup>*<sup>1</sup> <sup>=</sup> <sup>1</sup> *μ*1

( *μ σ* )

*<sup>σ</sup>* �= 1,(*μ*, *σ* �= 0), the respective activity is represented by a Hypoexponential

*μ σ* )

*μ*2

*and* <sup>2</sup> <sup>=</sup> <sup>1</sup>

(*μ*<sup>2</sup> + *σ*2)

, (7)

. (9)

2, (10)

, (11)

*<sup>γ</sup>* <sup>+</sup> <sup>1</sup> , (12)

*<sup>γ</sup>* <sup>+</sup> <sup>1</sup> (13)

*r*<sup>2</sup> = 1 − *r*<sup>1</sup> (8)

A Petri Net-Based Approach to the Quanti cation of Data Center Dependability 327

by a two-phase Hyperexponential distribution with parameters

assumed to be infinity, that is, the related average delay is zero.

**Figure 13.** Hyperexponential Model

*<sup>σ</sup>* <sup>∈</sup>/ **<sup>N</sup>**, *<sup>μ</sup>*

Hypoexponential or Erlangian s-transi-

and

tions. If *<sup>μ</sup>*

where

As well, the standard deviation of the Time To Repair (*TTR*) can be calculated as follows:

$$sd(TTR) = sd(TTF) \times \frac{UA}{A} \tag{6}$$

Next, *MTTF sd*(*TTF*) (and *MTTR sd*(*TTR*)) are computed for choosing the expolinomial distribution that best fits the *TTF* and *TTR* distributions [6, 22].

Figure 12 depicts the generic simple component model using SPN, which provides a high-level representation of a subsystem. One should notice the trapezoidal shape of transitions (high-level transition named s-transition). This shape means that the time distributions of such transitions are not exponentially distributed, instead they should be refined by subnets. The delay assigned to s-transition *f* is the *TTF* and the delay of s-transition *r* is the *TTR*. If the *TTF* and *TTR* are exponentially distributed, the shape of the transitions should be the regular one (white rectangles) and *TTF* and *TTR* should be summarized by the respective *MTTF* and *MTTR*.

**Figure 12.** Generic simple model - SPN

A well-established method that considers *expolynomial distribution* random variables is based on distribution moment matching. The moment matching process presented in [6] takes into account that Hypoexponential and Erlangian distributions have the average delay (*μ*) greater than the standard-deviation (*σ*) -*μ* > *σ*-, and Hyperexponential distributions have *μ*<*σ*, in order to represent an activity with a generally distributed delay as an Erlangian or a Hyperexponential subnet referred to as s-transition1. One should note that in cases where these distributions have *μ* = *σ*, they are, indeed, equivalent to an exponential distribution with parameter equal to <sup>1</sup> *<sup>μ</sup>* . Therefore, according to the coefficient of variation associated with an activity's delay, an appropriate s-transition implementation model could be chosen. For each s-transition implementation model (see Figure 13), a set of parameters should be configured for matching their first and second moments. In other words, an associated delay distribution (it might have been obtained by a measuring process) of the

<sup>1</sup> In this work, *μ* could be *MTTF* or *MTTR* and the *σ* could represent *sd*(*TTF*) or *sd*(*TTR*), for instance.

original activity is matched with the first and second moments of s-transition (*expolynomial distribution*). According to the aforementioned method, one activity with *μ*<*σ* is approximated by a two-phase Hyperexponential distribution with parameters

$$r\_1 = \frac{2\mu^2}{\left(\mu^2 + \sigma^2\right)'}\tag{7}$$

$$r\_2 = 1 - r\_1 \tag{8}$$

and

14 Petri Nets

One should bear in mind that, for computing reliability of a given system service, the repairing activity of the respective service must not be represented. Besides, taking into account *UA* =

*UA*

*UA*

) are computed for choosing the expolinomial distribution that best

*<sup>μ</sup>* . Therefore, according to the coefficient of variation

*<sup>A</sup>* (5)

*<sup>A</sup>* (6)

*MTTR* = *MTTF* ×

As well, the standard deviation of the Time To Repair (*TTR*) can be calculated as follows:

*sd*(*TTR*) = *sd*(*TTF*) ×

Figure 12 depicts the generic simple component model using SPN, which provides a high-level representation of a subsystem. One should notice the trapezoidal shape of transitions (high-level transition named s-transition). This shape means that the time distributions of such transitions are not exponentially distributed, instead they should be refined by subnets. The delay assigned to s-transition *f* is the *TTF* and the delay of s-transition *r* is the *TTR*. If the *TTF* and *TTR* are exponentially distributed, the shape of the transitions should be the regular one (white rectangles) and *TTF* and *TTR* should be summarized by the

A well-established method that considers *expolynomial distribution* random variables is based on distribution moment matching. The moment matching process presented in [6] takes into account that Hypoexponential and Erlangian distributions have the average delay (*μ*) greater than the standard-deviation (*σ*) -*μ* > *σ*-, and Hyperexponential distributions have *μ*<*σ*, in order to represent an activity with a generally distributed delay as an Erlangian or a Hyperexponential subnet referred to as s-transition1. One should note that in cases where these distributions have *μ* = *σ*, they are, indeed, equivalent to an exponential

associated with an activity's delay, an appropriate s-transition implementation model could be chosen. For each s-transition implementation model (see Figure 13), a set of parameters should be configured for matching their first and second moments. In other words, an associated delay distribution (it might have been obtained by a measuring process) of the

<sup>1</sup> In this work, *μ* could be *MTTF* or *MTTR* and the *σ* could represent *sd*(*TTF*) or *sd*(*TTR*), for instance.

1 − *A* (unavailability) and Equation 1, the following equation is derived

Next, *MTTF*

*sd*(*TTF*) (and *MTTR*

respective *MTTF* and *MTTR*.

**Figure 12.** Generic simple model - SPN

distribution with parameter equal to <sup>1</sup>

*sd*(*TTR*)

fits the *TTF* and *TTR* distributions [6, 22].

$$
\lambda = \frac{2\mu}{(\mu^2 + \sigma^2)}.\tag{9}
$$

where *λ* is the rate associated to phase 1, *r*<sup>1</sup> is the probability of related to this phase, and *r*<sup>2</sup> is the probability assigned to phase 2. In this particular model, the rate assigned to phase 2 is assumed to be infinity, that is, the related average delay is zero.

#### **Figure 13.** Hyperexponential Model

Activities with coefficients of variation less than one might be mapped either to Hypoexponential or Erlangian s-transi-

tions. If *<sup>μ</sup> <sup>σ</sup>* <sup>∈</sup>/ **<sup>N</sup>**, *<sup>μ</sup> <sup>σ</sup>* �= 1,(*μ*, *σ* �= 0), the respective activity is represented by a Hypoexponential distribution with parameters *λ*1, *λ*2(exponential rates); and *γ*, the integer representing the number of phases with rate equal to *λ*2, whereas the number of phases with rate equal to *λ*<sup>1</sup> is one. In other words, the s-transition is represented by a subnet composed of two exponential and one immediate transitions. The average delay assigned to the exponential transition *t*<sup>1</sup> is equal to *μ*<sup>1</sup> (*λ*<sup>1</sup> = 1/*μ*1), and the respective average delay assigned to the exponential transition *t*<sup>2</sup> is *μ*2(*λ*<sup>2</sup> = 1/*μ*2). *γ* is the integer value considered as the weight assigned to the output arc of transition *t*<sup>1</sup> as well as the input arc weight value of the immediate transition *t*<sup>3</sup> (see Figure 14). These parameters are calculated by the following expressions:

$$(\frac{\mu}{\sigma})^2 - 1 \le \gamma < (\frac{\mu}{\sigma})^2,\tag{10}$$

$$
\lambda\_1 = \frac{1}{\mu\_1} \text{ and } \mathbf{2} = \frac{1}{\mu\_2},
\tag{11}
$$

where

$$\mu\_1 = \frac{\mu \pm \sqrt{\gamma(\gamma+1)\sigma^2 - \gamma\mu^2}}{\gamma+1},\tag{12}$$

$$\mu\_2 = \frac{\gamma \mu \mp \sqrt{\gamma(\gamma+1)\sigma^2 - \gamma \mu^2}}{\gamma+1} \tag{13}$$

If *<sup>μ</sup> <sup>σ</sup>* <sup>∈</sup> **<sup>N</sup>**, *<sup>μ</sup> <sup>σ</sup>* �<sup>=</sup> 1,(*μ*, *<sup>σ</sup>* �<sup>=</sup> <sup>0</sup>), an Erlangian s-transition with two parameters, *<sup>γ</sup>* = ( *<sup>μ</sup> <sup>σ</sup>* )<sup>2</sup> is an integer representing the number of phases of this distribution; and *μ*<sup>1</sup> = *μ*/*γ*, where *μ*1(1/*λ*1) is the average delay value of each phase. The Erlangian model is a particular case of a Hypoexponential model, in which each individual phase rate has the same value.

Different from previous works, this paper proposes a set of models to the quantification of dependability metrics in the context of data center design. Furthermore, the adopted methodology for the quantification of those values takes into account a hybrid modeling approach, which utilizes RBD and SPN whenever they are best suited. The idea of mixing state (SPN) and non-state (RBD) based models is not new (e.g., [23]), but, as far as we are concerned, there is no similar work that applies such technique on the evaluation of data

A Petri Net-Based Approach to the Quanti cation of Data Center Dependability 329

Reliability Block Diagram (RBD) [8] is a combinatorial model that was initially proposed as a technique for calculating reliability of systems using intuitive block diagrams. Such a technique has also been extended to calculate other dependability metrics, such as availability and maintainability [10]. Figure 16 depicts two examples, in which independent blocks are

In the series arrangement, if a single component fails, the whole system is no longer operational. Assuming a system with *n* independent components, the reliability

> *n* ∏ *i*=1

where *Pi* is the reliability - *Ri*(*t*) (instantaneous availability (*Ai*(*t*)) or steady state availability

For a parallel arrangement (see Figure 16(b)), if a single component is operational, the whole system is also operational. Assuming a system with *n* independent components, the reliability

> *n* ∏ *i*=1

where *Pi* is the reliability - *Ri*(*t*) (instantaneous availability (*Ai*(*t*)) or steady state availability

A k-out-of-n system functions if and only if *k* or more of its *n* components are functioning. Let *p* be the success probability of each of those blocks. The system success probability (reliability

*Pi* (14)

(1 − *Pi*) (15)

*Ps* =

*Pp* = 1 −

center infrastructures. Besides, a tool is proposed to automate several activities.

arranged through series (Figure 16(a)) and parallel (Figure 16(b)) compositions.

The following sections presents the adopted dependability models.

(instantaneous availability or steady state availability) is obtained by

(instantaneous availability or steady state availability) is obtained by

**4. Dependability models**

**Figure 16.** Reliability Block Diagram

**RBD Models**

(*Ai*)) of block *bi*.

(*Ai*)) of block *bi*.

or availability) is depicted by:

**Figure 14.** Hypoexponential Model

The reader should refer to [6] for details regarding the representation of expolinomial distributions using SPN. For the sake of simplicity, the SPN models presented in the next sections consider only exponential distributions.

Depending on the system characteristics, a RBD model (Figure 15) could be adopted instead of the SPN counterpart, whenever the former is more suitable.

**Figure 15.** Generic simple model - RBD
