**3. Related works**

In the last few years, some works have been developed to perform dependability analysis of data center systems [24][26][27]. Reliability (which encompasses both the durability of the data and its availability for access) correspond to the primary property that data center users desire [2], .

Robidoux [28] proposes Dynamic RBD (DRBD) model, an extension to RBD, which supports reliability analysis of systems with dependence relationships. The additional blocks (in relation to RBD) to model dependence, turned the DRBD model complex. The DRBD model is automatic converted to CPN model in order to perform behavior properties analysis which may certify the correctness of the model [18]. It seems that an interesting alternative would be to model the system directly using CPN or any other formalism (e.g., SPN) which is able to perform dependability analysis as well as to model dependencies between components.

Wei [25] presents an hierarchical method to model and analyze virtual data center (VDC). The approach combines the advances of both RBD and General SPN (GSPN) for quantifying availability and reliability. Data center power architectures are not the focus of their research and the proposed models are specific for modeling VDC.

Additionally, redundancies on components to increase system reliability are costly. [7] propose an approach for reliability evaluation and risk analysis of dynamic process systems using stochastic Petri nets.

Different from previous works, this paper proposes a set of models to the quantification of dependability metrics in the context of data center design. Furthermore, the adopted methodology for the quantification of those values takes into account a hybrid modeling approach, which utilizes RBD and SPN whenever they are best suited. The idea of mixing state (SPN) and non-state (RBD) based models is not new (e.g., [23]), but, as far as we are concerned, there is no similar work that applies such technique on the evaluation of data center infrastructures. Besides, a tool is proposed to automate several activities.

## **4. Dependability models**

The following sections presents the adopted dependability models.

#### **RBD Models**

16 Petri Nets

an integer representing the number of phases of this distribution; and *μ*<sup>1</sup> = *μ*/*γ*, where *μ*1(1/*λ*1) is the average delay value of each phase. The Erlangian model is a particular case of a Hypoexponential model, in which each individual phase rate has the same value.

The reader should refer to [6] for details regarding the representation of expolinomial distributions using SPN. For the sake of simplicity, the SPN models presented in the next

Depending on the system characteristics, a RBD model (Figure 15) could be adopted instead

TTF TTR

In the last few years, some works have been developed to perform dependability analysis of data center systems [24][26][27]. Reliability (which encompasses both the durability of the data and its availability for access) correspond to the primary property that data center users

Robidoux [28] proposes Dynamic RBD (DRBD) model, an extension to RBD, which supports reliability analysis of systems with dependence relationships. The additional blocks (in relation to RBD) to model dependence, turned the DRBD model complex. The DRBD model is automatic converted to CPN model in order to perform behavior properties analysis which may certify the correctness of the model [18]. It seems that an interesting alternative would be to model the system directly using CPN or any other formalism (e.g., SPN) which is able to perform dependability analysis as well as to model dependencies between components.

Wei [25] presents an hierarchical method to model and analyze virtual data center (VDC). The approach combines the advances of both RBD and General SPN (GSPN) for quantifying availability and reliability. Data center power architectures are not the focus of their research

Additionally, redundancies on components to increase system reliability are costly. [7] propose an approach for reliability evaluation and risk analysis of dynamic process systems

*<sup>σ</sup>* �<sup>=</sup> 1,(*μ*, *<sup>σ</sup>* �<sup>=</sup> <sup>0</sup>), an Erlangian s-transition with two parameters, *<sup>γ</sup>* = ( *<sup>μ</sup>*

*<sup>σ</sup>* )<sup>2</sup> is

If *<sup>μ</sup>*

*<sup>σ</sup>* <sup>∈</sup> **<sup>N</sup>**, *<sup>μ</sup>*

**Figure 14.** Hypoexponential Model

**Figure 15.** Generic simple model - RBD

**3. Related works**

using stochastic Petri nets.

desire [2], .

sections consider only exponential distributions.

of the SPN counterpart, whenever the former is more suitable.

and the proposed models are specific for modeling VDC.

Reliability Block Diagram (RBD) [8] is a combinatorial model that was initially proposed as a technique for calculating reliability of systems using intuitive block diagrams. Such a technique has also been extended to calculate other dependability metrics, such as availability and maintainability [10]. Figure 16 depicts two examples, in which independent blocks are arranged through series (Figure 16(a)) and parallel (Figure 16(b)) compositions.

In the series arrangement, if a single component fails, the whole system is no longer operational. Assuming a system with *n* independent components, the reliability (instantaneous availability or steady state availability) is obtained by

$$P\_S = \prod\_{i=1}^{n} P\_i \tag{14}$$

where *Pi* is the reliability - *Ri*(*t*) (instantaneous availability (*Ai*(*t*)) or steady state availability (*Ai*)) of block *bi*.

For a parallel arrangement (see Figure 16(b)), if a single component is operational, the whole system is also operational. Assuming a system with *n* independent components, the reliability (instantaneous availability or steady state availability) is obtained by

$$P\_p = 1 - \prod\_{i=1}^{n} (1 - P\_i) \tag{15}$$

where *Pi* is the reliability - *Ri*(*t*) (instantaneous availability (*Ai*(*t*)) or steady state availability (*Ai*)) of block *bi*.

A k-out-of-n system functions if and only if *k* or more of its *n* components are functioning. Let *p* be the success probability of each of those blocks. The system success probability (reliability or availability) is depicted by:

$$
\Sigma\_{i=k}^{n} \binom{n}{b} p^k (1-p)^{n-k} \tag{16}
$$

**Figure 18.** Cold standby model.

**5. Applications**

**5.1. Architectures**

AC sources are not operational.

Transition Priority Delay or Weight X\_Failure - X\_MTTF

X\_Failure\_Spare1 - X\_MTTF\_Spare1

X\_Desactivate\_Spare1 1 1

**Table 2.** Cold standby model - Transition attributes.

validated through our previous work [5] [3] [4].

X\_Repair - IF #X\_Rel\_Flag=1:(1013 x X\_MTTF) ELSE X\_MTTR X\_Activate\_Spare1 - IF #X\_Rel\_Flag=1:(1013 x MTActivate) ELSE MTActivate

X\_Repair\_Spare1 - IF #X\_Rel\_Flag=1:(1013 x X\_MTTF\_Spare1) ELSE X\_MTTR\_Spare1

A Petri Net-Based Approach to the Quanti cation of Data Center Dependability 331

This section focuses in presenting the applicability of the proposed models to perform dependability analysis of real-world data center power architectures (from HP Labs Palo Alto, U.S. [12]). The environment ASTRO was adopted to conduct the case study. ASTRO was

Data center power infrastructure is responsible for providing uninterrupted, conditioned power at correct voltage and frequency to the IT equipments. Figure 19 (a) depicts a real-world power infrastructure. From the utility feed (i.e., AC Source), typically, the power goes through voltage panels, uninterruptible power supply (UPS) units, power distribution units (PDUs) (composed of transformers and electrical subpanels), junction boxes, and, finally, to rack PDUs (rack power distribution units). The power infrastructure fails (and, thus, the system) whenever both paths depicted in Figure 19 are not able to provide the power demanded (500 kW) by the IT components (50 racks). The reader should assume a path as a set of redundant interconnected components inside the power infrastructure. Another architecture is analyzed with an additional electricity generator (Figure 19 (b)) for supporting the system when both

For other examples and closed-form equations, the reader should refer to [10].

#### **SPN Models**

This section presents two proposed SPN building block for obtaining dependability metrics.

**Simple Component.** The simple component has two states: functioning or failed. To compute its availability, *MTTF* and *MTTR* should be represented. Figure 17 shows the SPN model of the "simple component", which has two parameters (not depicted in the figure), namely *X*\_*MTTF* and *X*\_*MTTR*, representing the delays associated to the transitions *X*\_*Failure* and *X*\_*Repair*, respectively.

**Figure 17.** Simple component model

Places *X*\_*ON* and *X*\_*OFF* are the model component's activity and inactivity states, respectively. The simple component also includes an arc from *X*\_*OFF* to X\_Repair with multiplicity depending on place marking. The multiplicity is defined through the expression IF(#*X*\_*Rel*\_*Flag* = 1):2 ELSE 1, where place *X*\_*Rel*\_*Flag* models the evaluation of reliability/availability. Hence, if condition #*X*\_*Rel*\_*Flag* = 1 is true, then the evaluation refers to reliability. Otherwise, the evaluation concerns availability.

Besides, although simple component model has been presented using the exponential distribution, other expolinomial distributions that best fits the *TTF* and *TTR* may be adopted following the techniques presented in [22].

**Cold standby.** A cold standby redundant system is composed by a non-active spare module that waits to be activated when the main active module fails. Figure 18 depicts the SPN model of this system, which includes four places, namely *X*\_*ON*, *X*\_*OFF*, *X*\_*Spare*1\_*ON*, *X*\_*Spare*1\_*OFF* that represent the operational and failure states of both the main and spare modules, respectively. The spare module (Spare1) is initially deactivated, hence no tokens are initially stored in places *X*\_*Spare*1\_*ON* and *X*\_*Spare*1 \_*OFF*. When the main module fails, the transition *X*\_*Activate*\_*Spare*1 is fired to activate the spare module.

Table 2 presents the attributes of each transition of the model. Once considering reliability evaluation (number of tokens (#) in the place *X*\_*Rel*\_*Flag* = 1), the *X*\_*Repair*, *X*\_*Activate*\_*Spare*1 and *X*\_*Repair*\_*Spare*1 transitions receive a huge number (many times larger than the associated MTTF or MTActivate) to represent the absence of repair. The MTActivate corresponds to the mean time to activate the spare module. Besides, when considering reliability, the weight of the edge that connects the place *X*\_*Wait*\_*Spare*1 and the *X*\_*Activate*\_*Spare*1 transition is two; otherwise, it is one. Both availability and reliability may be computed by the probability *P*{#*X*\_ *ON* = 1 OR #*X*\_*Spare*1 \_*ON* = 1}.

330 Petri Nets – Manufacturing and Computer Science A Petri Net-Based Approach to the Quantification of Data Center Dependability <sup>19</sup> A Petri Net-Based Approach to the Quanti cation of Data Center Dependability 331

**Figure 18.** Cold standby model.

18 Petri Nets

This section presents two proposed SPN building block for obtaining dependability metrics. **Simple Component.** The simple component has two states: functioning or failed. To compute its availability, *MTTF* and *MTTR* should be represented. Figure 17 shows the SPN model of the "simple component", which has two parameters (not depicted in the figure), namely *X*\_*MTTF* and *X*\_*MTTR*, representing the delays associated to the transitions *X*\_*Failure* and

Places *X*\_*ON* and *X*\_*OFF* are the model component's activity and inactivity states, respectively. The simple component also includes an arc from *X*\_*OFF* to X\_Repair with multiplicity depending on place marking. The multiplicity is defined through the expression IF(#*X*\_*Rel*\_*Flag* = 1):2 ELSE 1, where place *X*\_*Rel*\_*Flag* models the evaluation of reliability/availability. Hence, if condition #*X*\_*Rel*\_*Flag* = 1 is true, then the evaluation

Besides, although simple component model has been presented using the exponential distribution, other expolinomial distributions that best fits the *TTF* and *TTR* may be adopted

**Cold standby.** A cold standby redundant system is composed by a non-active spare module that waits to be activated when the main active module fails. Figure 18 depicts the SPN model of this system, which includes four places, namely *X*\_*ON*, *X*\_*OFF*, *X*\_*Spare*1\_*ON*, *X*\_*Spare*1\_*OFF* that represent the operational and failure states of both the main and spare modules, respectively. The spare module (Spare1) is initially deactivated, hence no tokens are initially stored in places *X*\_*Spare*1\_*ON* and *X*\_*Spare*1 \_*OFF*. When the main module fails, the

Table 2 presents the attributes of each transition of the model. Once considering reliability evaluation (number of tokens (#) in the place *X*\_*Rel*\_*Flag* = 1), the *X*\_*Repair*, *X*\_*Activate*\_*Spare*1 and *X*\_*Repair*\_*Spare*1 transitions receive a huge number (many times larger than the associated MTTF or MTActivate) to represent the absence of repair. The MTActivate corresponds to the mean time to activate the spare module. Besides, when considering reliability, the weight of the edge that connects the place *X*\_*Wait*\_*Spare*1 and the *X*\_*Activate*\_*Spare*1 transition is two; otherwise, it is one. Both availability and reliability may

refers to reliability. Otherwise, the evaluation concerns availability.

transition *X*\_*Activate*\_*Spare*1 is fired to activate the spare module.

be computed by the probability *P*{#*X*\_ *ON* = 1 OR #*X*\_*Spare*1 \_*ON* = 1}.

*<sup>p</sup>k*(<sup>1</sup> <sup>−</sup> *<sup>p</sup>*)*n*−*<sup>k</sup>* (16)

Σ*n i*=*k n b* 

**SPN Models**

*X*\_*Repair*, respectively.

**Figure 17.** Simple component model

following the techniques presented in [22].

For other examples and closed-form equations, the reader should refer to [10].


**Table 2.** Cold standby model - Transition attributes.
