5.1. Software failure rate

Field outage measurements are required for telecom products by TL9000 [26], which is a quality management system (QMS). It standardizes the quality system requirements for the design, development, delivery, installation, and maintenance of telecom products and services. It defines the reliability in terms of SO3 (service outage frequency) and SO4 (service outage duration) metrics. As demonstrated in [16] the defect find process during the operation period maybe modeled as a stationary Poisson process. It also follows that the rate of software failure (or outage) rate for each release can be modeled as stationary Poisson process. Consider a software release with a failure rate λ and defect rate λf. It is worth noting that λ is usually measured in terms of failures per year. The defect conversion factor may be expressed as shown in Eq. (7)

$$d = \frac{\lambda}{\lambda\_f} \tag{7}$$

Reliability and availability and among the key factors that are used to define the quality of software in practice. In what follows, we formulate mathematical representations for both these factors.

#### 5.2. Availability

point indicates the defects not found in this release and they will become a part of the next release. That is, not all delivered defects will be found during the operation period. It also demonstrates that actual data follows as predicted, indicating the importance of historical data

Figure 14 illustrates the difference in defect rate, λ(te), at the end of test phase, te, and during the operation period, λf. It can be observed that there is a difference in defect rate, likely due to differences in the intensity of testing during the two periods, as well as possible differences

In recent years many product suppliers have been implementing complex software-controlled systems with a large number of software features on a short development schedule. In the telecom industry, a critical customer operational issue is on system performance, especially in terms of system outages impacting the service availability for their end users. As a result, service providers frequently ask their product suppliers for software reliability and availability measurements. In this section, we discuss the relationship between software failure rate,

Field outage measurements are required for telecom products by TL9000 [26], which is a quality management system (QMS). It standardizes the quality system requirements for the design, development, delivery, installation, and maintenance of telecom products and services. It defines the reliability in terms of SO3 (service outage frequency) and SO4 (service outage duration) metrics. As demonstrated in [16] the defect find process during the operation period maybe modeled as a stationary Poisson process. It also follows that the rate of software failure (or outage) rate for each release can be modeled as stationary Poisson process. Consider a software release with a failure rate λ and defect rate λf. It is worth noting that λ is usually

in predicting post-delivery defects.

56 Telecommunication Networks - Trends and Developments

availability, and reliability.

5.1. Software failure rate

between a test environment and a field operational state.

Figure 14. Weekly view of project B customer defect prediction vs. actual.

5. Software failure rate, availability, and reliability

The availability of software can be expressed using cycles of uninterrupted working intervals (Uptime), followed by a repair period after a failure has occurred (Downtime) using (8).

$$A = \frac{\text{Uptime}}{\text{Uptime} + \text{Downtime}} = 1 - \frac{\text{Downtime}}{\text{Uptime} + \text{Downtime}}\tag{8}$$

Considering that availability is typically evaluated over a 1 year period, Uptime þ Downtime ¼ 1 year ¼ ð Þ 60 � 24 � 365 minutes. Therefore, as an example, to achieve system availability of 5 9's (i.e. A = 99.999%) the maximum allowed downtime would be 5.26 minutes/year.

#### 5.3. Reliability

On the other hand, software reliability is the probability that the software has not failed after a time period t. Therefore, reliability is a function of t, and can be denoted as R(t). R(t) is typically modeled using an exponential distribution in which the parameter is failure rate λ as shown in Eq. (9)

$$R(t) = 1 - \exp\left(-\lambda t\right) \tag{9}$$

It is important to note that while both reliability and availability are a measure of software quality, they have different technical meanings. In particular, availability is determined by both uptime and downtime, while reliability is only influenced by uptime. This implies that two software releases or systems having the same failure rate, would have the same reliability, but might have different availabilities. Achieving a high availability generally requires having automated ways of recovering from failures, for example, through redundancy or rebooting, so that the downtime is minimized. Software failures for which the system is able to automatically recover are known as covered failures. On the other hand, if a system fails to automatically detect and/or recover from a failure, such a failure is known as an uncovered failure, and usually leads to customer perceived defects. In systems where recovery time is significant, a coverage factor – the proportion of all failures that are covered failures – is defined. However, in most practical applications, it requires specialized tools to determine covered failures. Therefore, typical failure counts usually only consider the uncovered defects.

#### 5.4. Discussion

In what follows, we use (anonymised/scaled) data from project A to demonstrate the various aspects of software failure, reliability and availability, together with the predictions that are carried for the same. The data compares multiple releases of a software product over multiple years. Outage data represents unplanned, customer-reported, and uncovered failures, including full and partial outages. The outages were collected across a deployment of over 400 systems. The monthly outage count is annualized and normalized by the number of deployed systems as outages/year/system, which is equivalent to the failure rate. In the same way, the monthly outage downtime is annualized and normalized by deployed systems as downtime/year/deployed system. It should be noted that the downtime duration of each outage is discounted by percentage impact (i.e. 100% being a full outage), using the TL 9000 counting rule.

In Figure 15, we show the predicted software reliability as a function of failure rate (on the left) and software availability as a function of annual downtime (on the right). The following observations can be made:


Finally, we applied the method to Project B. In this project, it is not practical to collect the system downtime in the field due to the nature of the product. However, customers are concerned about resets. Therefore, the focus is on the number of unplanned autonomous resets. Figure 16 summarizes the annual reset rate with prediction and actual data over several releases. The predicted values are remarkably close to actual data and within the 90% limits. Although actual downtime is not available, we can use reset time measured in the lab to

Software Quality Assurance

59

http://dx.doi.org/10.5772/intechopen.79839

Figure 17 shows the implementation of BRACE. The tool is made up of multiple application programming interfaces (APIs), each of them connecting to a defect logging database (such as JIRA). Defect data is collected from the defect databases in real-time and pre-processed by a computer program (in Python) before being stored into a cloud-based, shared database used by the system. The SRGM algorithm (which is written in Python) then performs the core processing, providing a consistent, fast, flexible, robust, and statistically sound result. Using the output of from core processing, we have also created a unified graphical user interface (GUI) onto which a wealth of software quality metrics are presented to users. While in the current implementation all components of the tool are hosted in a virtual machine running in openstack, it is possible to have them also running in a dedicated server if

As an example use case, for a given project, a number of input parameters are required for the tool. Such inputs include the project milestones, the require changes in defect rate before and after deployment, and a number of assumptions based on expert knowledge of both the

calculate the reset-based availability using the reset rate prediction.

6. Implementation

Figure 16. Software failure rate prediction—Project B.

needed.

product and development process.


Figure 15. Release-over release software reliability and availability prediction—Project A.

Figure 16. Software failure rate prediction—Project B.

Finally, we applied the method to Project B. In this project, it is not practical to collect the system downtime in the field due to the nature of the product. However, customers are concerned about resets. Therefore, the focus is on the number of unplanned autonomous resets. Figure 16 summarizes the annual reset rate with prediction and actual data over several releases. The predicted values are remarkably close to actual data and within the 90% limits. Although actual downtime is not available, we can use reset time measured in the lab to calculate the reset-based availability using the reset rate prediction.
