Weather Nowcasting Using Deep Learning Techniques

*Makhamisa Senekane, Mhlambululi Mafu and Molibeli Benedict Taele*

#### **Abstract**

Weather variations play a significant role in peoples' short-term, medium-term or long-term planning. Therefore, understanding of weather patterns has become very important in decision making. Short-term weather forecasting (nowcasting) involves the prediction of weather over a short period of time; typically few hours. Different techniques have been proposed for short-term weather forecasting. Traditional techniques used for nowcasting are highly parametric, and hence complex. Recently, there has been a shift towards the use of artificial intelligence techniques for weather nowcasting. These include the use of machine learning techniques such as artificial neural networks. In this chapter, we report the use of deep learning techniques for weather nowcasting. Deep learning techniques were tested on meteorological data. Three deep learning techniques, namely multilayer perceptron, Elman recurrent neural networks and Jordan recurrent neural networks, were used in this work. Multilayer perceptron models achieved 91 and 75% accuracies for sunshine forecasting and precipitation forecasting respectively, Elman recurrent neural network models achieved accuracies of 96 and 97% for sunshine and precipitation forecasting respectively, while Jordan recurrent neural network models achieved accuracies of 97 and 97% for sunshine and precipitation nowcasting respectively. The results obtained underline the utility of using deep learning for weather nowcasting.

**Keywords:** nowcasting, deep learning, artificial neural network, Elman network, Jordan network, precipitation, rainfall

#### **1. Introduction**

Weather changes play a significant role in peoples' short-term, medium-term or long-term planning. Therefore, the understanding weather patterns have become very important in decision making. This further raises the need for availability of tools for accurate prediction of weather. This need is even more pronounced if the prediction is intended for a short-term weather forecasting, conventionally known as nowcasting.

To date, different weather nowcasting models have been proposed [1]. These models are mainly based on the different variants of artificial neural networks and fuzzy logic. As will be discussed later in this chapter, these techniques have some limitations which need to be addressed. In this chapter, we report the use multilayer perceptron (MLP) neural networks, Elman neural networks (ENN) and Jordan

neural networks for solar irradiance (sunshine) and precipitation (rainfall) nowcasting. The approach taken in this work is in line with the observation given in [1] in the sense that the performances of these models are further compared in order to establish which model performs best in weather nowcasting. The main contribution of the work reported in this chapter is the development of three solar irradiation and rainfall models using MLP, ENN and Jordan neural networks. These three models are examples of deep learning [2–4]. Therefore, the contribution of this work can be summarized as the use of deep learning models for weather nowcasting. Thus, the research question being addressed in this chapter is the design of integrated highaccuracy nowcasting techniques. Furthermore, the objectives of this work include:

	- MLP
	- ENN
	- Jordan recurrent neural networks
	- sunshine nowcasting
	- precipitation nowcasting

The remainder of this chapter is divided as follows. The next section provides a background information on artificial neural networks and the related work on the use of artificial neural networks in weather nowcasting. This is followed by Section 3, which discusses the method used for the design and Implementation of both solar irradiation and rainfall nowcasting models. Results are provided and discussed in Section 4, while Section 5 concludes this chapter.

#### **2. Preliminaries**

#### **2.1 Artificial neural networks (ANNs)**

Artificial neural network (ANN) is an example of supervised machine learning [5–7]. It draws inspiration from how the biological neuron in the brain operates. Thus, it mimics natural intelligence in its learning from experience [5]. As a supervised learning algorithm, ANN learns from the examples by constructing an inputoutput mapping [8]. A typical ANN consists of an input layer, an output layer, and at least one hidden layer. Each layer consists of nodes representing neurons and is connected by weights. Each internal node of artificial neural network consists of two functions, namely transfer function and activation function [6, 7]. The transfer function is a function of inputs **(**xi**)** and weights (wi), and is given as

$$\mathbf{f}(\mathbf{x}) = \boldsymbol{\Sigma}\mathbf{w}\_{i}\mathbf{x}\_{i} + \mathbf{b}\_{i},\tag{1}$$

where bi is a bias value. On the other hand, an activation function φ is nonlinear and hence responsible for modeling nonlinear relationships. Additionally, this function is differentiable [8]. The output of such an internal node is given as

$$\mathbf{y}\_i = \mathbf{q}(\mathbf{f}(\mathbf{x})).\tag{2}$$

**109**

**Figure 1.**

*represent the weights.*

*Weather Nowcasting Using Deep Learning Techniques DOI: http://dx.doi.org/10.5772/intechopen.84552*

examples of recurrent neural networks [4, 10].

**2.2 Related work: weather forecasting using ANNs**

given by Eq. (2).

**Figure 1** shows a schematic diagram of a typical ANN. Each node in the figure represents a neuron, while the arrows represent the weights. The first layer is the input layer, and each node (neuron) of the input layer corresponds to the feature used for prediction. Thus, in **Figure 1**, there would be three features used for prediction. The hidden layer is between the input layer and the output layer. Its nodes take a set of weighted inputs defined by transfer function in Eq. (1), and produces the output

Depending on the number of hidden layers, artificial neural networks can be classified as either shallow neural networks or deep neural networks. In the former class (shallow neural networks), fewer hidden layers are used while on the latter (deep neural networks), several hidden layers are used for better prediction accuracy. Examples of deep neural network architectures include multilayer perceptrons, convolutional neural networks and recurrent neural networks [2, 4, 9]. It is worth noting that both Elman neural networks and Jordan neural networks are

Different neural networks-based approaches to short-term weather forecasting have been proposed in literature [1, 10, 11]. furthermore, Mellit *et al.* [12] proposed artificial neural network model for predicting global solar radiation. The model

*Schematic diagram of a typical ANN. It consists of an input layer, a hidden layer and an output layer. An input layer consists of three nodes, a hidden layer consists of four nodes, while an output layer consists of two nodes. Since an output layer has two nodes, this ANN is used for two-class (binary) classification. The arrows*  *Weather Nowcasting Using Deep Learning Techniques DOI: http://dx.doi.org/10.5772/intechopen.84552*

*Data Mining - Methods, Applications and Systems*

ing deep learning architectures:

○ Jordan recurrent neural networks

Section 4, while Section 5 concludes this chapter.

**2.1 Artificial neural networks (ANNs)**

○ sunshine nowcasting ○ precipitation nowcasting

• Application of such techniques to the following tasks:

○ MLP ○ ENN

**2. Preliminaries**

neural networks for solar irradiance (sunshine) and precipitation (rainfall) nowcasting. The approach taken in this work is in line with the observation given in [1] in the sense that the performances of these models are further compared in order to establish which model performs best in weather nowcasting. The main contribution of the work reported in this chapter is the development of three solar irradiation and rainfall models using MLP, ENN and Jordan neural networks. These three models are examples of deep learning [2–4]. Therefore, the contribution of this work can be summarized as the use of deep learning models for weather nowcasting. Thus, the research question being addressed in this chapter is the design of integrated highaccuracy nowcasting techniques. Furthermore, the objectives of this work include:

• The design of integrated high-accuracy nowcasting techniques using the follow-

The remainder of this chapter is divided as follows. The next section provides a background information on artificial neural networks and the related work on the use of artificial neural networks in weather nowcasting. This is followed by Section 3, which discusses the method used for the design and Implementation of both solar irradiation and rainfall nowcasting models. Results are provided and discussed in

Artificial neural network (ANN) is an example of supervised machine learning [5–7]. It draws inspiration from how the biological neuron in the brain operates. Thus, it mimics natural intelligence in its learning from experience [5]. As a supervised learning algorithm, ANN learns from the examples by constructing an inputoutput mapping [8]. A typical ANN consists of an input layer, an output layer, and at least one hidden layer. Each layer consists of nodes representing neurons and is connected by weights. Each internal node of artificial neural network consists of two functions, namely transfer function and activation function [6, 7]. The transfer

f(x) = Σwixi + bi, (1)

yi = φ(f(x)). (2)

where bi is a bias value. On the other hand, an activation function φ is nonlinear and hence responsible for modeling nonlinear relationships. Additionally, this func-

function is a function of inputs **(**xi**)** and weights (wi), and is given as

tion is differentiable [8]. The output of such an internal node is given as

**108**

**Figure 1** shows a schematic diagram of a typical ANN. Each node in the figure represents a neuron, while the arrows represent the weights. The first layer is the input layer, and each node (neuron) of the input layer corresponds to the feature used for prediction. Thus, in **Figure 1**, there would be three features used for prediction. The hidden layer is between the input layer and the output layer. Its nodes take a set of weighted inputs defined by transfer function in Eq. (1), and produces the output given by Eq. (2).

Depending on the number of hidden layers, artificial neural networks can be classified as either shallow neural networks or deep neural networks. In the former class (shallow neural networks), fewer hidden layers are used while on the latter (deep neural networks), several hidden layers are used for better prediction accuracy. Examples of deep neural network architectures include multilayer perceptrons, convolutional neural networks and recurrent neural networks [2, 4, 9]. It is worth noting that both Elman neural networks and Jordan neural networks are examples of recurrent neural networks [4, 10].

#### **2.2 Related work: weather forecasting using ANNs**

Different neural networks-based approaches to short-term weather forecasting have been proposed in literature [1, 10, 11]. furthermore, Mellit *et al.* [12] proposed artificial neural network model for predicting global solar radiation. The model

#### **Figure 1.**

*Schematic diagram of a typical ANN. It consists of an input layer, a hidden layer and an output layer. An input layer consists of three nodes, a hidden layer consists of four nodes, while an output layer consists of two nodes. Since an output layer has two nodes, this ANN is used for two-class (binary) classification. The arrows represent the weights.*

proposed uses radial basis function networks, and uses sunshine duration and air temperature as inputs. The model used 300 data points for training, while 65 data points were used for validation and testing. The authors reported that the best performance was obtained with one hidden layer containing 9 neurons (nodes). In Ref. [13], authors reported adaptive neuro-fuzzy inference scheme (ANFIS)-based total solar radiation data forecasting model that takes as inputs daily sunshine duration and mean ambient temperature. The data used in the study spanned a period of 10 years; from 1981 to 1990. It reported validation mean relative error of 1% and correlation coefficient obtained from validation data set was reported to be 98%.

A solar radiation forecasting model based on meteorological data using artificial neural networks is reported in [8]. This algorithm uses meteorological data from Dezful city in Iran. Daily meteorological data from 2002 to 2005 is used to train the model, while 235 days' data from 2006 is used as a testing data. The model takes as inputs length of the day, daily mean air temperature, humidity and sunshine hours. The model achieved absolute testing error of 8.84%. Additionally, Ruffing and Venayagamoorthy [14] proposed a short-to-medium range solar irradiance prediction model using echo state network (ESN). ESN is another variant of a recurrent neural network. The model reported in [14] is capable of predicting solar irradiance 30–270 minutes into the future. Correlation coefficient was used as a performance metric for the model. For 30 minutes ahead predictions, coefficient of correlation was obtained to be 0.87, while for 270 minutes ahead predictions, it decreased to 0.48. Finally, Hosssain *et al.* [15] reported the use of deep learning to forecast weather in Nevada. Their proposed model uses deep neural networks with stacked auto-encoders to predict air temperature using pressure, humidity, wind speed and temperature as inputs. Data used was collected in an hourly interval from November 2013 to December 2014. The model achieved accuracy of 97.94%.

Precipitation forecasting model using artificial neural networks was reported in [16]. This model is capable of estimating 6 hour rainfall over the south coast of Tasmania, Australia. The data used for this model consists of 1000 training examples, 300 validation examples and 300 test examples. The model obtained accuracy of 84%. On the other hand, Shi *et al.* [17] reported a model for precipitation nowcasting which uses convolutional long short term memory (LSTM) network. Unlike other models that were above, this model uses radio detection and ranging (RADAR) data instead of meteorological data. The model obtained a correlation coefficient of 0.908 and mean square error of 1.420.

As will be discussed in the next section, the abovementioned methods are limited compared to the method proposed in this chapter. One of the limitations is that the methods use fewer data instances than the one used in the method discussed in the next section. Another limitation is that the abovementioned techniques use fewer features (less than six features). Finally, unlike the technique discussed in the next section, which integrates different deep learning architectures for both sunshine nowcasting and precipitation nowcasting, the techniques mentioned above either use only one neural network architecture or are designed for one nowcasting task.

#### **3. Methodology for design and implementation of weather nowcasting models**

The forecasting models reported in this chapter are tested on hourly weather data from Lesotho for the period ranging from 01/01/2012 to 26/03/2012. This meteorological data consists of 2045 instances; and six features were used to make predictions. As opposed to the approaches discussed in Section 2.2 above, the

**111**

**Figure 3.**

**Figure 2.**

*Summary of features that were used for weather forecasting.*

*Summary of the method used for weather nowcasting tasks using deep learning architectures.*

*Weather Nowcasting Using Deep Learning Techniques DOI: http://dx.doi.org/10.5772/intechopen.84552*

method discussed in this chapter has two major advantages. The first one is that it uses more data; 2045 instances. Additionally, the method is feature-rich, since it uses six features (more than what other methods reported in Section 2.2 use) for short-term weather forecasting. As a means of feature engineering, all the predictors (features) were plotted against one another, in order to ensure that they are not linearly related, in which case it would be sufficient to use one instead of all those that are related. Therefore, these six features were selected because they proved to be independent predictors. These features are summarized in **Figure 2**. As can be observed from **Figure 2**, all the six features form the nodes of the input layers of all three deep learning architectures (namely, MLP, ENN and Jordan recurrent neural networks). **Figure 3** summarizes the design of the method discussed in this chapter. The models were developed using R statistical programming language [18–20],

and RSNNS package was used to implement artificial neural networks [21]. The models created make use of multilayer perceptron, Elman recurrent neural network and Jordan recurrent neural network. These models were then used for weather Nowcasting to perform two tasks, namely sunshine predictions and precipitation predictions. Additionally, each model was designed with a time lag of 1 hour (thereby allowing 1 hour ahead forecasting). Furthermore, from the collected meteorological data, 80% of the data was used to train the model, 10%

#### *Weather Nowcasting Using Deep Learning Techniques DOI: http://dx.doi.org/10.5772/intechopen.84552*

*Data Mining - Methods, Applications and Systems*

proposed uses radial basis function networks, and uses sunshine duration and air temperature as inputs. The model used 300 data points for training, while 65 data points were used for validation and testing. The authors reported that the best performance was obtained with one hidden layer containing 9 neurons (nodes). In Ref. [13], authors reported adaptive neuro-fuzzy inference scheme (ANFIS)-based total solar radiation data forecasting model that takes as inputs daily sunshine duration and mean ambient temperature. The data used in the study spanned a period of 10 years; from 1981 to 1990. It reported validation mean relative error of 1% and correlation coefficient obtained from validation data set was reported to be 98%. A solar radiation forecasting model based on meteorological data using artificial neural networks is reported in [8]. This algorithm uses meteorological data from Dezful city in Iran. Daily meteorological data from 2002 to 2005 is used to train the model, while 235 days' data from 2006 is used as a testing data. The model takes as inputs length of the day, daily mean air temperature, humidity and sunshine hours. The model achieved absolute testing error of 8.84%. Additionally, Ruffing and Venayagamoorthy [14] proposed a short-to-medium range solar irradiance prediction model using echo state network (ESN). ESN is another variant of a recurrent neural network. The model reported in [14] is capable of predicting solar irradiance 30–270 minutes into the future. Correlation coefficient was used as a performance metric for the model. For 30 minutes ahead predictions, coefficient of correlation was obtained to be 0.87, while for 270 minutes ahead predictions, it decreased to 0.48. Finally, Hosssain *et al.* [15] reported the use of deep learning to forecast weather in Nevada. Their proposed model uses deep neural networks with stacked auto-encoders to predict air temperature using pressure, humidity, wind speed and temperature as inputs. Data used was collected in an hourly interval from

November 2013 to December 2014. The model achieved accuracy of 97.94%.

coefficient of 0.908 and mean square error of 1.420.

Precipitation forecasting model using artificial neural networks was reported in [16]. This model is capable of estimating 6 hour rainfall over the south coast of Tasmania, Australia. The data used for this model consists of 1000 training examples, 300 validation examples and 300 test examples. The model obtained accuracy of 84%. On the other hand, Shi *et al.* [17] reported a model for precipitation nowcasting which uses convolutional long short term memory (LSTM) network. Unlike other models that were above, this model uses radio detection and ranging (RADAR) data instead of meteorological data. The model obtained a correlation

As will be discussed in the next section, the abovementioned methods are limited compared to the method proposed in this chapter. One of the limitations is that the methods use fewer data instances than the one used in the method discussed in the next section. Another limitation is that the abovementioned techniques use fewer features (less than six features). Finally, unlike the technique discussed in the next section, which integrates different deep learning architectures for both sunshine nowcasting and precipitation nowcasting, the techniques mentioned above either use only one neural network architecture or are designed

The forecasting models reported in this chapter are tested on hourly weather data from Lesotho for the period ranging from 01/01/2012 to 26/03/2012. This meteorological data consists of 2045 instances; and six features were used to make predictions. As opposed to the approaches discussed in Section 2.2 above, the

**3. Methodology for design and implementation of weather** 

**110**

for one nowcasting task.

**nowcasting models**

method discussed in this chapter has two major advantages. The first one is that it uses more data; 2045 instances. Additionally, the method is feature-rich, since it uses six features (more than what other methods reported in Section 2.2 use) for short-term weather forecasting. As a means of feature engineering, all the predictors (features) were plotted against one another, in order to ensure that they are not linearly related, in which case it would be sufficient to use one instead of all those that are related. Therefore, these six features were selected because they proved to be independent predictors. These features are summarized in **Figure 2**. As can be observed from **Figure 2**, all the six features form the nodes of the input layers of all three deep learning architectures (namely, MLP, ENN and Jordan recurrent neural networks). **Figure 3** summarizes the design of the method discussed in this chapter.

The models were developed using R statistical programming language [18–20], and RSNNS package was used to implement artificial neural networks [21]. The models created make use of multilayer perceptron, Elman recurrent neural network and Jordan recurrent neural network. These models were then used for weather Nowcasting to perform two tasks, namely sunshine predictions and precipitation predictions. Additionally, each model was designed with a time lag of 1 hour (thereby allowing 1 hour ahead forecasting). Furthermore, from the collected meteorological data, 80% of the data was used to train the model, 10%

**Figure 2.**

*Summary of features that were used for weather forecasting.*

#### **Figure 3.**

*Summary of the method used for weather nowcasting tasks using deep learning architectures.*

for validation, while the remaining 10% was used to test the model for accuracy. In order to enable reproducibility of the result, a seed was set to 2017 using the R command: "set.seed (2017)."

#### **4. Results and discussion**

**Figures 4–6** show the MLP, Elman RNN and Jordan RNN sunshine forecasting models respectively. The black line is a fit for an ideal model, while a red line is a fit of the proposed model. As it can be observed, Jordan neural network model outperforms the other two models, while the multilayer perceptron model is the poorest of the three in sunshine nowcasting. Additionally, performances of both Elman neural network model and Jordan neural network model are comparable.

**Figures 7–9** compare the performances of the three neural network models in precipitation nowcasting. Once again, the black line is a fit for an ideal model, while a red line is a fit of the proposed model. It can be observed that once again, MLP has the lowest performance while Jordan neural network model is the best-performing model. Also, the performances of both the Elman neural network model and Jordan neural network model are comparable.

**Figure 4.** *MLP sunshine forecasting model.*

**113**

**Figure 8.**

*Weather Nowcasting Using Deep Learning Techniques DOI: http://dx.doi.org/10.5772/intechopen.84552*

investigation.

**Figure 6.**

**Figure 7.**

*Jordan RNN sunshine forecasting model.*

*MLP precipitation forecasting model.*

*Elman RNN precipitation forecasting model.*

Finally, accuracies of the models were compared, and the results are shown in **Figure 10**. The figure shows that MLP yet once again performing poorly compared to Elman RNN and Jordan RNN. Although these models have high accuracies individually, it is suggested that combining them together (as ensemble of models) might improve the accuracy even further. This possibility warrants further

**Figure 5.** *Elman RNN sunshine forecasting model.*

*Weather Nowcasting Using Deep Learning Techniques DOI: http://dx.doi.org/10.5772/intechopen.84552*

*Data Mining - Methods, Applications and Systems*

command: "set.seed (2017)."

**4. Results and discussion**

neural network model are comparable.

for validation, while the remaining 10% was used to test the model for accuracy. In order to enable reproducibility of the result, a seed was set to 2017 using the R

**Figures 4–6** show the MLP, Elman RNN and Jordan RNN sunshine forecasting models respectively. The black line is a fit for an ideal model, while a red line is a fit of the proposed model. As it can be observed, Jordan neural network model outperforms the other two models, while the multilayer perceptron model is the poorest of the three in sunshine nowcasting. Additionally, performances of both Elman neural

**Figures 7–9** compare the performances of the three neural network models in precipitation nowcasting. Once again, the black line is a fit for an ideal model, while a red line is a fit of the proposed model. It can be observed that once again, MLP has the lowest performance while Jordan neural network model is the best-performing model. Also, the performances of both the Elman neural network model and Jordan

network model and Jordan neural network model are comparable.

**112**

**Figure 5.**

**Figure 4.**

*MLP sunshine forecasting model.*

*Elman RNN sunshine forecasting model.*

Finally, accuracies of the models were compared, and the results are shown in **Figure 10**. The figure shows that MLP yet once again performing poorly compared to Elman RNN and Jordan RNN. Although these models have high accuracies individually, it is suggested that combining them together (as ensemble of models) might improve the accuracy even further. This possibility warrants further investigation.

**Figure 6.** *Jordan RNN sunshine forecasting model.*

**Figure 7.** *MLP precipitation forecasting model.*

**Figure 8.** *Elman RNN precipitation forecasting model.*

**Figure 9.** *Jordan RNN precipitation forecasting model.*

**Figure 10.**

 *Accuracies of different neural network models for weather nowcasting.*

#### **5. Conclusion**

In this chapter, we have reported the application of deep learning for short- term forecasting of Lesotho's weather. The deep learning models used are multilayer perceptron, Elman recurrent neural networks and Jordan neural networks. These models were used to predict sunshine and precipitation. High accuracies of these models in weather forecasting underline their utility. Thus, high-accuracy results obtained from this work, coupled with the integrated nature of the technique reported, provide more advantages over other approaches used for weather nowcasting. Future work will focus on improving the accuracy of weather nowcasting by using an ensemble of the stated deep learning models, instead of using them as individual models.

#### **Acknowledgements**

Makhamisa Senekane and Benedict Molibeli Taele acknowledge support from the National University of Lesotho Research and Conferences Committee. Mhlambululi Mafu thanks his colleagues at Botswana International University of Science and Technology for their unwavering support and constructive discussions.

**115**

**Author details**

Roma, Lesotho

Makhamisa Senekane1

provided the original work is properly cited.

Science and Technology, Palapye, Botswana

\*Address all correspondence to: makhamisa12@gmail.com

*DOI: http://dx.doi.org/10.5772/intechopen.84552 Weather Nowcasting Using Deep Learning Techniques*

© 2020 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/ by/3.0), which permits unrestricted use, distribution, and reproduction in any medium,

and Molibeli Benedict Taele1

\*, Mhlambululi Mafu<sup>2</sup>

1 Department of Physics and Electronics, National University of Lesotho,

2 Department of Physics and Astronomy, Botswana International University of

### **Conflict of interest**

The authors declare no conflict of interest.

*DOI: http://dx.doi.org/10.5772/intechopen.84552 Weather Nowcasting Using Deep Learning Techniques*

*Data Mining - Methods, Applications and Systems*

*Jordan RNN precipitation forecasting model.*

**114**

**5. Conclusion**

**Figure 10.**

**Figure 9.**

**Acknowledgements**

**Conflict of interest**

The authors declare no conflict of interest.

 *Accuracies of different neural network models for weather nowcasting.*

In this chapter, we have reported the application of deep learning for short- term forecasting of Lesotho's weather. The deep learning models used are multilayer perceptron, Elman recurrent neural networks and Jordan neural networks. These models were used to predict sunshine and precipitation. High accuracies of these models in weather forecasting underline their utility. Thus, high-accuracy results obtained from this work, coupled with the integrated nature of the technique reported, provide more advantages over other approaches used for weather nowcasting. Future work will focus on improving the accuracy of weather nowcasting by using an ensemble of

the stated deep learning models, instead of using them as individual models.

Makhamisa Senekane and Benedict Molibeli Taele acknowledge support from the National University of Lesotho Research and Conferences Committee. Mhlambululi Mafu thanks his colleagues at Botswana International University of Science and Technology for their unwavering support and constructive discussions.

### **Author details**

Makhamisa Senekane1 \*, Mhlambululi Mafu<sup>2</sup> and Molibeli Benedict Taele1

1 Department of Physics and Electronics, National University of Lesotho, Roma, Lesotho

2 Department of Physics and Astronomy, Botswana International University of Science and Technology, Palapye, Botswana

\*Address all correspondence to: makhamisa12@gmail.com

© 2020 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/ by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

### **References**

[1] Yadav AK, Chandel S. Solar radiation prediction using artificial neural network techniques: A review. Renewable and Sustainable Energy Reviews. 2013;**33**:772-781

[2] Goodfellow I, Bengio Y, Courville A. Deep Learning. Massachusetts: MIT Press; 2016, Available from: http://www. deeplearningbook.org

[3] Nielsen M. Neural Networks and Deep Learning. California: Determination Press; 2015

[4] Lewis N. Deep Learning Made Easy with R: A Gentle Introduction For Data Science. California: CreateSpace Independent Publishing Platform; 2016

[5] Wasserman PD. Advanced Methods in Neural Computing. New Jersey: John Wiley & Sons, Inc; 1993

[6] Christopher MB. Pattern Recognition and Machine Learning. New York: Springer-Verlag; 2016

[7] Russell SJ, Norvig P. Artificial Intelligence: A Modern Approach. Malaysia: Pearson Education Limited; 2016

[8] Ghanbarzadeh A, Noghrehabadi A, Assareh E, Behrang M. Solar radiation forecasting based on meteorological data using artificial neural networks. In: Industrial Informatics, 2009. INDIN 2009. 7th IEEE International Conference on, IEEE. 2009. pp. 227-231

[9] Schmidhuber J. Deep learning in neural networks: An overview. Neural Networks. 2015;**61**:85-117

[10] Lewis N. Neural Networks for Time Series Forecasting with R: Intuitive Step by Step for Beginners. California: CreateSpace Independent Publishing Platform; 2017

[11] Yadav AK, Malik H, Chandel S. Selection of most relevant input

parameters using weka for artificial neural network based solar radiation prediction models. Renewable and Sustainable Energy Reviews. 2014;**31**:509-519

[12] Mellit A, Menghanem M, Bendekhis M. Artificial neural network model for prediction solar radiation data: Application for sizing stand-alone photovoltaic power system. In: Power Engineering Society General Meeting, 2005. IEEE. IEEE; 2005. pp. 40-44

[13] Mellit A, Arab AH, Khorissi N, Salhi H. An anfis-based forecasting for solar radiation data from sunshine duration and ambient temperature. In: Power Engineering Society General Meeting, 2007. IEEE. IEEE; 2007. pp. 1-6

[14] Ruffing SM, Venayagamoorthy GK. Short to medium range time series prediction of solar irradiance using an echo state network. In: Intelligent System Applications to Power Systems, 2009. ISAP'09. 15th International Conference on, IEEE. 2009. pp. 1-6

[15] Hossain M, Rekabdar B, Louis SJ, Dascalu S. Forecasting the weather of nevada: A deep learning approach. In: Neural Networks (IJCNN), 2015 International Joint Conference on, IEEE. 2015. pp. 1-6

[16] McCullagh J, Bluff K, Ebert E. A neural network model for rainfall estimation. In: Artificial Neural Networks and Expert Systems, 1995. Proceedings., Second New Zealand International Two-Stream Conference on, IEEE. 1995. pp. 389-392

[17] Shi X, Chen Z, Wang H, Yeung D-Y, Wong W-K, Woo W-C. Convolutional LSTM network: A machine learning approach for precipitation nowcasting. In: Advances in Neural Information Processing Systems. New York: Curran Associates; 2015. pp. 802-810

**117**

*DOI: http://dx.doi.org/10.5772/intechopen.84552 Weather Nowcasting Using Deep Learning Techniques*

[18] Verzani J. Using R for Introductory Statistics. Florida: Chapman & Hall;

[19] Kohl M. Introduction to Statistical Analysis with R. London: Bookboon;

[20] Stowell S. Using R for Statistics.

[21] Bergmier C, Benitez J. Neural networks in R using the stuttgart neural network simulator: RSNNS. Journal of Statistical Software. 2012;**46**(7):1-26

New York: Apress; 2014

2005

2015

*DOI: http://dx.doi.org/10.5772/intechopen.84552 Weather Nowcasting Using Deep Learning Techniques*

[18] Verzani J. Using R for Introductory Statistics. Florida: Chapman & Hall; 2005

[19] Kohl M. Introduction to Statistical Analysis with R. London: Bookboon; 2015

[20] Stowell S. Using R for Statistics. New York: Apress; 2014

[21] Bergmier C, Benitez J. Neural networks in R using the stuttgart neural network simulator: RSNNS. Journal of Statistical Software. 2012;**46**(7):1-26

**116**

*Data Mining - Methods, Applications and Systems*

parameters using weka for artificial neural network based solar radiation prediction models. Renewable and Sustainable Energy Reviews.

[12] Mellit A, Menghanem M, Bendekhis M. Artificial neural network model for prediction solar radiation data: Application for sizing stand-alone photovoltaic power system. In: Power Engineering Society General Meeting, 2005. IEEE. IEEE; 2005. pp. 40-44

[13] Mellit A, Arab AH, Khorissi N, Salhi H. An anfis-based forecasting for solar radiation data from sunshine duration and ambient temperature. In: Power Engineering Society General Meeting,

[14] Ruffing SM, Venayagamoorthy GK. Short to medium range time series prediction of solar irradiance using an echo state network. In: Intelligent System Applications to Power Systems, 2009. ISAP'09. 15th International Conference on, IEEE. 2009. pp. 1-6

[15] Hossain M, Rekabdar B, Louis SJ, Dascalu S. Forecasting the weather of nevada: A deep learning approach. In: Neural Networks (IJCNN), 2015 International Joint Conference on, IEEE.

[16] McCullagh J, Bluff K, Ebert E. A neural network model for rainfall estimation. In: Artificial Neural Networks and Expert Systems, 1995. Proceedings., Second New Zealand International Two-Stream Conference

[17] Shi X, Chen Z, Wang H, Yeung D-Y, Wong W-K, Woo W-C. Convolutional LSTM network: A machine learning approach for precipitation nowcasting. In: Advances in Neural Information Processing Systems. New York: Curran

on, IEEE. 1995. pp. 389-392

Associates; 2015. pp. 802-810

2015. pp. 1-6

2007. IEEE. IEEE; 2007. pp. 1-6

2014;**31**:509-519

[2] Goodfellow I, Bengio Y, Courville A. Deep Learning. Massachusetts: MIT Press; 2016, Available from: http://www.

[3] Nielsen M. Neural Networks and Deep Learning. California: Determination

[4] Lewis N. Deep Learning Made Easy with R: A Gentle Introduction For Data Science. California: CreateSpace Independent Publishing Platform; 2016

[5] Wasserman PD. Advanced Methods in Neural Computing. New Jersey: John

[6] Christopher MB. Pattern Recognition and Machine Learning. New York:

Malaysia: Pearson Education Limited; 2016

[8] Ghanbarzadeh A, Noghrehabadi A, Assareh E, Behrang M. Solar radiation forecasting based on meteorological data using artificial neural networks. In: Industrial Informatics, 2009. INDIN 2009. 7th IEEE International Conference

[1] Yadav AK, Chandel S. Solar radiation prediction using artificial neural network techniques: A review. Renewable and Sustainable Energy

Reviews. 2013;**33**:772-781

deeplearningbook.org

Wiley & Sons, Inc; 1993

Springer-Verlag; 2016

[7] Russell SJ, Norvig P. Artificial Intelligence: A Modern Approach.

on, IEEE. 2009. pp. 227-231

Networks. 2015;**61**:85-117

Platform; 2017

[9] Schmidhuber J. Deep learning in neural networks: An overview. Neural

[10] Lewis N. Neural Networks for Time Series Forecasting with R: Intuitive Step by Step for Beginners. California: CreateSpace Independent Publishing

[11] Yadav AK, Malik H, Chandel S. Selection of most relevant input

Press; 2015

**References**

**Chapter 8**

**Abstract**

techniques.

clustering

**119**

**1. Introduction**

*Elife Ozturk Kiyak*

Data Mining and Machine

Learning for Software Engineering

Software engineering is one of the most utilizable research areas for data mining. Developers have attempted to improve software quality by mining and analyzing software data. In any phase of software development life cycle (SDLC), while huge amount of data is produced, some design, security, or software problems may occur. In the early phases of software development, analyzing software data helps to handle these problems and lead to more accurate and timely delivery of software projects. Various data mining and machine learning studies have been conducted to deal with software engineering tasks such as defect prediction, effort estimation, etc. This study shows the open issues and presents related solutions and recommendations in software engineering, applying data mining and machine learning

**Keywords:** software engineering tasks, data mining, text mining, classification,

In recent years, researchers in the software engineering (SE) field have turned their interest to data mining (DM) and machine learning (ML)-based studies since collected SE data can be helpful in obtaining new and significant information. Software engineering presents many subjects for research, and data mining can give

**Figure 1** shows the intersection of three main areas: data mining, software engineering, and statistics/math. A large amount of data is collected from organizations during software development and maintenance activities, such as requirement specifications, design diagrams, source codes, bug reports, program versions, and so on. Data mining enables the discovery of useful knowledge and hidden patterns from SE data. Math provides the elementary functions, and statistics determines probability, relationships, and correlation within collected data. Data science, in the center of the diagram, covers different disciplines such as DM, SE, and statistics. This study presents a comprehensive literature review of existing research and offers an overview of how to approach SE problems using different mining techniques. Up to now, review studies either introduce SE data descriptions [1], explain tools and techniques mostly used by researchers for SE data analysis [2], discuss the role of software engineers [3], or focus only on a specific problem in SE such as defect prediction [4], design pattern [5], or effort estimation [6]. Some existing review articles having the same target [7] are former, and some of them are not

further insight to support decision-making related to these subjects.

#### **Chapter 8**

## Data Mining and Machine Learning for Software Engineering

*Elife Ozturk Kiyak*

#### **Abstract**

Software engineering is one of the most utilizable research areas for data mining. Developers have attempted to improve software quality by mining and analyzing software data. In any phase of software development life cycle (SDLC), while huge amount of data is produced, some design, security, or software problems may occur. In the early phases of software development, analyzing software data helps to handle these problems and lead to more accurate and timely delivery of software projects. Various data mining and machine learning studies have been conducted to deal with software engineering tasks such as defect prediction, effort estimation, etc. This study shows the open issues and presents related solutions and recommendations in software engineering, applying data mining and machine learning techniques.

**Keywords:** software engineering tasks, data mining, text mining, classification, clustering

#### **1. Introduction**

In recent years, researchers in the software engineering (SE) field have turned their interest to data mining (DM) and machine learning (ML)-based studies since collected SE data can be helpful in obtaining new and significant information. Software engineering presents many subjects for research, and data mining can give further insight to support decision-making related to these subjects.

**Figure 1** shows the intersection of three main areas: data mining, software engineering, and statistics/math. A large amount of data is collected from organizations during software development and maintenance activities, such as requirement specifications, design diagrams, source codes, bug reports, program versions, and so on. Data mining enables the discovery of useful knowledge and hidden patterns from SE data. Math provides the elementary functions, and statistics determines probability, relationships, and correlation within collected data. Data science, in the center of the diagram, covers different disciplines such as DM, SE, and statistics.

This study presents a comprehensive literature review of existing research and offers an overview of how to approach SE problems using different mining techniques. Up to now, review studies either introduce SE data descriptions [1], explain tools and techniques mostly used by researchers for SE data analysis [2], discuss the role of software engineers [3], or focus only on a specific problem in SE such as defect prediction [4], design pattern [5], or effort estimation [6]. Some existing review articles having the same target [7] are former, and some of them are not

design pattern mining. In addition, some machine learning studies are divided into subgroups, including ensemble learning- and deep learning-based studies. Section 4

conducted in the last decade. Related works considered as fundamental by journals with a highly positive reputation are listed, and the specific methods they used and their categories and purposes are clearly expressed. In addition, widely used datasets related to SE are given. Finally, Section 5 offers concluding remarks and suggests future scientific and practical efforts that might improve the efficiency

This section basically explains the consecutive critical steps that should be followed to discover beneficial knowledge from software engineering data. It outlines the order of necessary operations in this process and explains how related

Software development life cycle (SDLC) describes a process to improve the quality of a product in project management. The main phases of SDCL are planning, requirement analysis, designing, coding, testing, and maintenance of a project. In every phase of software development, some software problems (e.g., software bugs, security, or design problems) may occur. Correcting these problems in the early phases leads to more accurate and timely delivery of the project. Therefore, software engineers broadly apply data mining techniques for different SE tasks to solve

**Figure 2** presents the data mining and knowledge discovery process of SE tasks including data collection, data preprocessing, data mining, and evaluation. In the data collection phase, data are obtained from software projects such as bug reports, historical data, version control data, and mailing lists that include various information about the project's versions, status, or improvement. In the data preprocessing phase, the data are preprocessed after collection by using different methods such as feature selection (dimensionality reduction), feature extraction, missing data elimination, class imbalance analysis, normalization, discretization, and so on. In the next phase, DM techniques such as classification, clustering, and association rule mining are applied to discover useful patterns and relationships in software

gives statistical information about the number of highly validated research

*Data Mining and Machine Learning for Software Engineering*

*DOI: http://dx.doi.org/10.5772/intechopen.91448*

**2. Knowledge discovery from software engineering data**

SE problems and to enhance programming efficiency and quality.

of SE actions.

**Figure 2.**

**121**

*KDD process for software engineering.*

data flows among them.

#### **Figure 1.**

*The intersection of data mining and software engineering with other areas of the field.*

comprehensive. In contrast to the previous studies, this article provides a systematic review of several SE tasks, gives a comprehensive list of available studies in the field, clearly states the advantages of mining SE data, and answers "how" and "why" questions in the research area.

The novelties and main contributions of this review paper are fivefold.


This paper addresses the following research questions:

RQ1. What kinds of SE problems can ML and DM techniques help to solve? RQ2. What are the advantages of using DM techniques in SE?

RQ3. Which DM methods and algorithms are commonly used to handle SE tasks?

RQ4. Which performance metrics are generally used to evaluate DM models constructed in SE studies?

RQ5. Which types of machine learning techniques (e.g., ensemble learning, deep learning) are generally preferred for SE problems?

RQ6. Which SE datasets are popular in DM studies?

The remainder of this paper is organized as follows. Section 2 explains the knowledge discovery process that aims to extract interesting, potentially useful, and nontrivial information from software engineering data. Section 3 provides an overview of current work on data mining for software engineering grouped under five tasks: defect prediction, effort estimation, vulnerability analysis, refactoring, and

*Data Mining and Machine Learning for Software Engineering DOI: http://dx.doi.org/10.5772/intechopen.91448*

design pattern mining. In addition, some machine learning studies are divided into subgroups, including ensemble learning- and deep learning-based studies. Section 4 gives statistical information about the number of highly validated research conducted in the last decade. Related works considered as fundamental by journals with a highly positive reputation are listed, and the specific methods they used and their categories and purposes are clearly expressed. In addition, widely used datasets related to SE are given. Finally, Section 5 offers concluding remarks and suggests future scientific and practical efforts that might improve the efficiency of SE actions.

#### **2. Knowledge discovery from software engineering data**

This section basically explains the consecutive critical steps that should be followed to discover beneficial knowledge from software engineering data. It outlines the order of necessary operations in this process and explains how related data flows among them.

Software development life cycle (SDLC) describes a process to improve the quality of a product in project management. The main phases of SDCL are planning, requirement analysis, designing, coding, testing, and maintenance of a project. In every phase of software development, some software problems (e.g., software bugs, security, or design problems) may occur. Correcting these problems in the early phases leads to more accurate and timely delivery of the project. Therefore, software engineers broadly apply data mining techniques for different SE tasks to solve SE problems and to enhance programming efficiency and quality.

**Figure 2** presents the data mining and knowledge discovery process of SE tasks including data collection, data preprocessing, data mining, and evaluation. In the data collection phase, data are obtained from software projects such as bug reports, historical data, version control data, and mailing lists that include various information about the project's versions, status, or improvement. In the data preprocessing phase, the data are preprocessed after collection by using different methods such as feature selection (dimensionality reduction), feature extraction, missing data elimination, class imbalance analysis, normalization, discretization, and so on. In the next phase, DM techniques such as classification, clustering, and association rule mining are applied to discover useful patterns and relationships in software

**Figure 2.** *KDD process for software engineering.*

comprehensive. In contrast to the previous studies, this article provides a systematic review of several SE tasks, gives a comprehensive list of available studies in the field, clearly states the advantages of mining SE data, and answers "how" and

The novelties and main contributions of this review paper are fivefold.

*The intersection of data mining and software engineering with other areas of the field.*

• Second, it comprehensively discusses existing data mining solutions in software engineering according to various aspects, including methods

(clustering, classification, association rule mining, etc.), algorithms (k-nearest neighbor (KNN), neural network (NN), etc.), and performance metrics

• Third, it points to several significant research questions that are unanswered in the recent literature as a whole or the answers to which have changed

• Fourth, some statistics related to the studies between the years of 2010 and 2019 are given from different perspectives: according to their subjects and

unsupervised learning, especially on ensemble learning and deep learning.

RQ1. What kinds of SE problems can ML and DM techniques help to solve?

RQ3. Which DM methods and algorithms are commonly used to handle SE

RQ4. Which performance metrics are generally used to evaluate DM models

The remainder of this paper is organized as follows. Section 2 explains the knowledge discovery process that aims to extract interesting, potentially useful, and nontrivial information from software engineering data. Section 3 provides an overview of current work on data mining for software engineering grouped under five tasks: defect prediction, effort estimation, vulnerability analysis, refactoring, and

RQ5. Which types of machine learning techniques (e.g., ensemble learning, deep

• Five, it focuses on different machine learning types: supervised and

• First, it provides a general overview of several SE tasks that have been the focus of studies using DM and ML, namely, defect prediction, effort estimation, vulnerability analysis, refactoring, and design pattern mining.

"why" questions in the research area.

*Data Mining - Methods, Applications and Systems*

**Figure 1.**

(accuracy, mean absolute error, etc.).

according to their methods.

tasks?

**120**

constructed in SE studies?

with the technological developments in the field.

This paper addresses the following research questions:

learning) are generally preferred for SE problems?

RQ6. Which SE datasets are popular in DM studies?

RQ2. What are the advantages of using DM techniques in SE?

engineering data and therefore to solve a software engineering problem such as defected or vulnerable systems, reused patterns, or parts of code changes. Mining and obtaining valuable knowledge from such data prevents errors and allows software engineers to deliver the project on time. Finally, in the evaluation phase, validation techniques are used to assess the data mining results such as k-fold cross validation for classification. The commonly used evaluation measures are accuracy, precision, recall, F-score, area under the curve (AUC) for classification, and sum of squared errors (SSE) for clustering.

#### **3. Data mining in software engineering**

In this review, we examine data mining studies in various SE tasks and evaluate commonly used algorithms and datasets.

#### **3.1 Data mining in defect prediction**

A defect means an error, failure, flaw, or bug that causes incorrect or unexpected results in a system [8]. A software system is expected to be without any defects since software quality represents a capacity of the defect-free percentage of the product [9]. However, software projects often do not have enough time or people working on them to extract errors before a product is released. In such a situation, defect prediction methods can help to detect and remove defects in the initial stages of the SDLC and to improve the quality of the software product. In other words, the goal of defect prediction is to produce robust and effective software systems. Hence, software defect prediction (SDP) is an important topic for software engineering because early prediction of software defects could help to reduce development costs and produce more stable software systems.

Various studies have been conducted on defect prediction using different metrics such as code complexity, history-based metrics, object-oriented metrics, and process metrics to construct prediction models [10, 11]. These models can be considered on a cross-project or within-project basis. In within-project defect prediction (WPDP), a model is constructed and applied on the same project [12]. For within-project strategy, a large amount of historical defect data is needed. Hence, in new projects that do not have enough data to train, cross-project strategy may be preferred [13]. Cross-project defect prediction (CPDP) is a method that involves applying a prediction model from one project to another, meaning that models are prepared by utilizing historical data from other projects [14, 15]. Studies in the field of CPDP have increased in recent years [10, 16]. However, there are some deficiencies in comparisons of prior studies since they cannot be replicated because of the difference in utilizing evaluation metrics or preparation way of training data. Therefore, Herbold et al. [16] tried to replicate different CPDP methods previously proposed and find which approach performed best in terms of metrics such as F-score, area under the curve (AUC), and Matthews correlation coefficient (MCC). Results showed that 7- or 8-year approaches may perform better. Another study [17] replicated prior work to demonstrate whether the determination of classification techniques is important. Both noisy and cleaned datasets were used, and the same results were obtained from the two datasets. However, new dataset gave better results for some classification algorithms. For this reason, authors claimed that the selection of classification techniques affects the performance of the model.

Numerous defect prediction studies have been conducted using DM techniques. In the following subsections, we will explain these studies in terms of whether they apply ensemble learning or not. Some defect prediction studies in SE are compared

**Ref. Year Task**

**123**

[18] 2011

Classification

ensemble methods to find the

most effective one

> [19] 2013

Classification

Comparative

imbalance learning methods and

proposed dynamic version of

AdaBoost.NC

[20] 2014

[17] 2015 Clustering/ classification

classification

impact on the

software defect prediction

models

[21] 2015

Classification

 Average probability ensemble

(APE) learning module is

proposed by combining feature

selection and ensemble learning

[22,

2016

Classification

Comparative

techniques using OO metrics on

six releases of Android operating

system

[24] 2016

Classification

 Caret has been applied whether

parameter settings can have a

 study of 18 ML

LR, **NB**, BN, **MLP**, RBF

Bagging, random

6 releases of Android app:

10-fold, validation

AUC for NB, LB, MLP is

>0.7

Out-of-sample

validation technique, AUC

 bootstrap

inter-release

Android 2.3.2, Android 2.3.7, Android 4.0.4, Android 4.1.2,

Android 4.2.2, Android 4.3.1

forest, Logistic model trees, **Logit**

**Boost**, Ada Boost

Bagging, boosting

 Cleaned NASA JM1, PC5

Proprietary

Prop-5

 from Prop-1 to

SVM, VP, CART, J48, ADT,

Nnge, DTNB

NB, KNN, LR, partial least

squares, NN, LDA, rule based,

DT, SVM

23]

performance

 of

Rule based: Ripper, Ridor

NNs: RBF

Nearest neighbor: KNN

DTs: J48, **LMT**

**APE** system combines seven

RF, GB

> classifiers: SGD, weighted SVMs

(W-SVMs),

Bernoulli naive Bayes (BNB)

 LR, MNB and

 technique has an

To show that the selection of

Classification

Comparative

imbalanced

 data

 study to deal with

Base Classifiers: C4.5, NB

AdaBoost, Bagging,

NASA datasets: CM1, JM1, KC1,

5 5 CV, MCC, ROC, results

*Data Mining and Machine Learning for Software Engineering*

change according to

characteristics

10 

AUC > 0.5 Scott-Knott

simple logistic, LMT, and

RF + base learner

outperforms

10 

10-fold CV, AUC > 0.5

Scott-Knott

simple logistic, LMT, and

RF + base learner

outperforms

 KNN and RBF

 test α = 0.05,

 KNN and RBF

 test α = 0.05,

10-fold CV

 of datasets

KC2, KC3, MC1, MC2, MW1,

PC1, PC2, PC3, PC4, PC5 NASA: CM1, JM1, KC1, KC3, KC4, MW1, PC1, PC2, PC3, PC4

PROMISE: Ant 1.7, Camel 1.6, Ivy 1.4, Jedit 4, Log4j 1, Lucene

2.4, Poi 3, Tomcat 6, Xalan 2.6,

Xerces 1.3 NASA: CM1, JM1, KC1, KC3, KC4, MW1, PC1, PC2, PC3, PC4

PROMISE (RQ2): Ant 1.7, Camel

1.6, Ivy 1.4, Jedit 4, Log4j 1,

Lucene 2.4, Poi 3, Tomcat 6,

Xalan 2.6, Xerces 1.3

boosting, RF

Sampling: ROS, RUS, SMOTE

Statistical: NB, **Simple Logistic**

Bagging, AdaBoost,

rotation forest, random subspace

Clustering: KM, EM

 study of class

NB, RUS, RUS-bal, THM, SMB,

BNC

Comparative

 study of various

NB

Bagging, boosting,

NASA datasets: CM1 JM1 KC1

KC2 KC3 KC4 MC1 MC2 MW1

PC1 PC2 PC3 PC4 PC5

RT, **RF**, RS, AdaBoost, Stacking,

and **Voting** RF, SMB, BNC,

NASA and PROMISE repository:

10-fold CV Balance, G-mean and AUC,

*DOI: http://dx.doi.org/10.5772/intechopen.91448*

PD, PF

MC2, KC2, JM1, KC1, PC4, PC3,

CM1, KC3, MW1, PC1

**AdaBoost.NC**

 **Objective**

**Algorithms**

**Ensemble learning**

 **Dataset**

**Evaluation**

**results**

10-fold CV, ACC, and AUC

*Vote* 88.48%

87.90%

*random forest*

 **metrics and**


#### *Data Mining and Machine Learning for Software Engineering DOI: http://dx.doi.org/10.5772/intechopen.91448*

engineering data and therefore to solve a software engineering problem such as defected or vulnerable systems, reused patterns, or parts of code changes. Mining and obtaining valuable knowledge from such data prevents errors and allows software engineers to deliver the project on time. Finally, in the evaluation phase, validation techniques are used to assess the data mining results such as k-fold cross validation for classification. The commonly used evaluation measures are accuracy, precision, recall, F-score, area under the curve (AUC) for classification, and sum

In this review, we examine data mining studies in various SE tasks and evaluate

A defect means an error, failure, flaw, or bug that causes incorrect or unexpected results in a system [8]. A software system is expected to be without any defects since software quality represents a capacity of the defect-free percentage of the product [9]. However, software projects often do not have enough time or people working on them to extract errors before a product is released. In such a situation, defect prediction methods can help to detect and remove defects in the initial stages of the SDLC and to improve the quality of the software product. In other words, the goal of defect prediction is to produce robust and effective

software systems. Hence, software defect prediction (SDP) is an important topic for software engineering because early prediction of software defects could help to

Various studies have been conducted on defect prediction using different metrics such as code complexity, history-based metrics, object-oriented metrics, and process metrics to construct prediction models [10, 11]. These models can be considered on a cross-project or within-project basis. In within-project defect prediction (WPDP), a model is constructed and applied on the same project [12]. For within-project strategy, a large amount of historical defect data is needed. Hence, in new projects that do not have enough data to train, cross-project strategy may be preferred [13]. Cross-project defect prediction (CPDP) is a method that involves applying a prediction model from one project to another, meaning that models are prepared by utilizing historical data from other projects [14, 15]. Studies in the field of CPDP have increased in recent years [10, 16]. However, there are some deficiencies in comparisons of prior studies since they cannot be replicated because of the difference in utilizing evaluation metrics or preparation way of training data. Therefore, Herbold et al. [16] tried to replicate different CPDP methods previously proposed and find which approach performed best in terms of metrics such as F-score, area under the curve (AUC), and Matthews correlation coefficient (MCC). Results showed that 7- or 8-year approaches may perform better. Another study [17] replicated prior work to demonstrate whether the determination of classification techniques is important. Both noisy and cleaned datasets were used, and the same results were obtained from the two datasets. However, new dataset gave better results for some classification algorithms. For this reason, authors claimed that the selection of classification techniques affects the performance of the model. Numerous defect prediction studies have been conducted using DM techniques. In the following subsections, we will explain these studies in terms of whether they apply ensemble learning or not. Some defect prediction studies in SE are compared

reduce development costs and produce more stable software systems.

of squared errors (SSE) for clustering.

*Data Mining - Methods, Applications and Systems*

**3. Data mining in software engineering**

commonly used algorithms and datasets.

**3.1 Data mining in defect prediction**

**122**

**Ref. Year Task**

**125**

[29] 2015 [30] 2015

[31] 2016

[32] 2016

[33] 2016

—

Weighted

support vector machine

(WLSTSVM)

misclassification

[34] 2016

[35] 2016

Classification

 A software defect prediction model to find faulty components

of a software

> [36] 2017

[37] 2017

Classification

 Analyze five popular ML

algorithms for software defect

prediction

Classification

 Propose an hybrid method called

A random

on two-step cluster (TSC)

ANN, PSO, DT, NB, LC

—

undersampling

 based

Stacking: DT, LR,

NASA MDP: i.e., CM1, KC1,

10-fold CV, AUC, (TSC-RUS + S) is the best

KC3, MC2, MW1, PC1, PC2,

PC3, PC4

Nasa and PROMISE datasets:

10-fold CV ANN < DT

CM1, JM1, KC1, KC2, PC1,

KC1-LC

kNN, NB

TSC-RUS + S

—

A learning techniques MONB,

MOBNN

multi-objective

 naive Bayes

NB, LR, DT, MODT, MOLR,

—

Jureczko datasets obtained from

AUC, Wilcoxon rank test

CP MO NB (0.72) produces

the highest value

ACC, ent filters, ACC 90%

PROMISE repository

MONB

Hybrid filter approaches

—

KC1, KC2, JM1, PC1, PC2, PC3,

and PC4 datasets

FISHER, MR, ANNIGMA.

 cost of DP

 to find

least-squares

 twin

SVM, NB, RF, LR, KNN, BN,

—

cost-sensitive

 neural network

Classification

 GA to select suitable source code

LR, ELM, SVML, SVMR, SVMP

—

30 open-source from PROMISE repository from

DS1 to DS30 PROMISE repository: CM1, KC1,

10-fold CV, PR, recall,

F-score, G-mean

Wilcoxon signed rank test

PC1, PC3, PC4, MC2, KC2, KC3

 software projects

5-fold CV, F-score, ACC,

pairwise t-test

metrics

Classification

 Authors proposed a model that

finds

fault-proneness

NB, LR, LivSVM, MLP, SGD,

SMO, VP, LR Logit Boost, Decision Stamp, RT, REP Tree

Classification

 To show the attributes that

NB, NN, association rules, DT

 Weighted voting rule

of the four algorithms

RF

Camel1.6, Tomcat 6.0, Ant 1.7,

10-fold CV, AUC

AUC = 0.661

*Data Mining and Machine Learning for Software Engineering*

*DOI: http://dx.doi.org/10.5772/intechopen.91448*

jEdit4.3, Ivy 2.0, arc, e-learning,

berek, forrest 0.8, zuzel,

Intercafe, and

Nieruchomosci

predict the defective state of

software modules

Classification

 Defect DM algorithms

identification

 by applying

NB, J48, MLP

—

PROMISE, NASA MDP dataset:

CM1, JM1, KC1, KC3, MC1, MC2,

MW1, PC1, PC2, PC3

NASA datasets: CM1, JM1, KC1,

PR, recall, ACC, F-score

NB > NN > DT

KC2, PC1

 **Objective**

**Algorithms**

**Ensemble learning**

 **Dataset**

**Evaluation**

**results**

10-fold CV, ACC, PR,

FMLP is the best

 **metrics and**


#### *Data Mining and Machine Learning for Software Engineering DOI: http://dx.doi.org/10.5772/intechopen.91448*

**Ref. Year Task**

**124**

 **Objective** large impact on the

of defect prediction models

> [25] 2017 Regression

 Aim is to validate the source code

5 training algorithms:

GDX, NM, LM

 GD, GDM,

Heterogeneous

and nonlinear ensemble methods

 linear

56 open-source from PROMISE Repository

 Java projects

10-fold CV, t-test, ULR

analysis

*Data Mining - Methods, Applications and Systems*

Neural network with Levenberg Marquardt (LM)

is the best Recall, PR, ACC, G-measure,

F-score, MCC, AUC

10-fold CV, F-score

metrics and identify a suitable set

of source code metrics

> [16] 2017

[26] 2017

[13] 2017

Classification

 Adaptive Selection of Classifiers

Base classifiers: LOG (binary

logistic regression),

MLP, DT

EALR, SL, RBFNet

Bagging, AdaBoost,

Bugzilla, Columba, Eclipse JDT,

Eclipse Platform, Mozilla,

PostgreSQ

Rotation Forest, RS

Unsupervised:

 LT, AGE

 NB, RBF,

Voting

Ginger Bread (2.3.2 and 2.3.7),

10-fold, validation

AUC for NB, LB, MLP is

>0.7

CV, POPT

MULTI performs

significantly

the baselines PR, recall, ACC

 better than all

timewise-CV,

 ACC, and

inter-release

Ice Cream Sandwich (4.0.2 and

4.0.4), and JellyBean (4.1.2, 4.2.2

and 4.3.1)

in bug prediction (ASCI) method

is proposed.

> [27] 2018

[28] 2007 [8] 2014 Clustering

 Cluster ensemble with PSO for clustering the software modules

(fault-prone

 or not fault-prone)

Classification

 To found pre- and post-release

LR

PSO clustering algorithm

 KM-E, KM-M, PSO-

Nasa MDP, PROMISE

> E, PSO-M and EM

—

Eclipse 2.0, 2.1, 3.0

> defects for every package and file

Classification

 MULTI method for JIT-SDP (just in time software defect

prediction)

Classification

 Just-in-time

(TLEL)

 defect prediction

NB, SVM, DT, LDA, NN

Classification

 Replicate 24 CDPD approaches,

DT, LR, NB, SVM

LE, RF, BAG-DT,

5 available datasets: JURECZKO,

NASA MDP, AEEEM,

NETGENE,

 RELINK

BAG-NB, BOOST-

DT, BOOST-NB

 Bagging, stacking

 Bugzilla, Columba, JDT,

Platform, Mozilla, and

PostgreSQL

and compare on 5 different

datasets

performance

**Algorithms**

**Ensemble learning**

 **Dataset** Apache Camel 1.2, Xalan 2.5–2.6

Eclipse Platform 2.0–2.1–3.0,

Debug 3.4, SWT 3.4, JDT,

Mylyn, PDE

**Evaluation**

**results**

Caret AUC to 40 percentage points

performance

 up

 **metrics and**


#### **Table 1.**

*Data mining and machine learning studies on the subject "defect prediction."* in **Table 1**. The objective of the studies, the year they were conducted, algorithms, ensemble learning techniques and datasets in the studies, and the type of data mining tasks are shown in this table. The bold entries in **Table 1** have better

Ensemble learning combines several base learning models to obtain better per-

The commonly used ensemble techniques bagging, boosting, and stacking are shown in **Figure 3** and briefly explained in this part. Bagging (which stands for bootstrap aggregating) is a kind of parallel ensemble. In this method, each model is built independently, and multiple training datasets are generated from the original dataset through random selection of different feature subsets; thus, it aims to decrease variance. It combines the outputs of each ensemble member by a voting mechanism. Boosting can be described as sequential ensemble. First, the same weights are assigned to data instances; after training, the weight of wrong predictions is increased, and this process is repeated as the ensemble size. Finally, it uses a weighted voting scheme, and in this way, it aims to decrease bias. Stacking is a technique that uses predictions from multiple models via a meta-classifier.

Some software defect prediction studies have compared ensemble techniques to determine the best performing one [10, 18, 21, 39, 40]. In a study conducted by Wang et al. [18], different ensemble techniques such as bagging, boosting, random tree, random forest, random subspace, stacking, and voting were compared to each other and a single classifier (NB). According to the results, voting and random forest clearly exhibited better performance than others. In a different study [39],

formance than individual models. These base learners can be acquired with:

performance than other algorithms in that study.

i. Different learning algorithms

iii. Different training sets

**Figure 3.**

**127**

*3.1.1 Defect prediction using ensemble learning techniques*

*Data Mining and Machine Learning for Software Engineering*

*DOI: http://dx.doi.org/10.5772/intechopen.91448*

ii. Different parameters of the same algorithm

*Common ensemble learning methods: (a) Bagging, (b) boosting, (c) stacking.*

in **Table 1**. The objective of the studies, the year they were conducted, algorithms, ensemble learning techniques and datasets in the studies, and the type of data mining tasks are shown in this table. The bold entries in **Table 1** have better performance than other algorithms in that study.

#### *3.1.1 Defect prediction using ensemble learning techniques*

Ensemble learning combines several base learning models to obtain better performance than individual models. These base learners can be acquired with:


The commonly used ensemble techniques bagging, boosting, and stacking are shown in **Figure 3** and briefly explained in this part. Bagging (which stands for bootstrap aggregating) is a kind of parallel ensemble. In this method, each model is built independently, and multiple training datasets are generated from the original dataset through random selection of different feature subsets; thus, it aims to decrease variance. It combines the outputs of each ensemble member by a voting mechanism. Boosting can be described as sequential ensemble. First, the same weights are assigned to data instances; after training, the weight of wrong predictions is increased, and this process is repeated as the ensemble size. Finally, it uses a weighted voting scheme, and in this way, it aims to decrease bias. Stacking is a technique that uses predictions from multiple models via a meta-classifier.

Some software defect prediction studies have compared ensemble techniques to determine the best performing one [10, 18, 21, 39, 40]. In a study conducted by Wang et al. [18], different ensemble techniques such as bagging, boosting, random tree, random forest, random subspace, stacking, and voting were compared to each other and a single classifier (NB). According to the results, voting and random forest clearly exhibited better performance than others. In a different study [39],

**Figure 3.** *Common ensemble learning methods: (a) Bagging, (b) boosting, (c) stacking.*

**Ref. Year Task**

**126**

[38] 2018

[10] 2018

**Table 1.** *Data mining and machine learning studies on the subject "defect prediction."*

Classification

 ML algorithms are compared

LR, BN, RBF, MLP, alternating

Max, CODEP,

PROMISE: Ant, Camel, ivy, Jedit,

Log4j, Lucene, Poi, Prop,

Tomcat, Xalan

Bagging J48, Bagging

NB, Boosting J48,

Boosting NB, RF

decision tree (ADTree), and DT

with CODEP

Classification

 Three well-known techniques are compared.

 ML

NB, DT, ANN

—

Three different datasets

DS1, DS2, DS3

 **Objective**

**Algorithms**

**Ensemble learning**

 **Dataset**

**Evaluation**

**results**

ACC, PR, recall, F, ROC

ACC 97%

DT > ANN > NB F-score, PR, AUC ROC

Max performs better than

CODEP

*Data Mining - Methods, Applications and Systems*

 **metrics and** ensemble methods were compared with more than one base learner (NB, BN, SMO, PART, J48, RF, random tree, IB1, VFI, DT, NB tree). For boosted SMO, bagging J48, and boosting and bagging RT, performance of base classifiers was lower than that of ensemble learner classifiers.

**3.2 Data mining in effort estimation**

*DOI: http://dx.doi.org/10.5772/intechopen.91448*

• Design the activities.

estimation task.

points (UCP).

**129**

• Determine project objectives and requirements.

*Data Mining and Machine Learning for Software Engineering*

• Estimate product size and complexity.

• Compare and repeat estimates.

data for estimation techniques [50].

Software effort estimation (SEE) is critical for a company because hiring more employees than required will cause loss of revenue, while hiring fewer employees than necessary will result in delays in software project delivery. The estimation analysis helps to predict the amount of effort (in person hours) needed to develop a software product. Basic steps of software estimation can be itemized as follows:

SEE contains requirements and testing besides predicting effort estimation [42].

Recently, a survey [43] analyzed effort estimation studies that concentrated on ML techniques and compared them with studies focused on non-ML techniques. According to the survey, case-based reasoning (CBR) and artificial neural network (ANN) were the most widely used techniques. In 2014, Dave and Dutta [44]

Several studies have compared ensemble learning methods with single learning algorithms [45, 46, 48, 49, 51, 60] and examined them on cross-company (CC) and within-company (WC) datasets [50]. The authors observed that ensemble methods obtained by a proper combination of estimation methods achieved better results than single methods. Various ML techniques such as neural network, support vector machine (SVM), and k-nearest neighbor are commonly used as base classifiers for ensemble methods such as bagging and boosting in software effort estimation. Moreover, their results indicate that CC data can increase performance over WC

In addition to the abovementioned studies, researchers have conducted studies without using ensemble techniques. The general approach is to investigate which DM technique has the best effect on performance in software effort estimation. For instance, Subitsha and Rajan [54] compared five different algorithms—MLP, RBFNN, SVM, ELM, and PSO-SVM—and Nassif et al. [57] investigated four neural network algorithms—MLP, RBFNN, GRNN, and CCNN. Although neural networks are widely used in this field, missing values and outliers frequently encountered in the training set adversely affect neural network results and cause inaccurate estimations. To overcome this problem, Khatibi et al. [53] split software projects into several groups based on their similarities. In their studies, the C-means clustering algorithm was used to determine the most similar projects and to decrease the impact of unrelated projects, and then analogy-based estimation (ABE) and NN were applied. Another clustering study by Azzeh and Nassif [59] combined SVM and bisecting k-medoids clustering algorithms; an estimation model was then built using RBFNN. The proposed method was trained on historical use case

Many research and review studies have been conducted in the field of SEE.

The current effort estimation studies using DM and ML techniques are available in **Table 2**. This table summarizes the prominent studies in terms of aspects such as year, data mining task, aim, datasets, and metrics. **Table 2** indicates that neural network is the most widely used technique for the effort

examined existing studies that focus only on neural network.

In study [21], a new method was proposed of mixing feature selection and ensemble learning for defect classification. Results showed that random forests and the proposed algorithm are not affected by poor features, and the proposed algorithm outperforms existing single and ensemble classifiers in terms of classification performance. Another comparative study [10] used seven composite algorithms (Ave, Max, Bagging C4.5, bagging naive Bayes (NB), Boosting J48, Boosting naive Bayes, and RF) and one composite state-of-the art study for cross-project defect prediction. The Max algorithm yielded the best results regarding F-score in terms of classification performance.

Bowes et al. [40] compared RF, NB, Rpart, and SVM algorithms to determine whether these classifiers obtained the same results. The results demonstrated that a unique subset of defects can be discovered by specific classifiers. However, whereas some classifiers are steady in the predictions they make, other classifiers change in their predictions. As a result, ensembles with decision-making without majority voting can perform best.

One of the main problems of SDP is the imbalance between the defect and nondefect classes of the dataset. Generally, the number of defected instances is greater than the number of non-defected instances in the collected data. This situation causes the machine learning algorithms to perform poorly. Wang and Yao [19] compared five class-imbalanced learning methods (RUS, RUS-bal, THM, BNC, SMB) and NB and RF algorithms and proposed the dynamic version of AdaBoost. NC. They utilized balance, G-mean, and AUC measures for comparison. Results showed that AdaBoost.NC and naive Bayes are better than the other seven algorithms in terms of evaluation measures. Dynamic AdaBoost.NC showed better defect detection rate and overall performance than the original AdaBoost.NC. To handle the class imbalance problem, studies [20] have compared different methods (sampling, cost sensitive, hybrid, and ensemble) by taking into account evaluation metrics such as MCC and receiver operating characteristic (ROC).

As shown in **Table 1**, the most common datasets used in the defect prediction studies [17–19, 39] are the NASA MDP dataset and PROMISE repository datasets. In addition, some studies utilized open-source projects such as Bugzilla Columba and Eclipse JDT [26, 27], and other studies used Android application data [22, 23].

#### *3.1.2 Defect prediction studies without ensemble learning*

Although use of ensemble learning techniques has dramatically increased recently, studies that do not use ensemble learning are still conducted and successful. For example, in study [32], prediction models were created using source code metrics as in ensemble studies but by using different feature selection techniques such as genetic algorithm (GA).

To overcome the class imbalance problem, Tomar and Agarwal [33] proposed a prediction system that assigns lower cost to non-defective data samples and higher cost to defective samples to balance data distribution. In the absence of enough data within a project, required data can be obtained from cross projects; however, in this case, this situation may cause class imbalance. To solve this problem, Ryu and Baik [34] proposed multi-objective naïve Bayes learning for cross-project environments. To obtain significant software metrics on cloud computing environments, Ali et al. used a combination of filter and wrapper approaches [35]. They compared different machine learning algorithms such as NB, DT, and MLP [29, 37, 38, 41].

#### **3.2 Data mining in effort estimation**

ensemble methods were compared with more than one base learner (NB, BN, SMO, PART, J48, RF, random tree, IB1, VFI, DT, NB tree). For boosted SMO, bagging J48, and boosting and bagging RT, performance of base classifiers was lower than that of

In study [21], a new method was proposed of mixing feature selection and ensemble learning for defect classification. Results showed that random forests and the proposed algorithm are not affected by poor features, and the proposed algorithm outperforms existing single and ensemble classifiers in terms of classification performance. Another comparative study [10] used seven composite algorithms (Ave, Max, Bagging C4.5, bagging naive Bayes (NB), Boosting J48, Boosting naive Bayes, and RF) and one composite state-of-the art study for cross-project defect prediction. The Max algorithm yielded the best results regarding F-score in terms of

Bowes et al. [40] compared RF, NB, Rpart, and SVM algorithms to determine whether these classifiers obtained the same results. The results demonstrated that a unique subset of defects can be discovered by specific classifiers. However, whereas some classifiers are steady in the predictions they make, other classifiers change in their predictions. As a result, ensembles with decision-making without majority

One of the main problems of SDP is the imbalance between the defect and nondefect classes of the dataset. Generally, the number of defected instances is greater than the number of non-defected instances in the collected data. This situation causes the machine learning algorithms to perform poorly. Wang and Yao [19] compared five class-imbalanced learning methods (RUS, RUS-bal, THM, BNC, SMB) and NB and RF algorithms and proposed the dynamic version of AdaBoost. NC. They utilized balance, G-mean, and AUC measures for comparison. Results showed that AdaBoost.NC and naive Bayes are better than the other seven algorithms in terms of evaluation measures. Dynamic AdaBoost.NC showed better defect detection rate and overall performance than the original AdaBoost.NC. To handle the class imbalance problem, studies [20] have compared different methods (sampling, cost sensitive, hybrid, and ensemble) by taking into account evaluation

As shown in **Table 1**, the most common datasets used in the defect prediction studies [17–19, 39] are the NASA MDP dataset and PROMISE repository datasets. In addition, some studies utilized open-source projects such as Bugzilla Columba and Eclipse JDT [26, 27], and other studies used Android application data [22, 23].

Although use of ensemble learning techniques has dramatically increased recently, studies that do not use ensemble learning are still conducted and successful. For example, in study [32], prediction models were created using source code metrics as in ensemble studies but by using different feature selection techniques

machine learning algorithms such as NB, DT, and MLP [29, 37, 38, 41].

To overcome the class imbalance problem, Tomar and Agarwal [33] proposed a prediction system that assigns lower cost to non-defective data samples and higher cost to defective samples to balance data distribution. In the absence of enough data within a project, required data can be obtained from cross projects; however, in this case, this situation may cause class imbalance. To solve this problem, Ryu and Baik [34] proposed multi-objective naïve Bayes learning for cross-project environments. To obtain significant software metrics on cloud computing environments, Ali et al. used a combination of filter and wrapper approaches [35]. They compared different

metrics such as MCC and receiver operating characteristic (ROC).

*3.1.2 Defect prediction studies without ensemble learning*

such as genetic algorithm (GA).

**128**

ensemble learner classifiers.

*Data Mining - Methods, Applications and Systems*

classification performance.

voting can perform best.

Software effort estimation (SEE) is critical for a company because hiring more employees than required will cause loss of revenue, while hiring fewer employees than necessary will result in delays in software project delivery. The estimation analysis helps to predict the amount of effort (in person hours) needed to develop a software product. Basic steps of software estimation can be itemized as follows:


SEE contains requirements and testing besides predicting effort estimation [42]. Many research and review studies have been conducted in the field of SEE. Recently, a survey [43] analyzed effort estimation studies that concentrated on ML techniques and compared them with studies focused on non-ML techniques. According to the survey, case-based reasoning (CBR) and artificial neural network (ANN) were the most widely used techniques. In 2014, Dave and Dutta [44] examined existing studies that focus only on neural network.

The current effort estimation studies using DM and ML techniques are available in **Table 2**. This table summarizes the prominent studies in terms of aspects such as year, data mining task, aim, datasets, and metrics. **Table 2** indicates that neural network is the most widely used technique for the effort estimation task.

Several studies have compared ensemble learning methods with single learning algorithms [45, 46, 48, 49, 51, 60] and examined them on cross-company (CC) and within-company (WC) datasets [50]. The authors observed that ensemble methods obtained by a proper combination of estimation methods achieved better results than single methods. Various ML techniques such as neural network, support vector machine (SVM), and k-nearest neighbor are commonly used as base classifiers for ensemble methods such as bagging and boosting in software effort estimation. Moreover, their results indicate that CC data can increase performance over WC data for estimation techniques [50].

In addition to the abovementioned studies, researchers have conducted studies without using ensemble techniques. The general approach is to investigate which DM technique has the best effect on performance in software effort estimation. For instance, Subitsha and Rajan [54] compared five different algorithms—MLP, RBFNN, SVM, ELM, and PSO-SVM—and Nassif et al. [57] investigated four neural network algorithms—MLP, RBFNN, GRNN, and CCNN. Although neural networks are widely used in this field, missing values and outliers frequently encountered in the training set adversely affect neural network results and cause inaccurate estimations. To overcome this problem, Khatibi et al. [53] split software projects into several groups based on their similarities. In their studies, the C-means clustering algorithm was used to determine the most similar projects and to decrease the impact of unrelated projects, and then analogy-based estimation (ABE) and NN were applied. Another clustering study by Azzeh and Nassif [59] combined SVM and bisecting k-medoids clustering algorithms; an estimation model was then built using RBFNN. The proposed method was trained on historical use case points (UCP).


**Ref. Year Task**

**131**

[53] 2013 Clustering/ [54] 2014 Regression

[55] 2014

[56] 2015 Regression [57] 2016 Regression [58] 2016 Regression

[59] 2016

Classification/

A hybrid model using SVM and

SVM, RBNN

—

Dataset1 = 45 industrial projects

Dataset2 = 65 educational

 projects

regression

[60] 2017

**Table 2.**

*Data mining and machine learning studies on the subject "effort estimation."*

Classification

 To estimate software effort by

SVM, KNN

Boosting:

Desharnais,

 Maxwell

LOOCV, k-fold CV

ACC = 91.35% for Desharnais

ACC = 85.48% for Maxwell

kNN and

SVM

using ML techniques

RBNN compared against previous

models

 To propose a model based on

Bayesian network

 Four neural network models are

compared with each other.

—

A hybrid model based on GA And

ACO for

 To display the effect of data

preprocessing

methods in SEE

 techniques on ML

optimization

 ANNs are examined using

COCOMO model

classification

development

 effort

Estimation of software

NN, ABE, C-means

MLP, RBFNN, SVM,

—

COCOMO II Data

> PSO-SVM Extreme

learning Machines

GA, ACO

CBR, ANN, CART

—

ISBSG, Desharnais,

USPFT

Kitchenham,

CV, MBRE, PRED (0.25), MdBRE

*Data Mining and Machine Learning for Software Engineering*

*DOI: http://dx.doi.org/10.5772/intechopen.91448*

Preprocessing

MDT, LD, MI, FS, CS,

FSS, BSS

MLP, RBFNN, GRNN,

—

ISBSG repository

10-fold CV, MAR

The CCNN

models

DIR, DRM

The proposed model is best

LOOCV, MAE, MBRE, MIBRE, SA

The proposed approach is the best

outperforms

 the other three

CCNN

GA and PSO

—

COCOMO NASA Dataset

 rech:

—

NASA datasets

—

Maxwell

3-fold CV and LOOCV, RE, MRE,

MMRE, PRED

MMRE, PRED

PSO-SVM is the best MMRE, the proposed method is the best

 **Objective**

**Algorithms**

**Ensemble**

**Dataset**

**Evaluation**

 **metrics and results**

**learning**


*Data mining and machine learning studies on the subject "effort estimation."*

*Data Mining and Machine Learning for Software Engineering DOI: http://dx.doi.org/10.5772/intechopen.91448*

**131**

**Ref. Year Task**

**130**

[45] 2008 Regression [46] 2009 Regression

 Authors proposed the ensemble of

NN, MLP, KNN

 Bagging

 NASA, NASA 93, USC, SDR,

Desharnais

neural networks with associative

memory (ENNA)

 To show the

for SEE

effectiveness

 of SVR

SVR, RBF

—

Tukutuku

[47] 2010 Regression [48] 2011 Regression [49] 2012 Regression [50] 2012 Regression [51] 2012 Regression

[52] 2012

Classification/

DM techniques to estimate

regression

software effort.

 To generate estimates from

CART, NN, LR, PCR,

Combining

PROMISE

MAR, MMRE, MdMRE, MMER, MBRE,

MIBRE.

Combinations

 perform better than 83%

top M solo

methods

PLSR, SWR,

ABE0-1NN,

ABE0-5NN

M5, CART, LR, MARS,

—

Coc81, CSC, Desharnais,

Maxwell, USP05

 Cocnasa,

MdMRE, Pred(25), Friedman test

Log + OLS > LMS, BC + OLS, MARS,

LS-SVM

MLPNN, RBFNN, SVM

ensembles of multiple prediction

methods

 To use

create diverse ensembles able to

dynamically

 adapt to changes

cross-company

 models to

WC RTs, CC-DWM

 WC-DWM

 3 datasets from ISBSG repository

(ISBSG2000,

datasets from PROMISE

(CocNasaCoc81

 and CocNasaCoc81Nasa93)

 ISBSG2001,

 ISBSG) 2

 To show the measures behave in

MLP, RBF, REPTree,

 Bagging

 cocomo81, nasa93, nasa, cocomo2,

desharnais,

 ISBSG repository

SEE and to create good ensembles

 To evaluate whether readily

MLP, RBF, RT

 Bagging

 5 datasets from PROMISE: cocomo81,

nasa93, nasa, sdr, and Desharnais

8 datasets from ISBSG repository

available ensemble methods

enhance SEE

 Ensemble of neural networks with

NN, MLP, KNN

 Bagging

 NASA, NASA 93, USC, SDR,

Desharnais

associative memory (ENNA)

 **Objective**

**Algorithms**

**Ensemble**

**Dataset**

**Evaluation**

MMRE, MdMRE and PRED(L)

For ENNA PRED(25) = 36.4

For neural network PRED(25) = 8

Random

MMRE, MdMRE, and PRED(L)

ENNA is the best

*Data Mining - Methods, Applications and Systems*

LOOCV, MMRE, Pred(25), MEMRE,

MdEMRE

SVR MMRE, MdMRE, PRED(25)

RTs and Bagging with MLPs perform

similarly

MMRE, PRED(25), LSD, MdMRE,

MAE, MdAE

Pareto ensemble for all measures,

except LSD.

MAE, Friedman test

Only DCL could improve upon RT

CC data potentially beneficial for

improving SEE

outperforms

 others

subsampling,

 t-test

 **metrics and results**

**learning**

Zare et al. [58] and Maleki et al. [55] utilized optimization methods for accurate cost estimation. In the former study, a model was proposed based on Bayesian network with genetic algorithm and particle swarm optimization (PSO). The latter study used GA to optimize the effective factors' weight, and then trained by ant colony optimization (ACO). Besides conventional effort estimation studies, researchers have utilized machine learning techniques for web applications. Since web-based software projects are different from traditional projects, the effort estimation process for these studies is more complex.

It is observed that PRED(25) and MMRE are the most popular evaluation metrics in effort estimation. MMRE stands for the mean magnitude relative error, and PRED(25) measures prediction accuracy and provides a percentage of predictions within 25% of actual values.

#### **3.3 Data mining in vulnerability analysis**

Vulnerability analysis is becoming the focal point of system security to prevent weaknesses in the software system that can be exploited by an attacker. Description of software vulnerability is given in many different resources in different ways [61]. The most popular and widely utilized definition appears in the Common Vulnerabilities and Exposures (CVE) 2017 report as follows:

Vulnerability is a weakness in the computational logic found in software and some hardware components that, when exploited, results in a negative impact to confidentiality, integrity or availability.

Vulnerability analysis may require many different operations to identify defects and vulnerabilities in a software system. Vulnerabilities, which are a special kind of defect, are more critical than other defects because attackers exploit system vulnerabilities to perform unauthorized actions. A defect is a normal problem that can be encountered frequently in the system, easily found by users or developers and fixed promptly, whereas vulnerabilities are subtle mistakes in large codes [62, 63]. Wijayasekara et al. claim that some bugs have been identified as vulnerabilities after being publicly announced in bug databases [64]. These bugs are called "hidden impact vulnerabilities" or "hidden impact bugs." Therefore, the authors proposed a hidden impact vulnerability identification methodology that utilizes text mining techniques to determine which bugs in bug databases are vulnerabilities. According to the proposed method, a bug report was taken as input, and it produces feature vector after applying text mining. Then, classifier was applied and revealed whether it is a bug or a vulnerability. The results given in [64] demonstrate that a large proportion of discovered vulnerabilities were first described as hidden impact bugs in public bug databases. While bug reports were taken as input in that study, in many other studies, source code is taken as input. Text mining is a highly preferred technique for obtaining features directly from source codes as in the studies [65–69]. Several studies [63, 70] have compared text mining-based models and software metrics-based models.

In the security area of software systems, several studies have been conducted related to DM and ML. Some of these studies are compared in **Table 3**, which shows the data mining task and explanation of the studies, the year they were performed, the algorithms that were used, the type of vulnerability analysis, evaluation metrics, and results. In this table, the best performing algorithms according to the evaluation criteria are shown in bold.

Vulnerability analysis can be categorized into three types: static vulnerability analysis, dynamic vulnerability analysis, and hybrid analysis [61, 80]. Many studies have applied the static analysis approach, which detects vulnerabilities from source code without executing software, since it is cost-effective. Few studies have

**Ref. Year Task**

**133**

[71] 2011 Clustering

[42] 2011

Classification/

To predict the time to next

LR, LMS, MLP, RBF,

Static

Static

 K9 email client for the Android platform

 NVD, CPE, CVSS

SMO

regression

[65] 2012 Text mining

[64] 2012

Classification/

To identify

vulnerabilities

 in

—

Static

 Linux kernel MITRE CVE and MySQL bug

databases

text mining

[72] 2014

Classification/

Combine taint analysis and

ID3, C4.5/J48, RF, RT,

Hybrid A version of WAP to collect the data

> KNN, NB, Bayes Net,

MLP, SVM, LR

regression

[73] 2014 Clustering

[63] 2014

[69] 2014

Classification

 To create model in the form of

NB, RF

Static

and Android

Applications

 from the F-Droid repository

> a binary classifier using text

mining

[74] 2015

Classification

 A new approach

to obtain potentially dangerous

codes

Comparison

 of text mining and

RF

Hierarchical

 clustering

Static

 5 open-source Pidgin, VLC, Poppler (Xpdf)

 projects: Linux, OpenSSL,

(complete-linkage)

—

Vulnerabilities

(Drupal, Moodle,

 from open-source

 web apps

10-fold CV Metrics: ER-BCE, ERBPP, ER-AVG

Correct source, correct sanitization, number of traversals, generation time,

execution time, reduction, amount of code

review <95%

PHPMyAdmin)

[70] 2015 Ranking/ [75] 2015 Clustering

 Search patterns for taint-style

vulnerabilities

 in C code

classification

software metrics models

(VCCFinder)

SVM-based

model

 detection

—

The database contains 66 GitHub projects

Classification

 Comparison metrics with text mining

 of software

RF

Static

(Drupal, Moodle,

Vulnerabilities

 from open-source

 web apps

3-fold CV, recall, IR, PR, FPR, ACC.

Text mining provides benefits overall

10-fold CV, PR, recall

PR and recall ≥ 80%

 k-fold CV, false alarms <99% at the same

level of recall

PHPMyAdmin)

 Identify

source codes using CPG

vulnerabilities

 from

—

Static

 Neo4J and

InfiniteGraph

 databases

—

data mining to obtain

vulnerabilities

bug databases

 Analysis of source code as text RBF, SVM

vulnerability

 **Objective**

 Obtaining software

vulnerabilities

 based on RDBC

**Algorithms**

RDBC

**Type**

Static

 Database is built by RD-Entropy

 **Dataset description**

**Evaluation**

FNR, FPR

CC, RMSE, RRSE

 ACC, PR, recall

ACC = 0.87, PR = 0.85, recall = 0.88

BDR, TPR, FPR

32% (Linux) and 62% (MySQL) of

vulnerabilities

*Data Mining and Machine Learning for Software Engineering*

*DOI: http://dx.doi.org/10.5772/intechopen.91448*

 10-fold CV, TPD, ACC, PR, KAPPA

ACC = 90.8%, PR = 92%, KAPPA = 81%

 **metrics and results**


#### *Data Mining and Machine Learning for Software Engineering DOI: http://dx.doi.org/10.5772/intechopen.91448*

Zare et al. [58] and Maleki et al. [55] utilized optimization methods for accurate

It is observed that PRED(25) and MMRE are the most popular evaluation metrics in effort estimation. MMRE stands for the mean magnitude relative error, and PRED(25) measures prediction accuracy and provides a percentage of predictions

Vulnerability analysis is becoming the focal point of system security to prevent weaknesses in the software system that can be exploited by an attacker. Description of software vulnerability is given in many different resources in different ways [61]. The most popular and widely utilized definition appears in the Common Vulnera-

Vulnerability is a weakness in the computational logic found in software and some hardware components that, when exploited, results in a negative impact to

Vulnerability analysis may require many different operations to identify defects and vulnerabilities in a software system. Vulnerabilities, which are a special kind of defect, are more critical than other defects because attackers exploit system vulnerabilities to perform unauthorized actions. A defect is a normal problem that can be encountered frequently in the system, easily found by users or developers and fixed promptly, whereas vulnerabilities are subtle mistakes in large codes [62, 63]. Wijayasekara et al. claim that some bugs have been identified as vulnerabilities after being publicly announced in bug databases [64]. These bugs are called "hidden impact vulnerabilities" or "hidden impact bugs." Therefore, the authors proposed a hidden impact vulnerability identification methodology that utilizes text mining techniques to determine which bugs in bug databases are vulnerabilities. According to the proposed method, a bug report was taken as input, and it produces feature vector after applying text mining. Then, classifier was applied and revealed whether it is a bug or a vulnerability. The results given in [64] demonstrate that a large proportion of discovered vulnerabilities were first described as hidden impact bugs in public bug databases. While bug reports were taken as input in that study, in many other studies, source code is taken as input. Text mining is a highly preferred technique for obtaining features directly from source codes as in the studies [65–69]. Several studies [63, 70] have compared text mining-based models and software

In the security area of software systems, several studies have been conducted related to DM and ML. Some of these studies are compared in **Table 3**, which shows the data mining task and explanation of the studies, the year they were performed, the algorithms that were used, the type of vulnerability analysis, evaluation metrics, and results. In this table, the best performing algorithms according to the evaluation

Vulnerability analysis can be categorized into three types: static vulnerability analysis, dynamic vulnerability analysis, and hybrid analysis [61, 80]. Many studies have applied the static analysis approach, which detects vulnerabilities from source

code without executing software, since it is cost-effective. Few studies have

cost estimation. In the former study, a model was proposed based on Bayesian network with genetic algorithm and particle swarm optimization (PSO). The latter study used GA to optimize the effective factors' weight, and then trained by ant colony optimization (ACO). Besides conventional effort estimation studies, researchers have utilized machine learning techniques for web applications. Since web-based software projects are different from traditional projects, the effort

estimation process for these studies is more complex.

bilities and Exposures (CVE) 2017 report as follows:

within 25% of actual values.

**3.3 Data mining in vulnerability analysis**

*Data Mining - Methods, Applications and Systems*

confidentiality, integrity or availability.

metrics-based models.

criteria are shown in bold.

**132**


#### **Table 3.**

*Data mining and machine learning studies on the subject "vulnerability analysis."* performed the dynamic analysis approach, in which one must execute software and check program behavior. The hybrid analysis approach [72, 76] combines these two

As revealed in **Table 3**, in addition to classification and text mining, clustering techniques are also frequently seen in software vulnerability analysis studies. To detect vulnerabilities in an unknown software data repository, entropy-based density clustering [71] and complete-linkage clustering [75] were proposed. Yamaguchi et al. [73] introduced a model to represent a large number of source codes as a graph called control flow graph (CPG), a combination of abstract syntax tree, CFG, and program dependency graph (PDG). This model enabled the discov-

To learn the time to next vulnerability, a prediction model was proposed in the study [42]. The result could be a number that refers to days or a bin representing values in a range. The authors used regression and classification techniques for the

In vulnerability studies, issue tracking systems like Bugzilla, code repositories like Github, and vulnerability databases such as NVD, CVE, and CWE have been

Li et al. [78] note two difficulties of vulnerability studies: demanding, intense manual labor and high false-negative rates. Thus, the widely used evaluation met-

During the past years, software developers have used design patterns to create complex software systems. Thus, researchers have investigated the field of design

Patterns display relationships and interactions between classes or objects. Welldesigned object-oriented systems have various design patterns integrated into them. Design patterns can be highly useful for developers when they are used in the right manner and place. Thus, developers avoid recreating methods previously refined by others. The pattern approach was initially presented in 1994 by four authors namely, Erich Gama, Richard Helm, Ralph Johnson, and John Vlissides—called the Gang of Four (GOF) in 1994 [85]. According to the authors, there are three types of

utilized [79]. In addition to these datasets, some studies have used Android [65, 68, 69] or web [63, 70, 72] (PHP source code) datasets. In recent years, researchers have concentrated on deep learning for building binary classifiers [77], obtaining vulnerability patterns [78], and learning long-term dependencies in

rics in vulnerability analysis are false-positive rate and false-negative rate.

sequential data [68] and features directly from the source code [81].

patterns in many ways [82, 83]. Fowler defines a pattern as follows:

"*A pattern is an idea that has been useful in one practical context and will*

1.Creational patterns provide an object creation mechanism to create the

necessary objects based on predetermined conditions. They allow the system to call appropriate object and add flexibility to the system when objects are created. Some creational design patterns are factory method, abstract factory,

2. Structural patterns focus on the composition of classes and objects to allow the establishment of larger software groups. Some of the structural design patterns

ery of previously unknown (zero-day) vulnerabilities.

*Data Mining and Machine Learning for Software Engineering*

*DOI: http://dx.doi.org/10.5772/intechopen.91448*

former and latter cases, respectively.

**3.4 Data mining in design pattern mining**

*probably be useful in others.*" [84]

design patterns:

**135**

builder, and singleton.

are adapter, bridge, composite, and decorator.

approaches.

*Data Mining and Machine Learning for Software Engineering DOI: http://dx.doi.org/10.5772/intechopen.91448*

performed the dynamic analysis approach, in which one must execute software and check program behavior. The hybrid analysis approach [72, 76] combines these two approaches.

As revealed in **Table 3**, in addition to classification and text mining, clustering techniques are also frequently seen in software vulnerability analysis studies. To detect vulnerabilities in an unknown software data repository, entropy-based density clustering [71] and complete-linkage clustering [75] were proposed. Yamaguchi et al. [73] introduced a model to represent a large number of source codes as a graph called control flow graph (CPG), a combination of abstract syntax tree, CFG, and program dependency graph (PDG). This model enabled the discovery of previously unknown (zero-day) vulnerabilities.

To learn the time to next vulnerability, a prediction model was proposed in the study [42]. The result could be a number that refers to days or a bin representing values in a range. The authors used regression and classification techniques for the former and latter cases, respectively.

In vulnerability studies, issue tracking systems like Bugzilla, code repositories like Github, and vulnerability databases such as NVD, CVE, and CWE have been utilized [79]. In addition to these datasets, some studies have used Android [65, 68, 69] or web [63, 70, 72] (PHP source code) datasets. In recent years, researchers have concentrated on deep learning for building binary classifiers [77], obtaining vulnerability patterns [78], and learning long-term dependencies in sequential data [68] and features directly from the source code [81].

Li et al. [78] note two difficulties of vulnerability studies: demanding, intense manual labor and high false-negative rates. Thus, the widely used evaluation metrics in vulnerability analysis are false-positive rate and false-negative rate.

#### **3.4 Data mining in design pattern mining**

During the past years, software developers have used design patterns to create complex software systems. Thus, researchers have investigated the field of design patterns in many ways [82, 83]. Fowler defines a pattern as follows:

"*A pattern is an idea that has been useful in one practical context and will probably be useful in others.*" [84]

Patterns display relationships and interactions between classes or objects. Welldesigned object-oriented systems have various design patterns integrated into them. Design patterns can be highly useful for developers when they are used in the right manner and place. Thus, developers avoid recreating methods previously refined by others. The pattern approach was initially presented in 1994 by four authors namely, Erich Gama, Richard Helm, Ralph Johnson, and John Vlissides—called the Gang of Four (GOF) in 1994 [85]. According to the authors, there are three types of design patterns:


**Ref. Year Task**

**134**

[76] 2016

[77] 2017

Classification

 1. Employ a deep neural

Deep neural network

—

Feature extraction from 4 applications

(BoardGameGeek,

CoolReader,

 AnkiDroid)

 Connectbot,

network

2. Combine N-gram analysis

and feature selection

 To analyze

software

source files

 Deep learning (LSTM) is used

RNN, LSTM, DBN

—

Experiments

the Android OS platform

 on 18 Java applications

 from

10-fold CV, PR, recall, and F-score

Deep Belief Network

PR, recall, and F-score > 80%

5-fold CV ACC, TP, TN

ACC = 74% 10-fold CV, PR, recall, F-score

F-score = 80.8% PR, recall, F-score

LSI > SVM

to learn semantic and syntactic

features in code

[68] 2017 Text mining

[66] 2018

Classification

 Identify bugs by extracting

NB, KNN, K-means,

Static

 NVD, Cat, Cp, Du, Echo, Head, Kill,

Mkdir, Nl, Paste, Rm, Seq, Shuf, Sleep, Sort, Tail, Touch, Tr, Uniq, Wc, Whoami

NN, SVM, DT, RF

text features from C source

code

 A deep vulnerability

(VulDeePecker)

 detection system

learning-based

BLSTM NN

Static

 NIST: NVD and SAR project

[78] 2018 Regression

[79] 2018

**Table 3.** *Data mining and machine learning studies on the subject*

*"vulnerability*

 *analysis."*

Classification

 A mapping between existing

LR, SVM, NB

—

Data is gathered from Apache Tomcat,

CVE,

source code is collected from Github

requirements

 from Bugzilla, and

requirements

vulnerabilities

 and

vulnerability

 from

characteristics

 of

—

 —

CVE, CWE, NVD databases

[67] 2017 Text mining

Classification

 Static and dynamic features for

classification

 **Objective**

**Algorithms** LR, MLP, RF

**Type**

Hybrid Dataset was created by analyzing 1039 test

cases from the Debian Bug Tracker

 **Dataset description**

**Evaluation**

FPR, FNR

Detect 55% of vulnerable programs

10 times using 5-fold CV

ACC = 92.87%, PR = 94.71%, recall =

90.17%

PR = 70%, recall = 60%

*Data Mining - Methods, Applications and Systems*

 **metrics and results** 3.Behavioral patterns determine common communication patterns between objects and how multiple classes behave when performing a task. Some behavioral design patterns are command, interpreter, iterator, observer, and visitor.

Many design pattern studies exist in the literature. **Table 4** shows some design pattern mining studies related to machine learning and data mining. This table contains the aim of the study, mining task, year, and design patterns selected by the study, input data, dataset, and results of the studies.

In design pattern mining, detecting the design pattern is a frequent study objective. To do so, studies have used machine learning algorithms [87, 89–91], ensemble learning [95], deep learning [97], graph theory [94], and text mining [86, 95].

In study [91], the training dataset consists of 67 object-oriented (OO) metrics extracted by using the JBuilder tool. The authors used LRNN and decision tree techniques for pattern detection. Alhusain et al. [87] generated training datasets from existing pattern detection tools. The ANN algorithm was selected for pattern instances. Chihada et al. [90] created training data from pattern instances using 45 OO metrics. The authors utilized SVM for classifying patterns accurately. Another metrics-oriented dataset was developed by Dwivedi et al. [93]. To evaluate the results, the authors benefited from three open-source software systems (JHotDraw, QuickUML, and JUnit) and applied three classifiers, SVM, ANN, and RF. The advantage of using random forest is that it does not require linear features and can manage high-dimensional spaces.

To evaluate methods and to find patterns, open-source software projects such as JHotDraw, Junit, and MapperXML have been generally preferred by researchers. For example, Zanoni et al. [89] developed a tool called MARPLE-DPD by combining graph matching and machine learning techniques. Then, to obtain five design patterns, instances were collected from 10 open-source software projects, as shown in **Table 4**.

Design patterns and code smells are related issues: Code smell refers to symptoms in code, and if there are code smells in a software, its design pattern is not well constructed. Therefore, Kaur and Singh [96] checked whether design pattern and smell pairs appear together in a code by using J48 Decision Tree. Their obtained results showed that the singleton pattern had no presence of bad smells.

According to the studies summarized in the table, the most frequently used patterns are abstract factory and adapter. It has recently been observed that studies on ensemble learning in this field are increasing.

#### **3.5 Data mining in refactoring**

One of the SE tasks most often used to improve the quality of a software system is refactoring, which Martin Fowler has described as "a technique for restructuring an existing body of code, altering its internal structure without changing its external behavior" [98]. It improves readability and maintainability of the source code and decreases complexity of a software system. Some of the refactoring types are: Add Parameter, Replace Parameter, Extract method, and Inline method [99].

Code smell and refactoring are closely related to each other: Code smells represent problems due to bad design and can be fixed during refactoring. The main challenge is to obtain which part of the code needs refactoring.

Some of data mining studies related to software refactoring are presented in **Table 5**. Some studies focus on historical data to predict refactoring [100] or to obtain both refactoring and software defects [101] using different data mining algorithms such as LMT, Rip, and J48. Results suggest that when refactoring

**Ref. Year Task**

**137**

[86] 2012 Text

[87] 2013 Regression [88] 2014 Graph mining Sub-graph

[89] 2015

Classification/

MARPLE-DPD

 is developed

SVM, DT, RF, K-

—

Classification

singleton and adapter

Classification

clustering for composite,

decorator, and factory

method

 and

 for

—

10 open-source

systems

DPExample,

Lexi v0.1.1 alpha, JRefactory v2.6.24, Netbeans v1.0.x, JUnit

v3.7, JHotDraw v5.1,

MapperXML

v0.4, PMD v1.8

 v1.9.7, Nutch

 QuickUML

 2001,

 software

10-fold CV, ACC,

F-score, AUC

ACC > =85%

means, ZeroR, OneR, NB, JRip,

CLOPE.

clustering

to classify instances

whether it is a bad or good

instance

[90] 2015 Regression

[91] 2016

Classification

 Design pattern recognition

LRNN, DT

—

Abstract factory,

Source code Dataset with 67 OO metrics,

extracted by JBuilder tool

adapter patterns

using ML algorithms.

 A new method (SVM-

Simple Logistic,

—

Adapter, builder,

Source code P-mart repository

PR, recall, F-score, FP

PR = 0.81, recall

=0.81,

F-score = 0.81,

FP = 0.038 5-fold CV, ACC, PR, recall, F-score

ACC = 100% by

LRNN

composite, factory

method, iterator,

observer

C4.5, KNN, SVM,

SVM-PHGS

PHGS) is proposed

mining-based

CloseGraph

—

 —

approach

 An approach is to find a

ANN

—

Adapter, command,

Set of

JHotDraw 5.1 open-source

application

candidate

classes

Java source

Open-source

project:YARI,

No any empirical

*Data Mining and Machine Learning for Software Engineering*

comparison

code

Zest, JUnit, JFreeChart,

ArgoUML

composite, decorator,

observer, and proxy

valid instance of a DP or not

classification

1—text 2—learning

 design patterns

classification

 to

Two-phase

 method:

NB, KNN, DT,

—

46 security patterns, 34

Documents

 Security, Douglass, GoF

> Douglass patterns, 23

GoF patterns

SVM

 **Objective**

**Algorithms**

 **EL**

 **Selected design**

**Input data Dataset**

**Evaluation**

**metrics and**

**results**

 PR, recall, EWM

PR = 0.62, recall = 0.75 10 fold CV, PR,

recall

*DOI: http://dx.doi.org/10.5772/intechopen.91448*

**patterns**


#### *Data Mining and Machine Learning for Software Engineering DOI: http://dx.doi.org/10.5772/intechopen.91448*

3.Behavioral patterns determine common communication patterns between objects and how multiple classes behave when performing a task. Some behavioral design patterns are command, interpreter, iterator, observer, and

Many design pattern studies exist in the literature. **Table 4** shows some design pattern mining studies related to machine learning and data mining. This table contains the aim of the study, mining task, year, and design patterns selected by the

In design pattern mining, detecting the design pattern is a frequent study objective. To do so, studies have used machine learning algorithms [87, 89–91], ensemble learning [95], deep learning [97], graph theory [94], and text mining [86, 95]. In study [91], the training dataset consists of 67 object-oriented (OO) metrics extracted by using the JBuilder tool. The authors used LRNN and decision tree techniques for pattern detection. Alhusain et al. [87] generated training datasets from existing pattern detection tools. The ANN algorithm was selected for pattern instances. Chihada et al. [90] created training data from pattern instances using 45 OO metrics. The authors utilized SVM for classifying patterns accurately. Another metrics-oriented dataset was developed by Dwivedi et al. [93]. To evaluate the results, the authors benefited from three open-source software systems (JHotDraw, QuickUML, and JUnit) and applied three classifiers, SVM, ANN, and RF. The advantage of using random forest is that it does not require linear features and can

To evaluate methods and to find patterns, open-source software projects such as JHotDraw, Junit, and MapperXML have been generally preferred by researchers. For example, Zanoni et al. [89] developed a tool called MARPLE-DPD by combining graph matching and machine learning techniques. Then, to obtain five design patterns, instances were collected from 10 open-source software projects, as shown

Design patterns and code smells are related issues: Code smell refers to symptoms in code, and if there are code smells in a software, its design pattern is not well constructed. Therefore, Kaur and Singh [96] checked whether design pattern and smell pairs appear together in a code by using J48 Decision Tree. Their obtained

According to the studies summarized in the table, the most frequently used patterns are abstract factory and adapter. It has recently been observed that studies

One of the SE tasks most often used to improve the quality of a software system is refactoring, which Martin Fowler has described as "a technique for restructuring an existing body of code, altering its internal structure without changing its external behavior" [98]. It improves readability and maintainability of the source code and decreases complexity of a software system. Some of the refactoring types are: Add

Code smell and refactoring are closely related to each other: Code smells represent problems due to bad design and can be fixed during refactoring. The main

Some of data mining studies related to software refactoring are presented in **Table 5**. Some studies focus on historical data to predict refactoring [100] or to obtain both refactoring and software defects [101] using different data mining algorithms such as LMT, Rip, and J48. Results suggest that when refactoring

results showed that the singleton pattern had no presence of bad smells.

Parameter, Replace Parameter, Extract method, and Inline method [99].

challenge is to obtain which part of the code needs refactoring.

on ensemble learning in this field are increasing.

**3.5 Data mining in refactoring**

study, input data, dataset, and results of the studies.

*Data Mining - Methods, Applications and Systems*

manage high-dimensional spaces.

in **Table 4**.

**136**

visitor.


**Table 4.** *Data mining*

 *and machine learning studies on the subject "design pattern mining."*

**Ref. Year Task**

**139**

[100] 2007 Regression

 Stages: (1) data

preprocessing,

analysis of the results

> [101] 2008

[102] 2014 Regression [103] 2015 Regression [104] 2016 Web mining/

[105] 2017 Clustering

[99] 2017

Classification

 A technique to predict refactoring

 at class level PCA, SMOTE

LS-SVM, RBF

 A novel algorithm (HASP) for software

Hierarchical

algorithm

 clustering

—

—

From tera- PROMISE Repository

seven open-source

systems

 software

Three open-source

 case studies

Evaluation Metric Function 10-fold CV, AUC, and ROC

curves

RBF kernel and polynomial The mean value of AUC for

LS-SVM RBF kernel is 0.96

 kernel

outperforms

 linear

Modularization

 Quality and

refactoring

 at the package level

clustering

refactoring

applications

opportunities

 in

service-oriented

Unsupervised

 learning approach to detect

 Removing defects with time series in a multi-

objective approach

 Propose GA-based learning for software

refactoring

 based on ANN

Classification

 Finding the relationship

and defects

 between refactoring

C4.5, LMT, Rip, NNge

GA, ANN

Multi-objective

based on NSGA-II,

ARIMA

PAM, K-means,

—

Two datasets of WSDL documents

> COBWEB, X-Means

 algorithm,

—

Xerces-J, JFreeChart,

Wilcoxon test with a 99%

confidence

 level (<sup>α</sup> = 0.01)

*Data Mining and Machine Learning for Software Engineering*

*DOI: http://dx.doi.org/10.5772/intechopen.91448*

GanttProject,

JHotDraw,

FindBugs, JFreeChart,

Pixelitor, and JDI-Ford

 Hibernate,

Wilcoxon rank sum test with a

99% confidence

< 1%)

 COBWEB and K-means max.

83.33% and 0%, COBWEB and K-means min.

33.33% and 66.66% intra-

cluster

inter-cluster

 level (<sup>α</sup>

 and Rhino.

 AntApache,

—

ArgoUML, JBoss Cache, Liferay

Portal, Spring Framework,

XDoclet

 (3) ML, (4)

understanding,

 (2)

J48, LMT, Rip, NNge

—

ArgoUML, Spring Framework

post-processing,

 (5)

 **Objective**

**Algorithms**

**EL**

 **Dataset**

**Evaluation**

**results**

 10-fold CV, PR, recall, F-score

PR and recall are 0.8 for

ArgoUML

PR, recall, F-score

 **metrics and**


#### *Data Mining and Machine Learning for Software Engineering DOI: http://dx.doi.org/10.5772/intechopen.91448*

**Ref. Year Task**

**138**

[92] 2016

Classification

 Three aspects: design

Layer Recurrent

RF

 Abstract factory,

Source code Dataset with 67 OO metrics,

extracted by JBuilder tool

adapter, bridge,

singleton,

and template method

Neural Network

(LRNN)

patterns, software metrics,

and supervised learning

methods

[93] 2017

Classification

 1. Creation of metrics-

ANN, SVM

 RF

 Abstract factory,

Source code Metrics extracted from source

codes (JHotDraw,

and Junit)

 QuickUML,

adapter, bridge,

composite, and

Template

oriented dataset

2. Detection of software

design patterns

> [94] 2017

[95] 2017 Text

[96] 2018

Classification

 Finding design pattern and

J48

—

Used patterns: adapter,

bridge, Template,

singleton

smell pairs which coexist in

the code

> **Table 4.**

*Data mining and machine learning studies on the subject "design pattern mining."*

categorization

appropriate

 design patterns

Selection of more

Fuzzy c-means

 Ensemble-IG

Various design patterns

 Problem definitions

DP, GoF, Douglass, Security

> of design

patterns

Source code Eclipse plugin Web of Patterns

PR, recall, F-score, PRC,

ROC

Singleton pattern

shows no

presence of bad

smells

The tool selected for code smell

detection is iPlasma

Classification

 Detection of design motifs

Strong graph

—

All three groups:

UML class

—

diagrams

creational, structural,

behavioral

simulation,

matching

 graph

based on a set of directed

semantic graphs

 **Objective**

**Algorithms**

 **EL**

 **Selected design**

**Input data Dataset**

**Evaluation**

**metrics and**

**results**

PR, recall, F-score

F-score = 100% by LRNN and RF

ACC = 100% by

RF

5-fold and 10-fold

*Data Mining - Methods, Applications and Systems*

CV, PR, recall,

F-score

ANN, SVM, and

RF yielded to

100% PR for

JHotDraw

PR, recall

High accuracy by

the proposed

method

 F-score

**patterns**



increases, the number of software defects decreases, and thus refactoring has a

While automated refactoring does not always give the desired result, manual refactoring is time-consuming. Therefore, one study [109] proposed a clusteringbased recommendation tool by combining multi-objective search and unsupervised learning algorithm to reduce the number of refactoring options. At the same time, the number of refactoring that should be selected is decreasing with the help of the

Since many SE studies that apply data mining approaches exist in the literature,

") for defect prediction, (

**Figure 5** shows the publications studied in classification, clustering, text mining, and association rule mining as a percentage of the total number of papers obtained by a Scopus query for each SE task. For example, in defect prediction, the number of studies is 339 in the field of classification, 64 in clustering, 8 in text mining, and 25 in the field of association rule mining. As can be seen from the pie charts, while clustering is a popular DM technique in refactoring, no study related to text mining is found in this field. In other SE tasks, the preferred technique is classification,

Defect prediction generally compares learning algorithms in terms of whether they find defects correctly using classification algorithms. Besides this approach, in some studies, clustering algorithms were used to select futures [110] or to compare supervised and unsupervised methods [27]. In the text mining area, to extract features from scripts, TF-IDF techniques were generally used [111, 112]. Although many different algorithms have been used in defect prediction, the most popular

*Number of publications of data mining studies for SE tasks from Scopus search by their years.*

" OR

") for effort estimation, (

") for refactoring. As seen in the figure, the number of studies

") for vulnerability analysis, and (

"defect detection

"effort estimation

" OR

"vulnerab\*

" OR

"software

"bug

" AND

"

this article presents only a few of them. However, **Figure 4** shows the current number of papers obtained from the Scopus search engine for each year from 2010 to 2019 by using queries in the title/abstract/keywords field. We extracted publications in 2020 since this year has not completed yet. Queries included ("data mining"

"defect prediction

using data mining in SE tasks, especially defect prediction and vulnerability analysis, has increased rapidly. The most stable area in the studies is design

positive effect on software quality.

*DOI: http://dx.doi.org/10.5772/intechopen.91448*

*Data Mining and Machine Learning for Software Engineering*

's feedback.

"machine learning

" OR

" OR

"refactoring

and the second is clustering.

ones are NB, MLP, and RBF.

**Figure 4.**

**141**

") with (

"vulnerability analysis

"cost estimation

"bug detection

" OR

developer

OR

prediction

"software

AND

"effort prediction

pattern mining.

**4. Discussion**

#### *Data Mining and Machine Learning for Software Engineering DOI: http://dx.doi.org/10.5772/intechopen.91448*

increases, the number of software defects decreases, and thus refactoring has a positive effect on software quality.

While automated refactoring does not always give the desired result, manual refactoring is time-consuming. Therefore, one study [109] proposed a clusteringbased recommendation tool by combining multi-objective search and unsupervised learning algorithm to reduce the number of refactoring options. At the same time, the number of refactoring that should be selected is decreasing with the help of the developer's feedback.

#### **4. Discussion**

Since many SE studies that apply data mining approaches exist in the literature, this article presents only a few of them. However, **Figure 4** shows the current number of papers obtained from the Scopus search engine for each year from 2010 to 2019 by using queries in the title/abstract/keywords field. We extracted publications in 2020 since this year has not completed yet. Queries included ("data mining" OR "machine learning") with ("defect prediction" OR "defect detection" OR "bug prediction" OR "bug detection") for defect prediction, ("effort estimation" OR "effort prediction" OR "cost estimation") for effort estimation, ("vulnerab\*" AND "software" OR "vulnerability analysis") for vulnerability analysis, and ("software" AND "refactoring") for refactoring. As seen in the figure, the number of studies using data mining in SE tasks, especially defect prediction and vulnerability analysis, has increased rapidly. The most stable area in the studies is design pattern mining.

**Figure 5** shows the publications studied in classification, clustering, text mining, and association rule mining as a percentage of the total number of papers obtained by a Scopus query for each SE task. For example, in defect prediction, the number of studies is 339 in the field of classification, 64 in clustering, 8 in text mining, and 25 in the field of association rule mining. As can be seen from the pie charts, while clustering is a popular DM technique in refactoring, no study related to text mining is found in this field. In other SE tasks, the preferred technique is classification, and the second is clustering.

Defect prediction generally compares learning algorithms in terms of whether they find defects correctly using classification algorithms. Besides this approach, in some studies, clustering algorithms were used to select futures [110] or to compare supervised and unsupervised methods [27]. In the text mining area, to extract features from scripts, TF-IDF techniques were generally used [111, 112]. Although many different algorithms have been used in defect prediction, the most popular ones are NB, MLP, and RBF.

**Figure 4.**

*Number of publications of data mining studies for SE tasks from Scopus search by their years.*

**Ref. Year Task**

**140**

[106] 2017

[107] 2017

[108] 2018

Classification

 A refactored and from repositories

> [109] 2018 Clustering

Combination

unsupervised

effort

**Table 5.**

*Data mining and machine learning studies on the subject*

*"refactoring."*

 learning to decrease developer's

 of the use of

multi-objective

 and

GMM, EM

—

ArgoUML, JHotDraw,

GanttProject,

Azureus

 UTest, Apache Ant,

learning-based

 approach (CREC) to extract

C4.5, SMO, NB.

RF,

Axis2,

Eclipse.jdt.core,

 Elastic

PR, recall, F-score

F-score = 83% in the

within-project

F-score = 76% in the

cross-project

One-way ANOVA with a 95%

confidence

 level (<sup>α</sup> = 5%)

 JRuby, and

Adaboost

Search, JFreeChart,

Lucene

non-refactored

 clone groups

—

Finding refactoring

opportunities

 in source code J48, BayesNet, SVM, LR RF

Classification

 Exploring the impact of clone refactoring

on the test code size

 (CR)

LR, KNN, NB

RF

 data collected from an open-

source Java software system

(ANT)

 Ant, ArgoUML, jEdit, jFreeChart,

Mylyn

 **Objective**

**Algorithms**

**EL**

 **Dataset**

**Evaluation**

**results**

PR, recall, accuracy, F-score

kNN and RF outperform ACC (fitting (98%), LOOCV

(95%), and 10 FCV (95%))

10-fold CV, PR, recall

86–97% PR and 71–98% recall

*Data Mining - Methods, Applications and Systems*

for proposed tech

 NB

 **metrics and**

**Figure 6** shows the number of document types (conference paper, book chapter, article, book) published between the years of 2010 and 2019. It is clearly seen that conference papers and articles are the most preferred research study type. It is clearly seen that there is no review article about data mining studies in design pattern mining.

example, the PMART repository includes source files of java projects, and the PROMISE repository has different datasets with software metrics such as

at package level using hierarchical clustering, and another study [99] applied class-level refactoring using LS-SVM as learning algorithm, SMOTE for handling

Data mining techniques have been applied successfully in many different domains. In software engineering, to improve the quality of a product, it is highly critical to find existing deficits such as bugs, defects, code smells, and vulnerabilities in the early phases of SDLC. Therefore, many data mining studies in the past decade have aimed to deal with such problems. The present paper aims to provide information about previous studies in the field of software engineering. This survey shows how classification, clustering, text mining, and association rule mining can be applied in five SE tasks: defect prediction, effort estimation, vulnerability analysis, design pattern mining, and refactoring. It clearly shows that classification is the most used DM technique. Therefore, new studies can focus on clustering

in this article.

on SE tasks.

**Abbreviations**

LMT logistic model trees

MAE mean absolute error RBF radial basis function RUS random undersampling

GMM Gaussian mixture model EM expectation maximizaion LR logistic regression SMB SMOTEBoost

THM threshold-moving BNC AdaBoost.NC RF random forest RBF radial basis function CC correlation coefficient

BayesNet Bayesian network

**143**

Rip repeated incremental pruning NNge nearest neighbor generalization PCA principal component analysis PAM partitioning around medoids

LS-SVM least-squares support vector machines

RUS-bal balanced version of random undersampling

SMOTE synthetic minority over-sampling technique

SMO sequential minimal optimization

ROC receiver operating characteristic

refactoring, and PCA for feature extraction.

*Data Mining and Machine Learning for Software Engineering*

*DOI: http://dx.doi.org/10.5772/intechopen.91448*

**5. Conclusion and future work**

cyclomatic complexity, design complexity, and lines of code. Since these repositories contain many datasets, no detailed information about them has been provided

Refactoring can be applied at different levels; study [105] predicted refactoring

**Table 6** shows popular repositories that contain various datasets and their descriptions, which tasks they are used for, and hyperlinks to download. For

#### **Figure 5.**

*Number of publications of data mining studies for SE tasks from Scopus search by their topics.*

#### **Figure 6.**

*The number of publications in terms of document type between 2010 and 2019.*


#### **Table 6.**

*Description of popular repositories used in studies.*

*Data Mining and Machine Learning for Software Engineering DOI: http://dx.doi.org/10.5772/intechopen.91448*

example, the PMART repository includes source files of java projects, and the PROMISE repository has different datasets with software metrics such as cyclomatic complexity, design complexity, and lines of code. Since these repositories contain many datasets, no detailed information about them has been provided in this article.

Refactoring can be applied at different levels; study [105] predicted refactoring at package level using hierarchical clustering, and another study [99] applied class-level refactoring using LS-SVM as learning algorithm, SMOTE for handling refactoring, and PCA for feature extraction.

#### **5. Conclusion and future work**

Data mining techniques have been applied successfully in many different domains. In software engineering, to improve the quality of a product, it is highly critical to find existing deficits such as bugs, defects, code smells, and vulnerabilities in the early phases of SDLC. Therefore, many data mining studies in the past decade have aimed to deal with such problems. The present paper aims to provide information about previous studies in the field of software engineering. This survey shows how classification, clustering, text mining, and association rule mining can be applied in five SE tasks: defect prediction, effort estimation, vulnerability analysis, design pattern mining, and refactoring. It clearly shows that classification is the most used DM technique. Therefore, new studies can focus on clustering on SE tasks.

#### **Abbreviations**

**Figure 6** shows the number of document types (conference paper, book chapter, article, book) published between the years of 2010 and 2019. It is clearly seen that conference papers and articles are the most preferred research study type. It is clearly seen that there is no review article about data mining studies in design

**Table 6** shows popular repositories that contain various datasets and their descriptions, which tasks they are used for, and hyperlinks to download. For

*Number of publications of data mining studies for SE tasks from Scopus search by their topics.*

*The number of publications in terms of document type between 2010 and 2019.*

**Repository Topic Description Web link**

estimation

Defect Pred. It includes software metrics,

Eclipse PDE:

validated

Nasa MDP Defect Pred. NASA's Metrics Data Program https://github.com/opensciences/ope

Android Git Defect Pred. Android version bug reports https://android.googlesource.com/

It includes 20 datasets for defect prediction and cost
