Analytical Statistics Techniques of Classification and Regression in Machine Learning

*Pramod Kumar, Sameer Ambekar, Manish Kumar and Subarna Roy*

#### **Abstract**

This chapter aims to introduce the common methods and practices of statistical machine learning techniques. It contains the development of algorithms, applications of algorithms and also the ways by which they learn from the observed data by building models. In turn, these models can be used to predict. Although one assumes that machine learning and statistics are not quite related to each other, it is evident that machine learning and statistics go hand in hand. We observe how the methods used in statistics such as linear regression and classification are made use of in machine learning. We also take a look at the implementation techniques of classification and regression techniques. Although machine learning provides standard libraries to implement tons of algorithms, we take a look on how to tune the algorithms and what parameters of the algorithm or the features of the algorithm affect the performance of the algorithm based on the statistical methods.

**Keywords:** machine learning, statistics, classification, regression, algorithms

#### **1. Introduction**

Stating that statistical methods are useful in machine learning is analogous to saying that wood working methods are helpful for a carpenter. Statistics is the foundation of machine learning. However not all machine learning methods have been said to have derived from statistics. To begin with let us take a look at what statistics and machine learning means.

Statistics is extensively used in areas of science and finance and in the industry. Statistics is known to be mathematical science and not just mathematics. It is said to have been originated in seventeenth century. It consists of data collection, organizing the data, analyzing the data, interpretation and presentation of data. Statistical methods are being used since a long time in various fields to understand the data efficiently and to gain an in-depth analysis of the data [1].

On the other hand, machine learning is a branch of computer science which uses statistical abilities to learn from a particular dataset [2]. It was invented in the year 1959. It learns using algorithm and then has the ability to predict based on what it has been fed with. Machine learning gives out detailed information than statistics [3].

Most of the techniques of machine learning derive their behavior from statistics. However not many are familiar with this since both of them have their own jargons. For instance learning in statistics is called as fitting, supervised learning from machine learning is called as regression. Machine learning is a subfield of computer science and artificial intelligence. Machine learning is said to be a subdivision of computer science and artificial intelligence. It does use fewer assumptions than statistics. Machine learning unlike statistics deals with large amount of data and it also requires minimum human effort since most of its computation is done by the machine or the computer itself. Machine learning unlike statistics has a strong predicting power than statistics. Depending on the type of data machine learning can be categorized into supervised machine learning, unsupervised machine learning and reinforcement learning [4].

• Data preparation—where statistics is used for data preprocessing which is later sent to the model. For instance when there are missing values in the dataset, we compute statistical mean or statistical median and fill it in the empty spaces of the dataset. It is recommended that machine learning model should never be fed with a dataset which has empty cells in it. It also used in preprocessing stage to scale the data by which the values are scaled to a particular range by which the mathematical computation becomes easy during the training of

*Analytical Statistics Techniques of Classification and Regression in Machine Learning*

• Model evaluation—no model is perfect in predicting when it is built for the first time. Simply building the model is not enough. It is vital to check how well is it performing and if not then by how much is it closer to being accurate enough. Hence, we evaluate the model by statistical methods, which tell by how much the result is accurate and a lot many things about the end result obtained. We make use of metrics such as confusion matrix, Kolmogorov Smirnov chart, AUC—ROC, root mean squared error and many metrics to

• Model selection—we make use of many algorithms to train the algorithm and there is a chance of selecting only one which gives out accurate results when compared to others. The process of selecting the right solution for this is called model selection. Two of the statistical methods can be used to select the appropriate model such as statistical hypothesis test and estimation

• Data selection—some datasets carry a lot of features with them. Of many features, it may happen so that only some contribute significantly in estimation of the result. Considering all the features becomes

computationally expensive and as well as time consuming. By making use of statistics concepts we can eliminate the features which do not contribute significantly in producing the result. That is it helps in finding out the

dependent variables or features for any result. But it is important to note that this method requires careful and skilled approach. Without which it may

In this chapter we take a look at how statistical methods such as, regression and classification are used in machine learning with their own merits and demerits.

Regression is a statistical measure used in finance, investing and many other areas which aims to determine relationship between the dependent variables and 'n'

Multiple regression—where two or more independent variables are used to

Linear and logistic are the types of regression which are used in predictive

Linear regression—where one independent variable is used to explain or predict

In statistical modeling, regression analysis consists of set of statistical methods to

number of independent variables. Regression consists of two types:

explain or predict the outcome of the dependent variable.

estimate how the variables are related to each other.

machine learning.

*DOI: http://dx.doi.org/10.5772/intechopen.84922*

enhance our model.

statistics [5].

lead to wrong results.

the outcome of the dependent variable.

**2. Regression**

modeling [6].

**69**

There seems to be analogy between machine learning and statistics. The following picture from textbook shows how statistics and machine learning visualize a model. **Table 1** shows how terms of statistics have been coined in machine learning.

To understand how machine learning and statistics come out with the results let's look at **Figure 1**. In statistical modeling on the left half of the image, linear regression with two variables is fitting the best plane with fewer errors. In machine learning the right half of the image to fit the model in the best possible way the independent variables have been converted into the square of error terms. That is machine learning strives to get a better fit than the statistical model. In doing so, machine learning minimizes the errors and increases the prediction rates.

Statistics methods are not just useful in training the machine learning model but they are helpful in many other stages of machine learning such as:


**Table 1.**

*Machine learning jargons and corresponding statistics jargons.*

**Figure 1.** *Statistical and machine learning method.*

*Analytical Statistics Techniques of Classification and Regression in Machine Learning DOI: http://dx.doi.org/10.5772/intechopen.84922*


In this chapter we take a look at how statistical methods such as, regression and classification are used in machine learning with their own merits and demerits.

#### **2. Regression**

Most of the techniques of machine learning derive their behavior from statistics. However not many are familiar with this since both of them have their own jargons. For instance learning in statistics is called as fitting, supervised learning from machine learning is called as regression. Machine learning is a subfield of computer science and artificial intelligence. Machine learning is said to be a subdivision of computer science and artificial intelligence. It does use fewer assumptions than statistics. Machine learning unlike statistics deals with large amount of data and it also requires minimum human effort since most of its computation is done by the machine or the computer itself. Machine learning unlike statistics has a strong predicting power than statistics. Depending on the type of data machine learning can be categorized into supervised machine learning, unsupervised machine

There seems to be analogy between machine learning and statistics. The following picture from textbook shows how statistics and machine learning visualize a model. **Table 1** shows how terms of statistics have been coined

To understand how machine learning and statistics come out with the results let's look at **Figure 1**. In statistical modeling on the left half of the image, linear regression with two variables is fitting the best plane with fewer errors. In machine learning the right half of the image to fit the model in the best possible way the independent variables have been converted into the square of error

terms. That is machine learning strives to get a better fit than the statistical model. In doing so, machine learning minimizes the errors and increases the

Generalization Tool set performance Supervised learning Regression/classification Unsupervised learning Density estimation, clustering

they are helpful in many other stages of machine learning such as:

**Machine learning Statistics** Network, graphs Model Weights Parameters Learning Fitting

*Machine learning jargons and corresponding statistics jargons.*

Statistics methods are not just useful in training the machine learning model but

learning and reinforcement learning [4].

*Data Mining - Methods, Applications and Systems*

in machine learning.

prediction rates.

**Table 1.**

**Figure 1.**

**68**

*Statistical and machine learning method.*

Regression is a statistical measure used in finance, investing and many other areas which aims to determine relationship between the dependent variables and 'n' number of independent variables. Regression consists of two types:

Linear regression—where one independent variable is used to explain or predict the outcome of the dependent variable.

Multiple regression—where two or more independent variables are used to explain or predict the outcome of the dependent variable.

In statistical modeling, regression analysis consists of set of statistical methods to estimate how the variables are related to each other.

Linear and logistic are the types of regression which are used in predictive modeling [6].

Linear assumes that the relationship between the variables are linear that is they are linearly dependent. The input variables consist of variables X1, X2, …, Xn (where n is a natural number).

Linear models were developed long time ago but till date they are able to produce significant results. That is even in the modern computer's era they are well off. They are widely used because they are not complex in nature. In prediction, they can even out perform complex nonlinear models.

There are 'n' number of regressions that can be performed. We look at the most widely used five types of regression techniques. They are:


Any regression method would involve the following:


It is denoted in the form of function as:

$$\mathbf{Y} \approx \mathbf{f}(\mathbf{X}, \boldsymbol{\upbeta}) \tag{1}$$

However, linear regression makes the following assumptions:

*Analytical Statistics Techniques of Classification and Regression in Machine Learning*

• There exists no auto-correlation between the variables

The regression object "reg" has been created and exists.

from sklearn import datasets, linear\_model #Declare the linear regression function reg=linear\_model.LinearRegression()

• There exists no multi collinearity or little multicollinearity among the variables

It is fast and easy to model and it is usually used when the relationship to be modeled is not complex. It is easy to understand. However linear regression is

Note: In all of the usages stated in this chapter, we have assumed the following: The dataset has been divided into training set (denoted by X) and test set

Scikit-learn to implement the algorithm, to split the dataset and various other purposes.

• That there is a linear relationship

*DOI: http://dx.doi.org/10.5772/intechopen.84922*

• There exists multivariate normality

• No presence of homoscedasticity

We have used the following libraries: Scipy and Numoy for numerical calculations

Usage of linear regression in python: import matplotlib.pyplot as plt

#to check slope and intercept

import numpy as np

#call the method reg.fit(height,weight)

Pandas for dataset handling

sensitive to outliners.

**Figure 2.**

*Linear regression on a dataset.*

(denoted by y\_test)

**71**

#### **2.1 Linear regression**

It is the most widely used regression type by far. Linear regression establishes a relationship between the input variables (independent variables) and the output variable (dependent variable).

$$\mathbf{That is } \mathbf{Y} = \mathbf{X\_1} + \mathbf{X\_2} + \dots + \mathbf{X\_n}$$

It assumes that the output variable is a combination of the input variables. A linear regression line is represented by Y = a + bX, where X is the explanatory variable and Y is the dependent variable. The slope of the line is 'b', and 'a' is the intercept (the value of y when x = 0).

A line regression is represented by the equation:

$$\mathbf{Y} = \mathbf{a} + \mathbf{b}\mathbf{X}$$

where X indicates independent variables and 'Y' is the dependent variable [7]. This equation when plotted on a graph is a line as shown below in **Figure 2**.

*Analytical Statistics Techniques of Classification and Regression in Machine Learning DOI: http://dx.doi.org/10.5772/intechopen.84922*

**Figure 2.** *Linear regression on a dataset.*

Linear assumes that the relationship between the variables are linear that is they are linearly dependent. The input variables consist of variables X1, X2, …, Xn (where

There are 'n' number of regressions that can be performed. We look at the most

It is the most widely used regression type by far. Linear regression establishes a relationship between the input variables (independent variables) and the output

That is Y ¼ X1 þ X2 þ … þ Xn

It assumes that the output variable is a combination of the input variables. A linear regression line is represented by Y = a + bX, where X is the explanatory variable and Y is the dependent variable. The slope of the line is 'b', and 'a' is the

Y ¼ a þ bX

where X indicates independent variables and 'Y' is the dependent variable [7]. This equation when plotted on a graph is a line as shown below

Y≈ f Xð Þ *;* β (1)

Linear models were developed long time ago but till date they are able to produce significant results. That is even in the modern computer's era they are well off. They are widely used because they are not complex in nature. In prediction,

they can even out perform complex nonlinear models.

*Data Mining - Methods, Applications and Systems*

widely used five types of regression techniques. They are:

Any regression method would involve the following:

• The dependent variables also known as output variable

• The independent variables also known as input variables

• The unknown variables is denoted by beta

It is denoted in the form of function as:

n is a natural number).

• Linear regression

• Logistic regression

• Polynomial regression

• Stepwise regression

• Ridge regression

**2.1 Linear regression**

in **Figure 2**.

**70**

variable (dependent variable).

intercept (the value of y when x = 0).

A line regression is represented by the equation:

However, linear regression makes the following assumptions:


It is fast and easy to model and it is usually used when the relationship to be modeled is not complex. It is easy to understand. However linear regression is sensitive to outliners.

Note: In all of the usages stated in this chapter, we have assumed the following: The dataset has been divided into training set (denoted by X) and test set (denoted by y\_test)

The regression object "reg" has been created and exists.

We have used the following libraries:

Scipy and Numoy for numerical calculations

Pandas for dataset handling

Scikit-learn to implement the algorithm, to split the dataset and various other purposes.

Usage of linear regression in python: import matplotlib.pyplot as plt import numpy as np from sklearn import datasets, linear\_model #Declare the linear regression function reg=linear\_model.LinearRegression() #call the method reg.fit(height,weight) #to check slope and intercept

m=reg.coef\_[0] b=reg.intercept\_ print("slope=",m, "intercept=",b) # check the accuracy on the training set reg.score(X, y)

#### **2.2 Logistic regression**

Logistic regression is used when the dependent variable is binary (True/False) in nature. Similarly the value of y ranges from 0 to 1 (**Figure 3**) and it is represented by the equation:

**2.3 Polynomial regression**

*DOI: http://dx.doi.org/10.5772/intechopen.84922*

Example:

higher degree.

It is a type of regression where the independent variable power is greater than 1.

If the degree of the equation is 2 then it is called quadratic. If 3 then it is called cubic and if it is 4 it is called quartic. Polynomial regressions are fit with the method of least squares. Since the least squares minimizes the variance of the unbiased estimators of all the coefficients which are done under the conditions of Gauss-Markov theorem. Although we may get tempted to fit a higher degree polynomial so

The model is more accurate when it fed with large number of observations. Not a good thing to extrapolate beyond the limits of the observed values. Values for the predictor shouldn't be large else they will cause overflow with

from sklearn.preprocessing import PolynomialFeatures

#makes use of a pre-processor called degree for the function

This type of regression is used when we have multiple independent variables. To select the variables which are independent an automatic process is used. If used in the right way it puts more power and presents us ton of information. It can be used when the number of variables is too many. However if it is used haphazardly it may

The plotted graph is usually a curve in nature as shown in **Figure 4**.

*Analytical Statistics Techniques of Classification and Regression in Machine Learning*

that we could get a low error, it may cause over-fitting [9]. Some guidelines which are to be followed are:

Usage of polynomial regression in python:

reg = PolynomialFeatures(degree=2)

import numpy as np

reg.fit\_transform(X) reg.score(X, y)

**2.4 Step-wise regression**

affect the models performance.

*Plotted graph is looks as curve in nature.*

**Figure 4.**

**73**

Y ¼ a þ b Xð Þ <sup>2</sup> þ X3 þ …Xn *:* (3)

Odds ¼ p*=*ð Þ¼ 1 � p probability that event will occur*=*probability that the event will not occur ln odds ð Þ¼ ln pð Þ *=*ð Þ 1 � p (2) logit pð Þ¼ ln pð Þ¼ *=*ð Þ 1 � p b0 þ b1X1 þ b2X2 þ b3X3… þ bkXk

Logistic regression is used in classification problems. For example to classify emails as spam or not and to predict whether the tumor is malignant or not. It is not mandatory that the input variables have linear relationship to the output variable [8]. The reason being that it makes us of nonlinear log transformation to the predicted odds. It is advised to make use of only the variables which are powerful predictors to increase the algorithms performance.

However, it is important to note the following while making use of logistic regression:

Doesn't handle large number of categorical features.

The non-linear features should be transformed before using them.

Usage of logistic regression in python:

import numpy as np import pandas as pd from sklearn.linear\_model import LogisticRegression # instantiate a logistic regression model, and fit with X and y reg = LogisticRegression() reg = model.fit(X, y) # check the accuracy on the training set reg.score(X, y)

**Figure 3.** *Standard logistic function.*

*Analytical Statistics Techniques of Classification and Regression in Machine Learning DOI: http://dx.doi.org/10.5772/intechopen.84922*

#### **2.3 Polynomial regression**

m=reg.coef\_[0] b=reg.intercept\_

*Data Mining - Methods, Applications and Systems*

reg.score(X, y)

**2.2 Logistic regression**

by the equation:

regression:

**Figure 3.**

**72**

*Standard logistic function.*

print("slope=",m, "intercept=",b) # check the accuracy on the training set

predictors to increase the algorithms performance.

Usage of logistic regression in python: import numpy as np import pandas as pd

> reg = LogisticRegression() reg = model.fit(X, y)

reg.score(X, y)

Doesn't handle large number of categorical features.
