**3.1 Steps involved in the PCA algorithm**

To carry out the process of PCA the following are the five significant steps to be followed [3]:


### *3.1.1 Standardization*

This step normalizes a set of continuous beginning variables so that their effects on the analysis are consistent.

Standardization is essential before PCA since it is sensitive to the variances of the original variables. Suppose the initial variable ranges differ significantly. In that case, the variables with more comprehensive ranges will outnumber those with smaller ranges (for example, a variable ranging from 0 to 100 will outnumber a variable ranging from 0 to 1), resulting in a skewed outcome. As a result, converting the data to equal scales could be a possible solution to this issue. Subtracting the mean and dividing by the standard deviation for each variable value can be done numerically.

$$Z = \frac{\text{Value} - \text{mean}}{\text{Standard deviation}} \tag{3}$$

After the standardization is complete, all variables will be changed to the same scale.

#### *3.1.2 Covariance matrix computation*

The goal of this step is to determine how the variables in the input dataset differ from the mean with each other and whether there is a link between them. Because the variables may become so intertwined that they contain redundant information, the covariance matrix is constructed to find these correlations.

The covariance matrix, a symmetric matrix with *p* � *p* entries, contains all possible pairs of starting variables and their covariances (where p is the number of dimensions). The covariance matrix for a three-dimensional dataset with three variables *x*, *y*, and *z* is a 3 � 3 matrix of the form:

$$\begin{array}{cccc}\mathbf{Cov}(\mathfrak{x},\mathfrak{x}) & \mathbf{Cov}(\mathfrak{x},\mathfrak{y}) & \mathbf{Cov}(\mathfrak{x},\mathfrak{z})\\ \mathbf{Cov}(\mathfrak{y},\mathfrak{x}) & \mathbf{Cov}(\mathfrak{y},\mathfrak{y}) & \mathbf{Cov}(\mathfrak{y},\mathfrak{z})\\ \mathbf{Cov}(\mathfrak{z},\mathfrak{x}) & \mathbf{Cov}(\mathfrak{z},\mathfrak{y}) & \mathbf{Cov}(\mathfrak{z},\mathfrak{z})\end{array} \tag{4}$$

*Evaluation of Principal Component Analysis Variants to Assess Their Suitability… DOI: http://dx.doi.org/10.5772/intechopen.105418*

The variances of each starting variable are shown on the main diagonal (top left to bottom right) since a variable covariance with itself equals its variance (Cov(*a*, *a*) = Var(*a*)). The entries of the covariance matrix are symmetric about the principal diagonal because covariance is commutative (Cov(*a*,*b*) = Cov(*b*,*a*)). This shows that the triangle's upper and lower triangular parts are equal.

The following are the signs of covariance that are related to correlation:


*3.1.3 Computation of eigenvectors and eigenvalues of the covariance matrix to identify the principal components*

In order to uncover the underlying components of the data, eigenvectors and eigenvalues are linear algebra concepts that must be computed from the covariance matrix. Before go into the details of these themes, let us establish what "principal components" mean.

PCs are new variables created by merging or linearly combining essential variables. The new variables (i.e., primary components) are uncorrelated due to these combinations, and the majority of the information from the initial variables is squeezed or compressed into the first components. So, 10-dimensional data provides ten primary components. However, PCA seeks to place as much information as possible in the first component, then as little information in the second, and so on, until the result looks like **Figure 1** below.

One can minimize dimensionality without granting too much information by splitting data into critical components and eliminating components with insufficient data. The remaining components can be regarded as new variables. Because the essential components are produced as linear combinations of the original variables, they are less interpretable and have no significant relevance.

The data directions that explain the most variance, or the lines that include the most data information, are considered essential components in geometric terms. In

**Figure 1.** *Principal components vs. percentage of explained variances.*

this case, the higher a line variance, the greater the dispersion of data points, and the greater the dispersion along a line, the more information it retains. Put another way; consider the essential components as additional dimensions that provide the proper viewpoint for perceiving and processing data, making it easier to spot differences between observations.

### *3.1.3.1 Constructing principal components with PCA*

Because there are as many variables in the data as there are PCs, the first PC is designed to provide the possible variance in the dataset.

The second major component is determined in the same fashion as the first, except it must be uncorrelated (i.e., parallel to) and account for the following most significant variance. This technique is repeated until the number of variables equals the number of essential components.

## *3.1.3.2 Finding Eigen values and Eigen vectors*

After one has determined the essential components, let us discuss eigenvalues and eigenvectors. Remember that eigenvalues and eigenvectors are always obtained in pairs, with one eigenvalue per eigenvector. In addition, the number is the same as the number of data dimensions. There are three variables in a three-dimensional dataset. Hence there are three eigenvectors with three corresponding eigenvalues. PCs are the eigenvectors of the covariance matrix, and they are the directions of the axis with the most variation. Eigenvalues are the coefficients associated with eigenvectors, whereas eigenvalues are the coefficients attached to eigenvectors. The significant components are ordered in order of significance by arranging the eigenvectors in order of their eigenvalues, from highest to lowest.

Assume the dataset is two-dimensional, with two variables x and y, and the covariance matrix eigenvectors and eigenvalues are:

$$\nu1 = \frac{\textbf{0.6778736}}{\textbf{0.7351785}} \lambda\_1 = \textbf{1.284028} \tag{5}$$

$$
v2 = \frac{-0.7351785}{0.6778736} \lambda\_2 = 0.04908323\tag{6}
$$

The outcome of sorting the eigenvalues in ascending order is *> λ***2** , suggesting that the eigenvector of the first PC is v1 and the eigenvector of the second PC is v2. To find the proportion of variance (information) that each component accounts for, divide each component eigenvalue by the total eigenvalues. In the scenario mentioned above, PC1 and PC2 are responsible for 96 and 4% of the data fluctuation.

#### *3.1.4 Feature vector creation*

The key components can be identified in order of importance by computing the eigenvectors and sorting them by their eigenvalues in decreasing order. One must decide whether to preserve all of these components or reject those with low eigenvalues and then use the remaining ones to construct the feature vector–matrix at this *Evaluation of Principal Component Analysis Variants to Assess Their Suitability… DOI: http://dx.doi.org/10.5772/intechopen.105418*

phase. As a result, the feature vector is just a matrix with the appropriate components eigenvectors as columns. Because only *p* eigenvectors (components) are left out of *n*, the final dataset will only have *p* dimensions.

Combining both eigenvectors v1 and v2 creates a feature vector, as seen in the example above:

$$
\begin{bmatrix}
\mathbf{0.6778736} & -\mathbf{0.7351785} \\
\mathbf{0.7351785} & \mathbf{0.6778736}
\end{bmatrix}
\tag{7}
$$

Alternatively, one can omit the less relevant eigenvector v2 and solely utilize v1 to generate a feature vector:

$$
\begin{bmatrix}
\mathbf{0.6778736} \\
\mathbf{0.7351785} \\
\end{bmatrix}
\tag{8}
$$

By eliminating the eigenvector v2, the final dataset dimensionality will be reduced by one, resulting in a loss of information. The loss will be minimal because v2 only carried 4% of the data, and v1 will keep 96% of the data.

The individual must decide whether to maintain all components or delete those not as significant, like in the previous scenario. Because leaving out less significant components is unnecessary if all one wants to do is explain the data in terms of new uncorrelated variables (PCs) without attempting to reduce dimensionality.

## *3.1.5 Recast the data along the axes of the principal component*

The data from previous phases is unchanged except for standardization; all required is to select the primary components and generate the feature vector; however, the input dataset is always in terms of the original axes (i.e., in terms of the initial variables). The third phase, PCA, shifts data from the original axis to the ones indicated by the significant components using a feature vector constructed from the covariance matrix eigenvectors. This is done by multiplying the original dataset transpose by the feature vector transpose. Therefore,

Final dataset <sup>¼</sup> feature vectorTstandardized original datasetT (9)
