**1. Introduction**

The established financial reporting system within an entity is the basic source of information on its financial position and results. The economic and financial globalization of the world market has emphasized the importance of high quality financial reporting. For the business decision-making process, financial and audit reports are the main source of information, as they contain information on financial position, business results, changes in equity, cash-flows and other reliable information [1]. Development of the capital market and the increase in the number of interested parties (investors) created even higher demand of reliable, on time and fair financial statements as the main results of financial reporting. The regulation of the relationship between the state and society, owners of capital and management, various stakeholders and society, and others; has been further

improved by a quality financial reporting and audit process. However, in order to fulfill their main purpose for all interested parties, financial statements must provide information that is true, objective, comprehensible, comparable and uniform [2]. In the first place, financial statements have to be publicly available, which is usually regulated by law. For example, Law on Accounting of the Republic of Serbia prescribes that all business entities have to submit their financial reports to the competent institution which later publishes them on the official internet site [3]. Information contained in financial statements can be used for numerous purposes. For example, other business entities can use them in the process of making business, financial, investment and other decisions. Likewise, banks and financial institutions can use them in order to approve loans or assess investment risks related to the certain business entity. However, financial information contained in financial statements are not processed and represent a raw data that should be analyzed in order to assess the performance of a certain business entity. Aside Notes to financial statements, as one of the qualitative statements that business entities prepare and report, all other statements are quantitative in nature and offer hundreds of pieces of data. Therefore, it is of great importance to perform certain type of analysis on the collected data in order to gain a solid basis for business decision making process. Analysis of financial statements is one of the most common methods of assessing business performance. The main goal of conducting the analysis of financial statements is to obtain information on the performance of the observed company, i.e. liquidity, profitability and solvency. Measuring financial performance using compiled and disclosed financial statements is a quantitative analysis of the position of the observed company, including the way in which the company uses the capital invested in business. High quality analysis of the performance of the observed entity provides a comprehensive image of the business, including meeting the information needs of stakeholders. The authors [4] point out in their paper that the analysis of financial performance is crucial in determining the efficiency in terms of the use of available resources. Likewise, an entity owners will be able to assess management skills and decisions that have been made in previous, as well as in current reporting period, so that they could analyze entities strengths, weaknesses and therefore improve their overall performance [5–7].

Some pieces of data disclosed in financial statements have informational power to be used on their own, such as Total assets, Sales revenue, or Net result. However, informational power of data increases when they are put into relation with other pieces of data. Therefore, financial statements analysis using ratios has been one of the most commonly used methods of assessing business performance. Financial ratio is a relative magnitude of two (or more) selected numerical values taken from financial statements. For example, relation between Net result and Equity will provide information on how much dollars of profit an entity earns for each dollar invested in equity. Results of financial statements analysis can be used to compare performance of a certain entity over a period of time, or for comparison with other entities within the industry. However, since financial statements analysis takes time and there are numerous financial ratios that analysts could use (and the fact that most of these ratios are correlated), the number of ratios that are being calculated and assessed should be reduced so that an analyst could focus on several of them without losing data that could be relevant for the analysis [8]. One of the methods that can be used is Principal Component Analysis (PCA), which reduces number of observed variables for any further, regression, or any other type of analysis [9]. PCA analysis has found its

numerous purposes in different industries, for example, in image compressing [9–11], as well as in biometrics or "bioimaging" where physical characteristics of a person are used for its identification with application on communication devices and security systems.

The significance of PCA results is reflected in the fact that they can be used for more effective and efficient analysis of performance of certain entity, or for all business entities within a certain industry, or if analyzed financial data is related to whole economy, than results could be used for the analysis of all entities within it. The main advantages of PCA are precision of results; reduction of time needed for the analysis and evaluation of results; as well as reduction of related costs and efforts of the analyst.

With the development of technology, we have gained the ability to generate massive amounts of data. The use of correct methodologies for data analysis has become essential when dealing with complex financial challenges. In this paper, we discuss the theory underlying PCA. This type of analysis is one of the most used statistical tools in the field of financial data analysis. To ensure that the proper method is used for the analysis, theoretical knowledge and an comprehension of statistical methods are essential.

## **1.1 General postulates of PCA**

PCA is primarily designed as a statistical technique that selectively reduces the dimensionality of data in complex data sets while preserving maximum variance. Since research in the financial sector involves both a large amount of data and a large number of variables simultaneously, it is difficult for us to perform analysis for this type of data.

Visualization techniques are only useful in two or three dimensional spaces, and single-variable analysis does not provide precise results due to overlapping variance. To achieve dimensionality reduction, it is necessary to generate principal components, i.e., a new set of variables containing a linear combination of the original variables. PCA can be used for a variety of tasks. A very small number of components are sufficient to cope with the variability of a data set. Since the number of components is reduced by using principal components, the complexity of the analysis itself is also reduced by avoiding analyzing a large number of output variables.

The standard PCA procedure takes as its starting point a data set in which *m* numerical variables are observed for each *n* individuals. These data are defined by the vectors *x*1, … , *xm* or *n* � *m* of the data matrix *X*. The *j th* column is the vector *x <sup>j</sup>* resulting from the *j th* variable. Linear combinations of columns for an *X* matrix with maximum variance are calculated as P*<sup>m</sup> <sup>j</sup>*¼<sup>1</sup>*cjx <sup>j</sup>* <sup>¼</sup> *Xc*. Here *<sup>c</sup>* stands for the vector of constants *c*1,*c*2, … *cm*. The variants of such a linear combination are obtained as *var Xc* ð Þ¼ *c*<sup>0</sup> *Mc*. Here *M* stands for an exemplary covariance matrix. Finding a linear combination with maximum variance is the same as finding a *m* dimensional vector *c* that maximizes the quadratic form *c*<sup>0</sup> *Mc*. For this reason, it is necessary to enter another constraint, which is usually unit norm vectors. Such vectors require *c*<sup>0</sup> *c* ¼ 1. This problem is the same as maximizing *c*<sup>0</sup> *Mc* � *λ c*<sup>0</sup> ð Þ *c* � 1 , where *λ* represents the Lagrange multiplier. Equating it to the zero vector gives the following equation:

$$
\lambda \mathbf{M} \mathbf{c} - \lambda \mathbf{c} = \mathbf{0} \\
\Leftrightarrow \mathbf{M} \mathbf{c} = \lambda \mathbf{c} \tag{1}
$$

This equation is valid even when the eigenvectors are multiplied by �1. Here, *c* is the eigenvector and *λ* is the corresponding eigenvalue for the covariance matrix *M*. We need the largest *λ*1, the largest eigenvalue, and the corresponding eigenvector *c*1. Eigenvalues are defined by the corresponding eigenvector *c* : *var Xc* ð Þ¼ *c*<sup>0</sup> *Ma* ¼ *λc*<sup>0</sup> *c* ¼ *λ*. The covariance matrix *M* is a symmetric *m* � *m* matrix and has exactly *m* real eigenvalues. *λk*ð Þ *k* ¼ 1, … , *m* can be defined together with the corresponding eigenvectors to form a set of vectors that are orthonormal. An example of this is *c*0 *mcm* ¼ 1 if *m* ¼ *m*<sup>0</sup> . The eigenvectors of *M* are used to obtain up to *m* linear combinations of *Xck* <sup>¼</sup> <sup>P</sup>*<sup>m</sup> <sup>j</sup>*¼<sup>1</sup>*cjkx <sup>j</sup>* that maximize the variances. The fact that the covariance between the two linear combinations of *Xck* and *Xck*<sup>0</sup> is obtained from *c*<sup>0</sup> *<sup>k</sup>Mck* ¼ *λkc*<sup>0</sup> *<sup>k</sup>ck* ¼ 0 if *k*<sup>0</sup> 6¼ *k*, leads to results of uncorrelatedness [12]. Linear combinations of *Xck* represent the principal component of a data set. There are several PCA terms used for specific values. Elements of linear combinations *Xck* are called principal component scores (PCA scores) and eigenvectors *ck* are also called principal component loads (PCA loads). These contain a generic element *x*<sup>∗</sup> *ij* <sup>¼</sup> *xij* � *<sup>x</sup> <sup>j</sup>*, where *<sup>x</sup>*<sup>∗</sup> *<sup>j</sup>* represents the observed value for variable *j*.

The *<sup>n</sup>* � *<sup>m</sup>* matrix labeled *<sup>X</sup>*<sup>∗</sup> contains columns with centered variables *<sup>x</sup>*<sup>∗</sup> *j* , resulting in the following equation:

$$\mathbf{x}(n-1)\mathbf{M} = \mathbf{X}^{\*'}\mathbf{X}^\* \tag{2}$$

## **1.2 Premises of PCA**

For the final outcome of the PCA assessment to be successful and significant, numerous conditions must be met. Initially, it is crucial that the data entered are uninterrupted and that variables should be measured on an interval or ratio scale. This condition must be met because PCA tests important correlation patterns for these variables.

Another crucial requirement is that the relationships between the individual pairs of variables are linear. If there are nonlinear relationships between the individual pairs of variables, appropriate data transformation techniques, such as logarithmic transformations, should be considered. Presumptions for PCA are filling missing values with not null values, outliers handling, and normalization scaling. All outliers should be filtered out prior to analysis, as they can bias the results by affecting the magnitude of the correlation.

To obtain more accurate estimates for the correlation population parameters, a large sample size is required. The data sets must be linear in order to be formed. The basic principle of PCA is that high variance must be taken into account, while variables with lower variance can be considered noise and are not taken into account. All variables must be processed at the same level of measurement.

#### **1.3 Features extraction in PCA**

Eq. (2) associates the eigenvalue decomposition of the covariance matrix *M* and the singular value decomposition of the matrix *X*<sup>∗</sup> with the centered column data. For dimension *n* � *m* and rank *r*, where it must be *r*≤ *min n*f g , *m* , the matrix *Y* can be calculated as follows:

$$Y = \text{ULA}^{\prime} \tag{3}$$

Where *U* and *A* represent the matrices *n* � *r* and *m* � *r* containing orthonormal columns *U*<sup>0</sup> *U* ¼ *Ir* ¼ *A*<sup>0</sup> *A*, where *Ir* represents the identity matrix *r* � *r*. *L* is the *r* � *r* diagonal matrix. The columns *A* are also called right singular vectors and represent eigenvectors for the *m* � *m* matrix *Y*<sup>0</sup> *Y* associated with its non-zero eigenvalues. Columns *U* are also called left singular vectors and represent eigenvectors for the *n* � *n* matrix *YY*<sup>0</sup> associated with its non-zero eigenvalues. Singular values of *Y* represent diagonal elements of the matrix, denoted by *L*. These elements are non-negative square roots for the non-zero eigenvalues of the two matrices *Y*<sup>0</sup> *Y* and *YY*<sup>0</sup> . We consider that the diagonal elements are sorted from the largest to the smallest element, which determines the order of the columns *U* and *A*, except for singular values that are equal [12]. This is true in all cases except when the singular values are equal. If we assume that *<sup>Y</sup>* <sup>¼</sup> *<sup>X</sup>*<sup>∗</sup> , then the right singular vectors for the matrix *<sup>X</sup>*<sup>∗</sup> are vectors *ck* of principal component loads. Because of the orthogonality of columns *A*, columns *<sup>X</sup>*<sup>∗</sup> *<sup>A</sup>* <sup>¼</sup> *ULA*<sup>0</sup> *<sup>A</sup>* <sup>¼</sup> *UL* are the principal components for *<sup>X</sup>*<sup>∗</sup> . The types of these principal components are obtained by squaring the singular values of *X*<sup>∗</sup> and dividing by *n* � 1. This results in the following equation:

$$(n-1)M = X^{\*'}X^{\*} = (\text{ULA'})^{\prime}(\text{ULA'}) = \text{ALU}^{\prime}\text{ULA'} = \text{AL}^2\text{A}^{\prime} \tag{4}$$

Here *L*<sup>2</sup> stands for a diagonal matrix with one square of the singular values. With this equation we get the eigenvalue decomposition for the matrix ð Þ *n* � 1 *M*. The singular value decomposition for the *X*<sup>∗</sup> matrix with the data centered in the column is equivalent to PCA. Taking the rank *r* in the matrix *Y*, which has the magnitude *n* � *m*, the matrix *Yq*, which has the same magnitude but the second rank *q*<*R* and whose elements reduce the sum of squared differences with the corresponding elements of *Y*, is obtained as:

$$Y\_q = U\_q L\_q A'\_q \tag{5}$$

Here *Lq* stands for the diagonal matrix of dimensions *q* � *q*, which contains the first largest diagonal element *q* of *L* and *Uq*. *Aq* stands for the matrices *n* � *q* and *m* � *q* obtained by keeping the *q* columns in *U* and *A*. The number of rows *n* from the rank *r* of the matrix *X*<sup>∗</sup> defines the scatter plot from the number *n* of points in the *r* dimensional subspace ℝ*<sup>m</sup>*, where the beginning of the gravity center for the scatter plot is located. It follows that the best approximation of the *n* points in this scatterplot in the *q* dimensional subspace, obtained by using *X*<sup>∗</sup> *<sup>q</sup>* rows, is given by this equation. That means that the sum of the squared distances between the given points in each scatterplot is minimal, as in Pearson's original approach [13]. The *q* axis system defines the main subspace. It can be concluded that PCA is a dimensionality reduction method where a set of *m* original variables can be replaced by a given set of *q* variables. In the case of *q* ¼ 2 or *Q* ¼ 3, it is possible to make a graphical approximation for *n* points in the scatter plot, and it is very often used to visualize the whole data set. A very important aspect is that the results are incremental in their dimensions.

The variability associated with the set of retained principal components can be used to ensure the quality of any *q* dimensional approximation. The trace, i.e. the sum of the diagonal elements, of the covariance matrix *M* is equal to the sum of the variances of the *m* variables. It is possible to achieve this with the help of matrix theory results. It is easy to prove that this number is also the sum of the variances of all *m* principal components. Consequently, the proportion of the overall variation

accounted for by a given principal component is a standard measurement of its quality and it's equal to:

$$\pi\_j = \frac{\lambda\_j}{\sum\_{j=1}^{m} \lambda\_j} = \frac{\lambda\_j}{tr(M)}\tag{6}$$

The trace of *M* is labeled *tr M*ð Þ. Due to the incremental behavior of principal components, we can speak of a proportion of the total variance explained by a set *M* of principal components, which is usually expressed as a percentage of the total variance and is accounted for:

$$\sum\_{j \in \mathcal{M}} \pi\_j \times \mathbf{100\%} \tag{7}$$

It is a common approach to use a pre-specified percentage of the total variance to determine how many principal components to keep, but graphical constraints often lead to keeping only the first two or three principal components. The percentage of total variance is a basic tool for measuring the quality of these low-dimensional graphical representations of the data set.

The biggest problem is the number of components needed to obtain a sufficient number of variances while achieving a reduction in dimensionality. There are several ways to determine the components, and one of them is to set a threshold.

The next very popular approach is the "Scree Plot" [14], where the components are arranged on the *X*-axis from largest to smallest with respect to their eigenvalues. In this way, we can see a very large difference between important and less important components. The only drawback to this approach is that it is subjective in determining the correct number of components.

The most popular method is parallel analysis [15], where PCA is performed with as many variables as the original data set includes. The average eigenvalues between the original data set and the simulated data set are measured. Any values from the original data that are lower than the data in the simulated set are discarded.

#### **1.4 Sparse PCA**

PCA has many advantages. In terms of maximizing variance in *Q* dimensions, PCA provides the best possible representation of a *m* dimensional data set in *q* dimensions *q*< *m*. However, the new variables it defines are often linear functions of all the *m* original variables, which is a downside. Multiple variables with not so simple coefficients are common for larger *m*, making the components difficult to read. A number of PCA adjustments have been proposed to facilitate interpretation of the *q* dimensions while limiting the loss of variance that results from not using the principal components themselves. There is a compromise between interpretability and variance. Two types of adjustments are briefly outlined below.

Factor analysis is a method that is often combined with PCA and it inspires the concept of rotating principal components [16]. Assume that *Aq* is the *m* � *q* matrix whose columns are the loadings of the first *q* of the principal components. Then *XAq* is the *n* � *qnq* matrix whose columns are the scores of the first *q* of the principal components for the *n* observations. Let us assume that *T* is an orthogonal *q* � *q* matrix. Multiplying *Aq* by *T* causes orthogonal rotation of the axes within the space spanned by the first *q* of principal components, resulting in *Bq* ¼ *AqT*, a *m* � *q* matrix whose

*Principal Component Analysis in Financial Data Science DOI: http://dx.doi.org/10.5772/intechopen.102928*

columns are the charges of the *q* rotated principal components. *XBq* is an *n* � *q* matrix containing the associated values of the rotated principal components. Any orthogonal matrix *T* can be used to rotate the components, but it is preferable to make the rotated components easy to understand. For this reason, *T* is chosen to maximize simplicity. A variety of such criteria have been proposed, some of which involve non-orthogonal rotation. The criterion where an orthogonal matrix *T* is chosen for maximizing

$$Q = \sum\_{k=1}^{q} \left[ \sum\_{j=1}^{m} b\_{jk}^{4} - \left(\frac{1}{m}\right) \left(\sum\_{j=1}^{m} b\_{jk}^{2}\right)^{2} \right], \text{ where } b\_{jk} \text{ is the } \left(j, k\right)^{th} \text{ member of } B\_{q}, \text{ is}$$

probably the most commonly used. No variance is lost when considering the rotated *q* dimensional space, since the sum of the variances of the *q* rotated components is the same as the sum of the variances of the unrotated components. Successive maximization of the non-rotated principal components is lost, which means that the sum of the variances of the *q* rotated components is the same as the sum of the variances of the non-rotated components. A disadvantage of rotation is the necessary choice between different rotation criteria, although this choice often makes less difference than the choice of the number of components to rotate. If *q* is increased by 1, the rotated components may look substantially different. That is because this does not happen in principal components with defined non-rotated nature.

Another method of simplifying the principal components is to limit the charges of the new variables. This is called adding a constraint. There are several variants of this strategy, one of which uses LASSO linear regression [17], that represents least absolute shrinkage and selection operator. In this approach, SCoTLASS components are discovered, solving the same optimization problem as PCA, but with the additional constraint P*<sup>m</sup> <sup>j</sup>*¼<sup>1</sup> *cjk* � � � �≤*τ*, where tuning parameter is *τ*. The constraint has no effect for *τ* > ffiffiffiffi *m*p , and principal components are generated; however, more charges are pushed to zero at a lower value, which simplifies the interpretation. These simplified components must have less variation than the corresponding number of principal components, and multiple values of *τ* are often examined to find a reasonable compromise between added simplicity and loss of variance. One distinction between rotation and constraint techniques is that the second has the advantage that some loadings in linear functions are set exactly to zero for interpretation, whereas this is usually not the case with rotation. Sparse variants of PCA are type of adjustments in which many coefficients are zero, and numerous studies of such principal components have been conducted in recent years. Hastie et al. [18] provides a good overview of this work.

### **1.5 Robust PCA**

PCA is inherently sensitive to the occurrence of outliers and thus to large errors in data sets [19]. As a result, efforts have been made to define robust variants of PCA, and the terminology RPCA has been used to refer to several approaches to this problem. Huber's early work focused on robust alternatives to covariance or correlation matrices and how they could be used to generate robust principal components [20]. The demand for methods to process very large data sets sparked renewed interest in robust PCA variants. This led to PCA research lines, especially in areas such as machine learning, image processing, web data analysis, and many others.

Wright et al. [21] defined RPCA as the sum of two *n* � *m* components, a low-rank component *L* and a sparse component *S* in an *n* � *m* data matrix *X*. Identifying the matrix components of *X* ¼ *L* þ *S* that minimize a linear combination of two separate component norms was defined as a convex optimization task and calculated as:

$$\min\_{L, \mathbf{S}} \|L\|\_{\*} + \lambda \|\mathbf{S}\|\_{1} \tag{8}$$

where k k*<sup>L</sup>* <sup>∗</sup> <sup>¼</sup> <sup>P</sup> *<sup>r</sup>σr*ð Þ *<sup>L</sup>* is the nuclear norm of *<sup>L</sup>*, and *<sup>λ</sup>*k k*<sup>S</sup>* <sup>1</sup> <sup>¼</sup> <sup>P</sup> *i* P *<sup>j</sup> sij* � � � � is the *l*<sup>1</sup> norm of matrix *S*.
