**4. Development**

#### **4.1 Principal component analysis (PCA)**

The principal component analysis is a multivariate statistics technique focused on explaining the variance–covariance structure of a set of data, and its main objectives are the reduction of the dimensionality of the problem and the better

*A Methodology for Evaluation and Distribution of Patent Applications to INPI-BR Patent… DOI: http://dx.doi.org/10.5772/intechopen.98400*

interpretation of data [26–28]. Still according to [27], the PCA usually reveals relations that would not have been previously identified only with an analysis of the original set of data and variables, enabling a more comprehensive interpretation of the study phenomenon.

To examine a patent application, several measured variables of each patent application of the study population should be considered. The proposal of the principal component analysis method is to apply a transformation to such variables, so that the new components obtained enable a better breakdown and analysis of the elements of such population. In [29], it is shown that this new look has great value when it comes to creating a typology for the population, classifying the elements according to certain criteria, etc. According to Vicini [30] "In practice, the algorithm is based on the variance-covariance matrix, or on the correlation matrix, from which the eigenvalues and eigenvectors are extracted" and "finally, writing the linear combinations, which will be the new variables, referred to as principal components".

It is important to note that the PCA is widely employed, evidencing its efficiency and robustness in applications in several fields of knowledge, such as agronomy, zootechnics, medicine, among others [31–34]. Examples of practical applications of the PCA for evaluation of public services in Brazilian states, assessment of the regional development of cities in the Brazilian state of Santa Catarina, as well as analysis of crime statistics in U.S. states can be found, respectively, in the papers [35–37].

The following steps are necessary for determining the principal components:


Eqs. (1) and (2) below show, respectively, the calculation for standardization of the variables and matrix Z of standardized variables.

$$Z\_{i\bar{j}} = \frac{X\_{i\bar{j}} - \mu\_p}{\delta\_p} \tag{1}$$

$$Z = \begin{bmatrix} Z\_{11} & Z\_{12} & \dots & Z\_{1p} \\ Z\_{21} & Z\_{22} & \dots & Z\_{2p} \\ \vdots & \vdots & \ddots & \vdots \\ Z\_{n1} & Z\_{n2} & \dots & Z\_{nn} \end{bmatrix} \tag{2}$$

Where: Xij is the element of the original data matrix; μ<sup>p</sup> is the average of variable p, being p = j; and δ<sup>p</sup> is the standard deviation of variable p, being p = j.

Eqs. (3)-(5) below show, respectively, the calculation of the variances, covariances, and matrix S.

$$\text{VAR}\left[Z\_{j}\right] = \frac{1}{n} \sum\_{k=1}^{n} \left(Z\_{jk} - \mu\_{j}\right)^{2} \tag{3}$$

$$\text{COV}\left[\mathbf{Z}\_{j};\mathbf{Z}\_{\cdot j'}\right] = \frac{1}{n} \sum\_{k=1}^{n} \left[ \left(\mathbf{Z}\_{jk} - \boldsymbol{\mu}\_{j}\right)\left(\mathbf{Z}\_{\cdot j'k} - \boldsymbol{\mu}\_{\cdot j'}\right) \right] \tag{4}$$

$$S = \begin{bmatrix} \begin{bmatrix} \text{VAR}[Z\_1; Z\_1] & \text{COV}[Z\_1; Z\_2] & \dots & \text{COV}[Z\_1; Z\_p] \\ \text{COV}[Z\_2; Z\_1] & \text{VAR}[Z\_2; Z\_2] & \dots & \text{COV}[Z\_2; Z\_p] \\ \vdots & \vdots & \ddots & \vdots \\ \text{COV}[Z\_p; Z\_1] & \text{COV}[Z\_p; Z\_2] & \dots & \text{VAR}[Z\_p; Z\_p] \end{bmatrix} \\ \begin{bmatrix} \text{COV}[Z\_2; Z\_1] & \text{COV}[Z\_1; Z\_2] & \dots & \text{COV}[Z\_1; Z\_p] \\ \text{COV}[Z\_2; Z\_1] & \text{1} & \dots & \text{COV}[Z\_2; Z\_p] \\ \vdots & \vdots & \ddots & \vdots \\ \text{COV}[Z\_p; Z\_1] & \text{COV}[Z\_p; Z\_2] & \dots & \text{1} \end{bmatrix} \end{bmatrix} \tag{5}$$

Where: VAR [Zj ] is the variance of the standardized variable Zj ; COV[Zj ; Zj '] is the covariance of the standardized variables Zj and Zj'; n is the number of individuals; μ<sup>j</sup> and μj' are, respectively, the average of the standardized variables Zj and Zj'; and Zjk are the data matrix elements.

Eqs. (6) and (7) below allow for determination of the eigenvalues and eigenvectors of matrix S and, so, matrix V of eigenvectors of S is determined according to Eq. (8).

$$\left|\mathbf{S} - \lambda I\right| = \mathbf{0} \tag{6}$$

$$\text{S.V} = \lambda.V \tag{7}$$

$$V = \begin{bmatrix} V1 & V2 & \dots & Vp \end{bmatrix} = \begin{bmatrix} v\_{11} & v\_{12} & \dots & v\_{1p} \\ v\_{21} & v\_{22} & \dots & v\_{2p} \\ \vdots & \vdots & \ddots & \vdots \\ v\_{p1} & v\_{p2} & \dots & v\_{pp} \end{bmatrix} \tag{8}$$

Where: S is the Variance–Covariance matrix of pxp dimension of standardized data; *λ* is one of the p eigenvalues of matrix S; I is the identity matrix of order p; p is the total number of standardized variables; and V is an eigenvector of S with dimension px1;

Upon acquisition of the eigenvectors associated with eigenvalues in a descending order, the principal components (Y1, Y2, … , Yp) for each of the n individuals under analysis is determined through a linear combination between the standardized variables and the eigenvalues calculated. Therefore, we can then write the components of individual n in the form of the following equation:

$$Y\_{i\ (n)} = Z\_{i1}.v\_{1i} + Z\_{i2}.v\_{2i} + \dots + Z\_{ip}.v\_{pi} \tag{9}$$

Where: i ranges from 1 to p; Yi (n) is component i of individual n; Zip are the elements of matrix Z of standardized variables; and vpi are the elements of the eigenvectors calculated.

#### *A Methodology for Evaluation and Distribution of Patent Applications to INPI-BR Patent… DOI: http://dx.doi.org/10.5772/intechopen.98400*

It is important to highlight that in the PCA, the contribution of each principal component (Y1, Y2, … , Yp) is measured in terms of its variance. Thus, it is possible to calculate that contribution considering the relation between the variance of the component under analysis and the sum of the variances of all components, resulting in the proportion (or percentage) of total explained variance to each of the components. Eq. (10) below shows how to calculate each contribution Ci (C1, C2, … , Cp).

$$C\_i = \frac{VAR(Y\_i)}{\sum\_{i=1}^p VAR(Y\_i)} . 100 = \frac{\lambda\_i}{\sum\_{i=1}^p \lambda\_i} . 100 \tag{10}$$

Where: Ci is the contribution or total % variance explained by component Yi; VAR(Yi) is the variance of principal component Yi; *λ<sup>i</sup>* is one of p eigenvalues of matrix S; p is the total number of standardized variables.

It should be noted that further details about the PCA formal mathematical statements and its properties, as well as about all linear algebra used, may be consulted throughout the already mentioned works [26–37].

#### **4.2 Selection criteria for principal components**

Kaiser's [38] is the most used criterion to date. According to such criterion, only components associated with eigenvalues with ranges wider than the unit are considered principal components, i.e.: λ<sup>i</sup> >1. However, some practical cases show that only this selection criterion may not properly represent the majority of the total data variance.

A second option is to perform a graphic analysis and verify greater differences among the consecutive eigenvalues. The Cattel [39] criterion, for example, suggests a graphic representation of the range of eigenvalues based on the number of eigenvalues, arranged in an ascending order. The number of components would be selected based on the breaking point of the graph. This breaking point occurs when there is a slump in the range of eigenvalues [40].

A third possible criterion, also quite disseminated, is to use a reference value for the proportion of variance explained by the principal components. Following this logic, the principal components whose cumulative percentage of explained variance exceeds such reference value shall be selected. It is important to highlight that there is no consensus among researchers about which percentage should be used, and there are several practical examples. A great part of the applications uses the limit of 70%. In [40], the problem was ranked in levels of acceptance, and amounts between 62 and 80% were considered reasonable or "partially good".

Although each criterion has advantages and disadvantages, in this paper a combination of the three above-described criteria was adopted. As a reference value for the third criterion, we believe a percentage of explained variance starting at 60% is a suitable amount for selecting the most representative principal components.

#### **4.3 General complexity ratio (IGC)**

Hongyu et al. [31] also states that "In order to establish a ratio that enables us to order a set of n objects, according to a criterion defined by a set of m suitable variables, it is necessary to choose the weights of the variables so that they translate information contained in them," provided that, to create a ratio as a linear combination of variables, "it is desirable that this ratio includes the maximum possible

information of the set of variables selected for study". According to Sandanielo [41] (*apud* HONGYU, 2015), "a method that creates linear combinations with maximum variance is the principal component analysis".

In this context, this paper intends to, in a first evaluation, use a ratio in one dimension based on the most significant principal components (carefully selected using the selection criteria addressed in item 4.2), weighted by their corresponding eigenvalues. Hence, a General Complexity Ratio (IGC) is defined for the patent applications to be evaluated, according to the following equation:

$$I\text{GC}\_n = \frac{\sum\_{i=1}^k Y\_i \lambda\_i}{\sum\_{i=1}^k \lambda\_i} \tag{11}$$

Where: Yi are the principal components calculated, *λ<sup>i</sup>* are the eigenvalues calculated, k is the number of principal components selected and n is the patent application evaluated.

#### **4.4 Classification of the patent applications into classes**

To group data according to the IGC ratio calculated, the first step was to standardize the original values using Eq. (1) and, therefore, the standard deviation of the IGC sample will always be equal to one. Based on IGC data for all patent applications under examination, classification ranges were then defined as shown in **Figure 1**.

#### **4.5 Mathematical model: diagram and evaluation**

After calculating the IGC ratio and classifying the applications, the complete model for evaluating patent applications is created. A diagram of the model is presented in **Figure 2**.


**Figure 1.**

*Classification of the patent applications into classes.*

**Figure 2.** *Diagram of the model.*

*A Methodology for Evaluation and Distribution of Patent Applications to INPI-BR Patent… DOI: http://dx.doi.org/10.5772/intechopen.98400*

Thus, the next step is to enable its evaluation through a sensitivity analysis and through the correlations with time for examination/analysis of the application. This evaluation will happen through preparation of a form regarding the time for examination in order to obtain a new Standard Sample of patent applications with time, in addition to several simulations considering different numbers of variables and sample sizes. **Figure 3** shows the template form developed.
