**2. Linear separability vs. linear dependence**

Let us assume that each of *m* objects *O*<sup>j</sup> from a given database were represented by the *n*-dimensional feature vector **x**<sup>j</sup> = [xj,1,...,xj,n] <sup>T</sup> belonging to the feature space *F*[*n*] (**x**<sup>j</sup> ∈ *F*[*n*]). The data set *C* consists of *m* such feature vectors **x**j:

$$\mathbf{C} = \{ \mathbf{x}\_{\parallel} \}, where \, j = \mathbf{1}, \, \dots, m \tag{1}$$

The components xj,i of the feature vector **x**<sup>j</sup> are numerical values (xj,i ∈ *R* or xj,I ∈{0, 1}) of the individual features *X*<sup>i</sup> of the *j*-th object *O*j. In this context, each feature vector **x**<sup>j</sup> (**x**<sup>j</sup> ∈ *F*[*n*]) represents *n* features *X*<sup>i</sup> belonging to the feature set *F*(*n*)={*X*1, … , *X*n}.

The pairs {*G*<sup>k</sup> + , *G*<sup>k</sup> = } (*k* = 1, … , *K*) of the learning sets *G*<sup>k</sup> + and *G*<sup>k</sup> <sup>=</sup> (*G*<sup>k</sup> <sup>+</sup> ∩ *G*<sup>k</sup> � = ∅) are formed from some feature vectors **x**<sup>j</sup> selected from the data set *C* (1):

$$\mathbf{G\_{k}}^{+} = \{ \mathbf{x\_{j}} : j \in I\_{k}^{+} \}, and \; \mathbf{G\_{k}}^{-} = \{ \mathbf{x\_{j}} : j \in I\_{k}^{-} \} \tag{2}$$

where *J*<sup>k</sup> <sup>+</sup> and *J*<sup>k</sup> � are non-empty sets of indices *j* of vectors **x**<sup>j</sup> (*J*<sup>k</sup> <sup>+</sup> ∩ *J*<sup>k</sup> � = ∅).

The *positive* learning set *G*<sup>k</sup> <sup>+</sup> is composed of *m*<sup>k</sup> <sup>+</sup> feature vectors **x**<sup>j</sup> (*j* ∈ *J*<sup>k</sup> + ). Similarly, the *negative* learning set *G*<sup>k</sup> � is composed of *m*<sup>k</sup> � feature vectors **x**<sup>j</sup> (*j* ∈ *J*<sup>k</sup> �), where *m*<sup>k</sup> <sup>+</sup> + *m*<sup>k</sup> � ≤ *m*.

Possibility of the learning sets *G*<sup>k</sup> <sup>+</sup> and *G*<sup>k</sup> � (2) separation using a hyperplane *H*(**w**k*,* θk) in the feature space *F*[*n*] is investigated in pattern recognition [1]:

$$H(\mathbf{w}\_{\mathbf{k}}, \theta\_{\mathbf{k}}) = \left\{ \mathbf{x} : \mathbf{w}\_{\mathbf{k}}^T \mathbf{x} = \theta\_{\mathbf{k}} \right\} \tag{3}$$

where **w**<sup>k</sup> = [wk,1,..., wk,n] <sup>T</sup> ∈ *R*<sup>n</sup> is the weight vector, θ<sup>k</sup> ∈ *R*<sup>1</sup> is the threshold, and **w**<sup>k</sup> T **x** = Σ<sup>i</sup> wk,i xi is the scalar product.

*Definition* 1: The learning sets *G*<sup>k</sup> <sup>+</sup> and *G*<sup>k</sup> � (1) are *linearly separable* in the feature space *F*[*n*], if and only if there exists a weight vector **w**<sup>k</sup> (**w**<sup>k</sup> ∈ *R*<sup>n</sup> ), and a threshold *θ*<sup>k</sup> (*θ*<sup>k</sup> ∈ *R*<sup>1</sup> ) that the hyperplane *H*(**w**k*,* θk) (3) separates these sets [7]:

$$(\exists w\_{\mathbf{k}}, \theta\_{\mathbf{k}}) \; (\forall \mathbf{x}\_{\mathbf{j}} \in \mathbf{G}\_{\mathbf{k}}^{+}) \; \mathbf{w}\_{\mathbf{k}}^{\top} \mathbf{x}\_{\mathbf{j}} \ge \theta\_{\mathbf{k}} + \mathbf{1} \; and \tag{4}$$

$$(\forall \mathbf{x}\_{\mathbf{j}} \in \mathbf{G}\_{\mathbf{k}}^{-}) \; \mathbf{w}\_{\mathbf{k}}^{\top} \mathbf{x}\_{\mathbf{j}} \le \theta\_{\mathbf{k}} - \mathbf{1}$$

According to the above inequalities, all vectors **x**<sup>j</sup> from the learning set *G*<sup>k</sup> <sup>+</sup> (2) are located on the positive side of the hyperplane *H*(**w**k*,* θk) (3), and all vectors **x**<sup>j</sup> from the set *G*<sup>k</sup> � lie on the negative side of this hyperplane.

The hyperplane *H*(**w**k*,* θk) (3) separates (4) the sets *G*<sup>k</sup> <sup>+</sup> and *G*<sup>k</sup> � (1) with the following margin δL2(**w**k) based on the Euclidean (*L*2) norm which is used in the Support Vector Machines (SVM) method [12]:

$$\left|\delta\_{\mathbf{L}2}(\mathbf{w}\_{\mathbf{k}}) = \mathbf{2}/\right| \left|\mathbf{w}\_{\mathbf{k}}\right| \left|\_{\mathbf{L}2} = \mathbf{2}/\left(\mathbf{w}\_{\mathbf{k}}^{\mathrm{T}}\mathbf{w}\_{\mathbf{k}}\right)^{1/2} \tag{5}$$

where || **w**<sup>k</sup> ||L2 = (**w**<sup>k</sup> T **w**k) 1/2 is the Euclidean length of the weight vector **w**k.

The margin δL1(**w**k) with the *L*<sup>1</sup> norm related to the hyperplane *H*(**w**k*,* θk) (2), which separates (10) the learning sets *G*<sup>k</sup> <sup>+</sup> and *G*<sup>k</sup> � (2) was determined by analogy to (5) as [11]:

$$\delta\_{\mathbf{L}1}(\mathbf{w}\_{\mathbf{k}}) = \mathcal{Z}/\|\ \mathbf{w}\_{\mathbf{k}}\|\_{\mathbf{L}1} = \mathcal{Z}/(|\mathbf{w}\_{\mathbf{k},1}| + \dots + |\mathbf{w}\_{\mathbf{k},\mathbf{n}}|) \tag{6}$$

where || **w**<sup>k</sup> ||L1 = |wk,1| + ... +| wk,n| is the *L*<sup>1</sup> length of the weight vector **w**k.

The margins δ L2(**w**k) (5) or δ L1(**w**k) (6) are maximized to improve the generalization properties of linear classifiers designed from the learning sets *G*<sup>k</sup> <sup>+</sup> and *G*<sup>k</sup> � (2) [7].

The following set of *m*<sup>k</sup> <sup>0</sup> = *m*<sup>k</sup> <sup>+</sup> + *m*<sup>k</sup> � linear equations can be formulated on the basis of the linear separability inequalities (4):

$$(\forall j \in J\_{\mathbf{k}}^{-+}) \; \mathbf{x}\_{\mathbf{j}}^{\mathrm{T}} \mathbf{w}\_{\mathbf{k}} = \boldsymbol{\theta}\_{\mathbf{k}} + \mathbf{1} \; and$$

$$(\forall j \in J\_{\mathbf{k}}^{-}) \; \mathbf{x}\_{\mathbf{j}}^{\mathrm{T}} \mathbf{w}\_{\mathbf{k}} = \boldsymbol{\theta}\_{\mathbf{k}} - \mathbf{1}$$

If we assume that the threshold θ<sup>k</sup> can be determined latter, then we have *n* unknown weights wk,i (**w**<sup>k</sup> = [wk,1,..., wk,n] T ) in an underdetermined system of *m*<sup>k</sup> <sup>0</sup> = *m*<sup>k</sup> <sup>+</sup> + *m*<sup>k</sup> � (*m*<sup>k</sup> <sup>0</sup> ≤ *m*<sup>k</sup> < *n*) linear Eqs. (7). In order to obtain a system of *n* linear Eqs. (7) with *n* unknown weights wk,i, additional linear equations based on selected *n m*<sup>k</sup> <sup>0</sup> unit vectors **e**<sup>i</sup> (*i* ∈ *I*k) were taken into account [6]:

$$(\forall i \in I\_k) \; \mathbf{e}\_i^T \mathbf{w}\_k = \mathbf{0} \tag{8}$$

The parameter vertex **w**<sup>k</sup> = [wk,1,..., wk,n] <sup>T</sup> can be determined by the linear Eqs. (7) and (8) if the feature vectors **x**<sup>j</sup> forming the learning sets *G*<sup>k</sup> <sup>+</sup> and *G*<sup>k</sup> � (2) are linearly independent [7].

The feature vector **x**<sup>j</sup> <sup>0</sup> (**x**<sup>j</sup> <sup>0</sup> ∈ *G*<sup>k</sup> <sup>+</sup> ∪ *G*<sup>k</sup> � (2)) is a linear combination of some other vectors **x**j(i) ( *j*(*i*) 6¼ *j* 0 ) from the learning sets (2), if there are such parameters α j 0 ,i (α<sup>j</sup> 0 ,i 6¼ 0) that the following relation holds:

$$\mathbf{x}\_{\!\!\!\!/} = \mathbf{a}\_{\!\!\!\!/ , 1} \mathbf{x}\_{\!\!\!/ , 1} + \dots + \mathbf{a}\_{\!\!\!\!/ , 1} \mathbf{x}\_{\!\!\!/ , \!\!/ }\tag{9}$$

*Definition* 2: Feature vectors **x**<sup>j</sup> making up the learning sets *G*<sup>k</sup> <sup>+</sup> and *G*<sup>k</sup> � (2) are linearly independent if neither of these vectors **x**<sup>j</sup> <sup>0</sup> (**x**<sup>j</sup> <sup>0</sup> ∈ *G*<sup>k</sup> <sup>+</sup> ∪ *G*<sup>k</sup> �) can be expressed as a linear combination (9) of *l* (*l* ∈{1, … , *m* - 1}) other vectors **x**j(*l)* from the learning sets.

If the number *m*<sup>k</sup> <sup>0</sup> = *m*<sup>k</sup> <sup>+</sup> + *m*<sup>k</sup> � of elements **x**<sup>j</sup> of the learning sets *G*<sup>k</sup> <sup>+</sup> and *G*<sup>k</sup> � (2) is smaller than the dimension *n* of the feature space *F*[*n*] (*m*<sup>k</sup> <sup>+</sup> + *m*<sup>k</sup> � ≤ *n*), then the parameter vertex **w**k(θk) can be defined by the linear equations in the following matrix form [13]:

$$\mathbf{B}\_{\mathbf{k}} \text{ w}\_{\mathbf{k}}(\theta\_{\mathbf{k}}) = \mathbf{1}\_{\mathbf{k}}(\theta\_{\mathbf{k}}) \tag{10}$$

where

$$\mathbf{1}\_{\mathbf{k}}(\boldsymbol{\theta}\_{\mathbf{k}}) = \begin{bmatrix} \boldsymbol{\theta}\_{\mathbf{k}} + \mathbf{1}, \dots, \boldsymbol{\theta}\_{\mathbf{k}} + \mathbf{1}, \boldsymbol{\theta}\_{\mathbf{k}} - \mathbf{1}, \dots, \boldsymbol{\theta}\_{\mathbf{k}} - \mathbf{1}, \mathbf{0}, \dots, \mathbf{0} \end{bmatrix}^{\mathrm{T}} \tag{11}$$

and

$$\mathbf{B}\_{\mathbf{k}} = \begin{bmatrix} \mathbf{x}\_{1}, \dots, \mathbf{x}\_{\text{mk'}}, \mathbf{e}\_{\text{i}(\text{mk'}+1)}, \dots, \mathbf{e}\_{\text{i}(\text{n})} \end{bmatrix}^{\text{T}} \tag{12}$$

The first *m*<sup>k</sup> <sup>+</sup> components of the vector **1**k(θk) are equal to θ<sup>k</sup> + 1, the next *m*<sup>k</sup> � components equal to θ<sup>k</sup> - 1, and the last *n* - *m*<sup>k</sup> <sup>+</sup> � *<sup>m</sup>*<sup>k</sup> � components are equal to 0. The first *m*<sup>k</sup> <sup>+</sup> rows of the square matrix *B*<sup>k</sup> (12) are formed by the feature vectors **x**<sup>j</sup> (*j* ∈ *J*k + ) from the set *G*<sup>k</sup> <sup>+</sup> (2), the next *m*<sup>k</sup> � rows are formed by vectors **x**<sup>j</sup> (*j* ∈ *J*<sup>k</sup> �) from the set *G*<sup>k</sup> � (2), and the last *n* - *m*<sup>k</sup> <sup>+</sup> � *<sup>m</sup>*<sup>k</sup> � rows are made up of unit vectors **e**<sup>j</sup> (*i* ∈ *I*k):

If the matrix *B*<sup>k</sup> (12) is non-singular, then there exists the inverse matrix *B*<sup>k</sup> �1 :

$$\mathbf{B}\_{\mathbf{k}}^{-1} = \begin{bmatrix} \mathbf{r}\_1, \dots, \mathbf{r}\_{\mathbf{mk'}}, \mathbf{r}\_{\mathbf{i}(\mathbf{mk'+1})}, \dots, \mathbf{r}\_{\mathbf{i}(\mathbf{n})} \end{bmatrix} \tag{13}$$

In this case, the parameter vertex **w**k(θk) (10) can be defined by the following equation:

$$\mathbf{w}\_{\mathbf{k}}(\boldsymbol{\Theta}\_{\mathbf{k}}) = \mathbf{B}\_{\mathbf{k}}^{-1}\mathbf{1}\_{\mathbf{k}}(\boldsymbol{\Theta}\_{\mathbf{k}}) = (\boldsymbol{\Theta}\_{\mathbf{k}} + \mathbf{1})\ \mathbf{r}\_{\mathbf{k}}{}^{+} + (\boldsymbol{\Theta}\_{\mathbf{k}} - \mathbf{1})\ \mathbf{r}\_{\mathbf{k}}{}^{-} = \tag{14}$$

$$= \boldsymbol{\Theta}\_{\mathbf{k}} \left(\mathbf{r}\_{\mathbf{k}}{}^{+} + \mathbf{r}\_{\mathbf{k}}{}^{-}\right) + (\mathbf{r}\_{\mathbf{k}}{}^{+} - \mathbf{r}\_{\mathbf{k}}{}^{-})$$

where the vector **r**<sup>k</sup> <sup>+</sup> is the sum of the first *m*<sup>k</sup> <sup>+</sup> columns **r**<sup>i</sup> of the inverse matrix *B*<sup>k</sup> �1 (13), and the vector **r**<sup>k</sup> � is the sum of the successive *m*<sup>k</sup> � columns **r**<sup>i</sup> of this matrix.

The last *n* - (*m*<sup>k</sup> <sup>+</sup> + *m*<sup>k</sup> �) components wk.i(θk) of the vector **w**k(θk) = [wk,1(θk), … , wk,n(θk)]T (14) linked to the zero components of the vector **1**k(θk) (11) are equal to zero:

$$(\forall i \in \{m\_k^+ + m\_k^- + 1, \dots, n\}) \text{ w}\_{k.i}(\Theta\_k) = \mathbf{0} \tag{15}$$

The conditions wk.i(θk) = 0 (15) result from the equations **e**<sup>i</sup> T **w**k(θk) = 0 (8) at the vertex **w**k(θk) (14).

Length ||**w**k(θk)||L1 of the weight vector **w**k(θk) (14) in the *L*<sup>1</sup> norm is the sum of *m*<sup>k</sup> <sup>0</sup> = *m*<sup>k</sup> <sup>+</sup> + *m*<sup>k</sup> � components |wk,i(θk)|:

$$\|\mathbf{w}\_{\mathbf{k}}(\theta\_{\mathbf{k}})\|\_{\text{L1}} = |\mathbf{w}\_{\mathbf{k},\mathbf{l}}(\theta\_{\mathbf{k}})| + \dots + |\mathbf{w}\_{\mathbf{k},\text{mk}'}(\theta\_{\mathbf{k}})|\tag{16}$$

In accordance with the Eq. (14), components |wk,i(θk)| can be determined as follows:

$$\left| \left( \forall i \in \{1, \ldots, m\_{\mathbf{k}} \,^{+} + m\_{\mathbf{k}} \,^{-} \right) \right| \left| \mathbf{w}\_{\mathbf{k},i} (\theta\_{\mathbf{k}}) \right| = \left| \left. \Theta\_{\mathbf{k}} \left( \mathbf{r}\_{\mathbf{k},i} ^{+} + \mathbf{r}\_{\mathbf{k},i} ^{-} \right) + \left( \mathbf{r}\_{\mathbf{k},i} ^{+} - \mathbf{r}\_{\mathbf{k},i} ^{-} \right) \right| \right| \tag{17}$$

The length ||**w**k(θk)||L1 (16) of the vector **w**k(θk) (14) with the *L*<sup>1</sup> norm is minimized to increase the margin δL1(**w**k(θk)) (6). The length ||**w**k(θk)||L1 (16) can be minimized by selecting the optimal threshold value θk\* on the basis of the Eq. (14).

$$(\forall \Theta\_{\mathbf{k}}) \; \delta\_{\text{L1}}(\mathbf{w}\_{\mathbf{k}}(\Theta\_{\mathbf{k}} \; ^\*)) \geq \delta\_{\text{L1}}(\mathbf{w}\_{\mathbf{k}}(\Theta\_{\mathbf{k}})) \tag{18}$$

where the optimal vertex **w**k(θk\*) is defined by the Eq. (14).

*Theorem* 1: The learning sets *G*<sup>k</sup> <sup>+</sup> and *G*<sup>k</sup> � (2) formed by *m* (*m* ≤ *n*) linearly independent (9) feature vectors **x**<sup>j</sup> are linearly separable (4) in the feature space *F*[*n*] (**x**<sup>j</sup> ∈ *F*[*n*]).

*Proof*: If the learning sets *G*<sup>k</sup> <sup>+</sup> and *G*<sup>k</sup> � (2) are formed by *m* linearly independent feature.

vectors **x**<sup>j</sup> then the non-singular matrix *B*<sup>k</sup> = [**x**1, … , **x**m, **e**i(m + 1), … ., **e**i(n)] <sup>T</sup> (12) containing these *m* vectors **x**<sup>j</sup> and *n* - *m* unit vectors **e**<sup>i</sup> (*i* ∈ *I*k) can be defined [10]. In this case, the inverse matrix *B*<sup>k</sup> �<sup>1</sup> (13) exists and can determine the vertex **w**k(θk) (14). The vertex equation *B*<sup>k</sup> **w**k(θk) = **1**k(θk) (10) can be reformulated for the feature vectors **x**<sup>j</sup> (2) as follows:

$$\left(\forall \mathbf{x}\_{\mathbf{j}} \in \mathbf{G}\_{\mathbf{k}}\,^{+}\right) \left(\mathbf{w}\_{\mathbf{k}} \big(\boldsymbol{\theta}\_{\mathbf{k}}\big)^{\mathrm{T}} \mathbf{x}\_{\mathbf{j}} = \boldsymbol{\theta}\_{\mathbf{k}} + \mathbf{1}\,\mathrm{and}\ \left(\forall \mathbf{x}\_{\mathbf{j}} \in \mathbf{G}\_{\mathbf{k}}\,^{-}\right) \mathbf{w}\_{\mathbf{k}} \big(\boldsymbol{\theta}\_{\mathbf{k}}\big)^{\mathrm{T}} \mathbf{x}\_{\mathbf{j}} = \boldsymbol{\theta}\_{\mathbf{k}} - \mathbf{1} \tag{19}$$

The solution of the Eqs. (19) satisfies the linear separability inequalities (4).

It is possible to enlarge the learning sets *G*<sup>k</sup> <sup>+</sup> and *G*<sup>k</sup> � (2) in such a way, which maintains their linear separability (4).

*Lemma* 1: Increasing the positive learning set *G*<sup>k</sup> <sup>+</sup> (2) by such a new vector **x**<sup>j</sup> <sup>0</sup> (**x**<sup>j</sup> 0∉ *G*<sup>k</sup> + ), which is a linear combination with the parameters α<sup>j</sup> 0 ,i (9) of some feature vectors **x**j(*l*) (2) from this set (**x**j(*l*) ∈ *G*<sup>k</sup> + ) preserves the linear separability (4) of the learning sets if the parameters α<sup>j</sup> 0 ,i fulfill the following condition:

$$\mathbf{a}\_{\mathsf{f}',1} + \ldots + \mathbf{a}\_{\mathsf{f}',l} \ge \mathbf{1} \tag{20}$$

If the assumptions of the lemma are met, then

$$\mathbf{w}\_{\mathbf{k}}^{\mathrm{T}}\mathbf{x}\_{\circlearrowleft} = \mathbf{w}\_{\mathbf{k}}^{\mathrm{T}}\big(\mathbf{a}\_{\circlearrowleft 1}\,\mathbf{x}\_{\circlearrowleft (1)} + \ldots + \mathbf{a}\_{\circlearrowleft 1}\,\mathbf{x}\_{\circlearrowleft (l)}\right) = \tag{21}$$

$$= \mathbf{a}\_{\circlearrowleft 1}\,\mathbf{w}\_{\mathbf{k}}^{\mathrm{T}}\mathbf{x}\_{\circlearrowleft (1)} + \ldots + \mathbf{a}\_{\circlearrowleft 1}\,\mathbf{w}\_{\mathbf{k}}^{\mathrm{T}}\mathbf{x}\_{\circlearrowleft (l)} = \mathbf{a}\_{\circlearrowleft 1}\,(\mathbf{\theta}\_{\mathbf{k}} + \mathbf{1}) + \ldots + \mathbf{a}\_{\circlearrowleft 1}\,(\mathbf{\theta}\_{\mathbf{k}} + \mathbf{1}) \ge \mathbf{\theta}\_{\mathbf{k}} + \mathbf{1}$$

The above inequality means that linear separability connditions (4) still apply after the increasing of the learning set *G*<sup>k</sup> <sup>+</sup> (2).

*Lemma* 2: Increasing the negative learning set *G*<sup>k</sup> � (2) by such a new vector **x**j <sup>0</sup> (**x**<sup>j</sup> <sup>0</sup> ∉ *G*<sup>k</sup> �), which is a linear combination with the parameters α<sup>j</sup> 0 ,i (9) of some feature vectors **x**j(*l*) (2) from this set (**x**j(*l*) ∈ *G*<sup>k</sup> �) preserves the linear separability (4) of the learning sets if the parameters α<sup>j</sup> 0 , i fulfill the following condition:

$$\mathbf{a}\_{\mathbf{j}',1} + \dots + \mathbf{a}\_{\mathbf{j}',l} \le -1 \tag{22}$$

Justification Lemma 2 may be based on inequality similar to (21).
