**3. Perceptron criterion function**

The minimization the perceptron criterion function allows to assess the degree of linear separabilty (4) of the learning sets *G*<sup>k</sup> <sup>+</sup> and *G*<sup>k</sup> � (2) in different feature subspaces *F*[*n*<sup>0</sup> ] (*F*[*n*<sup>0</sup> ] ⊂ *F*[*n* + 1]) [6]. When defining the perceptron criterion function, it is convenient to use the following augmented feature vectors **y**<sup>j</sup> (**y**<sup>j</sup> ∈ *F*[*n* + 1]) and augmented weight vectors **v**<sup>k</sup> (**v**<sup>k</sup> ∈ *R*n+1) [1]:

$$\left(\forall \mathbf{j} \in f\_{\mathbf{k}}^{+}\left(\mathbf{2}\right)\right)\mathbf{y}\_{\mathbf{j}} = \left[\mathbf{x}\_{\mathbf{j}}^{\mathrm{T}}, \mathbf{1}\right]^{\mathrm{T}},\tag{23}$$

$$\left(\forall j \in J\_k^-(2)\right) \mathbf{y}\_j = -\left[\mathbf{x}\_{\mathfrak{h}}^\mathrm{T}, \mathbf{1}\right]^\mathrm{T}$$

and

$$\mathbf{w}\_{\mathbf{k}} = \begin{bmatrix} \mathbf{w}\_{\mathbf{k}}\mathbf{}^{\mathrm{T}}, -\boldsymbol{\theta}\_{\mathbf{k}} \end{bmatrix}^{\mathrm{T}} = \begin{bmatrix} \mathbf{w}\_{\mathbf{k},1}, \dots, \mathbf{w}\_{\mathbf{k},\mathbf{n}}, -\boldsymbol{\theta}\_{\mathbf{k}} \end{bmatrix}^{\mathrm{T}} \tag{24}$$

The augmented vectors **y**<sup>j</sup> are constructed (23) on the basis of the learning sets *G*<sup>k</sup> + and *G*<sup>k</sup> � (2). These learning sets are extracted from the data set *C* (1) according to some additional knowledge. The linear separability (4) of the learning sets *G*<sup>k</sup> <sup>+</sup> and *G*<sup>k</sup> � (2) can be reformulated using the following set of *m* inequalities with the augmented vectors **y**<sup>j</sup> (23) [7]:

$$\left(\left(\exists \mathbf{v}\_{k}\right)\left(\forall j \in f\_{k}^{+} \cup f\_{k}^{-}\left(\mathbf{2}\right)\right)\left(\mathbf{v}\_{k}^{\top}\mathbf{y}\_{j} \geq \mathbf{1}\right)\tag{25}$$

The dual hyperplanes *h*<sup>j</sup> <sup>p</sup> in the parameter space *R*n+1 (**v** ∈ *R*n+1) are defined on the basis of the augmented vectors **y**<sup>j</sup> [6]:

$$\left(\forall \mathbf{j} \in J\_{\mathbf{k}}^{+} \cup J\_{\mathbf{k}}^{-}\ (2)\right) h\_{\mathbf{j}}^{\mathbb{P}} = \left\{\mathbf{v} : \mathbf{y}\_{\mathbf{j}}^{\mathrm{T}}\mathbf{v} = \mathbf{1}\right\} \tag{26}$$

Dual hyperplanes *h*<sup>j</sup> <sup>p</sup> (26) divide the parameter space *R*n+1 (**v** ∈ *R*n+1) into a finite number *L* of disconnected regions (*convex polyhedra*) *D<sup>l</sup>* <sup>p</sup> (*l* = 1, … , *L*) [7]:

$$\mathbf{D}\_{l}^{\rm p} = \left\{ \mathbf{v} : (\forall j \in f\_{l}^{+}) \; \mathbf{y}\_{j}^{\rm T} \mathbf{v} \ge \mathbf{1} \; and \; \; (\forall j \in f\_{l}^{-}) \; \mathbf{y}\_{j}^{\rm T} \mathbf{v} < \mathbf{1} \right\} \tag{27}$$

where *Jl* <sup>+</sup> and *Jl* � are disjointed subsets (*Jl* <sup>+</sup> ∩ *Jl* <sup>+</sup> = ∅) of indices *j* of feature vectors **x**<sup>j</sup> making up the learning sets *G*<sup>k</sup> <sup>+</sup> and *G*<sup>k</sup> � (2).

The perceptron penalty functions φ<sup>j</sup> p (**v**) are defined as follows for each of augmented feature vectors **y**<sup>j</sup> (23) [6]:

$$\begin{aligned} (\forall j \in f\_k) \\ \mathbf{q}\_j^\mathbf{p}(\mathbf{v}) = \begin{array}{c} \mathbf{1} - \mathbf{y}\_j^\mathbf{T} \mathbf{v} & \circlearrowright & \mathbf{y}\_j^\mathbf{T} \mathbf{v} < \mathbf{1} \\ \mathbf{0} & \circlearrowright & \mathbf{y}\_j^\mathbf{T} \mathbf{v} \ge \mathbf{1} \end{array} \tag{28}$$

The *j* - th penalty function φ<sup>j</sup> p (**v**) (28) is greater than zero if and only if the weight vector **v** is located on the wrong side (**y**<sup>j</sup> T **v** < 1) of the *j*-th dual hyperplane *h*<sup>j</sup> <sup>p</sup> (26). The function φ<sup>j</sup> p (**v**) (28) is linear and greater than zero as long as the parameter vector **v** = [vk,1,..., vk,n + 1] <sup>T</sup> remains on the wrong side of the hyperplane *h*<sup>j</sup> <sup>p</sup> (26). Convex and piecewise-linear (*CPL*) penalty functions φ<sup>j</sup> p (**v**) (28) are used to enforce the linear separation (8) of the learning sets *G*<sup>k</sup> <sup>+</sup> and *G*<sup>k</sup> � (2).

The perceptron criterion function Φ<sup>k</sup> p (**v**) is defined as the weighted sum of the penalty functions φ<sup>j</sup> p (**v**) (28) [6]:

$$\Phi\_{\mathbf{k}}{}^{\mathbb{P}}(\mathbf{v}) = \Sigma\_{\mathfrak{j}} a\_{\mathfrak{j}} \,\mathrm{q}\_{\mathfrak{j}}{}^{\mathbb{P}}(\mathbf{v}) \tag{29}$$

Positive parameters *α*<sup>j</sup> (*α*<sup>j</sup> > 0) can be treated as prices of individual feature vectors **x**<sup>j</sup> :

$$\left(\forall j \in J\_{\mathbf{k}}^{+}\left(\mathbf{2}\right)\right) \mathfrak{a}\_{\mathfrak{j}} = \mathbf{1}\left(\mathbf{2}\ m\_{\mathbf{k}}^{+}\right) \mathit{and}\ \left(\forall j \in J\_{\mathbf{k}}^{-}\left(\mathbf{2}\right)\right) \mathfrak{a}\_{\mathfrak{j}} = \mathbf{1}\left(\mathbf{2}\ m\_{\mathbf{k}}^{-}\right) \tag{30}$$

where *m*<sup>k</sup> <sup>+</sup> (*m*<sup>k</sup> �) is the number of elements **x**<sup>j</sup> in the learning set *G*<sup>k</sup> <sup>+</sup> (*G*<sup>k</sup> �) (2).

The perceptron criterion function Φ<sup>k</sup> p (**v**) (29) was built on the basis of the *error correction* algorithm, the basic algorithm in the *Perceptron* model of learning processes in neural networks [14].

The criterion function Φ<sup>k</sup> p (**v**) (29) is convex and piecewise-linear (*CPL*) [6]. It means, among others, that the function Φ<sup>k</sup> p (**v**) (29) remains linear within each area *D<sup>l</sup>* (27):

$$\begin{aligned} (\forall l \in \{1, \ldots, L\})\\ (\forall \mathbf{v} \in \mathbf{D}\_l) \, \Phi\_\mathbf{k} \, ^\mathbf{p} (\mathbf{v}) = \left(\Sigma\_\mathbf{j} \, a\_\mathbf{j} \, \mathbf{y}\_\mathbf{j}\right)^\mathbf{T} \end{aligned} \tag{31}$$

where the summation is performed on all vectors **y**<sup>j</sup> (23) fulfilling the condition **y**j T **v** < 1.

The optimal vector **v**k\* determines the minimum value Φ<sup>k</sup> p (**v**k\*) of the criterion function Φ<sup>k</sup> p (**v**) (29):

$$(\exists \mathbf{v}\_{\mathbf{k}} \, ^\*) \, (\forall \mathbf{v} \in \mathbf{R}^{n+1}) \, \Phi\_{\mathbf{k}} ^ {\mathbb{P}}(\mathbf{v}) \ge \Phi\_{\mathbf{k}} ^ {\mathbb{P}}(\mathbf{v}\_{\mathbf{k}} \, ^\*) \ge \mathbf{0} \tag{32}$$

Since the criterion function Φ<sup>k</sup> p (**v**) (29) is linear in each convex polyhedron *D<sup>l</sup>* (27), the optimal point **v**k\* representing the minimum Φ<sup>k</sup> p (**v**k\*) (32) can be located in selected vertex of some polyhedron *D<sup>l</sup>* 0 <sup>p</sup> (27). This property of the optimal vector **v**k\* (32) follows from the *fundamental theorem of linear programming* [5].

It has been shown that the minimum value Φ<sup>k</sup> p (**v**k\*) (32) of the perceptron criterion function Φ<sup>k</sup> p (**v**) (29) with the parameters α<sup>j</sup> (30) is normalized as follows [6]:

$$0 \le \Phi\_{\mathbf{k}} \mathbf{P}(\mathbf{v\_k}^\*) \le 1 \tag{33}$$

The below theorem has been proved [6]:

*Theorem* 2: The minimum value Φ<sup>k</sup> p (**v**k\*) (32) of the perceptron criterion function Φk p (**v**) (29) is equal to zero (Φ<sup>k</sup> p (**v**k\*) = 0) if and only if the learning sets *G*<sup>k</sup> <sup>+</sup> and *G*<sup>k</sup> � (2) are linearly separable (4).

The minimum value Φ<sup>k</sup> p (**v**k\*) (32) is near to one (Φ<sup>k</sup> p (**v**k\*) ≈ 1) if the sets *G*<sup>k</sup> <sup>+</sup> and *G*<sup>k</sup> � (2) cover almost completely. It can also be proved that the minimum value Φk p (**v**k\*) (32) of the perceptron criterion function Φ<sup>k</sup> p (**v**) (29) does not depend on invertible linear transformations of the feature vectors **y**<sup>j</sup> (23) [6]. The perceptron criterion function Φk(**v**) (29) remains linear inside of each region *D<sup>l</sup>* (27).

The regularized criterion function Ψ<sup>k</sup> p (**v**) is defined as the sum of the perceptron criterion function Φ<sup>k</sup> p (**v**) (29) and some additional penalty functions [13]. These additional *CPL* functions are equal to the costs γ<sup>i</sup> (γ<sup>i</sup> > 0) of individual features *X*<sup>i</sup> multiply by the absolute values |wi | of weighs wi , where **v** = [**w**<sup>T</sup> , �θ] <sup>T</sup> = [w1,..., wn, �θ] <sup>T</sup> ∈ *R*n+1 (24):

$$
\Psi\_{\mathbf{k}}{}^{\mathbb{P}}(\mathbf{v}) = \Phi\_{\mathbf{k}}{}^{\mathbb{P}}(\mathbf{v}) + \lambda \,\Sigma\_{\mathbf{i}}{}^{\gamma} \eta\_{\mathbf{i}} |\mathbf{w}\_{\mathbf{i}}| \tag{34}
$$

where λ (λ ≥ 0) is the *cost level*. The standard values of the cost parameters γ<sup>i</sup> are equal to one (∀*i* ∈ {1, ..., *n*} γ<sup>i</sup> = 1).

The optimal vector **v**k,λ\* constitutes the minimum value Ψ<sup>k</sup> p (**v**k,λ\*) of the *CPL* criterion function Ψ<sup>k</sup> p (**v**) (34), which is defined on elements **x**<sup>j</sup> of the learning sets *G*<sup>k</sup> <sup>+</sup> and *G*<sup>k</sup> � (2**)**:

$$(\exists \mathbf{v}\_{\mathbf{k},\lambda} \, ^\*) \; (\forall \mathbf{v} \in \mathbb{R}^{n+1}) \; \Psi\_{\mathbf{k}} \mathbf{\bar{r}}(\mathbf{v}) \ge \Psi\_{\mathbf{k}} \mathbf{\bar{r}}(\mathbf{v}\_{\mathbf{k},\lambda} \, ^\*) > \mathbf{0} \tag{35}$$

Similarly as in the case of the perceptron criterion function Φ<sup>k</sup> p (**v**) (29), the optimal vector **v**k,λ\* (35) can be located in selected vertex of some polyhedron *D<sup>l</sup>* <sup>0</sup> (27). The minimum value Ψ<sup>k</sup> p (**v**k,λ\*) (35) of the criterion function Ψ<sup>k</sup> p (**v**) (34) is used, among others, in the *relaxed linear separability* (*RLS*) method of gene subsets selection [15].
