**4.1 The hidden layers**

A MLP network usually have many network layers connected by adjustable weights. This removes the restriction that the network is only able to be able to

**Figure 3.** *A MLP network consisting of three layers of nodes.*

classify linear separable patterns that the ordinary perceptron network only is able to do. By introducing many hidden layers, more complex patterns can be classified. By inserting a new hidden layer, the network may learn more complex pattern, but at the same time more hidden layers could decrease the rate of performance of the network. Many degrees of freedom may then be introduced that are not really needed. This depends, however, on the actual algorithm being used.

#### **4.2 Training of MLP**

A supervised learning algorithm is used to train the MLP network. The network is presented to a set of training examples, where a target vector is given for each training example. The target vector is then compared to the output vector. The weights of the network are adjusted to make the network perform better according to the training examples and targets defined. The training algorithm used in this chapter is the backpropagation algorithm. This is a well-known learning algorithm used for classification [13–15].

Each node in the network is activated in accordance with the input of the node and the node activation function. The difference between the calculated output and the target output is compared. All the weights between the output layer, hidden layers, and the input layer are then adjusted by using this error value. The sigmoidal function f(x) = 1/ (1 + e�<sup>x</sup> ) is used to compute the output of a node, but other functions may also be used and may even give better performance. The weighted sum Sj = P wjiai is inserted into the sigmoidal function, and the result of the output value from a unit j is given by:

$$\mathbf{f}\left(\mathbf{S}\_{\mathbf{i}}\right) = \mathbf{1}/\left(\mathbf{1} + \mathbf{e}^{-\mathbf{S}\mathbf{j}}\right),\tag{1}$$

The error value of an output unit j is computed by formula:

$$\mathbf{\dot{s}}\_{\rangle} = (\mathbf{t}\_{\rangle} - \mathbf{a}\_{\rangle})\mathbf{f}'(\mathbf{S}\_{\rangle}) \tag{2}$$

tj and aj are the target and output value for unit j, and f´ is the derivative of the function f. The error value calculated for a hidden node is given by:

$$\delta\_{\mathbf{j}} = \sum \delta\_{\mathbf{k}} \,\,\mathbf{w}\_{\mathbf{k}\mathbf{j}} \left(\mathbf{f}''(\mathbf{S}\_{\mathbf{j}})\right) \tag{3}$$

From the formula, we see that the error of a processing unit in the hidden layer is computed by the upper layer. Finally, the weights can be adjusted by:

$$
\Delta \mathbf{w}\_{\text{ji}} = \alpha \mathbf{\delta}\_{\text{j}} \,\, \mathbf{a}\_{\text{i}} \tag{4}
$$

Here, α is the learning rate parameter.

Very often another parameter is also used in the MLP network. It is called the momentum (β). This additional parameter can be very helpful in speeding up the convergence of the algorithm and avoiding local minima [7]. By including momentum in the Eq. (5), the next iteration step can be written as:

$$\mathbf{w}\_{\text{ji}}(\mathbf{t}+\mathbf{1}) = \mathbf{w}\_{\text{ji}}(\mathbf{t}) + \mathbf{a}\mathbf{\delta}\_{\text{j}}\mathbf{a}\_{\text{i}} + \beta\Delta\mathbf{w}\_{\text{ji}}\left(\mathbf{t}\right) \tag{5}$$

Here again α is the learning rate, β is the momentum, and Δwji is the weight change from the previous processing step.

## **5. SVM theory**

SVM (support vector machine) is a computationally efficient learning algorithm that now is being widely used in pattern recognition and classification problems [8]. The algorithm has been derived from the ideas of statistical learning theory to control the generalization abilities of a learning machine [16, 17]. An optimal hyperplane is learnt that classifies the given pattern that the machine is learning. By use of what we called kernel functions, the input feature space can be transformed into a higher dimensional space where the optimal hyperplane can be learnt. Such an approach gives great flexibility by using one of many learning models by changing the kernel function. A nonlinear mapping <sup>Φ</sup> : *<sup>R</sup><sup>D</sup>* ! *<sup>F</sup>* where F represents the feature space and k (x, x0 ) is a Mercer's Kernel [18]. The inner product k(x, x<sup>0</sup> ) of Φ, is defined by:

$$
\Phi: \mathbb{R}^{\mathcal{D}} \to \mathcal{F} \tag{6}
$$

$$\mathbf{k}(\mathbf{x}, \mathbf{x}') = \boldsymbol{\Phi}^T(\mathbf{x}') \, \boldsymbol{\Phi}(\mathbf{x}) \tag{7}$$

where the dimension D of the input space is much less than the dimension of the feature space F. Mercer's kernels are known mathematical functions (polynomial, sigmoid, etc. … ) and therefore we can calculate the inner product of Φ without actually knowing it. The learning algorithm selects support vectors to build the decision surface in the feature space. Support vectors are patterns (vectors) that are most difficult to categorize and are lying on the margins of the SVM classifier. This mapping is achieved by first solving a *convex* optimization problem and then applying a linear mapping from the feature space to the output space. The advantage of having a convex optimization problem is that the solution is unique. This is in contrast to ANN where we may have many local minima or maxima of the error function. **Figure 4** illustrates visually the concept of SVM.

**Figure 4.** *The mappings of SVM.*

Calculating ɸ may be a time-consuming process and often not feasible. However, Mercer's theorem allows us to avoid this computation so there is no need to explicitly describe the nonlinear mapping ɸ neither the image points in the feature space F. This technique is known as the Kernel trick [19].

#### **5.1 The SVM classifier**

The concept of the SVM classifier is illustrated in **Figure 5**. **Figure 1** shows the simplest case where the data vectors (marked by 'X's and 'O's) can be separated by a hyperplane.

There may exist many separating hyperplanes. The SVM classifier seeks the separating hyperplane that produces the largest separation of margins. In a more general case, where the data points are not linearly separable in the input space, a nonlinear transformation is used to map the data vectors into a high-dimensional space (the feature space) prior to applying the linear maximum margin classifier.

SVM has been initially designed to classify only binary data as in **Figure 5**. A multiclass SVM, however, has been designed by allowing classification of a finite number of classes [18]. This kind of learning assumes a priori knowledge of the data. In the case of autism, we may then have different kinds of autisms (which is also the reality), and the SVM algorithm may then be able to learn to categorize between them.

#### **5.2 The kernel**

SVM uses a kernel function where the nonlinear mapping is implicitly embedded. The discriminant function of the SVM classifier can be defined as:

*N*

**Figure 5.** *A SVM classification maximizes the margins between the different classes.*

Here, K(�,-) is the kernel function, **xi** are the support vectors determined from the training data, yi is the class indicator (e.g. +1 and �1 for a two class problem) associated with each **xi**, N is the number of supporting vectors determined during training, α is the Lagrange multiplier for each point in the training set, and b (bias) is a scalar representing the perpendicular distance of the hyperplane from the origin.

A problem using SVM soon appears. As the dimension of data increases, the complexity of the problem also increases. This is called the *curse of dimension* [20]. However, this may be overcome using the kernel trick from the Mercer's theorem.

The most commonly used kernel functions are the polynomial kernel given by:

$$\mathbf{K(x\_i, x\_j)} = \left(\mathbf{x\_i^T x\_j} + \mathbf{1}\right)^p, \text{where } \mathbf{p} > 0 \text{ is a constant} \tag{9}$$

And the Gaussian radial basis function (RBF) kernel is given by:

$$\mathbf{K}(\mathbf{x\_i}, \mathbf{x\_j}) = \exp \left( -\left\| \mathbf{x\_i} - \mathbf{x\_j} \right\|^2 / 2\sigma^2 \right) \tag{10}$$

Here, σ > 0 is a constant that defines the kernel width.

The mapping to the output space is based on the Cover theorem [21] illustrated in **Figure 6**. By transforming to a higher-dimensional space, we are able to categorize nonlinear input as linear in the output space using this theorem. This means that what is nonlinear in the input space becomes linear in the output space. In this way, we are able to categorize nonlinear data by use of the SVM algorithm.

#### **6. Experiments and results**

A Java program was made to generate the patterns from the original HPLC data. The generation of optimal selection of patterns was very much dependent on computing the right number of peaks from the HPLC spectra.

The training and testing data that we used were written to file to later being read into a Java program. The Java program developed for training uses the package

javANN (java Artificial Neural Network). This program was developed by the company Pattern solutions Ltd. [22] in Norway.

A Java program was then developed to change the format of data, so it could be used in the LIBSVM toolbox [23] and be able to classify between a normal and an autism child. Before the training started, a regularization parameter C (cost) has to be determined. The value of C was determined experimentally. This is not an optimal way to do it. The performance of the SVM classifier was optimal for C values from 100 and up to around 200. The SVM algorithm was tested on the same samples of data (unseen) as the MLP algorithm was, and with a constant C equal 100 or 20n The SVM algorithm also uses another constant ɤ (gamma) that has to be defined or the algorithm may determine it by default.

#### **6.1 MLP and SVM experiments**

#### *6.1.1 Small-scale experiments*

In the first experiment, 18 samples were used for training and 12 samples were used for testing. These are the samples that the algorithm never has seen before. In the MLP experiments, the number of hidden nodes was selected to 100. The learning rate was set to 0.1 and the momentum to 0.9. The number of iterations was set to 10,000.

The MLP algorithm has been tested on 12 unseen samples. The test data were unknown to the system, but we knew what category they belong to. The performance rate was then easy to calculate. The MLP network was able to correctly classify 11 either as an autist or normal child. The best performance was estimated to 11/12 = 91.7%.

The SVM algorithm best performance was estimated to 83.4% where 10 of 12 samples were classified correctly. One false-positive sample was estimated on the average. This corresponds to where a normal child was classified as an autist. This is a far more serious mistake than a false-negative classification error where an autist is classified as normal.

#### *6.1.2 Large-scale experiments*

The second delivery of data consist of 62 samples of autism children and 52 samples of normal children, totally 114 samples. In the second analysis, we wanted to see if a *proof of principle* experiment in the first delivery of data and could be extended to a *proof of concept* experiment. The training set consists now of 71 samples of data and 43 samples of testing data.

The best performance of the SVM algorithm was estimated to 88.4% with a penalty constant C = 100. This implies that 38 of 43 samples were correctly classified. The average performance of using SVM was then estimated to about 85%.

The MLP network gave the best performance estimated to 81.4% with an average value of 78.3%. This implies that 35 of 43 samples were classified correctly. The average number of false-positive cases in both experiments was equal to 2 for both algorithms.

#### **7. Conclusion**

Pattern diagnostics represents new ways to detect early diseases. This method may also be used to classify for instance between different DNA sequences [24–26]. In this chapter, we have used it to diagnose early autism. Such an analysis requires only small amount of urine to create the HPLC spectra. Mass spectrometry (MS) data is another method that could be used, we believe. The most important aspect of such an analysis is a very high throughput since both HPLC and MS spectra can be determined in short time. An important aspect of this analysis is that the patterns themselves are independent of the identity of proteins as a discriminator. The classification can then be done before the identity of proteins is determined.

Both a proof of principle and a proof of concept experiment have been carried out, and two quite independent algorithms have been used to analyze the data, Both algorithms have shown consistently results with respect to early identification of autism from their HPLC data.

#### **8. Future work**

The values of the parameters used in the algorithms are not optimal. The selection of different parameter values has been carried out by doing experiment. A lot of tuning of the parameter values are needed to be able to adapt the algorithms to the given data. One method for optimization is particle swarm optimization (PSO) [27]. This method can be used to determine optimal values of the different parameters, for both the neural network and SVM. PSO is now used very much in different types of applications.

Another aspect of the MLP neural network used is that we have used a sigmodal activation function. However, there are other types of neurons which can be used to introduce nonlinearities in the computation. *Tanh* neurons use a similar kind of nonlinearity as the sigmodal function, but the output of Tanh range from 1 to 1 compared to the sigmodal function where the output is ranging from 0 to 1. Tanh may in many cases give better performance of the neural network.

A different kind of nonlinearity is used by the *restricted linear unit* (ReLU) neuron. It uses the function f(x) = max (o,x) and may in many cases give the best performance of the ANN.

#### **8.1 Particle swarm intelligence**

Computational swarm intelligence (CSI) may be defined by algorithmic models based on the idea where the design of these algorithms came from the study of bird flocks and ant swarms to simulate their behaviors in computer models. These simulations show great ability to explore a multidimensional space and quickly turned into a quite new domain of algorithmic theory. A swarm can be defined as a group of agents cooperating to achieve some goal.

PSO is based on an intrinsic property of swarms to execute complex tasks by defining a self-organization process [28]. This is a new way to explore the highdimensional search space to find optimal solutions. Birds are similar to particles that fly through a hyperdimensional space. The social tendency of individuals is used to update the velocity of the particles. Each particle is influenced by the experience of its neighbors and their own knowledge. The norm is that the agents should behave with no centralized structure. The local interactions between the agents often lead to emergency of global behavior [29–32]. The particles in the multidimensional space, the search space, represent all feasible solutions of a given problem. By using a *fitness*

*May Big Data Analysis Be Used to Diagnose Early Autism? DOI: http://dx.doi.org/10.5772/intechopen.109537*

function, their positions are updated in the process of finding an optimal or nearoptimal solution.

How may this be used to improve the results of recognition of early autism? For a neural network, we could use PSO to find a correct configuration of the network. To determine the correct number of neurons in the hidden layer, we could then find the optimal value of the learning rate and momentum of the network.

For the SVM network, we may use of the PSO algorithm to find optimal values for the gamma and cost parameters. We could for instance also use a global PSO algorithm [31] to find these values. So far we have not done this, but it belongs to our future work.

#### **8.2 Deep learning: backpropagation with TensorFlow**

Deep learning is another approach that may be used to achieve better performance of the recognition of early autism. A deep neural network is in its simplicity a neural network with many hidden layers [33]; see **Figure 7**. The more the hidden layers, the deeper the network is. As the neural network gets deeper, the processing power to train the network increases substantially. We may also increase the number of nodes in each hidden layer – often called the width of the network. Multiple frameworks of neural networks exist today. Maybe the most used is the TensorFlow [34] which is an open-source machine learning library developed by Google.

In TensorFlow, numerical computations are processed using data flow graphs (illustrated in **Figure 8**). The data is represented as *Tensors.* A Tensor in TensorFlow may be described as a typed multidimensional array. Nodes in the data flow graph are called ops (short for operations). Each op takes zero or more Tensors as input and performs some computation and outputs zero or more Tensors.

The edges in the graph represent Tensors communicated between the nodes of the graph. **Figure 8** illustrates a TensorFlow data graph.

We could also apply a TensorFlow backpropagation algorithm as in [34]. To create a TensorFlow network, we need to break the networks into Tensors. We also have to

**Figure 7.** *A simple and deep learning network.*

**Figure 8.** *A TensorFlow Data graph.*

define the input data. Then, we need to represent our input as Tensors. We then also need placeholder operations to hold the data.

In the backpropagation algorithm, the cost function is minimized. If we want to see visually what is happening during the learning process, we may introduce a graph and create a TensorFlow session.

To optimize the neural network, different hyperparameters need to be tuned. This may be done by using for instance the PSO method discussed before. These hyperparameters are as follows:

*May Big Data Analysis Be Used to Diagnose Early Autism? DOI: http://dx.doi.org/10.5772/intechopen.109537*


When using a standard gradient descent algorithm to find the minimum cost function or global error, the weights are updated after each iteration. However, by using a *stochastic* gradient descent algorithm, we may use small batches from the dataset in each iteration.

While standard gradient descent performs a parameter update after each run through the whole training set, a stochastic gradient descent performs a parameter update after each batch. According to LeCun et al. [35], one should use stochastic gradient descent if the training set is large (more than a few hundred samples) and redundant, and the task is classification.

The deeper and wider the network is, the more computational expensive it is to train the network. When using gradient descent, one could use the method "*GradientDescentOptimizer*" in TensorFlow. The algorithm then converges within reasonable time. By introducing a great number of hidden layers and/or a large number of nodes in each hidden layer, we will not necessarily increase the accuracy of the network. However, it may give better performance.

**Figure 9.** *Gradient descent in a deep learning network.*

**Figure 9** shows how the cost decreases when using gradient descent of small batches from the dataset in each iteration. An example of a learning curve using a deep learning network with stochastic gradient descent may looks like the one given in **Figure 9**. For the configuration of the network we may, for instance, use up to 10 hidden layers with a number of 10–20 nodes in each layer.

We have not yet used a deep learning network to classify between an autist and a normal child, but this belongs to do in the future. We may also use such a deep learning network for introducing different kinds of autism in the autism spectrum and be able to classify between them, based on their HPLC spectrum data.
