**3. Methods**

**2.3. Data Modeling**

346 Recent Advances in Autism Spectrum Disorders - Volume I

with intensity levels *Ai*

the aCGH technology can be model as follows.

As illustrated in Figure 3, aCGH technology is an experimental approach for genome-wide scanning of differences in DCN samples. It provides a high-resolution method to map and measure relative changes in DCN simultaneously at thousands of genomic loci. In a biologi‐ cal experiment, unknown (test) and reference (normal) DNA samples are labeled with fluo‐ rescent dyes Cy3 and Cy5, respectively. Then, they are combined and competitively cohybridized to an array containing genomic DNA targets that have been spotted on a glass slide. The resulting ratio of the fluorescence intensities is proportional to the ratio of the copy numbers of DNA sequences in the test and reference genomes measured in a logarith‐ mic scale for a certain genomic location. These intensity ratios are informative about DNA copy number changes. We expect to see duplication (gain) for positive ratio, deletion (loss) for negative ratio and normal state for neutral ratio. Due to the logarithmic scale and the probes performance, the data can be approximated as a piecewise function of short and long intervals with different intensity levels that are not equally-spaced along the genome. More‐ over, microarray experiments suffer from many sources of error due to human factors, array

**Figure 4.** Graphical representation of the generated data using aCGH technology. The red stars represent the raw da‐ ta as described in (1). The grey solid line represents the true value of 4 variant segments that need to be estimated

According to the data description and properties generated by microarray technology, the DCN cell line can be approximated as a one-dimensional piecewise constant (PWC) discretetime signal contaminated with some error. A good model of the genetic data generated by

, respectively.

*n N* (1)

measured in log2(ratio) and bounded by the breakpoints *n <sup>i</sup>*-1 and *n <sup>i</sup>*

[ ] [ ] , 1, 2, ..., . *<sup>n</sup> yn xn* =+ = e

printer performance, labeling, and hybridization efficiency.

#### **3.1. Data Filtering**

Although the recent advantecment in microarray technologies and sequencing now make it easy to measure the genetic variations with high-resolution through scanning large number of samples, small changes, particularly at the low copy repeat (LCRs) regions, remain diffi‐ cult to detect due to different noise conditions. Thus, the challenging problem is to differen‐ tiate between the true biological signaling and the noise measurements.

Various methods have been proposed as preprocessing techniques to tackle this problem. These methods have been motivated by either well-known signal processing techniques or statistical-based models.


**Table 2.** Comparison based on the computational complexity of the proposed denoising techniques.

In Table 2, we present a comparison study based on the computational cost of the most re‐ cent and successful approaches. As can be noticed that the smoothing techniques are well suited to process very large amount of data such as the genetic signals compared to the stat‐ istical-based models. However, these techniques include important features such as the var‐ iant regions boundaries in the smoothing process.

Here we present our previously proposed method (Alqallaf et al., 2007), Sigma filter (SF). It is a nonlinear method used as a feature extraction to detect the variant segments edges and to smooth the rest of the genetic data. The filter is conceptually simple but effective noise smoothing algorithm. Based on the assumption of the aCGH data modeling, the SF algo‐ rithm is well suited to denoise the tested samples before further analysis. SF algorithm is motivated by the sigma probability of the Gaussian distribution, and it smooths the noise by averaging only those neighborhood variant segments which have the intensities within a fixed sigma range of the center data point. Consequently, variant segmets edges are pre‐ served, and subtle details are retained.

cation or decision making, based on a previously provided data. Classifier's ability to spot trends and relationships in large data sets makes it well suited for many applications. In the field of medicine classifiers can be used to classify accurately diseases, genes, tumors, and other medical phenomena [54; 55; 56; 57; 58; 59& 60]. Although some attempts were made to use classifiers in genetics [61]. Our attempt is to use three comparative classifiers, namely, *k*-Nearest Neighbor, Neural Network, and Support Vector Machine, to help in diagnosing pa‐

Discovering the Genetics of Autism http://dx.doi.org/10.5772/ 53797 349

The leave-one-out cross-validation (LOOCV) is applied to evaluate the proposed classifiers by measuring the classification performance to accurately identify the association between the tested samples and the targeted disorder, ASD. The LOOCV involves using a single var‐ iant segment from the original sample as the validation data, and the remaining segments as the training data. This is repeated such that each variant segment in the sample is used once

The *k*-Nearest Neighbor (*k*-NN) classifier [64] is a well known nonparametric classifier. To classify a new input *x*, the *k* nearest neighbors are retrieved from the training data. The in‐ put *x* is then labeled with the majority class label corresponding to the *k* nearest neighbors. For the *k*-NN classifier, we used the Euclidean distance as the distance metric, and the best *k*

Neural networks are another type of classifier or mathematical models used for classifica‐ tion, regression or decision making. Their structure is inspired by the human neural system and brain. It consists of many neurons, interconnected at different stages. The direction of flow of information is usually from the input stage to the output stage. Each neuron has an input and an output, where an activation function converts a neurons input to its output. The output of each neuron is connected to the next stage through a weighted connection. A learning function determines the value of the weights of all the connections. The weights are updated based on a mathematical function that relates the network together. Therefore, a neural network is considered as an adaptive network that changes its structure during the learning or training phase, based on mathematical functions that relate input data to the cor‐ responding class labels. The sum of all neurons at the different layers and the weighted in‐ terconnections make up a complex network that is commonly referred to as a black box.

Before its use to classify a test sample, the neural network is trained on a given data set with known classes or labels. During the training phase the weights are updated to minimize the output error. The selected value of the minimum acceptable error determines when the training stops. For a difficult data where it is impossible to reach the set minimum error, the

maximum number of epochs is used as criteria for stopping the training process.

tients with ASD.

as the validation data.

*3.3.2. Neural Network*

*3.3.1. k-Nearest Neighbor Classi*

fi*er*

between 1 and 10 was found by performing LOOCV on the training data.

#### **3.2. Statistical significance**

Few studies in the literature have addressed the power of class discovery of the recurrent copy number variations (CNVs) across multiple samples of the genetic data [52& 53]. How‐ ever, they did not consider denoising the data prior to applying the statistical analysis.

To reduce the dimensionality of the detected variant regions, we apply a simple statisticalbased approach to measure the significance of the candidate gemonic regions. The approach is based on the frequency difference between the case and control samples at each gemonic location. It is used as a feature selection algorithm to select a small subset of variant seg‐ ments as features for classification. Figure 5 is an illustration of three RCVNs with different sizes of filtered DCN data for multiple samples of normal control (*C <sup>i</sup>* ) and autistic (*A <sup>i</sup>* ) indi‐ viduals, respectively. After selecting the informative segments of the genome, we then ap‐ plied comparative classification algorithms on the reduced data.

**Figure 5.** Schematic representation of 3 recurrent copy number variant segments (RCNVs) with different lengths. The *x*-axis represents the genomic position and the *y*-axis represents the indices of the samples, *Ci* is for normal-control samples and *A <sup>i</sup>* is for autistic samples, respectively. The vertical dashed lines represent the RCNVs boundaries. The dark red and dark blue bars represent duplication and deletion for the corresponding chromosomal regions.

#### **3.3. Data classification**

Based on the collected and processed genetic data, we apply a system of classifiers that are used to identify autistic individuals based on their genetic information. This system will help improve detection, identification and diagnosis of autism, which will benefit both the patients and the society in general and will lead to early diagnosis and treatment.

Generally classifiers are used by researchers faced with the task of classification based on a given data. Classifiers are mathematical models that are able to perform the task of classifi‐ cation or decision making, based on a previously provided data. Classifier's ability to spot trends and relationships in large data sets makes it well suited for many applications. In the field of medicine classifiers can be used to classify accurately diseases, genes, tumors, and other medical phenomena [54; 55; 56; 57; 58; 59& 60]. Although some attempts were made to use classifiers in genetics [61]. Our attempt is to use three comparative classifiers, namely, *k*-Nearest Neighbor, Neural Network, and Support Vector Machine, to help in diagnosing pa‐ tients with ASD.

The leave-one-out cross-validation (LOOCV) is applied to evaluate the proposed classifiers by measuring the classification performance to accurately identify the association between the tested samples and the targeted disorder, ASD. The LOOCV involves using a single var‐ iant segment from the original sample as the validation data, and the remaining segments as the training data. This is repeated such that each variant segment in the sample is used once as the validation data.

#### *3.3.1. k-Nearest Neighbor Classi*fi*er*

The *k*-Nearest Neighbor (*k*-NN) classifier [64] is a well known nonparametric classifier. To classify a new input *x*, the *k* nearest neighbors are retrieved from the training data. The in‐ put *x* is then labeled with the majority class label corresponding to the *k* nearest neighbors. For the *k*-NN classifier, we used the Euclidean distance as the distance metric, and the best *k* between 1 and 10 was found by performing LOOCV on the training data.

#### *3.3.2. Neural Network*

to smooth the rest of the genetic data. The filter is conceptually simple but effective noise smoothing algorithm. Based on the assumption of the aCGH data modeling, the SF algo‐ rithm is well suited to denoise the tested samples before further analysis. SF algorithm is motivated by the sigma probability of the Gaussian distribution, and it smooths the noise by averaging only those neighborhood variant segments which have the intensities within a fixed sigma range of the center data point. Consequently, variant segmets edges are pre‐

Few studies in the literature have addressed the power of class discovery of the recurrent copy number variations (CNVs) across multiple samples of the genetic data [52& 53]. How‐ ever, they did not consider denoising the data prior to applying the statistical analysis.

To reduce the dimensionality of the detected variant regions, we apply a simple statisticalbased approach to measure the significance of the candidate gemonic regions. The approach is based on the frequency difference between the case and control samples at each gemonic location. It is used as a feature selection algorithm to select a small subset of variant seg‐ ments as features for classification. Figure 5 is an illustration of three RCVNs with different

viduals, respectively. After selecting the informative segments of the genome, we then ap‐

**Figure 5.** Schematic representation of 3 recurrent copy number variant segments (RCNVs) with different lengths. The

samples and *A <sup>i</sup>* is for autistic samples, respectively. The vertical dashed lines represent the RCNVs boundaries. The dark

Based on the collected and processed genetic data, we apply a system of classifiers that are used to identify autistic individuals based on their genetic information. This system will help improve detection, identification and diagnosis of autism, which will benefit both the

Generally classifiers are used by researchers faced with the task of classification based on a given data. Classifiers are mathematical models that are able to perform the task of classifi‐

*x*-axis represents the genomic position and the *y*-axis represents the indices of the samples, *Ci*

red and dark blue bars represent duplication and deletion for the corresponding chromosomal regions.

patients and the society in general and will lead to early diagnosis and treatment.

) and autistic (*A <sup>i</sup>*

) indi‐

is for normal-control

sizes of filtered DCN data for multiple samples of normal control (*C <sup>i</sup>*

plied comparative classification algorithms on the reduced data.

served, and subtle details are retained.

348 Recent Advances in Autism Spectrum Disorders - Volume I

**3.2. Statistical significance**

**3.3. Data classification**

Neural networks are another type of classifier or mathematical models used for classifica‐ tion, regression or decision making. Their structure is inspired by the human neural system and brain. It consists of many neurons, interconnected at different stages. The direction of flow of information is usually from the input stage to the output stage. Each neuron has an input and an output, where an activation function converts a neurons input to its output. The output of each neuron is connected to the next stage through a weighted connection. A learning function determines the value of the weights of all the connections. The weights are updated based on a mathematical function that relates the network together. Therefore, a neural network is considered as an adaptive network that changes its structure during the learning or training phase, based on mathematical functions that relate input data to the cor‐ responding class labels. The sum of all neurons at the different layers and the weighted in‐ terconnections make up a complex network that is commonly referred to as a black box.

Before its use to classify a test sample, the neural network is trained on a given data set with known classes or labels. During the training phase the weights are updated to minimize the output error. The selected value of the minimum acceptable error determines when the training stops. For a difficult data where it is impossible to reach the set minimum error, the maximum number of epochs is used as criteria for stopping the training process.

#### *3.3.3. Support Vector Machine*

The Support Vector Machine (SVM) belongs to a new generation of learning system based on recent advances in statistical learning theory [65]. A linear SVM, which is used in our sys‐ tem, aims to find the separating hyper-plane with the largest margin, defined as the sum of the distances from a hyper-plane (implied by a linear classifier) to the closest positive and negative exemplars. The expectation is that the larger the margin, the better the generaliza‐ tion of the classifier. In a non-separable case, a linear SVM seeks a trade-off between maxi‐ mizing the margin and minimizing the number of errors.

Quantitative PCR is an indispensable tool for researchers in various fields including funda‐ mental biology, molecular diagnostics, biotechnology, and forensic sciences. Critical points and limitations of qPCR-based assays must be considered to increase the reliability of the obtained data. For the detection of qPCR four technologies are commonly used all of which are based on the measurement of fluorescence during the PCR. One principle is based on intercalation of double-stranded DNA-binding dyes (simplest and cheapest). The other three principles are based on the introduction of an additional fluorescence-labeled oligonu‐ cleotide (probe). Detectable fluorescence are only released either after cleavage of the probe (hydrolysis probes) or during hybridization of one (molecular beacon) or two (hybridization probes) oligonucleotides to the amplicon. The introduction of an additional probe increases the specificity of the quantified PCR product and allows the development of multiplex reac‐

Discovering the Genetics of Autism http://dx.doi.org/10.5772/ 53797 351

The qPCR method quickly became the first choice when it comes to quantitative analysis of nucleic acid because of many reasons. It is highly sensitive and it allows the detection of less than five copies (one copy in some cases) of a target sequence. It has good reproducibility. In addition, it has broad dynamic quantification range, at least 5 log units. It is also easy to use and has reasonable good value for money (low consumable and instrumentation costs).

For the purpose of this chapter, we are focusing on one of the many applications of qPCR,

**Array CBS SF** 1 14 20 2 10 20 3 20 21 4 13 21

**Table 3.** Representation of the number of events (CNVs) detected by the circular binary segmentation (CBS) and

Table 3 shows that the number of qPCR-confirmed CNVs detected by the sigma filtering (SF) method is considerably higher than those detected using the circular binary segmenta‐ tion (CBS), ranging from 4.5% to 36% more for 4 different array experiments. The results show that applying the averaging window of 2Kb allow the algorithms to be well suited for detecting variations in high-density microarray data, especially at the LCR-rich regions.

The etiology of Autism spectrum disorders involves genetic and environmental risk factors. In this chapter, we have discussed the genetics basis of the complex disorder, autism. With the recent advances in the new screening technologies to investigate the entire genome such

tions. Other technologies have been described for the detection of qPCR [63].

which is indispensable for research and diagnostics, the genetic variations.

sigma filtering methods, respectively, for 22 qPCR confirmed CNVs.

**5. Conclusion**

**Figure 6.** Comparison study of the performance of the three tested classifiers. The *x*-axis represents the number of segments and the *y*-axis represents the percentage average LOOCV accuracy.

Figure 6 illustrates the LOOCV classification accuracies using the tested classifiers, *k*-NN, NN, and SVM. The *x*-axis is associated with the number of selected top-ranked variant seg‐ ments and the *y*-axis shows the average LOOCV accuracy.

### **4. Validation of the predicted variant segments**

To evaluate our predictive power of our method in detecting and identifying patients with ASD, we use molecular test, quantitative Polymerase Chain Reaction (qPCR). It is a very sensitive and precise tool used for the quantification of nucleic acids. It can detect and quantify very small amounts of specific nucleic acid sequence. It is based on the method of PCR, developed by Kary Mullis in the 1980s. It allows the amplification of specific nucleic acid sequence (DNA) more than a billion-fold. Using qPCR allows scientists to quantify the starting amount of a specific DNA sequence in the sample before the amplification by PCR method [62].

Quantitative PCR is an indispensable tool for researchers in various fields including funda‐ mental biology, molecular diagnostics, biotechnology, and forensic sciences. Critical points and limitations of qPCR-based assays must be considered to increase the reliability of the obtained data. For the detection of qPCR four technologies are commonly used all of which are based on the measurement of fluorescence during the PCR. One principle is based on intercalation of double-stranded DNA-binding dyes (simplest and cheapest). The other three principles are based on the introduction of an additional fluorescence-labeled oligonu‐ cleotide (probe). Detectable fluorescence are only released either after cleavage of the probe (hydrolysis probes) or during hybridization of one (molecular beacon) or two (hybridization probes) oligonucleotides to the amplicon. The introduction of an additional probe increases the specificity of the quantified PCR product and allows the development of multiplex reac‐ tions. Other technologies have been described for the detection of qPCR [63].

The qPCR method quickly became the first choice when it comes to quantitative analysis of nucleic acid because of many reasons. It is highly sensitive and it allows the detection of less than five copies (one copy in some cases) of a target sequence. It has good reproducibility. In addition, it has broad dynamic quantification range, at least 5 log units. It is also easy to use and has reasonable good value for money (low consumable and instrumentation costs).

For the purpose of this chapter, we are focusing on one of the many applications of qPCR, which is indispensable for research and diagnostics, the genetic variations.


**Table 3.** Representation of the number of events (CNVs) detected by the circular binary segmentation (CBS) and sigma filtering methods, respectively, for 22 qPCR confirmed CNVs.

Table 3 shows that the number of qPCR-confirmed CNVs detected by the sigma filtering (SF) method is considerably higher than those detected using the circular binary segmenta‐ tion (CBS), ranging from 4.5% to 36% more for 4 different array experiments. The results show that applying the averaging window of 2Kb allow the algorithms to be well suited for detecting variations in high-density microarray data, especially at the LCR-rich regions.
