**3. Proposed method**

#### **3.1 Overview of the proposed framework**

In the proposed framework, we consider two different layers of networks in the GRN. One is the molecular interaction network at the factor-gene binding level. The other is the functional network that incorporates the consequences of these physical interactions, such as the activation or repression of transcription. We used three types of data to reconstruct the GRN, namely protein-protein interactions derived from a collection of public databases, protein-DNA interactions from the TRANSFAC database (Matys et al. 2006), and time course gene expression profiles. The first two data sources provided direct network information to constrain the GRN model. The gene expression profiles provided an unambiguous measurement on the causal effects of the GRN model. Gene ontology annotation describes the similarities between genes within one network, which facilitates further characterization of the relationships between genes. The goal is to discern dependencies between the gene expression patterns and the physical inter-molecular interactions revealed by complementary data sources.

The framework model for GRN inference is illustrated in Figure 1. Besides data preprocessing, three successive steps are involved in this framework as outlined in the following:

*Gene module selection*: genes with similar expression profiles are represented by a gene module to address the scalability problem in GRN inference (Ressom et al. 2003). The assumption is that a subset of genes that are related in terms of expression (co-regulated) can be grouped together by virtue of a unifying cis-regulatory element(s) associated with a common transcription factor regulating each and every member of the cluster (co-expressed) (Yeung et al. 2004). Gene ontology information is used to define the optimal number of clusters with respect to certain broad functional categories. Since each gene module identified from clustering analysis mainly represents one broad biological process or category as evaluated by FuncAssociate (Berriz et al. 2003), the regulatory network implies that a given transcription factor is likely to be involved in the control of a group of functionally related genes (De Hoon et al. 2002). This step is implemented by the method proposed in our previous work (Zhang et al. 2009).

*Network motif discovery*: to reduce the complexity of the inference problem, NMs are used instead of a global GRN inference. The significant NMs in the combined molecular interaction network are first established and assigned to at least one transcription factor. These associations are further used to reconstruct the regulatory modules. This step is implemented using FANMOD software (Wernicke and Rasche 2006). We briefly describe it in the following section.

*Gene regulatory module inference*: for each transcription factor assigned to a NM, a NN is trained to model a GRN that mimics the associated NM. Genetic algorithm generates the candidate gene modules, and particle swarm optimization (PSO) is used to configure the parameters of the NN. Parameters are selected to minimize the root mean square error (RMSE) between the output of the NN and the target gene module's expression pattern. The RMSE is returned to GA to produce the next generation of candidate gene modules. Optimization continues until either a pre-specified maximum number of iterations are completed or a pre-specified minimum RMSE is reached. The procedure is repeated for all transcription factors. Biological knowledge from public databases is used to evaluate the predicted results. This step is the main focus of this chapter.

#### **3.2 Network motif discovery**

220 Reverse Engineering – Recent Advances and Applications

1997), linear models (Chen et al. 1999; D'Haeseleer et al. 1999), Boolean networks (Shmulevich et al. 2002), fuzzy logic (Woolf and Wang 2000; Ressom et al. 2003), Bayesian networks (Friedman et al. 2000), and recurrent neural networks (RNNs) (Zhang et al. 2008). Biochemically inspired models are developed on the basis of the reaction kinetics between different components of a network. However, most of the biochemically relevant reactions under participation of proteins do not follow linear reaction kinetics, and the full network of regulatory reactions is very complex and hard to unravel in a single step. Linear models attempt to solve a weight matrix that represents a series of linear combinations of the expression level of each gene as a function of other genes, which is often underdetermined since gene expression data usually have far fewer dimensions than the number of genes. In a Boolean network, the interactions between genes are modeled as Boolean function. Boolean networks assume that genes are either "on" or "off" and attempt to solve the state transitions for the system. The validity of the assumptions that genes are only in one of these two states has been questioned by a number of researchers, particularly among those in the biological community. Woolf and Wang 2000 proposed an approach based on fuzzy rules of a known activator/repressor model of gene interaction. This algorithm transforms expression values into qualitative descriptors that is evaluated by using a set of heuristic rules and searches for regulatory triplets consisting of activator, repressor, and target gene. This approach, though logical, is a brute force technique for finding gene relationships. It involves significant computational time, which restricts its practical usefulness. In (Ressom et al. 2003), we proposed the use of clustering as an interface to a fuzzy logic–based method to improve the computational efficiency. In a Bayesian network model, each gene is considered as a random variable and the edges between a pair of genes represent the conditional dependencies entailed in the network structure. Bayesian statistics are applied to find certain network structure and the corresponding model parameters that maximize the posterior probability of the structure given the data. Unfortunately, this learning task is NPhard, and it also has the underdetermined problem. The RNN model has received considerable attention because it can capture the nonlinear and dynamic aspects of gene regulatory interactions. Several algorithms have been applied for RNN training in network inference tasks, such as fuzzy-logic (Maraziotis et al. 2005) and genetic algorithm (Chiang

In the proposed framework, we consider two different layers of networks in the GRN. One is the molecular interaction network at the factor-gene binding level. The other is the functional network that incorporates the consequences of these physical interactions, such as the activation or repression of transcription. We used three types of data to reconstruct the GRN, namely protein-protein interactions derived from a collection of public databases, protein-DNA interactions from the TRANSFAC database (Matys et al. 2006), and time course gene expression profiles. The first two data sources provided direct network information to constrain the GRN model. The gene expression profiles provided an unambiguous measurement on the causal effects of the GRN model. Gene ontology annotation describes the similarities between genes within one network, which facilitates further characterization of the relationships between genes. The goal is to discern

and Chao 2007).

**3. Proposed method** 

**3.1 Overview of the proposed framework** 

The NM analysis is based on network representation of protein-protein interactions and protein-DNA interactions. A node represents both the gene and its protein product. A protein-protein interaction is represented by a bi-directed edge connecting the interacting proteins. A protein-DNA interaction is an interaction between a transcription factor and its target gene and is represented by a directed edge pointing from the transcription factor to its target gene.

All connected subnetworks containing three nodes in the interaction network are collated into isomorphic patterns, and the number of times each pattern occurs is counted. If the number of occurrences is at least five and significantly higher than in randomized networks, the pattern is considered as a NM. The statistical significance test is performed by

Reverse Engineering Gene Regulatory Networks by Integrating Multi-Source Biological Data 223

addressed this challenge by inferring GRNs at NM modular level instead of gene level. Neural network models were built for all NMs detected in the molecular interaction data. A hybrid search algorithm, called GA-PSO, was proposed to select the candidate gene modules (output node) in a NN and to update its free parameters simultaneously. We illustrate the models and

Fig. 2. Four NMs discovered in yeast: (A) auto-regulatory motif; (B) feed-forward loop; (C)

The neural network model is based on the assumption that the regulatory effect on the expression of a particular gene can be expressed as a neural network (Figure 4(A)), where each node represents a particular gene and the wirings between nodes define regulatory interactions. Each layer of the network represents the expression level of genes at time *t*. The output of a node at time *t t* is derived from the expression levels at the time *t* and the connection weights of all genes connected to the given gene. As shown in the figure, the output of each node is fed back to its input after a unit delay and is connected to other nodes. The network is used as a model to a GRN: every gene in the network is considered as a neuron; NN model considers not only the interactions between genes but also gene self-

Figure 4 (B) illustrates the details of the *i*th self-feedback neuron (e.g. *i*th gene in the GRN), where *<sup>i</sup> v* , known as the induced local field (activation level), is the sum of the weighted

activation function (integrated regulation of the whole NN on *i*th gene), which transforms

() represents an

inputs (the regulation of other genes) to the neuron (ith gene); and

training algorithm in detail in the following sections.

single input module; and (D) multi-input module.

**3.3.1 Neural network model** 

regulations.

generating 1000 randomized networks and computing the fraction of randomized networks in which the pattern appears at least as often as in the interaction network, as described in (Yeger-Lotem et al. 2004). A pattern with p≤0.05 is considered statistically significant. This NM discovery procedure was performed using the FANMOD software. For different organisms, different NMs may be identified. As shown in Figure 2 and Figure 3, different sets of NMs were detected in yeast and human. Both NM sets shared some similar NM structures. For example, Figure 2(B) and Figure 3(C) were both feed-forward loops. This NM has been indentified and studied in many organisms including E. coli, yeast, and human. Knowledge of these NMs to which a given transcription factor belongs facilitates the identification of downstream target gene modules.

Fig. 1. Schematic overview of the computational framework used for the gene regulatory network inference.

#### **3.3 Reverse engineering transcriptional regulatory modules**

In building NNs for inferring GRNs, the identification of the correct downstream gene modules and determination of the free parameters (weights and biases) to mimic the real data is a challenging task given the limited available quantity of data. For example, in inferring a GRN from microarray data, the number of time points is considerably low compared to the number of genes involved. Considering the complexity of the biological system, it is difficult to adequately describe the pathways involving a large number of genes with few time points. We addressed this challenge by inferring GRNs at NM modular level instead of gene level. Neural network models were built for all NMs detected in the molecular interaction data. A hybrid search algorithm, called GA-PSO, was proposed to select the candidate gene modules (output node) in a NN and to update its free parameters simultaneously. We illustrate the models and training algorithm in detail in the following sections.

Fig. 2. Four NMs discovered in yeast: (A) auto-regulatory motif; (B) feed-forward loop; (C) single input module; and (D) multi-input module.

## **3.3.1 Neural network model**

222 Reverse Engineering – Recent Advances and Applications

generating 1000 randomized networks and computing the fraction of randomized networks in which the pattern appears at least as often as in the interaction network, as described in (Yeger-Lotem et al. 2004). A pattern with p≤0.05 is considered statistically significant. This NM discovery procedure was performed using the FANMOD software. For different organisms, different NMs may be identified. As shown in Figure 2 and Figure 3, different sets of NMs were detected in yeast and human. Both NM sets shared some similar NM structures. For example, Figure 2(B) and Figure 3(C) were both feed-forward loops. This NM has been indentified and studied in many organisms including E. coli, yeast, and human. Knowledge of these NMs to which a given transcription factor belongs facilitates the

**SWI4**

Fig. 1. Schematic overview of the computational framework used for the gene regulatory

In building NNs for inferring GRNs, the identification of the correct downstream gene modules and determination of the free parameters (weights and biases) to mimic the real data is a challenging task given the limited available quantity of data. For example, in inferring a GRN from microarray data, the number of time points is considerably low compared to the number of genes involved. Considering the complexity of the biological system, it is difficult to adequately describe the pathways involving a large number of genes with few time points. We

**3.3 Reverse engineering transcriptional regulatory modules** 

**C11 C13**

**NDD1 C20**

**SW14**

**SW14**

**FKH1**

**C8 C30 C34 C19**

**SWI4 C11**

**SWI5**

identification of downstream target gene modules.

**Activator Repressor**

**Gene1 Gene2 Gene3**

network inference.

The neural network model is based on the assumption that the regulatory effect on the expression of a particular gene can be expressed as a neural network (Figure 4(A)), where each node represents a particular gene and the wirings between nodes define regulatory interactions. Each layer of the network represents the expression level of genes at time *t*. The output of a node at time *t t* is derived from the expression levels at the time *t* and the connection weights of all genes connected to the given gene. As shown in the figure, the output of each node is fed back to its input after a unit delay and is connected to other nodes. The network is used as a model to a GRN: every gene in the network is considered as a neuron; NN model considers not only the interactions between genes but also gene selfregulations.

Figure 4 (B) illustrates the details of the *i*th self-feedback neuron (e.g. *i*th gene in the GRN), where *<sup>i</sup> v* , known as the induced local field (activation level), is the sum of the weighted inputs (the regulation of other genes) to the neuron (ith gene); and() represents an activation function (integrated regulation of the whole NN on *i*th gene), which transforms

Reverse Engineering Gene Regulatory Networks by Integrating Multi-Source Biological Data 225

*T N*

*t Ti E w xt xt Nm*

(gene) at time *t*. The goal is to determine weight vector *w*

points) are available.

NM gene regulatory modules.

1 <sup>1</sup> ( ) [[ ( ) ( )]] *<sup>m</sup> i*

where ( ) *<sup>i</sup> x t* and ( ) *<sup>i</sup> x t* are the true and predicted values (expression levels) for the *i*th neuron

This is a challenging task if the size of the network is large and only few samples (time

For each NM shown in Figure 2 and 3, a corresponding NN is built. Figure 5 presents the detailed NN model for each NM in Figure 2. For auto-regulatory motif, the NN model is the same as single input module except that its downstream gene module contains the transcription factor itself. Since the expression level of gene module is the mean of expression level of its member genes, the input node and output node have different expression profiles. This avoids an open-loop problem which may cause the stability of the model. For single input and multiple input modules, instead of selecting multiple downstream gene modules simultaneously, we build NN models to find candidate gene modules one at one time. The selected gene modules are merged together to build the final

(A)

(B) Fig. 4. Architecture of a fully connected NN (A) and details of a single recurrent neuron (B).

*i i*

2

(3)

that minimize this cost function.

the activation level of a neuron into an output signal (regulation result). The induced local field and the output of the neuron, respectively, are given by:

$$w\_i(T\_j) = \sum\_{k=1}^{N} w\_{ik} x\_k(T\_{j-1}) + b\_i(T\_j) \tag{1}$$

$$\propto\_i (T\_j) = \phi(\upsilon\_i(T\_j)) \tag{2}$$

where the synaptic weights *w*i1, *w*i2,…, *w*i*<sup>N</sup>* define the strength of connections between the *i*th neuron (e.g. *i*th gene) and its inputs (e.g. expression level of genes). Such synaptic weights exist between all pairs of neurons in the network. ( ) *<sup>i</sup> <sup>j</sup> b T* denotes the bias for the *i*th neuron at time *Tj* . We denote *w* as a weight vector that consists of all the synaptic weights and biases in the network. *w* is adapted during the learning process to yield the desired network outputs. The activation function() introduces nonlinearity to the model. When information about the complexity of the underlying system is available, a suitable activation function can be chosen (e.g. linear, logistic, sigmoid, threshold, hyperbolic tangent sigmoid or Gaussian function.) If no prior information is available, our algorithm uses the hyperbolic tangent sigmoid function.

Fig. 3. Four NMs discovered in human: (A) multi-input module; (B) single input module; (C) feed-forward loop - 1; and (D) feed-forward loop - 2.

As a cost function, we use the RMSE between the expected output and the network output across time and neurons in the network. The cost function is written as:

224 Reverse Engineering – Recent Advances and Applications

the activation level of a neuron into an output signal (regulation result). The induced local

() ( ) ()

*i j ik k j i j*

( ) ( ( )) *<sup>i</sup> <sup>j</sup> <sup>i</sup> <sup>j</sup> xT vT* 

where the synaptic weights *w*i1, *w*i2,…, *w*i*<sup>N</sup>* define the strength of connections between the *i*th neuron (e.g. *i*th gene) and its inputs (e.g. expression level of genes). Such synaptic weights exist between all pairs of neurons in the network. ( ) *<sup>i</sup> <sup>j</sup> b T* denotes the bias for the *i*th neuron

about the complexity of the underlying system is available, a suitable activation function can be chosen (e.g. linear, logistic, sigmoid, threshold, hyperbolic tangent sigmoid or Gaussian function.) If no prior information is available, our algorithm uses the hyperbolic tangent

Fig. 3. Four NMs discovered in human: (A) multi-input module; (B) single input module; (C)

As a cost function, we use the RMSE between the expected output and the network output

feed-forward loop - 1; and (D) feed-forward loop - 2.

across time and neurons in the network. The cost function is written as:

*vT wx T bT*

1

*N*

*k*

1

as a weight vector that consists of all the synaptic weights and

is adapted during the learning process to yield the desired network

() introduces nonlinearity to the model. When information

(1)

(2)

field and the output of the neuron, respectively, are given by:

at time *Tj* . We denote *w*

biases in the network. *w*

sigmoid function.

outputs. The activation function

$$E(\vec{w}) = \sqrt{\frac{1}{Nm} \sum\_{t=T\_i}^{T\_m} \sum\_{i=1}^{N} [\{x\_i(t) - \hat{x}\_i(t)\}]^2} \tag{3}$$

where ( ) *<sup>i</sup> x t* and ( ) *<sup>i</sup> x t* are the true and predicted values (expression levels) for the *i*th neuron (gene) at time *t*. The goal is to determine weight vector *w* that minimize this cost function. This is a challenging task if the size of the network is large and only few samples (time points) are available.

For each NM shown in Figure 2 and 3, a corresponding NN is built. Figure 5 presents the detailed NN model for each NM in Figure 2. For auto-regulatory motif, the NN model is the same as single input module except that its downstream gene module contains the transcription factor itself. Since the expression level of gene module is the mean of expression level of its member genes, the input node and output node have different expression profiles. This avoids an open-loop problem which may cause the stability of the model. For single input and multiple input modules, instead of selecting multiple downstream gene modules simultaneously, we build NN models to find candidate gene modules one at one time. The selected gene modules are merged together to build the final NM gene regulatory modules.

Fig. 4. Architecture of a fully connected NN (A) and details of a single recurrent neuron (B).

Reverse Engineering Gene Regulatory Networks by Integrating Multi-Source Biological Data 227

multiple generations, chromosomes with higher fitness values are left based on the survival of the fitness. A detailed description of GA is shown in Figure 6. In this chapter, we propose to apply to GA to select the best suitable downstream gene module(s) for each transcription factor and the NN model(s) that mimic its NM(s). The Genetic Algorithm and Direct Search

After the candidate downstream gene modules are selected by GA, PSO is proposed to determine the parameters in the NN model. Particle swarm optimization is motivated by the behavior of bird flocking or fish blocking, originally intended to explore optimal or nearoptimal solutions in sophisticated continuous spaces (Kennedy and Eberhart 1995). Its main difference from other evolutionary algorithms (e.g., GA) is that PSO relies on cooperation rather than competition. Good solutions in the problem set are shared with their less-fit ones

Particle swarm optimization consists of a swarm of particles, each of which represents a

corresponding D-dimensional instantaneous trajectory vector *w t*( ) , describing its direction of motion in the search space at iteration *t*. The index *i* refers to the *i*th particle. The core of the PSO algorithm is the position update rule (Eq. (4)) which governs the movement of each

1 , 2 , ( 1) [ ( ) ( ( ) ( )) ( ( ) ( ))] *wt wt w t wt w t wt <sup>i</sup>*

*i i best i G best i*

(5)

( 1) ( ) ( 1) *wt wt wt i ii* (4)

, with a

candidate solution. Each particle is represented as a D-dimensional vector *w*

Toolbox (Mathworks, Natick, MA) is used for implementation of GA.

**3.3.3 Particle swarm optimization** 

so that the entire population improves.

Fig. 6. A detailed description of GA.

of the n particles through the search space.

Based on NMs to which a given transcription factor belongs, the next process is to find out the target gene modules and the relationships between them. To resolve this problem, we propose a hybrid GA-PSO training algorithm for NN model.

#### **3.3.2 Genetic algorithm**

Genetic algorithms are stochastic optimization approaches which mimic representation and variation mechanisms borrowed from biological evolution, such as selection, crossover, and mutation (Mitchell 1998). In a GA, a candidate solution is represented as a linear string analogous to a biological chromosome. The general scheme of GAs starts from a population of randomly generated candidate solutions (chromosomes). Each chromosome is then evaluated and given a value which corresponds to a fitness level in the objective function space. In each generation, chromosomes are chosen based on their fitness to reproduce offspring. Chromosomes with a high level of fitness are more likely to be retained while the ones with low fitness tend to be discarded. This process is called selection. After selection, offspring chromosomes are constructed from parent chromosomes using operators that resemble crossover and mutation mechanisms in evolutionary biology. The crossover operator, sometimes called recombination, produces new offspring chromosomes that inherit information from both sides of parents by combining partial sets of elements from them. The mutation operator randomly changes elements of a chromosome with a low probability. Over

Fig. 5. NN models mimicking the topologies of the four NMs shown in Figure 2. Z-1 denotes a unit delay and() is a logistic sigmoid activation function.

multiple generations, chromosomes with higher fitness values are left based on the survival of the fitness. A detailed description of GA is shown in Figure 6. In this chapter, we propose to apply to GA to select the best suitable downstream gene module(s) for each transcription factor and the NN model(s) that mimic its NM(s). The Genetic Algorithm and Direct Search Toolbox (Mathworks, Natick, MA) is used for implementation of GA.

#### **3.3.3 Particle swarm optimization**

226 Reverse Engineering – Recent Advances and Applications

Based on NMs to which a given transcription factor belongs, the next process is to find out the target gene modules and the relationships between them. To resolve this problem, we

Genetic algorithms are stochastic optimization approaches which mimic representation and variation mechanisms borrowed from biological evolution, such as selection, crossover, and mutation (Mitchell 1998). In a GA, a candidate solution is represented as a linear string analogous to a biological chromosome. The general scheme of GAs starts from a population of randomly generated candidate solutions (chromosomes). Each chromosome is then evaluated and given a value which corresponds to a fitness level in the objective function space. In each generation, chromosomes are chosen based on their fitness to reproduce offspring. Chromosomes with a high level of fitness are more likely to be retained while the ones with low fitness tend to be discarded. This process is called selection. After selection, offspring chromosomes are constructed from parent chromosomes using operators that resemble crossover and mutation mechanisms in evolutionary biology. The crossover operator, sometimes called recombination, produces new offspring chromosomes that inherit information from both sides of parents by combining partial sets of elements from them. The mutation operator randomly changes elements of a chromosome with a low probability. Over

Fig. 5. NN models mimicking the topologies of the four NMs shown in Figure 2. Z-1 denotes

() is a logistic sigmoid activation function.

propose a hybrid GA-PSO training algorithm for NN model.

**3.3.2 Genetic algorithm** 

a unit delay and

After the candidate downstream gene modules are selected by GA, PSO is proposed to determine the parameters in the NN model. Particle swarm optimization is motivated by the behavior of bird flocking or fish blocking, originally intended to explore optimal or nearoptimal solutions in sophisticated continuous spaces (Kennedy and Eberhart 1995). Its main difference from other evolutionary algorithms (e.g., GA) is that PSO relies on cooperation rather than competition. Good solutions in the problem set are shared with their less-fit ones so that the entire population improves.

Fig. 6. A detailed description of GA.

Particle swarm optimization consists of a swarm of particles, each of which represents a candidate solution. Each particle is represented as a D-dimensional vector *w* , with a corresponding D-dimensional instantaneous trajectory vector *w t*( ) , describing its direction of motion in the search space at iteration *t*. The index *i* refers to the *i*th particle. The core of the PSO algorithm is the position update rule (Eq. (4)) which governs the movement of each of the n particles through the search space.

$$
\vec{w}\_i(t+1) = \vec{w}\_i(t) + \Delta \vec{w}\_i(t+1) \tag{4}
$$

$$
\Delta \vec{w}\_i(t+1) = \chi \{ \Delta \vec{w}\_i(t) + \Phi\_1(\vec{w}\_{i,best}(t) - \vec{w}\_i(t)) + \Phi\_2(\vec{w}\_{G,best}(t) - \vec{w}\_i(t)) \} \tag{5}
$$

Reverse Engineering Gene Regulatory Networks by Integrating Multi-Source Biological Data 229

new individual best positions, then replace previous individual best positions, *wi best* ,

5. Determine if any ( 1) *w t <sup>i</sup>* are outside of the specified bounds; hold positions of

If termination criterion is met (for example completed maximum number of iterations), then

Selection of appropriate values for the free parameters of PSO plays an important role in the algorithm's performance. In our study, the parameters setting is presented in Table 1, and the constriction factor *χ* is determined by Eq.(6). The PSOt Toolbox (Birge 2003) was used for

A hybrid of GA and PSO methods (GA-PSO) is applied to determine the gene modules that may be regulated by each transcription factor. Genetic algorithm generates candidate gene modules, while the PSO algorithm determines the parameters of a given NN represented by a weight vector . The RMSE between the NN output and the measured expression profile is returned to GA as a fitness function and to guide the selection of target genes through reproduction, cross-over, and mutation over hundreds of generations. The stopping criteria are pre-specified minimum RMSE or maximum number of generations. The GA-PSO algorithm is run for each transcription factor to train a NN that has the architecture mimicking the identified known NM(s) for the transcription factor. Thus, for a given transcription factor (input), the following steps are carried out to identify its likely

2. Use the following GA-PSO algorithm to build a NN model that mimics the NM to

a. Generate combinations of M gene modules to represent the target genes that may be regulated by the transcription factor. Each combination is a vector/chromosome. The initial set of combinations is composed of the initial population of chromosomes. b. Use the PSO algorithm to train a NN model for each chromosome, where the input is the transcription factor and the outputs are gene modules. The goal is to

.

, *i*=1,2,…,n, distributed randomly (uniform)

; if any particles have located

,

1. Generate initial population of particles, *wi*

2. Evaluate each particle with the objective function, ( )*<sup>i</sup> f w*

and keep track of the swarm global best position, *wG best* ,

3. Determine new trajectories, ( 1) *w t <sup>i</sup>* , according to Eq. (5).

4. Update each particle position according to Eq. (4).

is the best solution found; otherwise, go to step (2).

Parameter Value Maximum velocity, Vmax 2 Maximum search space range, |Wmax| [-5,5] Acceleration constants, c1 & c2 2.05, 2.05 Size of Swarm 20

downstream gene modules (output) based on their NM(s): 1. Assign the NM to the transcription factor it belongs to.

identify the downstream gene modules.

particles within the specified bounds.

*wG best* ,

implementation of PSO.

Table 1. PSO parameter setting

**3.3.4 GA-PSO training algorithm** 

within the specified bounds.

$$\begin{array}{ccccc} \text{where} & \Phi\_1 = c\_1 \begin{bmatrix} r\_{1,1} \\ & r\_{1,2} \\ & & \ddots \\ & & & r\_{1,D} \end{bmatrix} \end{array} \quad \text{and} \quad \Phi\_2 = c\_2 \begin{bmatrix} r\_{2,1} \\ & r\_{2,2} \\ & & \ddots \\ & & & r\_{2,D} \end{bmatrix}$$

At any instant, each particle is aware of its individual best position, , ( ) *w t i best* , as well as the best position of the entire swarm, , ( ) *w t G best* . The parameters *c1* and *c2* are constants that weight particle movement in the direction of the individual best positions and global best positions, respectively; and *r*1,*<sup>j</sup>* and *r*2,*<sup>j</sup>*, *j D* 1,2, are random scalars distributed uniformly between 0 and 1, providing the main stochastic component of the PSO algorithm. Figure 7 shows a vector diagram of the contributing terms of the PSO trajectory update. The new change in position, ( 1) *w t <sup>i</sup>* , is the resultant of three contributing vectors: (i) the inertial component, ( ) *w t <sup>i</sup>* , (ii) the movement in the direction of individual best, , ( ) *w t i best* , and (iii) the movement in the direction of the global (or neighborhood) best, , ( ) *w t G best* .

The constriction factor, , may also help to ensure convergence of the PSO algorithm, and is set according to the weights *c1* and *c2* as in Eq.(6).

$$\mathcal{X} = \frac{2}{\left| 2 - \varphi - \sqrt{\varphi^2 - 4\varphi} \right|}, \mathcal{\boldsymbol{\varphi}} = c\_1 + c\_2, \mathcal{\boldsymbol{\varphi}} > 4 \tag{6}$$

The key strength of the PSO algorithm is the interaction among particles. The second term in Eq. (5), 2 , ( ( ) ( )) *w t wt G best i* , is considered to be a "social influence" term. While this term tends to pull the particle towards the globally best solution, the first term, 1 , ( ( ) ( )) *w t wt i best i* , allows each particle to think for itself. The net combination is an algorithm with excellent trade-off between total swarm convergence, and each particle's capability for global exploration. Moreover, the relative contribution of the two terms is weighted stochastically.

The algorithm consists of repeated application of the velocity and position update rules presented above. Termination occurs by specification of a minimum error criterion, maximum number of iterations, or alternately when the position change of each particle is sufficiently small as to assume that each particle has converged.

Fig. 7. Vector diagram of particle trajectory update.

A pseudo-code description of the PSO algorithm is provided below:


If termination criterion is met (for example completed maximum number of iterations), then *wG best* , is the best solution found; otherwise, go to step (2).

Selection of appropriate values for the free parameters of PSO plays an important role in the algorithm's performance. In our study, the parameters setting is presented in Table 1, and the constriction factor *χ* is determined by Eq.(6). The PSOt Toolbox (Birge 2003) was used for implementation of PSO.


Table 1. PSO parameter setting

228 Reverse Engineering – Recent Advances and Applications

and

2,1

*r*

. The parameters *c1* and *c2* are constants that

2 2

, (ii) the movement in the direction of individual best, , ( ) *w t i best*

*c c*

1 2 <sup>2</sup> <sup>2</sup> , ,4

, is considered to be a "social influence" term. While this term

 

The key strength of the PSO algorithm is the interaction among particles. The second term in

tends to pull the particle towards the globally best solution, the first term, 1 , ( ( ) ( )) *w t wt i best i* , allows each particle to think for itself. The net combination is an algorithm with excellent trade-off between total swarm convergence, and each particle's capability for global exploration. Moreover, the relative contribution of the two terms is weighted stochastically. The algorithm consists of repeated application of the velocity and position update rules presented above. Termination occurs by specification of a minimum error criterion, maximum number of iterations, or alternately when the position change of each particle is

, may also help to ensure convergence of the PSO algorithm, and

*c*

2,2

*r*

2,*D*

, as well as the

.

(6)

,

*r*

1,*D*

weight particle movement in the direction of the individual best positions and global best positions, respectively; and *r*1,*<sup>j</sup>* and *r*2,*<sup>j</sup>*, *j D* 1,2, are random scalars distributed uniformly between 0 and 1, providing the main stochastic component of the PSO algorithm. Figure 7 shows a vector diagram of the contributing terms of the PSO trajectory update. The new change in position, ( 1) *w t <sup>i</sup>* , is the resultant of three contributing vectors: (i) the

*r*

and (iii) the movement in the direction of the global (or neighborhood) best, , ( ) *w t G best*

2 4

1,1

*r*

1 1

is set according to the weights *c1* and *c2* as in Eq.(6).

sufficiently small as to assume that each particle has converged.

Fig. 7. Vector diagram of particle trajectory update.

A pseudo-code description of the PSO algorithm is provided below:

best position of the entire swarm, , ( ) *w t G best*

*c*

1,2

At any instant, each particle is aware of its individual best position, , ( ) *w t i best*

*r*

where

inertial component, ( ) *w t <sup>i</sup>*

The constriction factor,

Eq. (5), 2 , ( ( ) ( )) *w t wt G best i*

#### **3.3.4 GA-PSO training algorithm**

A hybrid of GA and PSO methods (GA-PSO) is applied to determine the gene modules that may be regulated by each transcription factor. Genetic algorithm generates candidate gene modules, while the PSO algorithm determines the parameters of a given NN represented by a weight vector . The RMSE between the NN output and the measured expression profile is returned to GA as a fitness function and to guide the selection of target genes through reproduction, cross-over, and mutation over hundreds of generations. The stopping criteria are pre-specified minimum RMSE or maximum number of generations. The GA-PSO algorithm is run for each transcription factor to train a NN that has the architecture mimicking the identified known NM(s) for the transcription factor. Thus, for a given transcription factor (input), the following steps are carried out to identify its likely downstream gene modules (output) based on their NM(s):

	- a. Generate combinations of M gene modules to represent the target genes that may be regulated by the transcription factor. Each combination is a vector/chromosome. The initial set of combinations is composed of the initial population of chromosomes.
	- b. Use the PSO algorithm to train a NN model for each chromosome, where the input is the transcription factor and the outputs are gene modules. The goal is to

Reverse Engineering Gene Regulatory Networks by Integrating Multi-Source Biological Data 231

Among the 800 cell cycle related genes, 85 have been identified as DNA-binding transcription factors (Zhang et al. 2008). Four NMs were considered significant: autoregulatory motif, feed-forward loop, single input module, and multi-input module (Figure 2). These NMs were used to build NN models for corresponding transcription factors.

Neural network models that mimic the topology of the NMs were constructed to identify the relationships between transcription factors and putative gene modules. The NN models

**RAP1**

**MET28**

**C29**

**FKH1**

**C8 C30 C34**

**KAR4**

**C7 C17 C18**

**MET28**

**C17**

**MET28**

**MET28**

**C29**

**C29**

**MET32**

**C8**

**C6 C22 C34 C13**

**C8 C19 C33 C8**

**C19**

**SWI5**

**GAT3**

**GAT3 MET4**

**C11**

**SWI4**

**C11**

**NDD1**

**SW14 SW15**

**C20**

**C11**

**C11**

**SWI4**

**C11**

Fig. 8. Predicted gene regulatory modules from eight known cell cycle dependent transcription factors in yeast cell cycle dataset. The left panel presents the four gene

known cell cycle dependent transcription factors.

regulatory modules, and the right panel depicts inferred gene regulatory modules for eight

**SWI4**

**SWI5**

**SW14**

**4.1.3 Network motif discovery** 

**4.1.4 Reverse engineering gene regulatory networks** 

determine the optimized parameters of the NN that maps the measured expression profiles of the transcription factor to the gene modules.


When the process is completed, NM regulatory modules are constructed between transcription factors and their regulated gene modules.
