**Multi-Scale Information, Network, Causality, and Dynamics: Mathematical Computation and Bayesian Inference to Cognitive Neuroscience and Aging**

Michelle Yongmei Wang

Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/55262

### **1. Introduction**

The human brain is estimated to contain 100 billion or so neurons and 10 thousand times as many connections. Neurons never function in isolation: each of them is connected to 10, 000 others and they interact extensively every millisecond. Brain cells are organized into neural circuits often in a dynamic way, processing specific types of information and providing the foundation of perception, cognition, and behavior. Brain anatomy and activity can be descri‐ bed at various levels of resolution and are organized on a hierarchy of scales, ranging from molecules to organisms and spanning 10 and 15 orders of magnitude in space and time, respectively. Different dynamic processes on local and global scales generate multiple levels of segregation and integration, and lead to spatially distinct patterns of coherence. At each scale, neural dynamics is determined by processes at the same scale, as well as smaller and larger scales, with no scale being privileged over others. These scales interact with each other and are mutually dependent; the coordinated action yields overall functional properties of cells and organisms.

An ultimate goal of neuroscience is to understand the brain's driving forces and organizational principles, and how the nervous systems function together to generate behavior. This raises a challenge issue for researchers in the neuroscience community: integrate the diverse knowl‐ edge derived from multiple levels of analyses into a coherent understanding of brain structure and function. The accelerating availability of neuroscience data is placing a huge need on mining and modeling methods. These data are generated at different description resolutions, for example, from neuron spike trains to electroencephalogram (EEG), magnetoencephalog‐ raphy (MEG), and functional magnetic resonance imaging (fMRI). A key theme in modern

© 2013 Wang; licensee InTech. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. © 2013 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

neuroscience is to move from localization of function to characterization of brain networks; mathematical approaches aiming at extracting directed causal connectivity from neural or neuroimaging signals are increasingly in demand. Despite differences in spatiotemporal scales of the brain signals, the data analysis and modeling share some fundamental computation strategies.

generated by combined electrical currents of large neuronal populations, i.e. electroencepha‐ lography (EEG) and magnetoencephalography (MEG). They are non-invasive as recordings are made through sensors placed on, or near, the surface of the head. EEG and MEG directly record signals of neuronal activity and thus have a high temporal resolution. But the spatial resolution is relatively poor as neither technique allows an unambiguous reconstruction of the electrical sources responsible for the recorded signal. EEG and MEG signals are often processed

Multi-Scale Information, Network, Causality, and Dynamics: Mathematical Computation and Bayesian Inference…

http://dx.doi.org/10.5772/55262

183

With the development of magnetic resonance imaging (MRI) in 1980s [2], brain imaging took a huge step forward. The strong magnetic field and radiofrequency pulse used in MRI scanning are harmless, making this technique completely noninvasive. MRI is also extremely versatile: by changing the scanning parameters, we can acquire images based on a wide variety of different contrast mechanisms. For example, diffusion MRI is a MRI method allows the mapping of diffusion process of molecules, mainly water, in biological tissues, in vivo and non-invasively. Water molecule diffusion patterns can consequently reveal microscopic details about tissue architecture in the brain. Functional magnetic resonance imaging (fMRI) measures hemodynamic signals, only indirectly related to neural activity. These techniques allow the reconstruction of spatially localized signals at millimeter-scale resolution across the imaged brain volume. In fMRI, the primary measure of activity is the contrast between the magnetic susceptibility of oxygenated and deoxygenated hemoglobin within each voxel; so it is called the *blood oxygen level-dependent* (BOLD) signal. BOLD signal can only be viewed as an indirect measure of neural activity, In addition, the slow time constants of the BOLD response result in poor temporal resolution on the order of seconds. A critical objective of neuroimaging data analysis is the inference of neural processes responsible for the observed data, that is, the

Neural signals recorded via the above techniques differ significantly in both spatial and temporal resolutions and in the directness with which neuronal activity is detected. Simulta‐ neously using two more recording methods within the same experiment can reveal how different neural or metabolic signals are interrelated [3]. Each technique measures a different aspect of neural dynamics and organization, and interpreting neural data sets shall take these differences into account. All methods for observing brain structure and function have advan‐ tages but also disadvantages: some methods provide great structural detail but are invasive or cover only a small part of the brain, while others may be noninvasive but have poor spatial or temporal resolution. Nervous systems are organized at multiple scales, from synaptic connections between single cells, to the organization of cell populations within individual anatomical regions, and finally to the large-scale architecture of brain regions and their interconnections or network connectivity. Different techniques are sensitive to different levels of organization. The multi-scale aspect of the nervous system is an essential feature of its

Given the diverse techniques for observing the brain, there are many different ways to describe and measure brain connectivity [5, 6]. Brain connectivity can be derived from histological

in sensor space as sources are difficult to localize in anatomical space.

estimation of the hemodynamic response functions.

organization and network architecture [4].

**2.2. Categorization of brain network connectivity**

Among the diverse computational methods, probabilistic modeling and Bayesian inference play a significant role, and can contribute to neuroscience from different perspectives. Bayesian approaches can be used to analyze or decode brain signals such as spike trains and structural and functional neuroimaging data. Normative predictions can be made regarding how an ideal perceptual system integrate prior knowledge with sensory observations, and thus enable principled interpretations of data from behavioral and psychological experiments. Moreover, algorithms for Bayesian estimation could provide mechanistic interpretations of neural circuits and cognition in the brain. In addition, better understanding of the brain's computational mechanisms would have a synergistic impact on developing novel algorithms in Bayesian computation, resulting in new technologies and applications.

This chapter reviews and categorizes varieties of mathematical and statistical approaches for measuring and estimating information, networks, causality and dynamics in the multi-scale brain. Specifically, in Section 3, we introduce the fundamentals in information theory and the extended concepts and metrics for describing information processing in the brain, with validity and applications demonstrated on neural signals from multiple scales and aging research. Bayesian inference for neuroimaging data analysis, and cognition modeling of observations from psychological and behavioral experiments as well as the corresponding neural/neuronal underpinnings are provided in Section 4. Graphical models, Bayesian and dynamic Bayesian networks, and some new development, together with their applications in detecting causal connectivity and longitudinal morphological changes are presented in Section 5. We illustrate the attractor dynamics and the associated interpretations for aging brain in Section 6. Conclu‐ sions and future directions are given in Section 7.

### **2. Neuroscience data/signals and brain connectivity**

#### **2.1. Recording and imaging techniques at multiple scales**

An important breakthrough regarding neuronal activity and neurotransmission is that electrophysiological recordings of single neurons were carried out in the intact brain of an awake or anesthetized animal, or in an explanted piece of tissue [1]. Such recordings have extremely high spatial (micrometer) and temporal (millisecond) resolution and allow direct observation of electrical currents and potentials generated by single nerve cells, which, however, at considerable cost since all cellular recording techniques are highly invasive, requiring surgical intervention and placement of recording electrodes within brain tissue. Neurons communicate via action potentials or spikes; neural recordings are usually trans‐ formed into series of discrete spiking events that can be characterized in terms of rate and timing. Less direct observations of electrical brain activity are electromagnetic potentials generated by combined electrical currents of large neuronal populations, i.e. electroencepha‐ lography (EEG) and magnetoencephalography (MEG). They are non-invasive as recordings are made through sensors placed on, or near, the surface of the head. EEG and MEG directly record signals of neuronal activity and thus have a high temporal resolution. But the spatial resolution is relatively poor as neither technique allows an unambiguous reconstruction of the electrical sources responsible for the recorded signal. EEG and MEG signals are often processed in sensor space as sources are difficult to localize in anatomical space.

neuroscience is to move from localization of function to characterization of brain networks; mathematical approaches aiming at extracting directed causal connectivity from neural or neuroimaging signals are increasingly in demand. Despite differences in spatiotemporal scales of the brain signals, the data analysis and modeling share some fundamental computation

Among the diverse computational methods, probabilistic modeling and Bayesian inference play a significant role, and can contribute to neuroscience from different perspectives. Bayesian approaches can be used to analyze or decode brain signals such as spike trains and structural and functional neuroimaging data. Normative predictions can be made regarding how an ideal perceptual system integrate prior knowledge with sensory observations, and thus enable principled interpretations of data from behavioral and psychological experiments. Moreover, algorithms for Bayesian estimation could provide mechanistic interpretations of neural circuits and cognition in the brain. In addition, better understanding of the brain's computational mechanisms would have a synergistic impact on developing novel algorithms in Bayesian

This chapter reviews and categorizes varieties of mathematical and statistical approaches for measuring and estimating information, networks, causality and dynamics in the multi-scale brain. Specifically, in Section 3, we introduce the fundamentals in information theory and the extended concepts and metrics for describing information processing in the brain, with validity and applications demonstrated on neural signals from multiple scales and aging research. Bayesian inference for neuroimaging data analysis, and cognition modeling of observations from psychological and behavioral experiments as well as the corresponding neural/neuronal underpinnings are provided in Section 4. Graphical models, Bayesian and dynamic Bayesian networks, and some new development, together with their applications in detecting causal connectivity and longitudinal morphological changes are presented in Section 5. We illustrate the attractor dynamics and the associated interpretations for aging brain in Section 6. Conclu‐

An important breakthrough regarding neuronal activity and neurotransmission is that electrophysiological recordings of single neurons were carried out in the intact brain of an awake or anesthetized animal, or in an explanted piece of tissue [1]. Such recordings have extremely high spatial (micrometer) and temporal (millisecond) resolution and allow direct observation of electrical currents and potentials generated by single nerve cells, which, however, at considerable cost since all cellular recording techniques are highly invasive, requiring surgical intervention and placement of recording electrodes within brain tissue. Neurons communicate via action potentials or spikes; neural recordings are usually trans‐ formed into series of discrete spiking events that can be characterized in terms of rate and timing. Less direct observations of electrical brain activity are electromagnetic potentials

computation, resulting in new technologies and applications.

182 Functional Brain Mapping and the Endeavor to Understand the Working Brain

sions and future directions are given in Section 7.

**2. Neuroscience data/signals and brain connectivity**

**2.1. Recording and imaging techniques at multiple scales**

strategies.

With the development of magnetic resonance imaging (MRI) in 1980s [2], brain imaging took a huge step forward. The strong magnetic field and radiofrequency pulse used in MRI scanning are harmless, making this technique completely noninvasive. MRI is also extremely versatile: by changing the scanning parameters, we can acquire images based on a wide variety of different contrast mechanisms. For example, diffusion MRI is a MRI method allows the mapping of diffusion process of molecules, mainly water, in biological tissues, in vivo and non-invasively. Water molecule diffusion patterns can consequently reveal microscopic details about tissue architecture in the brain. Functional magnetic resonance imaging (fMRI) measures hemodynamic signals, only indirectly related to neural activity. These techniques allow the reconstruction of spatially localized signals at millimeter-scale resolution across the imaged brain volume. In fMRI, the primary measure of activity is the contrast between the magnetic susceptibility of oxygenated and deoxygenated hemoglobin within each voxel; so it is called the *blood oxygen level-dependent* (BOLD) signal. BOLD signal can only be viewed as an indirect measure of neural activity, In addition, the slow time constants of the BOLD response result in poor temporal resolution on the order of seconds. A critical objective of neuroimaging data analysis is the inference of neural processes responsible for the observed data, that is, the estimation of the hemodynamic response functions.

Neural signals recorded via the above techniques differ significantly in both spatial and temporal resolutions and in the directness with which neuronal activity is detected. Simulta‐ neously using two more recording methods within the same experiment can reveal how different neural or metabolic signals are interrelated [3]. Each technique measures a different aspect of neural dynamics and organization, and interpreting neural data sets shall take these differences into account. All methods for observing brain structure and function have advan‐ tages but also disadvantages: some methods provide great structural detail but are invasive or cover only a small part of the brain, while others may be noninvasive but have poor spatial or temporal resolution. Nervous systems are organized at multiple scales, from synaptic connections between single cells, to the organization of cell populations within individual anatomical regions, and finally to the large-scale architecture of brain regions and their interconnections or network connectivity. Different techniques are sensitive to different levels of organization. The multi-scale aspect of the nervous system is an essential feature of its organization and network architecture [4].

#### **2.2. Categorization of brain network connectivity**

Given the diverse techniques for observing the brain, there are many different ways to describe and measure brain connectivity [5, 6]. Brain connectivity can be derived from histological sections revealing anatomical connections, from electrical recordings of single nerve cells, or from functional imaging of the entire brain. Even with a single recording technique, different ways of processing and analyzing neural data may result in different descriptions of the underlying network. Structural connectivity is a wiring diagram if physical links while functional connectivity describes dynamic interactions. A third class of brain networks is effective connectivity, which encompasses the network of directed interactions between neural elements. Effective connectivity goes beyond structural and functional connectivity by detecting patterns of causal influence among neural elements. These three main types of brain connectivity are defined more precisely as below.

of a discrete random variable *X* is defined as P*<sup>X</sup>* (*x*)≜*P*(*X* = *x*), and the probability density

Multi-Scale Information, Network, Causality, and Dynamics: Mathematical Computation and Bayesian Inference…

http://dx.doi.org/10.5772/55262

185

The choice of logarithmic base determines the unit. The most common unit of information is the *bit*, based on the binary logarithm. The information is zero for a fully predicted outcome

The *entropy* of a discrete random variable *X* is defined to be the average information from

Entropy is a measure of randomness or uncertainty of the distribution: the more random the distribution, the more information is gathered by observing its value. Specifically, entropy is zero for a deterministic variable and is maximized for a uniform distribution. The conditional

The *Kullback-Leibler* (KL) *divergence* (also called the relative entropy) between two probability

It is a measure of the difference of two distributions, but does not usually satisfy the symmetry

P*<sup>Y</sup>* <sup>|</sup>*<sup>X</sup>* (*Y* | *X* ) P*<sup>Y</sup>* (*Y* )

function (PDF) of a continuous random variable is denoted as *pX* (*x*).

*x* with *P*(*X* = *x*)=1, and it increases as *P*(*X* = *x*) decreases.


distributions *P* and *Q* on X is defined as their average difference:

P(*x*)log P(*x*)

condition, that is, D(P∥Q)≠D(Q∥P). So, it cannot be called "distance".

Q(*x*) ≥0 .

The *Mutual information* of two discrete random variables *X* and *Y* is defined as:

<sup>P</sup>*<sup>Y</sup>* (*<sup>Y</sup>* ) =H(*<sup>Y</sup>* ) - H(*<sup>Y</sup>* <sup>|</sup> *<sup>X</sup>* ) .

Intuitively, mutual information measures the information that *X* and *Y* share: it meas‐ ures how much knowing one of the variables reduces uncertainty about the other. The mutual information is known to be symmetric: I(*X* ;*Y* )=I(*Y* ; *X* ). The chain rule for mutual

log <sup>1</sup>

H(*X* )= ∑ *x*∈X

H(*Y* | *X* )= ∑

H(*X <sup>n</sup>*) = ∑

*i*=1 *n*

D(P∥Q)≜EP log P(*<sup>X</sup>* )

= ∑ *x*∈X ∑ *y*∈Y

information is

I(*X <sup>n</sup>*;*Y <sup>n</sup>*) = ∑

*i*=1 *n* I(*Yi*

<sup>P</sup> *<sup>X</sup>* (*x*) <sup>=</sup> - log *<sup>P</sup>*(*<sup>X</sup>* <sup>=</sup> *<sup>x</sup>*) .

observing this variable:

entropy is given as below:

*x*∈X ∑ *y*∈Y

The chain rule for entropy is

H(*Xi* <sup>|</sup> *<sup>X</sup> <sup>i</sup>*-1

) .

Q(*<sup>X</sup>* ) = ∑ *x*∈X

I(*<sup>X</sup>* ;*<sup>Y</sup>* )≜D(P*XY* (∙, <sup>∙</sup>)∥P*<sup>X</sup>* (<sup>∙</sup> )P*<sup>Y</sup>* (<sup>∙</sup> )) =EP *XY* log

; *X <sup>n</sup>* |*Y <sup>i</sup>*-1

) ,

<sup>P</sup>*<sup>X</sup>* ,*<sup>Y</sup>* (*x*, *<sup>y</sup>*) <sup>P</sup>*<sup>Y</sup>* <sup>|</sup>*<sup>X</sup>* (*<sup>Y</sup>* <sup>|</sup> *<sup>X</sup>* )


The *information* or *surprise* [7] of a discrete random variable is defined as:

*Structural connectivity* refers to a set of physical or structural (anatomical) connections that links neural elements. These anatomical connections range in scale from those of local circuits of single cells to large-scale networks of interregional pathways. Their physical pattern can be treated as relatively static at shorter time scales (seconds to minutes) but may be dynamic at longer time scales (hours to days). *Functional connectivity* describes patterns of deviations from statistical independence between distributed and often spatially remote neuronal units. The basis of functional connectivity is time series data from neural recordings such as cellular recording, EEG, MEG, and fMRI. Deviations from statistical independence typically indicates dynamic coupling and can be measured by estimating the correlation or covariance, spectral coherence, or other metrics. Functional connectivity is very time dependent, and can be statistically nonstationary. It is also modulated by external task demands and sensory stimulation, as well as internal state of the organism. But functional connectivity does not make any explicit reference to causal effects among neural elements. *Effective connectivity* captures the network causal effects between neural elements, and can be inferred through time series analysis, statistical modeling, or experimental perturbation. Same as functional connectivity, effective connectivity is also time dependent and can be rapidly modulated by external stimuli or tasks, and internal state. Some methods for effective connectivity inference are model-free without assuming anatomical pathways, while others require the specification of an explicit causal model including structural parameters. In general, the estimation of effective connec‐ tivity needs complex data processing and modeling techniques. Thus, in this chapter, regard‐ ing the networks, I mainly review strategies for estimation of effective connectivity or causal inference.

#### **3. Information theory and processing**

#### **3.1. Fundamentals and definitions: Entropy, Kullback-Leibler divergence, and mutual information**

A major objective of neuroscience is to understand how the brain processes information. Here we provide probabilistic notations and information-theoretic definitions that will be used in this section (definitions denoted with <sup>≜</sup> ). We define *<sup>x</sup> <sup>n</sup>* <sup>≜</sup> *<sup>x</sup>*<sup>1</sup> *<sup>n</sup>* =(*x*1, …, *xn*). More generally, for integers *i* ≤ *j*, *xi j* ≜(*xi* , …, *x <sup>j</sup>* ). For a random variable *X* ,X corresponds to a measurable space that *X* takes values in, and *x* ∈X are specific realizations. The probability mass function (PMF) of a discrete random variable *X* is defined as P*<sup>X</sup>* (*x*)≜*P*(*X* = *x*), and the probability density function (PDF) of a continuous random variable is denoted as *pX* (*x*).

The *information* or *surprise* [7] of a discrete random variable is defined as:

$$\log \frac{1}{\mathbb{P}\_{\mathcal{X}}(x)} = \text{ - } \log P(X = x)$$

sections revealing anatomical connections, from electrical recordings of single nerve cells, or from functional imaging of the entire brain. Even with a single recording technique, different ways of processing and analyzing neural data may result in different descriptions of the underlying network. Structural connectivity is a wiring diagram if physical links while functional connectivity describes dynamic interactions. A third class of brain networks is effective connectivity, which encompasses the network of directed interactions between neural elements. Effective connectivity goes beyond structural and functional connectivity by detecting patterns of causal influence among neural elements. These three main types of brain

*Structural connectivity* refers to a set of physical or structural (anatomical) connections that links neural elements. These anatomical connections range in scale from those of local circuits of single cells to large-scale networks of interregional pathways. Their physical pattern can be treated as relatively static at shorter time scales (seconds to minutes) but may be dynamic at longer time scales (hours to days). *Functional connectivity* describes patterns of deviations from statistical independence between distributed and often spatially remote neuronal units. The basis of functional connectivity is time series data from neural recordings such as cellular recording, EEG, MEG, and fMRI. Deviations from statistical independence typically indicates dynamic coupling and can be measured by estimating the correlation or covariance, spectral coherence, or other metrics. Functional connectivity is very time dependent, and can be statistically nonstationary. It is also modulated by external task demands and sensory stimulation, as well as internal state of the organism. But functional connectivity does not make any explicit reference to causal effects among neural elements. *Effective connectivity* captures the network causal effects between neural elements, and can be inferred through time series analysis, statistical modeling, or experimental perturbation. Same as functional connectivity, effective connectivity is also time dependent and can be rapidly modulated by external stimuli or tasks, and internal state. Some methods for effective connectivity inference are model-free without assuming anatomical pathways, while others require the specification of an explicit causal model including structural parameters. In general, the estimation of effective connec‐ tivity needs complex data processing and modeling techniques. Thus, in this chapter, regard‐ ing the networks, I mainly review strategies for estimation of effective connectivity or causal

**3.1. Fundamentals and definitions: Entropy, Kullback-Leibler divergence, and mutual**

A major objective of neuroscience is to understand how the brain processes information. Here we provide probabilistic notations and information-theoretic definitions that will be used in

that *X* takes values in, and *x* ∈X are specific realizations. The probability mass function (PMF)

*<sup>n</sup>* =(*x*1, …, *xn*). More generally, for

). For a random variable *X* ,X corresponds to a measurable space

connectivity are defined more precisely as below.

184 Functional Brain Mapping and the Endeavor to Understand the Working Brain

**3. Information theory and processing**

this section (definitions denoted with <sup>≜</sup> ). We define *<sup>x</sup> <sup>n</sup>* <sup>≜</sup> *<sup>x</sup>*<sup>1</sup>

, …, *x <sup>j</sup>*

inference.

**information**

integers *i* ≤ *j*, *xi*

*j* ≜(*xi* The choice of logarithmic base determines the unit. The most common unit of information is the *bit*, based on the binary logarithm. The information is zero for a fully predicted outcome *x* with *P*(*X* = *x*)=1, and it increases as *P*(*X* = *x*) decreases.

The *entropy* of a discrete random variable *X* is defined to be the average information from observing this variable:

$$\mathbf{H}(X) = \sum\_{\mathbf{x} \in \mathbb{X}} \mathbf{-P}\_X(\mathbf{x}) \log \mathbf{P}\_X(\mathbf{x}) \quad \text{s.}$$

Entropy is a measure of randomness or uncertainty of the distribution: the more random the distribution, the more information is gathered by observing its value. Specifically, entropy is zero for a deterministic variable and is maximized for a uniform distribution. The conditional entropy is given as below:

$$\operatorname{H}(Y \mid X) = \sum\_{\mathbf{x} \in \mathcal{X}} \sum\_{y \in \mathcal{Y}} \operatorname{\mathbf{-} P}\_{X, Y} \begin{pmatrix} \mathbf{x}\_{\prime} \ y \end{pmatrix} \operatorname{\log P}\_{Y \mid X} \begin{pmatrix} y \ \mid \mathbf{x} \end{pmatrix} \ . $$

The chain rule for entropy is

$$\operatorname{HI}\{X^n\} = \sum\_{i=1}^n \operatorname{HI}\{X\_i \mid X^{i-1}\} \dots$$

The *Kullback-Leibler* (KL) *divergence* (also called the relative entropy) between two probability distributions *P* and *Q* on X is defined as their average difference:

$$\mathbf{D(P \parallel Q)} \triangleq \mathbf{E\_F}[\log \frac{\mathbf{P(X)}}{\mathbf{Q(X)}}] = \sum\_{x \in \mathcal{X}} \mathbf{P(x)} \log \frac{\mathbf{P(x)}}{\mathbf{Q(x)}} \ge 0 \quad \square$$

It is a measure of the difference of two distributions, but does not usually satisfy the symmetry condition, that is, D(P∥Q)≠D(Q∥P). So, it cannot be called "distance".

The *Mutual information* of two discrete random variables *X* and *Y* is defined as:

$$\mathbf{H}(X;Y) \triangleq \mathbf{D}\{\mathbf{P}\_{XY}(\bullet,\bullet) \,\|\,\|\,\mathbf{P}\_X(\bullet)\mathbf{P}\_Y(\bullet)\} = \mathbf{E}\_{\mathbf{P}\_{XY}}\left[\log \frac{\mathbf{P}\_{Y \mid X}(Y \mid X)}{\mathbf{P}\_Y(Y)}\right].$$

$$= \sum\_{\mathbf{x} \in \mathbf{X}} \sum\_{y \in \mathbf{Y}} \mathbf{P}\_{X,Y}\left(\mathbf{x},\,\,y\right) \frac{\mathbf{P}\_{Y \mid X}(Y \mid X)}{\mathbf{P}\_Y(Y)} = \mathbf{H}\{Y\} \cdot \mathbf{H}\{Y \mid X\} \quad .$$

Intuitively, mutual information measures the information that *X* and *Y* share: it meas‐ ures how much knowing one of the variables reduces uncertainty about the other. The mutual information is known to be symmetric: I(*X* ;*Y* )=I(*Y* ; *X* ). The chain rule for mutual information is

$$\mathbf{I}\{\mathbf{X}^n;\mathbf{Y}^n\} = \sum\_{i=1}^n \mathbf{I}\{\mathbf{Y}\_i;\mathbf{X}^n \mid \mathbf{Y}^{i-1}\}\_{i\in\mathcal{I}}$$

with the conditional mutual information given as following:

$$\mathbf{I}(X;Y\mid Z) = \mathbf{E}\_{\mathbf{P}\_{X\cap Z}}\left[\log \frac{\mathbf{P}\_{Y\mid X,Z}(Y\mid X,Z)}{\mathbf{P}\_{Y\mid Z}(Y\mid Z)}\right].$$

#### **3.2. Causal inference: Granger causality, transfer entropy, and directed information**

<sup>P</sup>*Yn*+1

shown to be:

information is defined as:

<sup>|</sup>*<sup>Y</sup> <sup>n</sup>*,*<sup>X</sup> <sup>n</sup>*(*yn*+1 <sup>|</sup> *<sup>y</sup> <sup>n</sup>*, *<sup>x</sup> <sup>n</sup>*) =P*Yn*+1


transfer entropy is defined as conditional mutual information [15]:

selection for both the driven and driving systems.

T*<sup>X</sup>* <sup>→</sup>*<sup>Y</sup>* (*i*)=I(*Yi*+1; *Xi*-*<sup>K</sup>* +1

*<sup>n</sup>* (*yn*+1 | *yn*-*<sup>J</sup>* +1

Multi-Scale Information, Network, Causality, and Dynamics: Mathematical Computation and Bayesian Inference…

where *J* and *K* are respectively the orders (memory) of the Markov processes for X and Y. The

Transfer entropy is asymmetric and based on transition probabilities; it thus provides direc‐ tional and dynamic information. The key feature of this information theoretic functional for identifying causality is that, theoretically, it does not assume any particular model for the interaction between the two time series. So, transfer entropy is sensitive to all order correla‐ tions, which makes it suitable for exploratory analyses over Granger causality or other model based approaches. This is especially advantages if some unknown non-linear interactions are embedded in the systems to be discovered. It is shown in [17] that for Gaussian variables, Granger causality and transfer entropy are equivalent, which bridges autoregressive and information-theoretic methods in causal inference. Another issue with transfer entropy is that its performance depends on the estimation of transitional probabilities; this requires the order

*Directed information*, proposed by Marko [18] and re-formalized by others [19, 20], is more general for quantifying directional dependencies, and has recently attracted attention [10, 21]. It is modified from the mutual information to capture causal influences, denoted as I(X→Y) for two stochastic processes X and Y. For vectors *X <sup>n</sup>* and *Y <sup>n</sup>*, the mutual information can be

1


*i n i*




*YY X YY*

( ) .

The mutual information is symmetric and only measures the correlation or statistical depend‐ ence between random processes, but cannot identify causal directionality. The directed

*in i i i*


<sup>1</sup> <sup>|</sup>

*DP P*

*i*=1 *n* I(*X <sup>i</sup>*

*<sup>i</sup> Y Y <sup>i</sup>*

*i*

log

1


1

*P YY*

*P YY X*

*i i*


*i n <sup>n</sup> YY X <sup>i</sup>*

é ù ê ú = ë û

1 1

( )


*i*


1

1

(3)

( )


;*Yi* <sup>|</sup>*<sup>Y</sup> <sup>i</sup>*-1

) (4)

( ) ( )

å

; ;|

*n nn n i*

å

1

*i*

=

*E*

*IX Y IX Y Y*

=

1

I(*X <sup>n</sup>* →*Y <sup>n</sup>*)≜ ∑

*i*

=

=

*n*

å

*<sup>i</sup>* <sup>|</sup> *Yi*-*<sup>J</sup>* +1

*<sup>n</sup>* , *xn*-*<sup>K</sup>* +1

*<sup>n</sup>* ) ,

*<sup>i</sup>* ) . (2)

http://dx.doi.org/10.5772/55262

187

*Granger Causality:* A widely-established technique for extracting causal relations or effective connectivity from data is *Granger causality* [8-11]. The principle of Granger causality is based on the concept of cross prediction. Accordingly, if incorporating the past values of times series X improves the future prediction of time series Y, the X is said to have a causal influence on Y [8]. Exploring Granger causality is closely related to analysis of vector autoregressive (VAR) models, by calculating the variances to correlation terms for autoregressive models. Using terminology introduced in [10], let X=(*Xi* :*i* ≥1) and Y=(*Yi* :*i* ≥1) be the two time series for determining whether X causally influences Y. Y is first modeled as an univariate autoregressive series with error term *Vi* , and then modeled again using the X series as causal information. That is:

$$\begin{aligned} Y\_i &= \sum\_{j=1}^p a\_j Y\_{i-j} + V\_i \quad , \\\\ Y\_i &= \sum\_{j=1}^p b\_j Y\_{i-j} + c\_j X\_{i-j} + \mathcal{W}\_i \quad , \end{aligned} \tag{1}$$

*j*=1

where *Wi* in Eq. (1) is the new error term. The number of time-lags or model order *p* can be a fixed prior or specified by minimizing a criterion (for example, Akaike information criterion [12] or Bayesian information criterion [13]) that balances the variance accounted for by the model, against the number of coefficients to be estimated. The Granger causality is defined as below, examining the ratio of the variances of the error terms:

$$G\_{\chi \to Y} \triangleq \log \frac{\text{var}(V)}{\text{var}(W)} \quad .$$

If including X in the modeling decreases the variance of the error term, *G*X→Y >0. Typically by comparing *G*X→Y and *G*Y→X, we determine the causal direction as the larger one. The directed transfer function transforms the autoregressive model into the spectral domain [14], and also uses multivariate models rather than univariate and bivariate models for each time series to consider the full covariance matrix for improved modeling. Granger causality, the directed transfer function, and their derivative methods are usually fast to calculate and easy to interpret. Despite the advantages, they may not be statistically suitable for inference questions associated with neural spike train data that are often modeled as point processes due to the sample-variance computation.

*Transfer entropy* is a measure of effective connectivity based on information theory [15, 16]. It does not require a model of interaction, is inherently non-linear, and thus provides a reasonable basis to precisely formulate causal hypotheses. Assume that the two time series X=(*Xi* :*i* ≥1) and Y=(*Yi* :*i* ≥1) can be approximated by Markov processes:

$$\mathbf{P}\_{\mathbf{Y}\_{n+1}|\mathbf{Y}\_{n}\mathbf{u}\_{n},\mathbf{X}\_{n}\mathbf{u}}(\mathbf{y}\_{n+1}\parallel\mathbf{y}\_{n}^{n},\mathbf{x}^{n}) = \mathbf{P}\_{\mathbf{Y}\_{n+1}|\mathbf{Y}\_{n-\mathbf{J}},\mathbf{x}\_{n}^{n},\mathbf{X}\_{n-\mathbf{K}+1}^{n}}(\mathbf{y}\_{n+1}\parallel\mathbf{y}\_{n+1}^{n},\mathbf{x}\_{n-\mathbf{K}+1}^{n}) \quad \text{(10.23)}$$

with the conditional mutual information given as following:

186 Functional Brain Mapping and the Endeavor to Understand the Working Brain

P*<sup>Y</sup>* <sup>|</sup>*<sup>X</sup>* ,*<sup>Z</sup>* (*Y* | *X* , *Z* ) <sup>P</sup>*<sup>Y</sup>* <sup>|</sup>*<sup>Z</sup>* (*<sup>Y</sup>* <sup>|</sup> *<sup>Z</sup>* ) .

> *Yi* = ∑ *j*=1 *p b j*

below, examining the ratio of the variances of the error terms:

and Y=(*Yi* :*i* ≥1) can be approximated by Markov processes:

**3.2. Causal inference: Granger causality, transfer entropy, and directed information**

*Granger Causality:* A widely-established technique for extracting causal relations or effective connectivity from data is *Granger causality* [8-11]. The principle of Granger causality is based on the concept of cross prediction. Accordingly, if incorporating the past values of times series X improves the future prediction of time series Y, the X is said to have a causal influence on Y [8]. Exploring Granger causality is closely related to analysis of vector autoregressive (VAR) models, by calculating the variances to correlation terms for autoregressive models. Using terminology introduced in [10], let X=(*Xi* :*i* ≥1) and Y=(*Yi* :*i* ≥1) be the two time series for determining whether X causally influences Y. Y is first modeled as an univariate autoregressive

, and then modeled again using the X series as causal information.

in Eq. (1) is the new error term. The number of time-lags or model order *p* can be a

fixed prior or specified by minimizing a criterion (for example, Akaike information criterion [12] or Bayesian information criterion [13]) that balances the variance accounted for by the model, against the number of coefficients to be estimated. The Granger causality is defined as

If including X in the modeling decreases the variance of the error term, *G*X→Y >0. Typically by comparing *G*X→Y and *G*Y→X, we determine the causal direction as the larger one. The directed transfer function transforms the autoregressive model into the spectral domain [14], and also uses multivariate models rather than univariate and bivariate models for each time series to consider the full covariance matrix for improved modeling. Granger causality, the directed transfer function, and their derivative methods are usually fast to calculate and easy to interpret. Despite the advantages, they may not be statistically suitable for inference questions associated with neural spike train data that are often modeled as point processes due to the

*Transfer entropy* is a measure of effective connectivity based on information theory [15, 16]. It does not require a model of interaction, is inherently non-linear, and thus provides a reasonable basis to precisely formulate causal hypotheses. Assume that the two time series X=(*Xi* :*i* ≥1)

*Yi*- *<sup>j</sup>* + *c <sup>j</sup>Xi*- *<sup>j</sup>* + *Wi* , (1)

I(*<sup>X</sup>* ;*<sup>Y</sup>* <sup>|</sup>*Z*)=EP *XYZ* log

series with error term *Vi*

*Yi*- *<sup>j</sup>* + *Vi* ,

That is:

*Yi* = ∑ *j*=1 *p a j*

where *Wi*

*<sup>G</sup>*X→Y≜log var(*<sup>V</sup>* )

var(*<sup>W</sup>* ) .

sample-variance computation.

where *J* and *K* are respectively the orders (memory) of the Markov processes for X and Y. The transfer entropy is defined as conditional mutual information [15]:

$$\mathbf{T}\_{X \to Y}(i) = \mathbf{I}\{Y\_{i+1}; X\_{i:K\*1}^{\;i} \mid Y\_{i:J\*1}^{\;i}\}\;. \tag{2}$$

Transfer entropy is asymmetric and based on transition probabilities; it thus provides direc‐ tional and dynamic information. The key feature of this information theoretic functional for identifying causality is that, theoretically, it does not assume any particular model for the interaction between the two time series. So, transfer entropy is sensitive to all order correla‐ tions, which makes it suitable for exploratory analyses over Granger causality or other model based approaches. This is especially advantages if some unknown non-linear interactions are embedded in the systems to be discovered. It is shown in [17] that for Gaussian variables, Granger causality and transfer entropy are equivalent, which bridges autoregressive and information-theoretic methods in causal inference. Another issue with transfer entropy is that its performance depends on the estimation of transitional probabilities; this requires the order selection for both the driven and driving systems.

*Directed information*, proposed by Marko [18] and re-formalized by others [19, 20], is more general for quantifying directional dependencies, and has recently attracted attention [10, 21]. It is modified from the mutual information to capture causal influences, denoted as I(X→Y) for two stochastic processes X and Y. For vectors *X <sup>n</sup>* and *Y <sup>n</sup>*, the mutual information can be shown to be:

$$\begin{split} I\left(X^n; Y^n\right) &= \sum\_{i=1}^n I\left(X^n; Y\_i \mid Y^{i-1}\right) \\ &= E\left[\sum\_{i=1}^n \log \frac{P\_{Y\_i \mid Y^{i-1}, X^n}\left(Y\_i \mid Y^{i-1}, X^n\right)}{P\_{Y\_i \mid Y^{i-1}}\left(Y\_i \mid Y^{i-1}\right)}\right] \\ &= \sum\_{i=1}^n D(P\_{Y\_i \mid Y^{i-1}, X^n} \parallel P\_{Y\_i \mid Y^{i-1}}) \ . \end{split} \tag{3}$$

The mutual information is symmetric and only measures the correlation or statistical depend‐ ence between random processes, but cannot identify causal directionality. The directed information is defined as:

$$\mathbf{I}\{\mathbf{X}^{\n}\rightarrow\mathbf{Y}^{\n}\} \triangleq \sum\_{i=1}^{n} \mathbf{I}\{\mathbf{X}^{\ i}; \mathbf{Y}\_{i} \mid \mathbf{Y}^{\ i\cdot 1}\} \tag{4}$$

$$\mathbf{E} = \mathbf{E} \left[ \sum\_{i=1}^{n} \mathbf{1} \log \frac{\mathbf{P}\_{\mathbf{y}\_{\boldsymbol{\cdot}} \mid \mathbf{y}^{\boldsymbol{\cdot} \cdot \mathbf{1}}, \mathbf{x}^{\boldsymbol{\cdot}}}{\mathbf{P}\_{\mathbf{y}\_{\boldsymbol{\cdot}} \mid \mathbf{y}^{\boldsymbol{\cdot} \cdot \mathbf{1}}} \{\mathbf{Y}\_{i} \stackrel{\scriptstyle \mathbf{1}}{\mathbf{y}^{\boldsymbol{\cdot} \cdot \mathbf{1}}} \mathbf{Y}^{\boldsymbol{\cdot} \cdot \mathbf{1}}} \right] \tag{5}$$

stages of AD development. In [25], the authors would like to address ongoing issues regarding how the default-mode network (DMN) hubs, including posterior cingulate cortex (PCC), medial prefrontal cortex (MPFC) and inferior parietal cortex (IPC), interact to each other, and the altered pattern of hubs in AD. Causal influences were examined between any pair of nodes within the DMN using Granger causality analysis and graph-theoretic methods on restingstate fMRI of 12 young subjects, 16 old normal controls and 15 AD patients. Results support the hub configuration of the DMN from the perspective of causal relationship, and reveal abnormal pattern of the DMN hubs in AD. Findings from young subjects give additional evidence for the role of PCC/MPFC/IPC acting as hubs in the DMN. Compared to old control, MPFC and IPC lost their roles as hubs due to the obvious causal interaction disruption, and PCC was preserved as the only hub with significant causal relations with all other nodes. Deshpande et al. [11] proposed a combination of multivariate Ganger causality analysis through temporal down-sampling of fMRI time series, to investigate causal brain networks and their dynamics. The method was applied to study epoch-to-epoch changes in a handgripping, muscle fatigue experiment. Causal influences between the activated regions were analyzed by applying the directed transfer function analysis of multivariate Granger causality with the integrated epoch response as the input, to account for the effects of several relevant regions simultaneously. The authors separately modeled the early, middle, and late periods in the fatigue. The results demonstrate the temporal evolution of the network and reveal that

Multi-Scale Information, Network, Causality, and Dynamics: Mathematical Computation and Bayesian Inference…

http://dx.doi.org/10.5772/55262

189

motor fatigue leads to a disconnection in the associated neural network.

monkeys.

*Transfer Entropy and Directed Information:* Vicente et al. [16] investigated the applicability of transfer entropy as a measure to electrophysiological data from simulations and MEG recordings in a motor task. Specifically, they demonstrated that transfer entropy improved the effective connectivity identification for non-linear interactions, and for sensor level MEG signals where linear approaches are hampered by signal-cross-talk due to volume conduction. Utilizing transfer entropy at the source-level, Wibral et al. [26] analyzed MEG data from an auditory short-term memory experiments and found that changes in the network between different task types can be detected. Prominently involved areas for the changes include left temporal pole and cerebellum, which have previously been implied to be involved in auditory short-term or working memory. Amblard and Michel [20] extracted Granger causality graphs using directed information, and such techniques were shown to be necessary to analyze the structure of systems with feedback in general, and neural systems specifically. Quinn et al. [10] proposed a nonlinear robust extension of the linear Granger tools also based directed infor‐ mation. They used point process models of neural spike trains, performed parameter and model order selection with minimal description length, and applied the analysis to infer the interactions and dynamics of neural ensembles in the primary motor cortex (MI) of macaque

*Multi-Scale Information and Multi-Scale Entropy:* There is increasing evidence that brain signals are expressed with variability of the neural network dynamics [27]. Effective characterization of this variability in the complex systems can bring new insight to empirical studies. A number of tools have recently been developed, integrating information theory, nonlinear dynamics, and complex systems, to support the empirical research and unravel

$$\mathbf{P} = \sum\_{i=1}^{n} \mathbf{D} \{ \mathbf{P}\_{Y\_i \mid Y\_i^{i-1}, X^{i-1}} \| \| \mathbf{P}\_{Y\_i \mid Y\_i^{i-1}} \| \tag{6}$$

It can also be written as following with the chain rule for entropy:

I(*X <sup>n</sup>* →*Y <sup>n</sup>*)=H(*Y <sup>n</sup>*) - H(*Y <sup>n</sup>* ∥ *X <sup>n</sup>*) ,

where the H(*Y <sup>n</sup>* ∥ *X <sup>n</sup>*) is the causally conditioned entropy given by [22]:

$$\operatorname{HI}\Big(Y^n \parallel X^n\Big) \triangleq \sum\_{i=1}^n \operatorname{HI}\Big(Y\_i \mid Y^{i-1}, \ X^i\Big) \quad .$$

The difference between mutual information in Eq. (3) and directed information in Eq. (5) is that *X <sup>n</sup>* is changed to *X <sup>i</sup>* ; so the causal influence of X on the current *Yi* at each time *i* can be captured by directed information. Compared with Granger causality, directed information is a sum of divergences (Eq. (6)), and well-defined for any joint probability distributions including point processes. In addition, directed information is not tied to any particular statistical model; it operates on log likelihood ratios, and thus is more flexible and can be directly applied to varieties of modalities such as neural spike trains. By calculating the mutual information in bits, a degree of correlation (or statistical interdependence) is determined. Similarly, we can also quantify a degree of causation in bits through calculating the directed information. It is demonstrated by Amblard et al. [20]: for linear Gaussian processes, directed information and Granger causality are equivalent. Note that the transfer entropy defined in Eq. (2) is part of the sum terms in Eq. (4) for directed information. Amblard et al. also proved that for a stationary process, directed information rate can be decomposed into two parts: one is equivalent to a particular instance of the transfer entropy, and the other to the instantaneous information change rate. In fact, it has recently shown in [23] that transfer entropy is equal to the upper bound of directed information rate.

#### **3.3. Applications and validity in neuroscience and aging research**

*Granger Causality:* Li et al. [24] performed a longitudinal MRI study to examine the gray matter changes due to Alzheimer's disease (AD) progression. A standard voxel-based morphometry method was used to localize the abnormal brain regions, and the absolute atrophy rate in these regions was calculated with a robust regression method. The hippocampus and middle temporal gyrus (MTG) were identified as the primary foci of atrophy. A model based Granger causality approach was developed to examine the cause–effect relationship over time between these regions based on gray matter concentration. It is shown that primary pathological foci are in the hippocampus and entorhinal cortex in the earlier stages of AD, and appears to subsume the MTG subsequently. The causality results indicate that there are larger differences in MTG between AD and age-matched healthy control but little in hippocampus, which implies local pathology in MTG being the predominant progressive abnormality during intermediate stages of AD development. In [25], the authors would like to address ongoing issues regarding how the default-mode network (DMN) hubs, including posterior cingulate cortex (PCC), medial prefrontal cortex (MPFC) and inferior parietal cortex (IPC), interact to each other, and the altered pattern of hubs in AD. Causal influences were examined between any pair of nodes within the DMN using Granger causality analysis and graph-theoretic methods on restingstate fMRI of 12 young subjects, 16 old normal controls and 15 AD patients. Results support the hub configuration of the DMN from the perspective of causal relationship, and reveal abnormal pattern of the DMN hubs in AD. Findings from young subjects give additional evidence for the role of PCC/MPFC/IPC acting as hubs in the DMN. Compared to old control, MPFC and IPC lost their roles as hubs due to the obvious causal interaction disruption, and PCC was preserved as the only hub with significant causal relations with all other nodes. Deshpande et al. [11] proposed a combination of multivariate Ganger causality analysis through temporal down-sampling of fMRI time series, to investigate causal brain networks and their dynamics. The method was applied to study epoch-to-epoch changes in a handgripping, muscle fatigue experiment. Causal influences between the activated regions were analyzed by applying the directed transfer function analysis of multivariate Granger causality with the integrated epoch response as the input, to account for the effects of several relevant regions simultaneously. The authors separately modeled the early, middle, and late periods in the fatigue. The results demonstrate the temporal evolution of the network and reveal that motor fatigue leads to a disconnection in the associated neural network.

=E ∑ *i*=1 *n* log P*Y i* |*Y <sup>i</sup>*-1 ,*X <sup>i</sup>* (*Yi* <sup>|</sup> *<sup>Y</sup> <sup>i</sup>*-1

188 Functional Brain Mapping and the Endeavor to Understand the Working Brain

= ∑ *i*=1 *n* D(P*Yi*

, *X <sup>i</sup>* ) .

I(*X <sup>n</sup>* →*Y <sup>n</sup>*)=H(*Y <sup>n</sup>*) - H(*Y <sup>n</sup>* ∥ *X <sup>n</sup>*) ,

H(*Yi* <sup>|</sup>*<sup>Y</sup> <sup>i</sup>*-1

the upper bound of directed information rate.

**3.3. Applications and validity in neuroscience and aging research**

) ≜ ∑ *i*=1 *n*

that *X <sup>n</sup>* is changed to *X <sup>i</sup>*

H(*Y <sup>n</sup>* ∥ *X <sup>n</sup>*

It can also be written as following with the chain rule for entropy:

where the H(*Y <sup>n</sup>* ∥ *X <sup>n</sup>*) is the causally conditioned entropy given by [22]:

P*Y i* |*Y <sup>i</sup>*-1

,*<sup>X</sup> <sup>i</sup>* <sup>∥</sup>P*Yi*

The difference between mutual information in Eq. (3) and directed information in Eq. (5) is

captured by directed information. Compared with Granger causality, directed information is a sum of divergences (Eq. (6)), and well-defined for any joint probability distributions including point processes. In addition, directed information is not tied to any particular statistical model; it operates on log likelihood ratios, and thus is more flexible and can be directly applied to varieties of modalities such as neural spike trains. By calculating the mutual information in bits, a degree of correlation (or statistical interdependence) is determined. Similarly, we can also quantify a degree of causation in bits through calculating the directed information. It is demonstrated by Amblard et al. [20]: for linear Gaussian processes, directed information and Granger causality are equivalent. Note that the transfer entropy defined in Eq. (2) is part of the sum terms in Eq. (4) for directed information. Amblard et al. also proved that for a stationary process, directed information rate can be decomposed into two parts: one is equivalent to a particular instance of the transfer entropy, and the other to the instantaneous information change rate. In fact, it has recently shown in [23] that transfer entropy is equal to

*Granger Causality:* Li et al. [24] performed a longitudinal MRI study to examine the gray matter changes due to Alzheimer's disease (AD) progression. A standard voxel-based morphometry method was used to localize the abnormal brain regions, and the absolute atrophy rate in these regions was calculated with a robust regression method. The hippocampus and middle temporal gyrus (MTG) were identified as the primary foci of atrophy. A model based Granger causality approach was developed to examine the cause–effect relationship over time between these regions based on gray matter concentration. It is shown that primary pathological foci are in the hippocampus and entorhinal cortex in the earlier stages of AD, and appears to subsume the MTG subsequently. The causality results indicate that there are larger differences in MTG between AD and age-matched healthy control but little in hippocampus, which implies local pathology in MTG being the predominant progressive abnormality during intermediate

; so the causal influence of X on the current *Yi* at each time *i* can be


, *X <sup>i</sup>* )

(*Yi* <sup>|</sup> *<sup>Y</sup> <sup>i</sup>*-1) (5)

<sup>|</sup>*<sup>Y</sup> <sup>i</sup>*-1) . (6)

*Transfer Entropy and Directed Information:* Vicente et al. [16] investigated the applicability of transfer entropy as a measure to electrophysiological data from simulations and MEG recordings in a motor task. Specifically, they demonstrated that transfer entropy improved the effective connectivity identification for non-linear interactions, and for sensor level MEG signals where linear approaches are hampered by signal-cross-talk due to volume conduction. Utilizing transfer entropy at the source-level, Wibral et al. [26] analyzed MEG data from an auditory short-term memory experiments and found that changes in the network between different task types can be detected. Prominently involved areas for the changes include left temporal pole and cerebellum, which have previously been implied to be involved in auditory short-term or working memory. Amblard and Michel [20] extracted Granger causality graphs using directed information, and such techniques were shown to be necessary to analyze the structure of systems with feedback in general, and neural systems specifically. Quinn et al. [10] proposed a nonlinear robust extension of the linear Granger tools also based directed infor‐ mation. They used point process models of neural spike trains, performed parameter and model order selection with minimal description length, and applied the analysis to infer the interactions and dynamics of neural ensembles in the primary motor cortex (MI) of macaque monkeys.

*Multi-Scale Information and Multi-Scale Entropy:* There is increasing evidence that brain signals are expressed with variability of the neural network dynamics [27]. Effective characterization of this variability in the complex systems can bring new insight to empirical studies. A number of tools have recently been developed, integrating information theory, nonlinear dynamics, and complex systems, to support the empirical research and unravel

the principles of brain dynamics [28]. In particular, approximate entropy and sample entropy were proposed to quantify the complexity of short and noisy time series, and with later correcting the bias effect in approximate entropy. Higher values of sample entropy are associated with the signals having more complexity and less regular patterns, while smaller values indicate less irregularity in their representation. Note that signaling in the brain is not instantaneous, and neural activity propagation takes time. Utilizing multiscale entropy (MSE) is a reasonable strategy to control for the embedding delay of the brain system. This can be achieved through down-sampling the original time series by factors 2, 4, 8, etc., which, would alleviate the effects of linear correlations between consecutive samples. A similar idea was previously introduced in [29], using a complexity measure based on the Shannon entropy at various scales. Some studies used the approximate and sample entropy statistics to quantify the brain signal variability for both the electrode measurements [30] and source dynamics [31]. In [32], in order to test the hypothesis that complexity of BOLD activity is reduced with aging and is correlated with cognitive performance in the elderly, the authors employed the MSE analysis, and investigated appropriate parameters for MSE calculation. Compared with younger subjects, the older group had the most significant reductions in MSE of BOLD signals in posterior cingulate gyrus and hippocampal cortex. MSE of BOLD signals from DMN areas were found to be positively correlated with major cognitive functions including attention, short-term memory and language, etc. The MSE approach was also applied to reveal the differences in the EEG signals, between normal subjects and patients with AD. The resting-state EEG was utilized in [33] with MSE curves (scales 1-16) averaged over channels and individuals for three groups: normal population, subjects with mild cognitive impairment (MCI), and AD patients. The three groups have some common features for the MSE curves, i.e. the sample entropy reached its maximum at scales 5-7 and then gradually decreased. Severe AD patients had a significantly lower level of sample entropy values than that of the normal group at scale 2-16. The maximal difference in the complexity was observed at scales 6-8. Between MCI and normal subjects, the main difference in the MSE curve was the shift of the peak in sample entropy toward coarse timescales for the MCI group.

of the observed data. When a particular observation is made, *p*(D | *θ*) is called the likelihood. The *maximum a posteriori* (MAP) estimate maximizes the posterior, *θ\** =argmax*<sup>θ</sup> p*(*θ* | D). For a flat prior, i.e. for *p*(*θ*) being a constant, the MAP solution is equivalent to the *maximum likelihood*, with *θ* maximizing the likelihood *p*(D | *θ*) of the model generating the observed data. The MAP can incorporate our prior knowledge about the variable, but it is still a point estimate. Bayesian estimate gives the full probability distribution or density of the posterior *p*(*θ* | D). For example, when the distribution is wide or even has multiple peaks, the corre‐ sponding outputs can be averaged to make a more conservative estimate instead of just using

Multi-Scale Information, Network, Causality, and Dynamics: Mathematical Computation and Bayesian Inference…

http://dx.doi.org/10.5772/55262

191

A key algorithm challenge for Bayesian inference is for many models of interest, analytical tractability of the above posterior is elusive due to the integral in the denominator. We therefore resort to approximation inference, where the approaches tend to fall into one of following two classes: 1) *Monte Carlo methods* [34] provide approximate answers with accuracy depending on the number of generated samples. Importance sampling is a simple Monte Carlo approxima‐ tion while Markov chain Monte Carlo (MCMC) is more efficient and popular. MCMC gener‐ ates each sample by making a random change to the preceding sample. So we can think of an MCMC algorithm as being in a particular current state specifying a value for every variable and generating a next state by making random changes to the current state. Special cases of MCMC include Gibbs sampling and the Metropolis-Hasting algorithm. 2) *Variational approxi‐ mations* [35, 36] are a series of deterministic techniques that make approximate inference for the parameters in complex statistical models. Compared with MCMC, they are much faster, especially for large models, but limited in their approximation accuracy. The mean-field approximation is a simplest example, which exploits the law of large numbers to approximate large sums of random variable by their means. Variational parameters are introduced and iteratively updated so as to minimize the KL divergence between the approximate and true probability distributions. Updating the variational parameters becomes a proxy for inference. The mean-field approximation produces a lower bound on the likelihood. More sophisticated

Here I focus on Bayesian inference in fMRI data analysis, mainly for activation detection and hemodynamic response function (HRF) estimation, although the key concepts of Bayesian methods have been applied to structural MRI images as well [37-39]. Graphical model based Bayesian and dynamic Bayesian networks and their applications will be

Bayesian inference has taken fMRI analysis research into an area that classical frequentist statistics have difficulty to address because of some challenging issues associated with the data. For example, fMRI response to stimuli is not instantaneous, but lagged and damp‐ ed by the hemodynamic response. Estimating HRFs has gained increasing interests, since it provides not only a deep insight into the underlying dynamics of human brain, but also a basis for making inference of brain activation regions. How do we account for the HRF

methods are possible, which give tighter lower (and upper) bounds.

**4.2. Neuroimaging data analyses using Bayesian approaches**

a single point estimate.

discussed in Section 5.

### **4. Probabilistic modeling and Bayesian inference for neural computation, cognition, and behavior**

#### **4.1. Bayes' theorem and approximate inference**

A generic problem in science is: given the observed data D and some knowledge of the underlying data generating mechanism, can you tell something about the variable *θ*? Based on Bayes' theorem, our interest is the quantity:

$$\,\_{P}(\mathcal{O}\mid\,\mathsf{D}) = \frac{\,\_{P}(\mathsf{D}\mid\,\,\theta)\,\_{P}(\mathcal{O})}{\,\_{P}(\mathsf{D})} = \frac{\,\_{P}(\mathsf{D}\mid\,\,\,\theta)\,\_{P}(\mathcal{O})}{\,\_{\ell}(\mathcal{P}\mid\,\,\,\theta)\,\_{P}(\mathcal{O})d\mathcal{O}}\,\,\_{P}\mathsf{D}$$

That is: from a *generative model p*(D | *θ*) of dataset and a *prior* belief *p*(*θ*) about which variable values are appropriate, we can infer the *posterior* distribution *p*(*θ* | D) of the variable in light of the observed data. When a particular observation is made, *p*(D | *θ*) is called the likelihood. The *maximum a posteriori* (MAP) estimate maximizes the posterior, *θ\** =argmax*<sup>θ</sup> p*(*θ* | D). For a flat prior, i.e. for *p*(*θ*) being a constant, the MAP solution is equivalent to the *maximum likelihood*, with *θ* maximizing the likelihood *p*(D | *θ*) of the model generating the observed data. The MAP can incorporate our prior knowledge about the variable, but it is still a point estimate. Bayesian estimate gives the full probability distribution or density of the posterior *p*(*θ* | D). For example, when the distribution is wide or even has multiple peaks, the corre‐ sponding outputs can be averaged to make a more conservative estimate instead of just using a single point estimate.

the principles of brain dynamics [28]. In particular, approximate entropy and sample entropy were proposed to quantify the complexity of short and noisy time series, and with later correcting the bias effect in approximate entropy. Higher values of sample entropy are associated with the signals having more complexity and less regular patterns, while smaller values indicate less irregularity in their representation. Note that signaling in the brain is not instantaneous, and neural activity propagation takes time. Utilizing multiscale entropy (MSE) is a reasonable strategy to control for the embedding delay of the brain system. This can be achieved through down-sampling the original time series by factors 2, 4, 8, etc., which, would alleviate the effects of linear correlations between consecutive samples. A similar idea was previously introduced in [29], using a complexity measure based on the Shannon entropy at various scales. Some studies used the approximate and sample entropy statistics to quantify the brain signal variability for both the electrode measurements [30] and source dynamics [31]. In [32], in order to test the hypothesis that complexity of BOLD activity is reduced with aging and is correlated with cognitive performance in the elderly, the authors employed the MSE analysis, and investigated appropriate parameters for MSE calculation. Compared with younger subjects, the older group had the most significant reductions in MSE of BOLD signals in posterior cingulate gyrus and hippocampal cortex. MSE of BOLD signals from DMN areas were found to be positively correlated with major cognitive functions including attention, short-term memory and language, etc. The MSE approach was also applied to reveal the differences in the EEG signals, between normal subjects and patients with AD. The resting-state EEG was utilized in [33] with MSE curves (scales 1-16) averaged over channels and individuals for three groups: normal population, subjects with mild cognitive impairment (MCI), and AD patients. The three groups have some common features for the MSE curves, i.e. the sample entropy reached its maximum at scales 5-7 and then gradually decreased. Severe AD patients had a significantly lower level of sample entropy values than that of the normal group at scale 2-16. The maximal difference in the complexity was observed at scales 6-8. Between MCI and normal subjects, the main difference in the MSE curve was the shift of

190 Functional Brain Mapping and the Endeavor to Understand the Working Brain

the peak in sample entropy toward coarse timescales for the MCI group.

**cognition, and behavior**

*<sup>p</sup>*(*<sup>θ</sup>* <sup>|</sup> <sup>D</sup>) <sup>=</sup> *<sup>p</sup>*(<sup>D</sup> <sup>|</sup> *<sup>θ</sup>*) *<sup>p</sup>*(*θ*)

**4.1. Bayes' theorem and approximate inference**

on Bayes' theorem, our interest is the quantity:

*<sup>p</sup>*(D) <sup>=</sup> *<sup>p</sup>*(<sup>D</sup> <sup>|</sup> *<sup>θ</sup>*) *<sup>p</sup>*(*θ*) *∫*

*<sup>θ</sup> <sup>p</sup>*(<sup>D</sup> <sup>|</sup> *<sup>θ</sup>*) *<sup>p</sup>*(*θ*)*d<sup>θ</sup>* .

**4. Probabilistic modeling and Bayesian inference for neural computation,**

A generic problem in science is: given the observed data D and some knowledge of the underlying data generating mechanism, can you tell something about the variable *θ*? Based

That is: from a *generative model p*(D | *θ*) of dataset and a *prior* belief *p*(*θ*) about which variable values are appropriate, we can infer the *posterior* distribution *p*(*θ* | D) of the variable in light A key algorithm challenge for Bayesian inference is for many models of interest, analytical tractability of the above posterior is elusive due to the integral in the denominator. We therefore resort to approximation inference, where the approaches tend to fall into one of following two classes: 1) *Monte Carlo methods* [34] provide approximate answers with accuracy depending on the number of generated samples. Importance sampling is a simple Monte Carlo approxima‐ tion while Markov chain Monte Carlo (MCMC) is more efficient and popular. MCMC gener‐ ates each sample by making a random change to the preceding sample. So we can think of an MCMC algorithm as being in a particular current state specifying a value for every variable and generating a next state by making random changes to the current state. Special cases of MCMC include Gibbs sampling and the Metropolis-Hasting algorithm. 2) *Variational approxi‐ mations* [35, 36] are a series of deterministic techniques that make approximate inference for the parameters in complex statistical models. Compared with MCMC, they are much faster, especially for large models, but limited in their approximation accuracy. The mean-field approximation is a simplest example, which exploits the law of large numbers to approximate large sums of random variable by their means. Variational parameters are introduced and iteratively updated so as to minimize the KL divergence between the approximate and true probability distributions. Updating the variational parameters becomes a proxy for inference. The mean-field approximation produces a lower bound on the likelihood. More sophisticated methods are possible, which give tighter lower (and upper) bounds.

#### **4.2. Neuroimaging data analyses using Bayesian approaches**

Here I focus on Bayesian inference in fMRI data analysis, mainly for activation detection and hemodynamic response function (HRF) estimation, although the key concepts of Bayesian methods have been applied to structural MRI images as well [37-39]. Graphical model based Bayesian and dynamic Bayesian networks and their applications will be discussed in Section 5.

Bayesian inference has taken fMRI analysis research into an area that classical frequentist statistics have difficulty to address because of some challenging issues associated with the data. For example, fMRI response to stimuli is not instantaneous, but lagged and damp‐ ed by the hemodynamic response. Estimating HRFs has gained increasing interests, since it provides not only a deep insight into the underlying dynamics of human brain, but also a basis for making inference of brain activation regions. How do we account for the HRF properties such as the nonlinearities and variability over different brain regions? fMRI is a 4-dimensional signal though with spatial and temporal noise correlations [40, 41]. How to incorporate the modeling of the presence of these correlations into the data analysis, alongside considering the clustered pattern of activation? Moreover, group level statisti‐ cal inference of fMRI time series is usually needed to answer imaging-based scientific questions. How to make valid, sensitive and robust estimation of activation effects in populations of subjects? In fMRI analysis, what we often do is taking acquired data plus a generative model and extracting pertinent information about the brain, i.e. making inference on the model and its parameters. Bayesian statistics requires a prior probabilistic belief about the model parameters to be specified. Such models are typically HRF models, spatial models, and hierarchical multi-subject models, to respectively address the challenges listed above.

activity) violating the distributional assumptions. More sophisticated modeling of struc‐ tured noise could be needed to render the distributional assumptions valid. Recent development of nonparametric Bayes can also be used to handle the mixture modeling, though a massive number of model parameters need to be estimated. Infinite mixture models based on Dirichlet process priors [58] involve effectively an infinite number of distributions. An application of such methods in fMRI for activation regions is in [59] using

Multi-Scale Information, Network, Causality, and Dynamics: Mathematical Computation and Bayesian Inference…

http://dx.doi.org/10.5772/55262

193

*Hierarchical models for group inference* was first proposed in [60], which fit naturally into the Bayesian framework via a cascade of conditional probabilities to handle activation effects over multiple subjects. In classical fMRI analysis, group-level inferences are usually made using the results of separate first-level analyses to decrease computation cost. This is the so-called summary statistics approach. The widely-used frequentist group analysis in [61] employed parameter estimates from the general linear model regression as summary statistics, which however, was only optimal under certain conditions due to the required balanced designs. On the contrary, Woolrich et al. [55] utilized Bayes to incorporate the summary statistics without restrictions, with information regarding both the effect sizes from the lower levels and their

**4.3. Bayesian brain: Cognition, perception, uncertainty, behavior and neural representations**

The neuroscience principle that the nervous system of animals and humans is adapted to the statistical properties of the environment is reflected across all organizational levels, from the activity of single neurons to networks and behavior [62]. A critical aim of the nervous system is to estimate the world state from incomplete and noisy data. During such process, a challenge issue that brains must handle is uncertainty. For example, when we perceive the physical world, make a decision, and take an action, there is uncertainty associated with the sensory system, the motor apparatus, one's own knowledge, and the world itself. Probability has played a central role in perception and cognition modeling. Specifically, the Bayesian frame‐ work of statistical estimation provides a systematic way of dealing with these uncertainties for optimal estimation. Comparison between the optimal and actual behavior gives rise to better understanding about how the nervous system works. Bayesian models have been used to explain results in perception, cognition, behavior, and neural coding in diverse forms [63-67], with differences in distinct assumptions about the world variables and how they relate to each other. However, the same key idea shared by all these Bayesian models is that different sources of information can be integrated for estimation of the relevant variables. Thus the Bayesian approach unifies an enormous range of otherwise apparently disparate behavior

A key aim of cognitive science is to reverse-engineer the mind. Cognition modeling based on the probabilistic method begins by identifying ideal solutions to these inductive problems, and then uses algorithms to model the mental processes for approximating these solutions. Neural processes are viewed as mechanisms for implementing these algo‐ rithms. Probabilistic models of cognition pursue a top-down strategy, which begins with abstract principles allowing agents to solve problems posed by the world (i.e. the func‐

a spatial mixture model.

variances passed up.

within one coherent framework.

*HRF models* can incorporate biophysical or regularization priors for flexible HRF modeling across brain voxels and over subjects. Several similar Bayesian approaches in the literature use parametric HRFs with parameters describing features such as time-to-peak and undershoot size [42, 43]. Priors placed on these HRF parameters can ensure biological plausibility and result in increased sensitivity. An early example of more advanced HRF modeling is in [44], which uses Bayes to infer on a fully Bayesian biologically informed generative model. The reason of introducing regularization priors is the models have too many parameters to infer stably without regularization. Bayesian regularization places priors on HRF parameters to encode the prior belief that HRF is smooth temporally without strong assumptions about the shape of the response function. Thus such priors are suitable for exploratory approaches or possibly abnormal HRFs. Regularization priors can also be achieved through semi-parametric Bayesian for HRF modeling [45-47]. In semi-parametric approaches, HRF does not have a fixed parametric format but can take any form with a parameter describing the HRF size at each time point.

*Spatial models* for regularization using spatial Markov random field (MRF) priors to tackle spatial correlation in fMRI were proposed in [38, 48, 49], followed by MCMC numerical integration for inference. To overcome the large computation cost for spatial model inference in MCMC, Variational Bayesian approaches were developed [50, 51] without timeconsuming numerical integration. Variational Bayes approximate the true posterior distribution through estimation using a posterior factorized over subsets of the model parameters, which results in update equations with the desired approximate posterior distributions in a much more efficient way than techniques such as MCMC. MRF-based work has recently been extended to using more flexible spatial Gaussian Process priors, to allow for the modeling of spatial non-stationarities [52] and the combining of spatial and non-spatial prior information [53]. The hyperparameters of the spatial priors can be estimated via Bayesian inference together with the rest of the model, which is a key advantage of fully Bayesian methods. Some other spatial models include mixture models representing the active and non-active voxels [54-56] and a Bayesian wavelets approach [57]. The popular mixture modeling, however, can be hampered by the presence of structured noise artifacts (e.g. stimulus correlated motion, spontaneous networks of activity) violating the distributional assumptions. More sophisticated modeling of struc‐ tured noise could be needed to render the distributional assumptions valid. Recent development of nonparametric Bayes can also be used to handle the mixture modeling, though a massive number of model parameters need to be estimated. Infinite mixture models based on Dirichlet process priors [58] involve effectively an infinite number of distributions. An application of such methods in fMRI for activation regions is in [59] using a spatial mixture model.

properties such as the nonlinearities and variability over different brain regions? fMRI is a 4-dimensional signal though with spatial and temporal noise correlations [40, 41]. How to incorporate the modeling of the presence of these correlations into the data analysis, alongside considering the clustered pattern of activation? Moreover, group level statisti‐ cal inference of fMRI time series is usually needed to answer imaging-based scientific questions. How to make valid, sensitive and robust estimation of activation effects in populations of subjects? In fMRI analysis, what we often do is taking acquired data plus a generative model and extracting pertinent information about the brain, i.e. making inference on the model and its parameters. Bayesian statistics requires a prior probabilistic belief about the model parameters to be specified. Such models are typically HRF models, spatial models, and hierarchical multi-subject models, to respectively address the challenges listed

192 Functional Brain Mapping and the Endeavor to Understand the Working Brain

*HRF models* can incorporate biophysical or regularization priors for flexible HRF modeling across brain voxels and over subjects. Several similar Bayesian approaches in the literature use parametric HRFs with parameters describing features such as time-to-peak and undershoot size [42, 43]. Priors placed on these HRF parameters can ensure biological plausibility and result in increased sensitivity. An early example of more advanced HRF modeling is in [44], which uses Bayes to infer on a fully Bayesian biologically informed generative model. The reason of introducing regularization priors is the models have too many parameters to infer stably without regularization. Bayesian regularization places priors on HRF parameters to encode the prior belief that HRF is smooth temporally without strong assumptions about the shape of the response function. Thus such priors are suitable for exploratory approaches or possibly abnormal HRFs. Regularization priors can also be achieved through semi-parametric Bayesian for HRF modeling [45-47]. In semi-parametric approaches, HRF does not have a fixed parametric format but can take any form with a parameter describing the HRF size at each

*Spatial models* for regularization using spatial Markov random field (MRF) priors to tackle spatial correlation in fMRI were proposed in [38, 48, 49], followed by MCMC numerical integration for inference. To overcome the large computation cost for spatial model inference in MCMC, Variational Bayesian approaches were developed [50, 51] without timeconsuming numerical integration. Variational Bayes approximate the true posterior distribution through estimation using a posterior factorized over subsets of the model parameters, which results in update equations with the desired approximate posterior distributions in a much more efficient way than techniques such as MCMC. MRF-based work has recently been extended to using more flexible spatial Gaussian Process priors, to allow for the modeling of spatial non-stationarities [52] and the combining of spatial and non-spatial prior information [53]. The hyperparameters of the spatial priors can be estimated via Bayesian inference together with the rest of the model, which is a key advantage of fully Bayesian methods. Some other spatial models include mixture models representing the active and non-active voxels [54-56] and a Bayesian wavelets approach [57]. The popular mixture modeling, however, can be hampered by the presence of structured noise artifacts (e.g. stimulus correlated motion, spontaneous networks of

above.

time point.

*Hierarchical models for group inference* was first proposed in [60], which fit naturally into the Bayesian framework via a cascade of conditional probabilities to handle activation effects over multiple subjects. In classical fMRI analysis, group-level inferences are usually made using the results of separate first-level analyses to decrease computation cost. This is the so-called summary statistics approach. The widely-used frequentist group analysis in [61] employed parameter estimates from the general linear model regression as summary statistics, which however, was only optimal under certain conditions due to the required balanced designs. On the contrary, Woolrich et al. [55] utilized Bayes to incorporate the summary statistics without restrictions, with information regarding both the effect sizes from the lower levels and their variances passed up.

#### **4.3. Bayesian brain: Cognition, perception, uncertainty, behavior and neural representations**

The neuroscience principle that the nervous system of animals and humans is adapted to the statistical properties of the environment is reflected across all organizational levels, from the activity of single neurons to networks and behavior [62]. A critical aim of the nervous system is to estimate the world state from incomplete and noisy data. During such process, a challenge issue that brains must handle is uncertainty. For example, when we perceive the physical world, make a decision, and take an action, there is uncertainty associated with the sensory system, the motor apparatus, one's own knowledge, and the world itself. Probability has played a central role in perception and cognition modeling. Specifically, the Bayesian frame‐ work of statistical estimation provides a systematic way of dealing with these uncertainties for optimal estimation. Comparison between the optimal and actual behavior gives rise to better understanding about how the nervous system works. Bayesian models have been used to explain results in perception, cognition, behavior, and neural coding in diverse forms [63-67], with differences in distinct assumptions about the world variables and how they relate to each other. However, the same key idea shared by all these Bayesian models is that different sources of information can be integrated for estimation of the relevant variables. Thus the Bayesian approach unifies an enormous range of otherwise apparently disparate behavior within one coherent framework.

A key aim of cognitive science is to reverse-engineer the mind. Cognition modeling based on the probabilistic method begins by identifying ideal solutions to these inductive problems, and then uses algorithms to model the mental processes for approximating these solutions. Neural processes are viewed as mechanisms for implementing these algo‐ rithms. Probabilistic models of cognition pursue a top-down strategy, which begins with abstract principles allowing agents to solve problems posed by the world (i.e. the func‐ tions minds performing) and then aims to reduce these principles to psychological and neural processes. This analysis results in better flexibility in exploration of the representa‐ tions and inductive biases underlying human cognition. On the contrary, connectionist models usually follow a bottom-up approach that starts with a neural mechanism character‐ ization and explores what macro-level functional phenomena might emerge. With a formal characterization of an inductive problem, a probabilistic model specifies the hypotheses under investigation, the relation between these hypotheses and observable data, and the prior probability of each hypothesis. By assuming different prior distribution for the hypotheses, different inductive biases can be captured. Although the link between probabil‐ istic inference and neural computation/function is drawing attention of modelers from different backgrounds, little is known concerning how these structured representations can be implemented in neural systems for high-level cognition.

behavior can be mapped to neural operations. This scheme has been successfully applied to cue combination [72], decision-making [75], etc. Some alternative approaches for encoding likelihood functions or probability distributions using neurons have also been proposed in

Multi-Scale Information, Network, Causality, and Dynamics: Mathematical Computation and Bayesian Inference…

http://dx.doi.org/10.5772/55262

195

Graphical models, intersecting probability and graph theories, provide a natural tool for handling uncertainty and complexity that frequently occur in applied mathematics and engineering, and scientific domains involving computation. Many of the classical multivariate probabilistic techniques are special cases of the general graphical models, such as mixture models, factor analysis, hidden Markov models, Kalman filters and Ising models [35, 78, 79]. A graph consists of *nodes* connected by *links* (also called *arcs* or *edges*). The nodes in probabilistic graphical models represent random variables, and the links or arcs express probabilistic relationships between these variables. The lack-of-arcs represent conditional independence assumptions. This provides a compact representation of joint probability distributions over all of the random variables, which can be decomposed into a product of factors each depending on a subset of variables. One category of graphical models is *Markov Random Fields* (MRFs), also known as *undirected graphical models*, in which the links do not have arrows and thus do not provide directional significance. For example, two sets of nodes *A* and *B* are conditionally independent given a third set, *C*, if all paths between the nodes in *A* and *B* are separated by a node in *C*. The other major class is *Bayesian Networks* or *Belief Networks* (BNs), also known as *directed graphical models*, in which the links carry arrows indicating a particular directionality in the notion of independence. Despite the complexity, directed models do have several advantages compared to undirected models; and the most important is that they can express causal relationships between random variables, whereas undirected graphics are more suitable

In Bayesian Networks, if there is an arrow from node *X* to node *Y* , *X* is said to be a *parent* of

discrete, it is represented as a table (CPT), listing the probability that the child node takes on each of its different values for each combination of its parents' values. The network in BNs can be viewed as a representation of the joint probability distribution (JPD), or as an encoding of a collection of conditional independence statements. Let the joint distribution be

P(*xi* | *parents*(*Xi*

is associated with a conditional probability distribution (CPD)

)) , (7)

)), quantifying the effect of the parents on the node. If the variables are

**5. Graphical models, Bayesian and dynamic Bayesian networks**

the literature [65, 66, 76, 77].

**5.1. Mathematical description and solution**

for soft constraints between random variables.

P(*x*1, …, *xn*) =∏

*i*=1 *n*

*Y* . Each node *Xi*

P(*x*1, …, *xn*); and we have

P(*Xi* | *Parents*(*Xi*

Sufficient results in perception have shown that the nervous system represents its uncertainty about the true state of the world probabilistically and such representations are utilized in two related cognitive areas: information fusion and perceptual decision-making. To fuse informa‐ tion from different sources about the same object, inferences about the object should rely on these sources commensurate with their corresponding uncertainty, as demonstrated in multisensory integration [68, 69] with the sources of different sensory modalities, or between information coming from the senses and being stored in memory [70, 71]. With the Bayesian framework, the organism calculates probability distributions over parameters describing the state of the world, with computation based on sensory information and knowledge accrued from experience. Although the particular sensory information and prior knowledge are specific to the task, the computation follows the same probability rules. Psychological evidence at the behavior level that animals and humans represent uncertainty during perceptual processes caused research into the neural underpinnings of such probabilistic representations. That is: how neurons compute with sensory uncertainty information or even full probability distributions? One scheme is the probabilistic population coding [72] that involves making use of the likelihood function encoded in neural population activity (as described below). Beyond perception, the neural implementation of cognitive probabilistic models has basically not been explored yet [64, 73].

*Neural/Neuronal Models of Probabilistic Computation (Probabilistic Population Coding)*: Percep‐ tion modeling has the potential to constrain neural implementation of perceptual computa‐ tion. In order to form a neural model from a behavioral model, one needs to first define the relevant level of neural variables. A common candidate is the level of spike counts in sensory and decision-making neurons. For example, an orientated stimulus s might elicit a set of spike counts **r**=(*r*1, …, *rn*) in a population of orientation-tuned cells in primary visual cortex. There is trial-to-trial variability in the population activity, which can be described by a distribution *p*(**r**∣*s*). The connection between **r** and s, is that the latter (the scalar stimulus in a behavioral model) is the value maximizing the neural likelihood function, L(*s*)=*p*(**r**∣*s*) [74]. The likelihood function L(*s*) has a width, *σ*, reflecting the observer's uncertainty about the stimulus. The variable **r**, is high-dimensional with sufficient degrees of freedom to encode *σ* on a trial-by-trial basis. With neural likelihood functions, Bayesian models of behavior can be mapped to neural operations. This scheme has been successfully applied to cue combination [72], decision-making [75], etc. Some alternative approaches for encoding likelihood functions or probability distributions using neurons have also been proposed in the literature [65, 66, 76, 77].

### **5. Graphical models, Bayesian and dynamic Bayesian networks**

#### **5.1. Mathematical description and solution**

tions minds performing) and then aims to reduce these principles to psychological and neural processes. This analysis results in better flexibility in exploration of the representa‐ tions and inductive biases underlying human cognition. On the contrary, connectionist models usually follow a bottom-up approach that starts with a neural mechanism character‐ ization and explores what macro-level functional phenomena might emerge. With a formal characterization of an inductive problem, a probabilistic model specifies the hypotheses under investigation, the relation between these hypotheses and observable data, and the prior probability of each hypothesis. By assuming different prior distribution for the hypotheses, different inductive biases can be captured. Although the link between probabil‐ istic inference and neural computation/function is drawing attention of modelers from different backgrounds, little is known concerning how these structured representations can

Sufficient results in perception have shown that the nervous system represents its uncertainty about the true state of the world probabilistically and such representations are utilized in two related cognitive areas: information fusion and perceptual decision-making. To fuse informa‐ tion from different sources about the same object, inferences about the object should rely on these sources commensurate with their corresponding uncertainty, as demonstrated in multisensory integration [68, 69] with the sources of different sensory modalities, or between information coming from the senses and being stored in memory [70, 71]. With the Bayesian framework, the organism calculates probability distributions over parameters describing the state of the world, with computation based on sensory information and knowledge accrued from experience. Although the particular sensory information and prior knowledge are specific to the task, the computation follows the same probability rules. Psychological evidence at the behavior level that animals and humans represent uncertainty during perceptual processes caused research into the neural underpinnings of such probabilistic representations. That is: how neurons compute with sensory uncertainty information or even full probability distributions? One scheme is the probabilistic population coding [72] that involves making use of the likelihood function encoded in neural population activity (as described below). Beyond perception, the neural implementation of cognitive probabilistic models has basically not been

*Neural/Neuronal Models of Probabilistic Computation (Probabilistic Population Coding)*: Percep‐ tion modeling has the potential to constrain neural implementation of perceptual computa‐ tion. In order to form a neural model from a behavioral model, one needs to first define the relevant level of neural variables. A common candidate is the level of spike counts in sensory and decision-making neurons. For example, an orientated stimulus s might elicit a set of spike counts **r**=(*r*1, …, *rn*) in a population of orientation-tuned cells in primary visual cortex. There is trial-to-trial variability in the population activity, which can be described by a distribution *p*(**r**∣*s*). The connection between **r** and s, is that the latter (the scalar stimulus in a behavioral model) is the value maximizing the neural likelihood function, L(*s*)=*p*(**r**∣*s*) [74]. The likelihood function L(*s*) has a width, *σ*, reflecting the observer's uncertainty about the stimulus. The variable **r**, is high-dimensional with sufficient degrees of freedom to encode *σ* on a trial-by-trial basis. With neural likelihood functions, Bayesian models of

be implemented in neural systems for high-level cognition.

194 Functional Brain Mapping and the Endeavor to Understand the Working Brain

explored yet [64, 73].

Graphical models, intersecting probability and graph theories, provide a natural tool for handling uncertainty and complexity that frequently occur in applied mathematics and engineering, and scientific domains involving computation. Many of the classical multivariate probabilistic techniques are special cases of the general graphical models, such as mixture models, factor analysis, hidden Markov models, Kalman filters and Ising models [35, 78, 79]. A graph consists of *nodes* connected by *links* (also called *arcs* or *edges*). The nodes in probabilistic graphical models represent random variables, and the links or arcs express probabilistic relationships between these variables. The lack-of-arcs represent conditional independence assumptions. This provides a compact representation of joint probability distributions over all of the random variables, which can be decomposed into a product of factors each depending on a subset of variables. One category of graphical models is *Markov Random Fields* (MRFs), also known as *undirected graphical models*, in which the links do not have arrows and thus do not provide directional significance. For example, two sets of nodes *A* and *B* are conditionally independent given a third set, *C*, if all paths between the nodes in *A* and *B* are separated by a node in *C*. The other major class is *Bayesian Networks* or *Belief Networks* (BNs), also known as *directed graphical models*, in which the links carry arrows indicating a particular directionality in the notion of independence. Despite the complexity, directed models do have several advantages compared to undirected models; and the most important is that they can express causal relationships between random variables, whereas undirected graphics are more suitable for soft constraints between random variables.

In Bayesian Networks, if there is an arrow from node *X* to node *Y* , *X* is said to be a *parent* of *Y* . Each node *Xi* is associated with a conditional probability distribution (CPD) P(*Xi* | *Parents*(*Xi* )), quantifying the effect of the parents on the node. If the variables are discrete, it is represented as a table (CPT), listing the probability that the child node takes on each of its different values for each combination of its parents' values. The network in BNs can be viewed as a representation of the joint probability distribution (JPD), or as an encoding of a collection of conditional independence statements. Let the joint distribution be P(*x*1, …, *xn*); and we have

$$\mathbf{P}\{\mathbf{x}\_1, \dots, \mathbf{x}\_n\} \stackrel{\text{in}}{=} \prod\_{i=1}^n \mathbf{P}\{\mathbf{x}\_i \mid \text{parents}\{\mathbf{X}\_i\}\}\tag{7}$$

where *parents*(*Xi* ) denotes the values of *Parents*(*Xi* ) appearing in *x*1, …, *xn*. The CPTs are essentially conditional probability tables based on Eq. (7). In general, given *n* binary nodes, the full joint would require O2*<sup>n</sup>* space to represent, but due to the presence of independence in the graphical modeling, the factored form would require O*n*2*<sup>k</sup>* space, where *k* is the maximum fan-in of a node. Fewer parameters make learning easier.

Markov chain was employed to model fMRI time-series for discovery of temporal interactions among brain regions. DBNs yield more accurate and informative brain connectivity than earlier methods since temporal characteristics of time-series are explicitly accounted. The functional structures captured on two fMRI datasets are consistent with the previous literature findings and more accurate than those identified by BN. Li et al. [84] aimed to extrapolate BN results from one subject to an entire population while addressing inter-subject, within-group variability. The authors explored two group analysis approaches in fMRI using DBNs: constructing a group network based on a common structure assumption across individuals, and identifying significant structure features by examining DBNs individually-trained. The methods were validated on subjects performing a motor task at three progressive levels of difficulty, and statistically significant, biologically plausible connectivity was detected.

Multi-Scale Information, Network, Causality, and Dynamics: Mathematical Computation and Bayesian Inference…

http://dx.doi.org/10.5772/55262

197

*Structural MRI:* Detecting interactions among brain regions from structural MRI presents a major challenge in computational neuroanatomy. Instead of traditional univariate analysis for brain morphometry, a network analysis based on a BN representation of variables was investigated in [85] to take into account interactions among brain structures in explaining a clinical outcome. Results on a cross-sectional study of mild cognitive impairment (MCI) demonstrated nonlinear and complex multivariate associations among morphological changes in the left hippocampus, the right thalamus, and the presence of MCI. This indicates that the BN has the potential to predict the presence of MCI from structural MRI. Chen et al. [86] proposed to use DBN to represent evolving inter-regional dependencies and identify longi‐ tudinal morphological changes in the human brain. The main advantage of DBN modeling is that it can represent complicated interactions among temporal processes. The approach wad validated by analyzing a simulated atrophy study: only a small number of samples were needed to detect the ground-truth temporal model. The method was also applied to a longi‐ tudinal study of normal aging and MCI — the Baltimore Longitudinal Study of Aging. It was shown that interactions among regional volume-change rates for the MCI group were different

*Further Development of Sparse BNs and Time-Varying DBNs:* There are some recent new devel‐ opment in the area of BNs and DBNs. Sparse BN for effective connectivity modeling was investigated in [87], with a novel formulation for the structure learning of BNs. A L1-norm penalty term imposes sparsity and another penalty ensures the learned networks to satisfy the required property of BNs (i.e. directed acyclic graph). Both theoretical analysis and experi‐ ments on moderate and large benchmark networks demonstrate that the approach has enhanced learning accuracy and scalability compared with existing algorithms. The authors also applied the proposed method to brain images of 42 Alzheimer's disease (AD) and 67 normal controls (NC); the revealed effective connectivity of AD was shown to be different from that of NC, for example, in the global-scale effective connectivity, intra-lobe, inter-lobe, and inter-hemispheric effective connectivity distributions, and the effective connectivity corre‐ sponding to specific brain regions. Graphical model results are often based on static networks, assuming networks with invariant topology. For certain situations, it is desirable to understand and quantitatively model the dynamic topological and functional properties of biological or brain networks. This yields time or condition specific time-varying or non-stationary net‐

from those for the normal aging group.

Note that Bayesian networks do not necessarily imply Bayesian statistics. In fact, it is common to use frequentists methods to estimate the parameters of the CPDs. They are so called because they use Bayes' rule for probabilistic inference. Nevertheless, Bayes net are a useful representation for hierarchical Bayesian models, which form the foundation of applied Bayesian statistics. Bayesian statistical methods in conjunction with Bayesian networks offer an efficient and principled approach for avoiding the data overfitting. Dynamic Bayesian Networks (DBNs) are directed graphical models of stochastic process‐ es, and generalization of hidden Markov models (HMMs) and linear dynamical systems (LDSs). DBN represent the hidden (and observed) state in terms of state variables, which can have complex interdependencies. The simplest DBN is a HMM, with one discrete hidden node and one discrete or continuous observed node per slice. A LDS has the same topology as an HMM, but all the nodes are assumed to have linear-Gaussian distribu‐ tions. Kalman filter is an online filtering of this model.

A graphical model specifies a complete JPD over all the variables; and all possible inference queries can be answered by marginalization, i.e. summing out over irrelevant variables. However, the JPD has size O2*<sup>n</sup>* , with *n* the number of nodes, and each node is assumed to have 2 states. So, summing over the JPD takes exponential time. More efficient methods are thus desirable, including variable elimination [80], dynamic programming [81], approximation algorithms [34, 35] (Monte Carlo methods, variational methods), etc. For the learning part, a BN has two components that need to be specified, i.e. the graph topology (structure) and the parameters (CPD of each node). It is possible to learn both of these from data, though learning structure is much harder than learning parameters. Also, learning when some of the nodes are hidden, or we have missing data, is much harder than when everything is observed. This gives rise to 4 cases and the respective algorithms: 1) known structure and full observability: Maximum Likelihood Estimation; 2) known structure and partial observability: Expectation Maximization (EM) algorithm; 3) unknown structure, full observability: search through model space; 4) unknown structure, partial observability: EM and search through model space.

#### **5.2. Applications and validity in neuroimaging and aging research**

*Functional MRI:* Bayesian networks (BNs) were used in [82] to learn the structure of effective connectivity involved in a fMRI experiment. The approach is exploratory, does not require a priori hypothesized model, and was validated using synthetic data and fMRI data collected in silent word reading and counting Stroop tasks. However, BNs provide a single snapshot of effective connectivity of the entire experiment and thus are not suitable for accurately inferring the temporal characteristics of connectivity. Dynamic Bayesian networks (DBNs) were then proposed [83] to learn the structure of effective brain connectivity in an exploratory way. A Markov chain was employed to model fMRI time-series for discovery of temporal interactions among brain regions. DBNs yield more accurate and informative brain connectivity than earlier methods since temporal characteristics of time-series are explicitly accounted. The functional structures captured on two fMRI datasets are consistent with the previous literature findings and more accurate than those identified by BN. Li et al. [84] aimed to extrapolate BN results from one subject to an entire population while addressing inter-subject, within-group variability. The authors explored two group analysis approaches in fMRI using DBNs: constructing a group network based on a common structure assumption across individuals, and identifying significant structure features by examining DBNs individually-trained. The methods were validated on subjects performing a motor task at three progressive levels of difficulty, and statistically significant, biologically plausible connectivity was detected.

where *parents*(*Xi*

the full joint would require O2*<sup>n</sup>*

However, the JPD has size O2*<sup>n</sup>*

) denotes the values of *Parents*(*Xi*

the graphical modeling, the factored form would require O*n*2*<sup>k</sup>*

fan-in of a node. Fewer parameters make learning easier.

196 Functional Brain Mapping and the Endeavor to Understand the Working Brain

tions. Kalman filter is an online filtering of this model.

**5.2. Applications and validity in neuroimaging and aging research**

essentially conditional probability tables based on Eq. (7). In general, given *n* binary nodes,

Note that Bayesian networks do not necessarily imply Bayesian statistics. In fact, it is common to use frequentists methods to estimate the parameters of the CPDs. They are so called because they use Bayes' rule for probabilistic inference. Nevertheless, Bayes net are a useful representation for hierarchical Bayesian models, which form the foundation of applied Bayesian statistics. Bayesian statistical methods in conjunction with Bayesian networks offer an efficient and principled approach for avoiding the data overfitting. Dynamic Bayesian Networks (DBNs) are directed graphical models of stochastic process‐ es, and generalization of hidden Markov models (HMMs) and linear dynamical systems (LDSs). DBN represent the hidden (and observed) state in terms of state variables, which can have complex interdependencies. The simplest DBN is a HMM, with one discrete hidden node and one discrete or continuous observed node per slice. A LDS has the same topology as an HMM, but all the nodes are assumed to have linear-Gaussian distribu‐

A graphical model specifies a complete JPD over all the variables; and all possible inference queries can be answered by marginalization, i.e. summing out over irrelevant variables.

2 states. So, summing over the JPD takes exponential time. More efficient methods are thus desirable, including variable elimination [80], dynamic programming [81], approximation algorithms [34, 35] (Monte Carlo methods, variational methods), etc. For the learning part, a BN has two components that need to be specified, i.e. the graph topology (structure) and the parameters (CPD of each node). It is possible to learn both of these from data, though learning structure is much harder than learning parameters. Also, learning when some of the nodes are hidden, or we have missing data, is much harder than when everything is observed. This gives rise to 4 cases and the respective algorithms: 1) known structure and full observability: Maximum Likelihood Estimation; 2) known structure and partial observability: Expectation Maximization (EM) algorithm; 3) unknown structure, full observability: search through model space; 4) unknown structure, partial observability: EM and search through model space.

*Functional MRI:* Bayesian networks (BNs) were used in [82] to learn the structure of effective connectivity involved in a fMRI experiment. The approach is exploratory, does not require a priori hypothesized model, and was validated using synthetic data and fMRI data collected in silent word reading and counting Stroop tasks. However, BNs provide a single snapshot of effective connectivity of the entire experiment and thus are not suitable for accurately inferring the temporal characteristics of connectivity. Dynamic Bayesian networks (DBNs) were then proposed [83] to learn the structure of effective brain connectivity in an exploratory way. A

) appearing in *x*1, …, *xn*. The CPTs are

space, where *k* is the maximum

space to represent, but due to the presence of independence in

, with *n* the number of nodes, and each node is assumed to have

*Structural MRI:* Detecting interactions among brain regions from structural MRI presents a major challenge in computational neuroanatomy. Instead of traditional univariate analysis for brain morphometry, a network analysis based on a BN representation of variables was investigated in [85] to take into account interactions among brain structures in explaining a clinical outcome. Results on a cross-sectional study of mild cognitive impairment (MCI) demonstrated nonlinear and complex multivariate associations among morphological changes in the left hippocampus, the right thalamus, and the presence of MCI. This indicates that the BN has the potential to predict the presence of MCI from structural MRI. Chen et al. [86] proposed to use DBN to represent evolving inter-regional dependencies and identify longi‐ tudinal morphological changes in the human brain. The main advantage of DBN modeling is that it can represent complicated interactions among temporal processes. The approach wad validated by analyzing a simulated atrophy study: only a small number of samples were needed to detect the ground-truth temporal model. The method was also applied to a longi‐ tudinal study of normal aging and MCI — the Baltimore Longitudinal Study of Aging. It was shown that interactions among regional volume-change rates for the MCI group were different from those for the normal aging group.

*Further Development of Sparse BNs and Time-Varying DBNs:* There are some recent new devel‐ opment in the area of BNs and DBNs. Sparse BN for effective connectivity modeling was investigated in [87], with a novel formulation for the structure learning of BNs. A L1-norm penalty term imposes sparsity and another penalty ensures the learned networks to satisfy the required property of BNs (i.e. directed acyclic graph). Both theoretical analysis and experi‐ ments on moderate and large benchmark networks demonstrate that the approach has enhanced learning accuracy and scalability compared with existing algorithms. The authors also applied the proposed method to brain images of 42 Alzheimer's disease (AD) and 67 normal controls (NC); the revealed effective connectivity of AD was shown to be different from that of NC, for example, in the global-scale effective connectivity, intra-lobe, inter-lobe, and inter-hemispheric effective connectivity distributions, and the effective connectivity corre‐ sponding to specific brain regions. Graphical model results are often based on static networks, assuming networks with invariant topology. For certain situations, it is desirable to understand and quantitatively model the dynamic topological and functional properties of biological or brain networks. This yields time or condition specific time-varying or non-stationary net‐ works. In order to capture the dynamic causal influences between covariates, time-varying dynamic Bayesian networks (TV-DBNs) was proposed [88]. It models the varying directed dependency structures underlying non-stationary biological/neural time series. A kernel reweighted L1-regularized auto-regressive procedure was employed, with desirable proper‐ ties including computational efficiency and asymptotic consistency. Application of the TV-DBNs to simulated data and brain EEG signals to visual stimuli show that the technique can identify temporally rewiring networks due to system dynamic transformation.

have several effects to cognitive changes in aging. First, the stability of short-term memory networks would be impaired, which may cause difficulty in hold items in short-term memory for long. Second, top-down attention would be impaired. Third, the recall of information from episodic memory systems in the temporal lobe would be impaired [92]. Lastly, any reduction of the firing rate of the pyramidal cells caused by NMDA receptor hypofunction would itself

Multi-Scale Information, Network, Causality, and Dynamics: Mathematical Computation and Bayesian Inference…

http://dx.doi.org/10.5772/55262

199

*Dopamine*: D1 receptor blockade in the prefrontal cortex can impair short-term memory [93]. Partial reason for this may be that D1 receptor blockade can decrease NMDA receptor activated ion channel conductances. Hence part of the role of dopamine in prefrontal cortex in shortterm memory can be accounted for by a decreased depth in the basins of attraction of prefrontal attractor networks [94]. The decreased depth would be caused by both the decreased firing rate of the neurons, and the reduced efficacy of the modified synapse since their ion channels would be less conductive. Dopaminergic function in the prefrontal cortex may decline with aging [95], which could contribute to the reduced short-term memory and attention in aging. *Impaired Synaptic Modification*: Long-lasting associative synaptic modification may also contribute to the cognitive changes in aging, as LTP is more difficult to achieve in older animals and decays more quickly [91, 96]. This would tend to make the synaptic strengths support an attractor weaker and weaken further over time, and thus directly reduces the depth of the attractor basins. This would impact episodic memory, the memory for particular past episodes. The reduction of synaptic strength over time could also affect short-term memory, which requires the synapses supporting a short-term memory attractor be modified in the first place

*Cholinergic Function*: Acetylcholine in the neocortex has its origin largely in the cholinergic neurons in the basal magnocellular forebrain nuclei of Meynert. The correlation of clinical dementia ratings with the reductions in a number of cortical cholinergic markers such as choline acetyltransferase, muscarinic and nicotinic acetylcholine receptor binding, as well as levels of acetylcholine, implied an association of cholinergic hypothesis of memory dysfunc‐ tion in senescence and AD [98]. Cholinergic system could also alter the cerebral cortex function in ways that can be illuminated by stochastic neurodynamics [99]. Enhancing cholinergic function will likely help to reduce the instability of attractor networks involved in short-term

Brain structure and activity can be described at various levels of resolution. Recent develop‐ ments in biotechnology have provided us the ability to measure and record population neuronal activity with more precision and accuracy than ever before, allowing researchers to study and perform detailed analyses which may have been impossible just a few years ago. Brain imaging techniques, such as EEG, MEG, and structural/functional MRI, open macro‐ scopic windows on processes in the working brain. These methods yield high dimensional data sets that are organized in space and time [100]. This creates a huge analysis need to extract

be likely to impair new learning involving long-term potentiation (LTP).

using LTP, before the attractor is used [97].

memory and attention that may occur in aging.

**7. Conclusions**

### **6. Dynamical brain system**

#### **6.1. Attractors and brain dynamics**

Computational neuroscience illustrates the network dynamics of neurons and synapses with models to reproduce emergent properties or predict observed neurophysiology (e.g. singleand multiple-cell recordings, EEG, MEG, fMRI) and associated behavior [27]. Attractor theory [89] is a powerful theoretical framework that can capture the neural computations inherence in cognitive functions such as attention, memory, and decision making. It is based on mathe‐ matical models formulated at the level of neuronal spiking and synaptic activity. An attractor of a dynamical system is a subset of the state space to which orbits originating from typical initial conditions evolve over time. It is common for dynamical system to have more than one attractor. For each such attractor, its *basin of attraction* is the set of initial conditions that give rise to long-time behavior approaching that attractor. Reduced depths in the basins of attraction of prefrontal cortical networks and the noise effects could result in some cognitive symptoms like poor short-term memory and attention. The hypothesis is that reduced depth in the basins of attraction would make short-term memory unstable. Hence the continuing firing of neurons implementing short-term memory sometimes would cease, and the system under noise influence would fall back out of the short-term memory state into spontaneous firing. Top-down attention requires a short-term memory to hold the object of attention in mind. This is the source of the top-down attentional bias that influences competition in other networks receiving incoming signals. Therefore, disruption of short-term memory is also predicted to impair the attention stability.

#### **6.2. Attractors dynamics in aging**

The stochastic dynamical theory to brain function given above has implications in aging research. In the following, we describe effects of these factors and the associated hypotheses to aging [90]. The stochastic dynamic approach to aging can provide a way to test combinations of pharmacological treatments, which may together help to minimize the cognitive symptoms of aging.

*NMDA Receptor Hypofunction*: NMDA receptor functionality tends to decrease with aging [91]. This would act to reduce the depth of the basins of attraction, by reducing firing rate of the neurons in the active attractor, and by decreasing the strength of the potentiated synaptic connections that support each attractor. The reduced depth in the basins of attraction could have several effects to cognitive changes in aging. First, the stability of short-term memory networks would be impaired, which may cause difficulty in hold items in short-term memory for long. Second, top-down attention would be impaired. Third, the recall of information from episodic memory systems in the temporal lobe would be impaired [92]. Lastly, any reduction of the firing rate of the pyramidal cells caused by NMDA receptor hypofunction would itself be likely to impair new learning involving long-term potentiation (LTP).

*Dopamine*: D1 receptor blockade in the prefrontal cortex can impair short-term memory [93]. Partial reason for this may be that D1 receptor blockade can decrease NMDA receptor activated ion channel conductances. Hence part of the role of dopamine in prefrontal cortex in shortterm memory can be accounted for by a decreased depth in the basins of attraction of prefrontal attractor networks [94]. The decreased depth would be caused by both the decreased firing rate of the neurons, and the reduced efficacy of the modified synapse since their ion channels would be less conductive. Dopaminergic function in the prefrontal cortex may decline with aging [95], which could contribute to the reduced short-term memory and attention in aging.

*Impaired Synaptic Modification*: Long-lasting associative synaptic modification may also contribute to the cognitive changes in aging, as LTP is more difficult to achieve in older animals and decays more quickly [91, 96]. This would tend to make the synaptic strengths support an attractor weaker and weaken further over time, and thus directly reduces the depth of the attractor basins. This would impact episodic memory, the memory for particular past episodes. The reduction of synaptic strength over time could also affect short-term memory, which requires the synapses supporting a short-term memory attractor be modified in the first place using LTP, before the attractor is used [97].

*Cholinergic Function*: Acetylcholine in the neocortex has its origin largely in the cholinergic neurons in the basal magnocellular forebrain nuclei of Meynert. The correlation of clinical dementia ratings with the reductions in a number of cortical cholinergic markers such as choline acetyltransferase, muscarinic and nicotinic acetylcholine receptor binding, as well as levels of acetylcholine, implied an association of cholinergic hypothesis of memory dysfunc‐ tion in senescence and AD [98]. Cholinergic system could also alter the cerebral cortex function in ways that can be illuminated by stochastic neurodynamics [99]. Enhancing cholinergic function will likely help to reduce the instability of attractor networks involved in short-term memory and attention that may occur in aging.

### **7. Conclusions**

works. In order to capture the dynamic causal influences between covariates, time-varying dynamic Bayesian networks (TV-DBNs) was proposed [88]. It models the varying directed dependency structures underlying non-stationary biological/neural time series. A kernel reweighted L1-regularized auto-regressive procedure was employed, with desirable proper‐ ties including computational efficiency and asymptotic consistency. Application of the TV-DBNs to simulated data and brain EEG signals to visual stimuli show that the technique can

Computational neuroscience illustrates the network dynamics of neurons and synapses with models to reproduce emergent properties or predict observed neurophysiology (e.g. singleand multiple-cell recordings, EEG, MEG, fMRI) and associated behavior [27]. Attractor theory [89] is a powerful theoretical framework that can capture the neural computations inherence in cognitive functions such as attention, memory, and decision making. It is based on mathe‐ matical models formulated at the level of neuronal spiking and synaptic activity. An attractor of a dynamical system is a subset of the state space to which orbits originating from typical initial conditions evolve over time. It is common for dynamical system to have more than one attractor. For each such attractor, its *basin of attraction* is the set of initial conditions that give rise to long-time behavior approaching that attractor. Reduced depths in the basins of attraction of prefrontal cortical networks and the noise effects could result in some cognitive symptoms like poor short-term memory and attention. The hypothesis is that reduced depth in the basins of attraction would make short-term memory unstable. Hence the continuing firing of neurons implementing short-term memory sometimes would cease, and the system under noise influence would fall back out of the short-term memory state into spontaneous firing. Top-down attention requires a short-term memory to hold the object of attention in mind. This is the source of the top-down attentional bias that influences competition in other networks receiving incoming signals. Therefore, disruption of short-term memory is also

The stochastic dynamical theory to brain function given above has implications in aging research. In the following, we describe effects of these factors and the associated hypotheses to aging [90]. The stochastic dynamic approach to aging can provide a way to test combinations of pharmacological treatments, which may together help to minimize the cognitive symptoms

*NMDA Receptor Hypofunction*: NMDA receptor functionality tends to decrease with aging [91]. This would act to reduce the depth of the basins of attraction, by reducing firing rate of the neurons in the active attractor, and by decreasing the strength of the potentiated synaptic connections that support each attractor. The reduced depth in the basins of attraction could

identify temporally rewiring networks due to system dynamic transformation.

198 Functional Brain Mapping and the Endeavor to Understand the Working Brain

**6. Dynamical brain system**

**6.1. Attractors and brain dynamics**

predicted to impair the attention stability.

**6.2. Attractors dynamics in aging**

of aging.

Brain structure and activity can be described at various levels of resolution. Recent develop‐ ments in biotechnology have provided us the ability to measure and record population neuronal activity with more precision and accuracy than ever before, allowing researchers to study and perform detailed analyses which may have been impossible just a few years ago. Brain imaging techniques, such as EEG, MEG, and structural/functional MRI, open macro‐ scopic windows on processes in the working brain. These methods yield high dimensional data sets that are organized in space and time [100]. This creates a huge analysis need to extract interpretable signals and information from the big data, harvesting the full richness of the multi-modality measurements of the multi-scale brain. One of the future directions on the computation side is to develop high-dimensional analysis methods for mining and modeling of the neuroscience data, and thus to assess and interpret properties in the joint data set combining imaging and behavior/stimulus measurements. The objective is to further our understanding about how neural structures of humans and other animals develop, are aged, and create systems able to accomplish basic and complex behavioral tasks.

[8] Granger, C. Investigating causal relations by econoetric models and cross-spectral

Multi-Scale Information, Network, Causality, and Dynamics: Mathematical Computation and Bayesian Inference…

http://dx.doi.org/10.5772/55262

201

[9] Seth, A. K. A MATLAB toolbox for Granger causal connectivity analysis, *J Neurosci*

[10] Quinn, C. J., Coleman, T. P., Kiyavash, N., & Hatsopoulos, N. G. Estimating the di‐ rected information to infer causal relationships in ensemble neural spike train record‐

[11] Deshpande, G., LaConte, S., James, G. A., Peltier, S., & Hu, X. Multivariate Granger

[12] Akaike, H. A new look at the statistical model identification, *IEEE Transactions on Au‐*

[13] Schwartz, G. Estimating the dimension of a model, *Annals of Statistics,* (1978), 5:

[14] Kaminski, M., Ding, M., Truccolo, W. A., & Bressker, S. L. Evaluating causal relations in neural systems: Granger causality, directed transfer function and statistical assess‐

[16] Vicente, R., Wibral, M., Lindner, M., & Pipa, G. Transfer entropy--a model-free meas‐ ure of effective connectivity for the neurosciences, *J Comput Neurosci,* (2011), 30:

[17] Barnett, L., Barrett, A. B., & Seth, A. K. Granger causality and transfer entropy are

[18] Marko, H. The bidirectional communication theory- a generalization of information

[19] Massey, G. Causality, feedback and directed information, in *Proceedings of Internation‐*

[20] Amblard, P. O., & Michel, O. J. On directed information theory and Granger causali‐

[21] Seghouane, A. K., & Amari, S. Identification of directed influence: Granger causality, Kullback-Leibler divergence, and complexity, *Neural Comput,* (2012), 24: 1722-1739.

[22] Kramer, G. *Directed Information for Channels with Feedback*, Ph.D. Thesis, University of

[23] Liu, Y., & Aviyente, S. The relationship between transfer entropy and directed infor‐

mation, in *IEEE Statistical Signal Processing Workshop*, (2012), pp. 73-76.

[15] Schreiber, T. Measuring information transfer, *Phys Rev Lett,* (2000), 85: 461-464.

equivalent for Gaussian variables, *Phys Rev Lett,* (2009), 103: 238701.

theory, *IEEE Transactions on Communications,* (1973), 21: 1345-1351.

ty graphs, *J Comput Neurosci,* (2011), 30: 7-16.

Manitoba, Canada, (1998).

*al Symposium on Information Theory and Its Applications*, (1990), pp. 27-30.

causality analysis of fMRI data, *Hum Brain Mapp, (2009),* 30: 1361-1373.

ment of significance, *Biological Cybernetics,* (2001), 85: 145-157.

methods, *Econometrica,* (1969), 37: 424-438.

ings, *J Comput Neurosci,* (2011), 30: 17-44.

*tomatic Control,* (1974), 19: 716-723.

461-464.

45-67.

*Methods,* (2010), 186: 262-273.

### **Acknowledgements**

Preparation of this chapter is supported in part by a grant from the National Institute of Aging, K25AG033725.

### **Author details**

Michelle Yongmei Wang\*

Address all correspondence to: ymw@illinois.edu

Departments of Statistics, Psychology, and Bioengineering, Beckman Institute, University of Illinois at Urbana-Champaign, U.S.A.

### **References**


[8] Granger, C. Investigating causal relations by econoetric models and cross-spectral methods, *Econometrica,* (1969), 37: 424-438.

interpretable signals and information from the big data, harvesting the full richness of the multi-modality measurements of the multi-scale brain. One of the future directions on the computation side is to develop high-dimensional analysis methods for mining and modeling of the neuroscience data, and thus to assess and interpret properties in the joint data set combining imaging and behavior/stimulus measurements. The objective is to further our understanding about how neural structures of humans and other animals develop, are aged,

Preparation of this chapter is supported in part by a grant from the National Institute of Aging,

Departments of Statistics, Psychology, and Bioengineering, Beckman Institute, University of

[1] Purves, D., Augustine, G. J., Fitzpatrick, D., Hall, W. C., LaMantia, A.-S., & White, L.

[3] Logothetis, N. K., Pauls, J., Augath, M., Trinath, T., & Oeltermann, A. Neurophysio‐ logical investigation of the basis of the fMRI signal, *Nature,* (2001), 412: 150-157.

[5] Friston, K. J. Functional and effective connectivity: a review, *Brain Connectivity,*

[6] Sporns, O. *Discovering the Human Connectome*, Massachusetts Institute of Technology,

[7] Shannon, C. E. A mathematical theory of communication, *Bell System Technical Jour‐*

[4] Sporns, O. *Networks of the Brain*, Massachusetts Institute of Technology, (2011).

and create systems able to accomplish basic and complex behavioral tasks.

200 Functional Brain Mapping and the Endeavor to Understand the Working Brain

**Acknowledgements**

K25AG033725.

**Author details**

**References**

Michelle Yongmei Wang\*

Address all correspondence to: ymw@illinois.edu

E. *Neuroscience*, Sinauer Associates, Inc., (2008).

[2] Ramachandran, V. S. *Encyclopedia of the Human Brain*, (2002), 3.

Illinois at Urbana-Champaign, U.S.A.

(2011), 1: 13-36.

*nal,* (1948), 27: 379-423.

(2012).


[24] Li, X., Coyle, D., Maguire, L., Watson, D. R., & Mcginnity, T. M. Gray matter concen‐ tration and effective connectivity changes in Alzheimber's disease: a longitudinal structural MRI study, *Diagnostic Neuroradiology,* (2011), 53: 773-748.

[38] Woolrich, M. W., Jbabdi, S., Patenaude, B., Chappell, M., Makni, S., & Behrens, T. Bayesian analysis of neuroimaging data in FSL, *Neuroimage,* (2009), 45: S173-S186. [39] Wang, Y, & Staib, L. H. Boundary finding with prior shape and smoothness models, *IEEE Trans. on Pattern Analysis and Machine Intelligence,* (2000), 22: 738-743.

Multi-Scale Information, Network, Causality, and Dynamics: Mathematical Computation and Bayesian Inference…

http://dx.doi.org/10.5772/55262

203

[40] Wang, Y. M., & Xia, J. Unified framework for robust estimation of brain networks from fMRI using temporal and spatial correlation analyses, *IEEE Transactions on Med‐*

[41] Wang, Y. M. Modeling and nonlinear analysis in fMRI via statistical learning, in *Ad‐ vanced Image Processing in Magnetic Resonance Imaging*, Landini, L. Positano, V., & San‐

[42] Genovese, C. A Bayesian time-course model for functional magnetic resonance imag‐ ing data (with discussion), *Journal of the American Statistical Association,* (2000), 95:

[43] Gossl, C., Fahrmeir, I., & Auer, D. P. Bayesian modeling of the hemodynamic re‐

[44] Friston, K. J. Bayesian estimation of dynamical systems: an application to fmri, *Neu‐*

[45] Ciuciu, P., Poline, J. B., Marrelec, G., Idier, J., Pallier, C., & Benali, H. Unsupervised robust nonparametric estimation of the hemodynamic response function for any fmri

[46] Goutte, C., Nielsen, F. A, & Hansen, L. K. Modeling the haemodynamic response in fmri using smooth fir filters, *IEEE Transactions on Medical Imaging,* (2000), 19:

[47] Marrelec, G., Benali, H., Ciuciu, P., Pelegrini-Issac, M., & Poline, J. B. Robust Bayesi‐ an estimation of the hemodynamic response function in event-related bold fmri us‐

[48] Gossl, C., Auer, D. P., & Fahrmeir, L. Bayeisan spatiotemporal inference in functional

[49] Xia, J., Liang, F., & Wang, Y. M. FMRI analysis through Bayesian variable selection with a spatial prior, in *IEEE International Symposium on Biomedical Imaging*, (2009), pp.

[50] Penny, W. D., Trujillo-Barreto, N. J, & Friston, K. J. Bayesian fmri time series analysis

[51] Woolrich, M., Behrens, T., & Smith, S. Constrained linear basis sets for HRF model‐

ing basic physiological information, *Human Brain Mapping,* (2003), 15: 1-25.

magnetic resonance imaging, *Biometrics,* (2001), 57: 554-562.

ing using variational Bayes, *Neuroimage,* (2004), 21: 1748-1761.

with spatial priors, *Neuroimage,* (2005), 24: 350-362.

experiment, *IEEE Transactions on Medical Imaging,* (2003), 22: 1235-1251.

sponse function in bold fmri, *Neuroimage,* (2001), 14: 140-148.

tarelli, M. F., Eds., Marcel Dekker International Publisher, (2005), pp. 565-586.

*ical Imaging,* (2009), 28: 1296-1307.

*roimage,* (2002), 16: 513-530.

691-703.

1188-1201.

714-717.


[38] Woolrich, M. W., Jbabdi, S., Patenaude, B., Chappell, M., Makni, S., & Behrens, T. Bayesian analysis of neuroimaging data in FSL, *Neuroimage,* (2009), 45: S173-S186.

[24] Li, X., Coyle, D., Maguire, L., Watson, D. R., & Mcginnity, T. M. Gray matter concen‐ tration and effective connectivity changes in Alzheimber's disease: a longitudinal

[25] Miao, X., Wu, X., Li, R., Chen, K., & Yao, L. Altered connectivity pattern of hubs in default-mode network with Alzheimer's disease: an Granger causality modeling ap‐

[26] Wibral, M., Rahm, B., Rieder, M., Lindner, M., Vicente, R., & Kaiser, J. Transfer entro‐ py in magnetoencephalographic data: quantifying information flow in cortical and

[27] Rabinovich, M. I., Friston, K. J., & Varona, P. *Principles of Brain Dynamics*, Massachu‐

[28] Vakorin, V. A., Ross, B., Krakovska, O., Bardouille, T., Cheyne, D., & Mcintosh, A. R. Complexity analysis of source activity underlying the neuromagnetic somatosensory

[29] Zhang, Y.-C. Complexity and 1/f noise. A phase space approach, *Journal of Physique I*

[30] Abasolo, D., Hornero, R., Espino, P., Alvarez, D., & Poza, J. Entropy analysis of the EEG background activity in Alzheimer's disease patients, *Physiological Mesurement,*

[31] Misic, B., Mills, T., Taylor, M. J., & Mcintosh, A. R. Brain noise is task-dependent and

[32] Yang, A. C., Huang, C.-C., Yeh, H.-L., Liu, M.-E., Hong, C.-J., & Tu, P.-C. Complexity of spontaneous BOLD activity in default mode network is correlated with cognitive function in normal male elderly: a multiscale entropy analysis, *Neurobiology of Aging,*

[33] Park, J. H., Kim, S., Kim, C. H., Cichocki, A., & Kim, K. Multiscale entropy analysis of EEG from patients under different pathological conditions, *Fractals,* (2007), 15:

[34] Robert, C., & Casella, G. *Monte Carlo Statistical Methods*, Berlin: Springer-Verlag,

[35] Jordan, M., Ghahramani, Z., Jaakkola, T., & Saul, L. An introduction to variational

[36] Ormerod, J. T., & Wand, M. P. Explaining variational approximations, *The American*

[37] Wang, Y. *Statistical Shape Analysis for Image Segmentation and Physical Model-Based Non-Rigid Registration*, Ph.D. Thesis, Department of Electrical Engineering, Yale Uni‐

methods for graphical models, *Machine Learning,* (1999), 37: 183-233.

region-specific, *Journal of Neurophysiology,* (2010), 104: 2667-2676.

structural MRI study, *Diagnostic Neuroradiology,* (2011), 53: 773-748.

cerebellar networks, *Prog Biophys Mol Biol,* (2011), 105: 80-97.

steady-state response, *Neuroimage,* (2010), 51: 83-90.

proach, *PLoS One,* (2011), 6: e25546.

202 Functional Brain Mapping and the Endeavor to Understand the Working Brain

setts Institute of Technology, (2012).

*France,* (1991), 1: 971-977.

(2006), 27: 241-253.

(2013), 34: 428-438.

*Statistician,* (2010), 64: 140-153.

399-404.

(2004).

versity, (1999).


[52] Harrison, L. M., Penny, W. D., Ashburner, J., Trujillo-Barreto, N., & Friston, K. J. Dif‐ fusion-based spatial priors for imaging, *Neuroimage,* (2007), 38: 677-695.

[67] Knill, D. C., & Pouget, A. The Bayesian brain: the role of uncertainty in neural coding

Multi-Scale Information, Network, Causality, and Dynamics: Mathematical Computation and Bayesian Inference…

http://dx.doi.org/10.5772/55262

205

[68] Atkins, J. E., Fiser, J., & Jacobs, R. A. Experience-dependent visual cue integration based on consistencies between visual and haptic percepts, *Vision Res,* (2001), 41:

[69] Ernst, M. O., & Banks, M. S. Humans integrate visual and haptic information in a

[70] Weiss, Y., Simoncelli, E. P., & Adelson, E. H. Motion illusions as optimal percepts,

[71] Kording, K. P., & Wolpert, D. M. Bayesian integration in sensorimotor learning, *Na‐*

[72] Ma, W. J., Beck, J. M., Latham, P. E., & Pouget, A. Bayesian inference with probabilis‐

[73] Shi, L., Griffiths, T. L., Feldman, N. H., & Sanborn, A. N. Exemplar models as a mechanism for performing Bayesian inference, *Psychon Bull Rev,* (2010), 17: 443-464.

[74] Sanger, T. D. Probability density estimation for the interpretation of neural popula‐

[75] Huys, Q. J. M., Zemel, R. S., Natarajan, R., & Dayan, P. Fast population coding, *Neu‐*

[76] Deneve, S. Bayesian spiking neurons I: inference, *Neural Comput,* Jan (2008). , 20,

[77] Jazayeri, M., & Movshon, J. A. Optimal representation of sensory information by

[78] Murphy, K. P. *Dynamic Bayesian Networks: Representation, Inference and Learning*, Ph.D. Thesis, Department of Computer Science, University of California, Berkeley, (2002).

[79] Bishop, C. M. *Pattern Recognition and Machine Learning*, Springer Science + Business

[80] Kschischang, F. R., Frey, B. J., & Loeliger, H.-A. Factor graphs and the sum-product

[81] Peot, M. A., & Shachter, R. D. Fusion and propagation with multiple observations in

[82] Zheng, X., & Rajapakse, J. C. Learning functional structure from fMR images, *Neuro‐*

[83] Rajapakse, J. C., & Zhou, J. Learning effective brain connectivity with dynamic Baye‐

algorithm, *IEEE Transactions on Information Theory,* (2001), 47: 498-519.

belief networks, *Artificial Intelligence,* (1991), 48: 299-318.

sian networks, *Neuroimage,* (2007), 37: 749-760.

and computation, *Trends Neurosci,* (2004), 27: 712-719.

statistically optimal fashion, *Nature,* (2002), 415: 429-433.

tic population codes, *Nat Neurosci,* (2006), 9: 1432-1438.

tion codes, *J Neurophysiol,* (1996), 76: 2790-2793.

neural populations, *Nat Neurosci,* (2006), 9: 690-696.

*ral Computation,* (2007), 19: 404-441.

*Nat Neurosci,* (2002), 5: 598-604.

*ture,* (2004), 427: 244-247.

449-461.

91-117.

Media, LLC, (2006).

*image,* (2006), 31: 1601-1613.


[67] Knill, D. C., & Pouget, A. The Bayesian brain: the role of uncertainty in neural coding and computation, *Trends Neurosci,* (2004), 27: 712-719.

[52] Harrison, L. M., Penny, W. D., Ashburner, J., Trujillo-Barreto, N., & Friston, K. J. Dif‐

[53] Groves, A. R., Chappell, M. A., & Woolrich, M. W. Combined spatial and non-spatial

[54] Hartvig, N. V., & Jensen, J. L. Spatial mixture modeling of fMRI data, *Hum Brain*

[55] Woolrich, M., Behrens, T, Beckmann, C, & Smith, S. Mixture models with adaptive spatial regularization for segmentation with an application to fmri data, *IEEE Trans‐*

[56] Xia, J., Liang, F., & Wang, Y. M. On clustering fMRI using Potts and mixture regres‐ sion models, in *IEEE Engineering in Medicine and Biology Society Conference*, (2009), pp.

[57] Flandin, G., & Penny, W. D. Bayesian fMRI data analysis with sparse spatial basis

[58] Fergusson, T. A Bayesian analysis of some nonparametric problems, *Annals of Statis‐*

[59] Kim, S., & Smyth, P. Hierarchical Dirichlet Processes with random effects, in *Neural*

[60] Friston, K. J., Penny, W., Phillips, C., Kiebel, S., Hinton, G., & Ashburner, J. Classical and Bayesian inference in neuroimaging: theory, *Neuroimage,* (2002), 16: 465-483.

[61] Holmes, A., & Friston, K. Generalisability, random effects & population inference, in *Fourth International Conference on Functional Mapping of the Human Brain: Neuroimage*,

[62] Geisler, W. S., & Diehl, R. L. Bayesian natural selection and the evolution of percep‐

[63] Ma, W. J. Organizing probabilistic models of perception, *Trends Cogn Sci,* (2012), 16:

[64] Griffiths, T. L., Chater, N., Kemp, C., Perfors, A., & Tenenbaum, J. B. Probabilistic models of cognition: exploring representations and inductive biases, *Trends Cogn Sci,*

[65] Fiser, J., Berkes, P., Orban, G., & Lengyel, M. Statistically optimal perception and learning: from behavior to neural representations, *Trends Cogn Sci,* (2010), 14:

[66] Vilares, I., & Kording, K. Bayesian models: the structure of the world, uncertainty,

behavior, and the brain, *Ann N Y Acad Sci,* (2011), 1224: 22-39.

tual systems, *Philos Trans R Soc Lond B Biol Sci,* (2002), 357: 419-448.

fusion-based spatial priors for imaging, *Neuroimage,* (2007), 38: 677-695.

prior for inference on MRI time-series, *Neuroimage,* (2009), 45: 795-809.

*Mapp,* (2000), 11: 233-248.

4795-4798.

*tics,* (1973), 1: 209-230.

(1998), pp. S754.

(2010), 14: 357-364.

511-518.

119-130.

*actions on Medical Imaging,* (2005), 24: 1-11.

204 Functional Brain Mapping and the Endeavor to Understand the Working Brain

function priors, *Neuroimage,* (2007), 34: 1108-1125.

*Information Processing Systems*, (2006), pp. 697-704.


[84] Li, J., Wang, Z. J., & Mckeown, M. J. Multi-subject, A. dynamic Bayesian networks (DBNs) framework for brain effective connectivity, in *IEEE International Conference on Acoustics, Speech and Signal Processing*, (2007). pp. I-429 – I-432.

[99] Rolls, E. T., & Deco, G. *The Noisy Brain: Stochastic Dynamics as a Principle of Brain Func‐*

Multi-Scale Information, Network, Causality, and Dynamics: Mathematical Computation and Bayesian Inference…

http://dx.doi.org/10.5772/55262

207

[100] Wang, M. Y., Zhou, C., & Xia, J. Statistical analysis for recovery of structure and func‐ tion from brain images, in *Biomedical Engineering, Trends, Researches and Technologies*,

Komorowska, M. A. & Olsztynska-Janus, S., Eds., (2011), pp. 169-196.

*tion*, Oxford University Press, (2012).


[99] Rolls, E. T., & Deco, G. *The Noisy Brain: Stochastic Dynamics as a Principle of Brain Func‐ tion*, Oxford University Press, (2012).

[84] Li, J., Wang, Z. J., & Mckeown, M. J. Multi-subject, A. dynamic Bayesian networks (DBNs) framework for brain effective connectivity, in *IEEE International Conference on*

[85] Chen, R., & Herskovits, E. H. Network analysis of mild cognitive impairment, *Neuro‐*

[86] Chen, R., Resnick, S. M., Davatzikos, C., & Herskovits, E. H. Dynamic Bayesian net‐ work modeling for longitudinal brain morphometry, *Neuroimage,* Feb 1 (2012). , 59,

[87] Huang, S., Li, J., Ye, J., Fleisher, A., Chen, K., & Wu, T. Brain effective connectivity modeling for Alzheimer's disease study by sparse Bayesian network, in *The Seven‐ teenth ACM SIGKDD International Conference On Knowledge Discovery and Data Mining*

[88] Song, L., Kolar, M., & Xing, E. P. Time-varying dynamic Bayesian networks, in *Pro‐ ceeding of the 23rd Neural Information Processing Systems*, (2009), pp. 1732-1740.

[89] Brunel, N., & Wang, X. J. Effects of neuromodulation in a cortical network model of object working memory dominated by recurrent inhibition, *J Comput Neurosci,* (2001),

[90] Rolls, E. T., Deco, G., & Loh, M. *A Stochastic Neurodynamics Approach to the Changes in*

[91] Kelly, K. M., Nadon, N. L., Morrison, J. H., Thibault, O., Barnes, C. A., & Blalock, E.

[92] Dere, E., Easton, A., Nadel, I., & Huston, J. P. *Handbook of Episodic Memory*, Elsevier,

[93] Goldman-Rakic, P. S. The physiological approach: functional architecture of working memory and disordered cognition in schizophrenia, *Biol Psychiatry,* (1999), 46:

[94] Loh, M., Rolls, E. T., & Deco, G. Statistical fluctuations in attractor networks related

[95] Sikstrom, S. Computational perspectives on neuromodulation of aging, *Acta Neuro‐*

[96] Burke, S. N., & Barnes, C. A. Neural plasticity in the ageing brain, *Nat Rev Neurosci,*

[97] Kesner, R. P., & Rolls, E. T. Role of long-term synaptic modification in short-term

[98] Schliebs, R., & Arendt, T. The significance of the cholinergic system in the brain dur‐ ing aging and in Alzheimer's disease, *J Neural Transm,* (2006), 113: 1625-1644.

M. The neurobiology of aging, *Epilepsy Res,* Suppl 1, 2006), 68: S5-20.

to schizophrenia, *Pharmacopsychiatry,* (2007), 40: S78-S84.

*Acoustics, Speech and Signal Processing*, (2007). pp. I-429 – I-432.

*image,* (2006), 29: 1252-1259.

206 Functional Brain Mapping and the Endeavor to Understand the Working Brain

2330-2338.

11: 63-85.

650-661.

(2006), 7: 30-40.

(2011), pp. 931-939.

Amsterdam, (2008).

*chir Suppl,* (2007), 97: 513-518.

memory, *Hippocampus,* (2001), 11: 240-250.

*Cognition and Memory in Aging*, (2010).

[100] Wang, M. Y., Zhou, C., & Xia, J. Statistical analysis for recovery of structure and func‐ tion from brain images, in *Biomedical Engineering, Trends, Researches and Technologies*, Komorowska, M. A. & Olsztynska-Janus, S., Eds., (2011), pp. 169-196.

**Chapter 11**

**The Crossmodal Influence of Odor Hedonics on Facial**

Facial attractiveness is a highly relevant social cue, readily assessed by human observers. Facial attractiveness significantly impact on success in both work and social environments [1, 2]. Taking a Darwinian perspective, Perrett at al. [3] have argued that the physical structure of beautiful faces – as judged by others – provide salient signals of mate value that motivate behavior in others. Several general features have been shown to contribute to the perceived attractiveness of a face, including both facial symmetry and the extent to which an individual face conforms to an average prototype [4, 5, 6]. Additionally, faces displaying various emo‐ tional expressions (e.g., joy, anger, etc.) have been used to investigate the brain regions involved in the coding of affect [7, 8, 9, 10, 11], such as the orbitofrontal cortex (OFC), the insular

At both an explicit and implicit level, humans through the ages have devised means by which to enhance facial attractiveness (a multibillion dollar cosmetic industry attests to this fact). An equally lucrative fragrance industry exploits the hedonic primacy of odors in the human brain, yet it remains unclear whether the presence of odors can modulate the perceived attractiveness

A pioneering positron emission tomography (PET) study by Nakamura and colleagues [12] demonstrated that activity in left frontal brain regions correlates with perceived facial attractiveness in humans. Furthermore, functional magnetic resonance imaging (fMRI) has been used to show that the viewing of attractive female faces by male participants activates reward circuitry in the brain, in particular, the nucleus accumbens and the OFC [13, 14].

and reproduction in any medium, provided the original work is properly cited.

© 2013 McGlone et al.; licensee InTech. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

© 2013 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution,

**Attractiveness: Behavioural and fMRI Measures**

Francis McGlone, Robert A. Österbauer, Luisa M. Demattè and Charles Spence

http://dx.doi.org/10.5772/56504

**1. Introduction**

cortex, and the amygdala.

of faces.

Additional information is available at the end of the chapter
