**Linking Neural Activity to Visual Perception: Separating Sensory and Attentional Contributions**

Jackson E.T. Smith, Nicolas Y. Masse, Chang'an A. Zhan and Erik P. Cook

Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/47270

## **1. Introduction**

160 Visual Cortex – Current Status and Perspectives

331-375.

231-265.

[61] Schneider WX (1995) VAM: a neuro-cognitive model for visual attention control of segmentation, object recognition, and space-based motor action. Visual Cognition 2:

[62] Rizzolatti G, Riggio L, Sheliga BM (1994) Space and selective attention. In: Umilta C, Moscovitch M, editors. Attention and Performance XV. Cambridge, MA: MIT Press. pp.

[63] Moore T, Fallah M (2001) Control of eye movements and spatial attention. Proceedings

[64] Muller JR, Philiastides MG, Newsome WT (2005) Microstimulation of the superior colliculus focuses attention without moving the eyes. Proceedings of the National

[65] Meegan DV, Tipper SP (1998) Reaching into cluttered visual environments: spatial and temporal influences of distracting objects. The Quarterly Journal of Experimental

[66] Desmurget M, Pelisson D, Rossetti Y, Prablanc C (1998) From eye to hand: planning goal-directed movements. Neuroscience and Biobehavioral Reviews 22: 761-788. [67] Schneider WX, Deubel H (2002) Selection-for-perception and selection-for-spatialmotor-action are coupled by visual attention: A review of recent findings and new evidence from stimulus-driven saccade control. In: Hommel B, Prinz W, editors. Attention and Performance XIX. New York: Oxford University Press. pp. 609-627. [68] Deubel H, Schneider WX (1996) Saccade target selection and object recognition: Evidence for a common attentional mechanism. Vision Research 36: 1827-1837. [69] Deubel H, Schneider WX, Paprotta I (1998) Selective dorsal and ventral processing: Evidence for a common attentional mechanism in reaching and perception. Visual

[70] Deubel H, Schneider WX (2003) Delayed saccades, but not delayed manual aiming movements, require visual attention shifts. Annals of the New York Academy of

[71] Baldauf D, Wolf M, Deubel H (2006) Deployment of visual attention before sequences of

[72] Bekkering H, Neggers SFW (2002) Visual search is modulated by action intentions.

goal-directed hand movements. Vision Research 46: 4355-4374.

of the National Academy of Sciences 98: 1273-1276.

Academy of Sciences 102: 524-529.

Psychology 51A: 225-249.

Cognition 5: 81-107.

Sciences 1004: 289-296.

Psychological Science 13: 370-374.

For each of the five basic senses, information about the external world begins as a physical representation in the brain. This representation exists in the structure of sensory neural activity, such as the flow of ions across neural membranes and the action potentials (or spikes) that neurons produce. At some point the brain achieves a transition – from tangible electrophysiology to something more. In other words, neural activity becomes a basic sensation that we are aware of and that we can name. For example, sensations like 'slow' or 'fast', 'far' or 'near', are some of the simplest features that we can assign to a visual stimulus and are some of the basic attributes that we can perceive.

But the transition from neural activity to perception is not simple and remains largely unknown. This process is not intractable, however, and enormous effort has been made by neuroscientists to solve it. In particular, much progress has been made to reveal how small fluctuations in cortical activity are correlated with perceptual behavior. We refer to this correlation as 'behavioral sensitivity'. New observations suggest that both bottom-up sensory mechanisms (such as neural noise) and top-down processes (such as attention) have a role to play in establishing behavioral sensitivity. How do we separate these two contributions?

Figure 1 illustrates the problem of untangling the link between a visual cortical neuron's activity and a subject's perceptual behavior. In the simplest model (Figure 1A), a visual cortical neuron contributes in a bottom-up manner to downstream networks that underlie perceptual behavior. In this case, a neuron is behaviorally sensitive because its activity is directly linked to the perception of the visual stimulus. In the alternative extreme (Figure 1B), a visual cortical neuron has no direct influence on the perceptual decision, but is

© 2012 Cook et al., licensee InTech. This is an open access chapter distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. © 2012 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Linking Neural Activity to Visual Perception: Separating Sensory and Attentional Contributions 163

behavioral sensitivity (Figure 1C), it is first important to understand the properties of

We can use the specialization of visual cortical neurons to begin understanding how they support visual perception. This is accomplished by comparing the activity of a neuron to the responses of an observer performing a perceptual task [1]. Neurons from the Middle Temporal area of visual cortex (MT i.e. area V5; [2]) are well suited to this purpose. In addition, the methods applied to the study of MT are generally applicable to other areas of visual cortex. MT is a part of the dorsal processing stream and it receives most of its sensory input from areas V1, V2 and V3 – while it sends output chiefly to parts of the parietal cortex [3]. In each hemisphere, area MT contains a complete topographical representation of the contralateral visual hemifield, and any one neuron receives visual information from a discrete patch of visual space, known as the neuron's receptive field (RF). MT neurons are highly selective for both the direction and speed of visual motion, and produce crisp responses to preferred stimuli that fall within their RFs [4]; they are also selective for stimulus

Although V1, V2, and V3 neurons can also be selective for the direction and speed of visual motion, a relatively high proportion of MT neurons have an emergent Gestalt sensitivity to the motion of objects formed from separate components. For example, when shown two superimposed sine-wave gratings that move in different directions, V1 neurons mainly respond to the motion of only one component grating or the other; however, a number of MT neurons will treat the two gratings as a single object, responding to the coherent motion of both [7, 8]. Similarly, MT neurons can detect the coherent motion of separate dots – as if the dots were painted on an invisible pane of glass that was moving – even when the coherent dots are embedded within another field of dots that move randomly and incoherently [9].

Importantly, MT has been shown to take part in the perception of coherent dot motion – as lesioning MT causes a severe deficit in a subject's ability to discriminate between opposite directions of motion [10], and microstimulating MT biases a subject to report motion in the preferred direction of the stimulation site [11]. Altogether, the robust responses and selectivity of MT neurons to visual motion, plus their involvement in motion perception, make them an

The classic studies of Newsome and colleagues demonstrated the power of a careful comparison between neural activity and perceptual behavior [9, 12]. Experiments were performed to carefully measure the discrimination sensitivity of MT neurons from monkey subjects performing a two-alternative, forced-choice (2AFC), motion-discrimination task. The subjects had to report whether the coherent motion in a patch of randomly moving dots was in the preferred or null (preferred + 180°) direction of an isolated MT neuron. It was critical to match the direction, speed, and location of dot motion to the neuron's RF

excellent choice for comparison against the perceptual capabilities of a subject [1].

**2. Area MT and visual motion perception as a model system** 

behavioral sensitivity and how it is measured.

size [5] and binocular disparity [6].

**3. A neuron's stimulus sensitivity** 

**Figure 1.** *Two neural mechanisms for a cortical neuron's behavioral sensitivity.* A cortical neuron exhibits behavioral sensitivity if its activity is correlated with perceptual behavior. For both mechanisms in A&B, information about the stimulus is encoded by visual cortical neurons and used to drive perceptual behavior. The arrows do not represent this flow of stimulus information; what they show are the source and destination of trial-to-trial variability. **A**, In the bottom-up mechanism, noisy sensory activity causes the variation in perceptual behavior. In this case, there is a causal link between the variability in visual cortical neurons and the variability in perceptual behavior. **B**, In the top-down mechanism, the subject's attentional state varies from trial-to-trial, causing variable perceptual behavior. However, feedback projections also allow the attentional state to affect the firing rates of visual cortical neurons. In this case, there is a non-causal link between variability in visual cortical activity and perceptual behavior. **C**, Different sources of variation that could contribute to bottom-up sensory and top-down attentional variability in cortical neurons. Note that the bottom-up and top-down hypotheses shown here are the two possible extremes. The brain may actually implement any number of hybrid models, incorporating components from both hypotheses.

modulated by top-down attentional signals that affect both its activity and the perceptual behavior. In both models a neuron can theoretically exhibit the same behavioral sensitivity, but in the top-down scenario it has no role in the perceptual process. As we will discuss, both bottom-up sensory and top-down attentional mechanisms can be at play, depending on the perceptual task. To distinguish among the many possible contributions to a neuron's behavioral sensitivity (Figure 1C), it is first important to understand the properties of behavioral sensitivity and how it is measured.

## **2. Area MT and visual motion perception as a model system**

162 Visual Cortex – Current Status and Perspectives

incorporating components from both hypotheses.

**Figure 1.** *Two neural mechanisms for a cortical neuron's behavioral sensitivity.* A cortical neuron exhibits behavioral sensitivity if its activity is correlated with perceptual behavior. For both mechanisms in A&B, information about the stimulus is encoded by visual cortical neurons and used to drive perceptual behavior. The arrows do not represent this flow of stimulus information; what they show are the source and destination of trial-to-trial variability. **A**, In the bottom-up mechanism, noisy sensory activity causes the variation in perceptual behavior. In this case, there is a causal link between the variability in visual cortical neurons and the variability in perceptual behavior. **B**, In the top-down mechanism, the subject's attentional state varies from trial-to-trial, causing variable perceptual behavior. However, feedback projections also allow the attentional state to affect the firing rates of visual cortical neurons. In this case, there is a non-causal link between variability in visual cortical activity and perceptual behavior. **C**, Different sources of variation that could contribute to bottom-up sensory and top-down attentional variability in cortical neurons. Note that the bottom-up and top-down hypotheses shown here are the two possible extremes. The brain may actually implement any number of hybrid models,

modulated by top-down attentional signals that affect both its activity and the perceptual behavior. In both models a neuron can theoretically exhibit the same behavioral sensitivity, but in the top-down scenario it has no role in the perceptual process. As we will discuss, both bottom-up sensory and top-down attentional mechanisms can be at play, depending on the perceptual task. To distinguish among the many possible contributions to a neuron's We can use the specialization of visual cortical neurons to begin understanding how they support visual perception. This is accomplished by comparing the activity of a neuron to the responses of an observer performing a perceptual task [1]. Neurons from the Middle Temporal area of visual cortex (MT i.e. area V5; [2]) are well suited to this purpose. In addition, the methods applied to the study of MT are generally applicable to other areas of visual cortex. MT is a part of the dorsal processing stream and it receives most of its sensory input from areas V1, V2 and V3 – while it sends output chiefly to parts of the parietal cortex [3]. In each hemisphere, area MT contains a complete topographical representation of the contralateral visual hemifield, and any one neuron receives visual information from a discrete patch of visual space, known as the neuron's receptive field (RF). MT neurons are highly selective for both the direction and speed of visual motion, and produce crisp responses to preferred stimuli that fall within their RFs [4]; they are also selective for stimulus size [5] and binocular disparity [6].

Although V1, V2, and V3 neurons can also be selective for the direction and speed of visual motion, a relatively high proportion of MT neurons have an emergent Gestalt sensitivity to the motion of objects formed from separate components. For example, when shown two superimposed sine-wave gratings that move in different directions, V1 neurons mainly respond to the motion of only one component grating or the other; however, a number of MT neurons will treat the two gratings as a single object, responding to the coherent motion of both [7, 8]. Similarly, MT neurons can detect the coherent motion of separate dots – as if the dots were painted on an invisible pane of glass that was moving – even when the coherent dots are embedded within another field of dots that move randomly and incoherently [9].

Importantly, MT has been shown to take part in the perception of coherent dot motion – as lesioning MT causes a severe deficit in a subject's ability to discriminate between opposite directions of motion [10], and microstimulating MT biases a subject to report motion in the preferred direction of the stimulation site [11]. Altogether, the robust responses and selectivity of MT neurons to visual motion, plus their involvement in motion perception, make them an excellent choice for comparison against the perceptual capabilities of a subject [1].

## **3. A neuron's stimulus sensitivity**

The classic studies of Newsome and colleagues demonstrated the power of a careful comparison between neural activity and perceptual behavior [9, 12]. Experiments were performed to carefully measure the discrimination sensitivity of MT neurons from monkey subjects performing a two-alternative, forced-choice (2AFC), motion-discrimination task. The subjects had to report whether the coherent motion in a patch of randomly moving dots was in the preferred or null (preferred + 180°) direction of an isolated MT neuron. It was critical to match the direction, speed, and location of dot motion to the neuron's RF preferences. This ensured that the subject was responding to the same stimulus as the neuron. But more importantly: it maximized the chance that spikes recorded from the neuron were used by the subject to perform the task.

Linking Neural Activity to Visual Perception: Separating Sensory and Attentional Contributions 165

A receiver operating characteristic (ROC) analysis (Figure 2) was used to quantify the discrimination sensitivity of MT neurons in the 2AFC task (see Appendix). For this, two distributions of spike counts were compared against each other, the distribution of counts from trials when the coherent motion was in the neuron's preferred direction (distribution **Y** in Figure 2A) versus the distribution of counts from trials with coherent motion in the null direction (distribution **X** in Figure 2A). The resulting ROC areas (Figure 2B) described the probability that an ideal observer could tell which direction had been presented to the subject, based on the distribution of MT spike counts. This was computed separately for each level of coherent motion strength and compared directly against the subject's performance. It was found that the average MT neuron could account for the subject's discrimination

The discrimination sensitivity of MT neurons in a 2AFC task is mirrored in experiments where the subject performs a slightly different task: motion detection. In such a task, the subject reports a change in the coherence of dot motion. The sensitivity of an MT neuron is judged by how different its firing rate is before and after the motion stimulus changes. Figure 3A shows an example motion detection task in which a monkey monitors a cloud of randomly moving dots (grey). At the start of each trial, all of the dots move independently (random dot motion) for a random duration. The subject was trained to release a lever in response to a brief (50 ms) period of coherent dot motion (motion pulse). Random dot motion resumed following the motion pulse. Trials were considered a failure if the subject did not release the lever following the coherent motion pulse. Importantly, the dots were located in the RF (dashed circle) of the MT neuron under study, and the coherent motion was always in the neuron's preferred direction and speed. Again, this maximized the chance

The response of an example neuron – recorded from a monkey performing the motion detection task – is shown in Figure 3B. Each spike is represented as a black tick mark, and each row of ticks is the neuron's response on one trial. Trials are sorted vertically by whether the subject was successful (correct trials, white background) or not (failed trials, grey background). In addition, correct trials are sorted by the duration between the motion pulse and the subject's response time. All tick marks are aligned to the start of the 50 ms motion pulse (dashed line). Before the motion pulse, the neuron produced a baseline number of spikes in response to the random dot motion. Following the start of the motion pulse, however, the neuron responded with a vigorous burst of spikes. The stark contrast of

The stimulus sensitivity of this example neuron is quantified using the ROC metric (Figure 2), similar to the one used by Newsome and colleagues. First, the spikes are counted on each trial in two analysis windows (black bars); one spans the 100 ms *before* the motion pulse (b), counting spikes produced in response to the random dot motion; the other spans the 100 ms *after* the burst of spikes began, counting spikes fired in response to the motion pulse (a). The distribution of spike counts from both windows are shown in Figure 3C. The neuron's sensitivity to the motion pulse is found by comparing the distribution of spikes after the motion pulse (Figure 3C, open bars) versus the distribution of spikes before the motion

sensitivity – at least under the particular conditions of the experiment [see 13].

that the recorded spikes contributed to the subject's performance.

the neuron's responses show that it was very sensitive to the motion pulse.

The direction of motion was randomly drawn on every trial so that the subject would have to watch the coherent dots carefully, in order to make a correct choice. However, the strength of coherent motion was also varied from trial to trial by changing the percentage of dots that moved together. This varied the difficulty of the task and therefore the subject's performance, which provided a frame of reference. The neuron's ability to discriminate the direction of coherent dot motion at any one difficulty level could be directly compared against the performance of the subject.

**Figure 2.** *Area under the receiver operating characteristic (ROC) curve.* **A**, Hypothetical example of two spike-count distributions, from trials grouped by conditions X and Y. Spike counts range between 0 and the maximum value, cmax. **B**, The curved line, located above the dashed 'chance' line, represents the ROC curve that is constructed from the distributions in Panel A by classifying their values with the ideal observer (see Appendix). Classification performance is tested for every possible value of the classification criterion, c, which includes all possible spike counts between 0 and cmax. Thus, each value of c corresponds to a point in the ROC curve; the arrow shows how increasing values of c are mapped. The grey region is the area under the ROC curve. **C**, Behavioral sensitivity (or stimulus sensitivity) are defined as the area under the ROC curve that compares a distribution of failed-trial (or noise) spike counts (grey) versus a distribution of correct-trial (or signal) spike counts (open). The area under the ROC curve quantifies the difference between the two distributions.

A receiver operating characteristic (ROC) analysis (Figure 2) was used to quantify the discrimination sensitivity of MT neurons in the 2AFC task (see Appendix). For this, two distributions of spike counts were compared against each other, the distribution of counts from trials when the coherent motion was in the neuron's preferred direction (distribution **Y** in Figure 2A) versus the distribution of counts from trials with coherent motion in the null direction (distribution **X** in Figure 2A). The resulting ROC areas (Figure 2B) described the probability that an ideal observer could tell which direction had been presented to the subject, based on the distribution of MT spike counts. This was computed separately for each level of coherent motion strength and compared directly against the subject's performance. It was found that the average MT neuron could account for the subject's discrimination sensitivity – at least under the particular conditions of the experiment [see 13].

164 Visual Cortex – Current Status and Perspectives

against the performance of the subject.

neuron were used by the subject to perform the task.

preferences. This ensured that the subject was responding to the same stimulus as the neuron. But more importantly: it maximized the chance that spikes recorded from the

The direction of motion was randomly drawn on every trial so that the subject would have to watch the coherent dots carefully, in order to make a correct choice. However, the strength of coherent motion was also varied from trial to trial by changing the percentage of dots that moved together. This varied the difficulty of the task and therefore the subject's performance, which provided a frame of reference. The neuron's ability to discriminate the direction of coherent dot motion at any one difficulty level could be directly compared

**Figure 2.** *Area under the receiver operating characteristic (ROC) curve.* **A**, Hypothetical example of two spike-count distributions, from trials grouped by conditions X and Y. Spike counts range between 0 and the maximum value, cmax. **B**, The curved line, located above the dashed 'chance' line, represents the ROC curve that is constructed from the distributions in Panel A by classifying their values with the ideal observer (see Appendix). Classification performance is tested for every possible value of the classification criterion, c, which includes all possible spike counts between 0 and cmax. Thus, each value of c corresponds to a point in the ROC curve; the arrow shows how increasing values of c are mapped. The grey region is the area under the ROC curve. **C**, Behavioral sensitivity (or stimulus sensitivity) are defined as the area under the ROC curve that compares a distribution of failed-trial (or noise) spike counts (grey) versus a distribution of correct-trial (or signal) spike counts (open). The area under the

sensitivity

ROC curve quantifies the difference between the two distributions.

The discrimination sensitivity of MT neurons in a 2AFC task is mirrored in experiments where the subject performs a slightly different task: motion detection. In such a task, the subject reports a change in the coherence of dot motion. The sensitivity of an MT neuron is judged by how different its firing rate is before and after the motion stimulus changes. Figure 3A shows an example motion detection task in which a monkey monitors a cloud of randomly moving dots (grey). At the start of each trial, all of the dots move independently (random dot motion) for a random duration. The subject was trained to release a lever in response to a brief (50 ms) period of coherent dot motion (motion pulse). Random dot motion resumed following the motion pulse. Trials were considered a failure if the subject did not release the lever following the coherent motion pulse. Importantly, the dots were located in the RF (dashed circle) of the MT neuron under study, and the coherent motion was always in the neuron's preferred direction and speed. Again, this maximized the chance that the recorded spikes contributed to the subject's performance.

The response of an example neuron – recorded from a monkey performing the motion detection task – is shown in Figure 3B. Each spike is represented as a black tick mark, and each row of ticks is the neuron's response on one trial. Trials are sorted vertically by whether the subject was successful (correct trials, white background) or not (failed trials, grey background). In addition, correct trials are sorted by the duration between the motion pulse and the subject's response time. All tick marks are aligned to the start of the 50 ms motion pulse (dashed line). Before the motion pulse, the neuron produced a baseline number of spikes in response to the random dot motion. Following the start of the motion pulse, however, the neuron responded with a vigorous burst of spikes. The stark contrast of the neuron's responses show that it was very sensitive to the motion pulse.

The stimulus sensitivity of this example neuron is quantified using the ROC metric (Figure 2), similar to the one used by Newsome and colleagues. First, the spikes are counted on each trial in two analysis windows (black bars); one spans the 100 ms *before* the motion pulse (b), counting spikes produced in response to the random dot motion; the other spans the 100 ms *after* the burst of spikes began, counting spikes fired in response to the motion pulse (a). The distribution of spike counts from both windows are shown in Figure 3C. The neuron's sensitivity to the motion pulse is found by comparing the distribution of spikes after the motion pulse (Figure 3C, open bars) versus the distribution of spikes before the motion pulse (Figure 3C, grey bars) using an ROC curve (refer to Figure 2). For this measure of stimulus selectivity, a value of 0.5 would indicate no difference between spike counts before and after the motion pulse, showing that a neuron's response had no information about the visual stimulus. In comparison, values of stimulus selectivity approaching 0 indicate more spikes before the motion pulse, while values approaching 1 indicate more spikes after the motion pulse. As expected, the stimulus sensitivity of this neuron is very high (0.92), meaning that this neuron conveyed reliable information about the occurrence of the motion pulse.

Linking Neural Activity to Visual Perception: Separating Sensory and Attentional Contributions 167

Whether the subject is detecting or discriminating motion, ROC analysis can be used to quantify the sensitivity of neurons to the stimulus; thus, both are referred to here as 'stimulus sensitivity'. While lesion, microstimulation, and stimulus sensitivity studies show that MT is involved in motion perception and can account for its capabilities – they do not explain how MT activity *becomes* the perception of motion. This requires the estimation of

The classic studies of Newsome and colleagues highlighted the large variation in the choices made by subjects and in the number of spikes fired by MT neurons. In response to statistically identical stimuli, subjects would sometimes report the wrong direction and their neurons would sometimes fire as if the opposite direction had been shown. However, this variation presented an exciting opportunity – because the ROC curve is a versatile tool and can be used to compare any two distributions of neural activity. Celebrini and Newsome [14] performed a ground-breaking analysis: they measured the correlation between the number of spikes fired by a neuron and the choice that the subject was about to make.

They began by grouping trials based on the 'preferred' or 'null' motion discrimination report made by the subject. Then they computed the ROC curve comparing the distribution of null-trial spike counts (distribution **X** in Figure 2A) versus the distribution of preferredtrial spike counts (distribution **Y** in Figure 2A). The area under this ROC curve is the probability that the ideal observer could correctly predict which direction of motion the subject would choose, using spike counts. This kind of ROC metric was named 'choice probability' when it was later used to analyze MT neurons [15]. However, we will refer to this ROC metric, and other like it, as 'behavioral sensitivity', because it measures how much the neural response predicts perceptual behavior. It is important to keep in mind that behavioral sensitivity does not measure the correlation between spike counts and perception itself – only the perceptual report, which may not always be faithful to what was actually

Similar to stimulus sensitivity, a behavioral sensitivity of 0.5 shows that there was no difference in the number of spikes fired prior to either choice (Figure 2C, left). If more spikes were fired prior to choices coinciding with the neuron's preferred direction, then behavioral sensitivity would rise towards 1, to indicate a positive correlation (Figure 2C, middle and right). On the other hand, if more spikes were fired prior to null direction choices, then behavioral sensitivity would sink towards 0, to indicate a negative correlation. On average, MT neurons had a weak but significant, positive correlation with the subject's upcoming choice of motion direction [15]. Since then, behavioral sensitivities have been observed between MT spike counts and the subject's upcoming choice when discriminating coherent dot motion direction [16-18], speed [19, 20], disparity [21, 22], and cylindrical rotation [23- 25]. Similar behavioral sensitivities have been observed between a subject's discrimination

performance and spike counts from cortical areas V2 [26, 27] and MST [14, 28, 29].

the neural link to perceptual behavior, referred to as 'behavioral sensitivity'.

**4. A neuron's behavioral sensitivity** 

perceived.

**Figure 3.** *Stimulus and behavioral sensitivity of an example MT neuron.* **A**, The perceptual task. A monkey directed its gaze to a fixation point (cross) and monitored a patch (grey) of randomly moving dots overlapping the neuron's receptive field (dashed circle). At a random time, the dots moved coherently for 50 ms (motion pulse) in the neuron's preferred direction and speed before reverting back to random dot motion. The trial was a success (correct trial) if the subject released a lever after the motion pulse. The trial was a failure if the subject did not respond. **B**, Raster of spike responses recorded electrophysiologically from an example MT neuron, while the animal subject performed the task in A. The analysis windows (a and b) were used to obtain the spike counts in C&D. **C**, The distribution of spike counts from before (window 'b' in panel B, grey) and after (window 'a' in panel B, open) the motion pulse used to obtain the ROC score that quantified the neuron's stimulus sensitivity. **D**, The distribution of spike counts from failed trials (grey) and correct trials (open) after the motion pulse (window 'a' in panel B) used to obtain the ROC score that quantified the neuron's behavioral sensitivity. Note that this neuron is exemplary and that few visual cortical neurons exhibit this level of behavioral sensitivity.

Whether the subject is detecting or discriminating motion, ROC analysis can be used to quantify the sensitivity of neurons to the stimulus; thus, both are referred to here as 'stimulus sensitivity'. While lesion, microstimulation, and stimulus sensitivity studies show that MT is involved in motion perception and can account for its capabilities – they do not explain how MT activity *becomes* the perception of motion. This requires the estimation of the neural link to perceptual behavior, referred to as 'behavioral sensitivity'.

## **4. A neuron's behavioral sensitivity**

166 Visual Cortex – Current Status and Perspectives

pulse (Figure 3C, grey bars) using an ROC curve (refer to Figure 2). For this measure of stimulus selectivity, a value of 0.5 would indicate no difference between spike counts before and after the motion pulse, showing that a neuron's response had no information about the visual stimulus. In comparison, values of stimulus selectivity approaching 0 indicate more spikes before the motion pulse, while values approaching 1 indicate more spikes after the motion pulse. As expected, the stimulus sensitivity of this neuron is very high (0.92), meaning that this neuron conveyed reliable information about the occurrence of the motion pulse.

**Figure 3.** *Stimulus and behavioral sensitivity of an example MT neuron.* **A**, The perceptual task. A monkey directed its gaze to a fixation point (cross) and monitored a patch (grey) of randomly moving dots overlapping the neuron's receptive field (dashed circle). At a random time, the dots moved coherently for 50 ms (motion pulse) in the neuron's preferred direction and speed before reverting back to random dot motion. The trial was a success (correct trial) if the subject released a lever after the motion pulse. The

electrophysiologically from an example MT neuron, while the animal subject performed the task in A. The analysis windows (a and b) were used to obtain the spike counts in C&D. **C**, The distribution of spike counts from before (window 'b' in panel B, grey) and after (window 'a' in panel B, open) the motion pulse used to obtain the ROC score that quantified the neuron's stimulus sensitivity. **D**, The distribution of spike counts from failed trials (grey) and correct trials (open) after the motion pulse (window 'a' in panel B) used to obtain the ROC score that quantified the neuron's behavioral sensitivity. Note that this neuron is exemplary and that few visual cortical neurons exhibit this level of behavioral sensitivity.

trial was a failure if the subject did not respond. **B**, Raster of spike responses recorded

The classic studies of Newsome and colleagues highlighted the large variation in the choices made by subjects and in the number of spikes fired by MT neurons. In response to statistically identical stimuli, subjects would sometimes report the wrong direction and their neurons would sometimes fire as if the opposite direction had been shown. However, this variation presented an exciting opportunity – because the ROC curve is a versatile tool and can be used to compare any two distributions of neural activity. Celebrini and Newsome [14] performed a ground-breaking analysis: they measured the correlation between the number of spikes fired by a neuron and the choice that the subject was about to make.

They began by grouping trials based on the 'preferred' or 'null' motion discrimination report made by the subject. Then they computed the ROC curve comparing the distribution of null-trial spike counts (distribution **X** in Figure 2A) versus the distribution of preferredtrial spike counts (distribution **Y** in Figure 2A). The area under this ROC curve is the probability that the ideal observer could correctly predict which direction of motion the subject would choose, using spike counts. This kind of ROC metric was named 'choice probability' when it was later used to analyze MT neurons [15]. However, we will refer to this ROC metric, and other like it, as 'behavioral sensitivity', because it measures how much the neural response predicts perceptual behavior. It is important to keep in mind that behavioral sensitivity does not measure the correlation between spike counts and perception itself – only the perceptual report, which may not always be faithful to what was actually perceived.

Similar to stimulus sensitivity, a behavioral sensitivity of 0.5 shows that there was no difference in the number of spikes fired prior to either choice (Figure 2C, left). If more spikes were fired prior to choices coinciding with the neuron's preferred direction, then behavioral sensitivity would rise towards 1, to indicate a positive correlation (Figure 2C, middle and right). On the other hand, if more spikes were fired prior to null direction choices, then behavioral sensitivity would sink towards 0, to indicate a negative correlation. On average, MT neurons had a weak but significant, positive correlation with the subject's upcoming choice of motion direction [15]. Since then, behavioral sensitivities have been observed between MT spike counts and the subject's upcoming choice when discriminating coherent dot motion direction [16-18], speed [19, 20], disparity [21, 22], and cylindrical rotation [23- 25]. Similar behavioral sensitivities have been observed between a subject's discrimination performance and spike counts from cortical areas V2 [26, 27] and MST [14, 28, 29].

When subjects are tested on their ability to detect a change in coherent dot motion (Figure 3A), the ROC curve is made by comparing the distribution of spike counts from failed trials (distribution **X** in Figure 2A) versus the distribution of spike counts from successful trials (distribution **Y** in Figure 2A). The area under this curve is the probability that the ideal observer can correctly predict the subject's detection performance, and so it was called 'detect probability' [13]. Again, we shall refer to this metric as behavioral sensitivity. A behavioral sensitivity above 0.5 indicates that the neuron fired more prior to successful detections – while behavioral sensitivity below 0.5 indicates that the neuron fired more prior to failures. Using behavioral sensitivity, correlations have been observed between MT spike counts and the subject's ability to detect a change in coherent motion strength [13, 30] and speed [31], while similar behavioral sensitivities have been observed between a subject's detection performance and spike counts from cortical areas V1 [32], V4 [33, 34], and VIP [13]. Linking Neural Activity to Visual Perception: Separating Sensory and Attentional Contributions 169

Study Task Cortical Area Behavioral

Gu et al. 2008 [28] Heading discrimination MST 0.52

Liu and Newsome 2005 [19] Speed discrimination MT 0.52

Sasaki and Uka 2009 [22] Direction discrimination MT 0.53

Britten et al. 1996 [15] Direction discrimination MT 0.55

Law and Gold 2008 [18] Direction discrimination MT 0.55

Price and Born 2010 [20] Speed discrimination MT & MST 0.55

Sasaki and Uka 2009 [22] Disparity discrimination MT 0.57

Cohen and Maunsell 2010 [33] Orientation detection V4 0.53

Cohen and Newsome 2009 [17] Direction discrimination MT 0.54

Purushothaman & Bradley 2005 [16] Direction discrimination MT 0.55

Nienborg and Cumming 2006 [26] Disparity discrimination V2 0.56

Bosking and Maunsell 2011 [30] Coherence detection MT 0.58

Smith, Zhan, and Cook 2011 [35] Coherence detection MT 0.58

Celebrini and Newsome 1994 [14] Direction discrimination MST 0.59

Uka and DeAngelis 2004 [21] Disparity discrimination MT 0.59

Cook and Maunsell 2002 [13] Coherence detection MT 0.60

Palmer and Cheng 2007 [32] Gabor detection V1 0.61

Dodd et al. 2001 [23] 3D rotation discrimination MT 0.67

Herrington and Assad 2009 [31] Speed detection LIP 0.63

Cook and Maunsell 2002 [13] Coherence detection VIP 0.70

analysis. The results of plotting the normalized stimulus sensitivity versus the normalized behavioral sensitivity is shown in Figure 4C and illustrates a significant correlation (Spearman's coefficient, R = 0.50, p < 0.01). The correlation between stimulus and behavioral sensitivity is an important property of visual neurons that is often observed [13-16, 18, 20,

**Table 1.** Average behavioral sensitivity across different studies.

21, 24, 28, 30, 35-38].

Herrington and Assad 2009 [31] Speed detection MT 0.59

Sensitivity

An example of how to compute behavioral sensitivity is shown for the same example MT neuron and task as before (Figure 3). This time, spike counts are only taken from the analysis window after the motion pulse (Figure 3B, bar 'a'), but they are grouped by whether the trial was correct or failed. The distributions of spike counts on correct (open, **Y** in Figure 2A) and failed (grey, **X** in Figure 2A) trials is shown in Figure 3D. As this neuron was likely to fire more spikes on correct trials, its behavioral sensitivity was very high (0.88); thus, one could reliably predict the animal's behavior from the neural responses.

## **5. Properties of behavioral sensitivity**

The example neuron's behavioral sensitivity (shown in Figure 3) is unusually strong. In general, the average sensitivity of visual neurons to perceptual behavior is much weaker. Table 1 lists the population averages over a number of studies; for most, it was under 0.6. Nevertheless, all averages were significantly greater than chance (0.5).

A typical behavioral sensitivity distribution for an example population of MT neurons is shown in Figure 4A. These neurons were recorded from two experiments, while monkeys performed either a motion detection [35] or a speed detection task [31]. These two experiments were combined because they were both detection tasks that used short, transient stimuli (~50 ms), as illustrated in Figure 3A. The mean behavioral sensitivity was weak, but significantly greater than 0.5 (mean = 0.54, two-sided t-test, p < 0.01). Therefore, behavioral sensitivity in visual neurons is a robust result, even though most neurons are only weakly sensitive to the subject's upcoming behavior.

A second key observation is that neurons with high stimulus sensitivities are also highly sensitive to the subject's perceptual behavior. A tempting interpretation is that the brain can determine which neurons are best able to support the subject's performance on a task, then assign them a special role in guiding the subject's behavior. To illustrate this relationship, we plot the distribution of stimulus sensitivities for the same example population of MT neurons in Figure 4B (mean stimulus sensitivity = 0.58, two-sided t-test, p < 0.01). To compare stimulus and behavior sensitivities, we first normalized each metric for each animal subject using a Z-score, in order to eliminate changes in the mean from affecting our



When subjects are tested on their ability to detect a change in coherent dot motion (Figure 3A), the ROC curve is made by comparing the distribution of spike counts from failed trials (distribution **X** in Figure 2A) versus the distribution of spike counts from successful trials (distribution **Y** in Figure 2A). The area under this curve is the probability that the ideal observer can correctly predict the subject's detection performance, and so it was called 'detect probability' [13]. Again, we shall refer to this metric as behavioral sensitivity. A behavioral sensitivity above 0.5 indicates that the neuron fired more prior to successful detections – while behavioral sensitivity below 0.5 indicates that the neuron fired more prior to failures. Using behavioral sensitivity, correlations have been observed between MT spike counts and the subject's ability to detect a change in coherent motion strength [13, 30] and speed [31], while similar behavioral sensitivities have been observed between a subject's detection performance and spike counts from cortical areas V1 [32], V4 [33, 34], and VIP [13]. An example of how to compute behavioral sensitivity is shown for the same example MT neuron and task as before (Figure 3). This time, spike counts are only taken from the analysis window after the motion pulse (Figure 3B, bar 'a'), but they are grouped by whether the trial was correct or failed. The distributions of spike counts on correct (open, **Y** in Figure 2A) and failed (grey, **X** in Figure 2A) trials is shown in Figure 3D. As this neuron was likely to fire more spikes on correct trials, its behavioral sensitivity was very high (0.88);

thus, one could reliably predict the animal's behavior from the neural responses.

Nevertheless, all averages were significantly greater than chance (0.5).

only weakly sensitive to the subject's upcoming behavior.

The example neuron's behavioral sensitivity (shown in Figure 3) is unusually strong. In general, the average sensitivity of visual neurons to perceptual behavior is much weaker. Table 1 lists the population averages over a number of studies; for most, it was under 0.6.

A typical behavioral sensitivity distribution for an example population of MT neurons is shown in Figure 4A. These neurons were recorded from two experiments, while monkeys performed either a motion detection [35] or a speed detection task [31]. These two experiments were combined because they were both detection tasks that used short, transient stimuli (~50 ms), as illustrated in Figure 3A. The mean behavioral sensitivity was weak, but significantly greater than 0.5 (mean = 0.54, two-sided t-test, p < 0.01). Therefore, behavioral sensitivity in visual neurons is a robust result, even though most neurons are

A second key observation is that neurons with high stimulus sensitivities are also highly sensitive to the subject's perceptual behavior. A tempting interpretation is that the brain can determine which neurons are best able to support the subject's performance on a task, then assign them a special role in guiding the subject's behavior. To illustrate this relationship, we plot the distribution of stimulus sensitivities for the same example population of MT neurons in Figure 4B (mean stimulus sensitivity = 0.58, two-sided t-test, p < 0.01). To compare stimulus and behavior sensitivities, we first normalized each metric for each animal subject using a Z-score, in order to eliminate changes in the mean from affecting our

**5. Properties of behavioral sensitivity** 

analysis. The results of plotting the normalized stimulus sensitivity versus the normalized behavioral sensitivity is shown in Figure 4C and illustrates a significant correlation (Spearman's coefficient, R = 0.50, p < 0.01). The correlation between stimulus and behavioral sensitivity is an important property of visual neurons that is often observed [13-16, 18, 20, 21, 24, 28, 30, 35-38].

Linking Neural Activity to Visual Perception: Separating Sensory and Attentional Contributions 171

preferred direction, in both a motion detection [30] or discrimination [16] task; a similar result holds for MST neurons during heading discrimination [29]. Lastly, when the type of stimulus is consistent but the subject performs two different perceptual tasks, behavioral sensitivity of the same MT neuron may be selective for the behavior on one task, but not the other [22]. These results suggest that the brain is constantly attempting to optimize the pool

Several further properties of behavioral sensitivity are apparent in Table 1. Two of the studies examined neurons from different cortical areas using the same perceptual task ([13] coherence detection, MT & VIP; [31] speed detection, MT & LIP). Each study found that behavioral sensitivity was stronger in the areas further along the hierarchal processing stream (LIP and VIP). Similarly, a disparity discrimination study found behavioral sensitivity in V2 neurons but not in V1 [26]. Extensive investigation of somatosensory cortex has also shown that behavioral sensitivity grows along the hierarchal stream [40, 41]. These studies suggest that the closer a neuron is to downstream decision centers, the better its

However, even lower level visual areas can demonstrate relatively strong behavioral sensitivity. Although V1 neurons had no sensitivity in a disparity-discrimination task [26], V1 is not thought to be directly involved in disparity perception [42]. On the other hand, V1 neurons are much better suited to supporting the perception of simpler stimuli, such as Gabor patches; accordingly, behavioral sensitivity emerges in V1 neurons when the subject detects Gabors [32]. Similarly, MT neurons are thought to participate in the perception of both motion [10, 11] and disparity [43]. When the subject discriminates a stimulus that requires the perception of both factors, MT neurons become more sensitive to the upcoming behavior [23]. Thus, the behavioral sensitivity of a neuron seems to reflect the extent to

A critical observation is that the strength of behavioral sensitivity is strongly contextual [reviewed by 44]. When subjects are presented with ambiguous stimuli, MT neurons maintain sensitivity to the subject's upcoming behaviour, in 2AFC motion direction [15] and cylindrical rotation [23] discrimination tasks. Ambiguous stimuli carry no meaningful signal; that is, both directions of motion or rotation are equally well represented. In this case, the subject can make no correct choice based on the stimulus and is forced to guess. Although the ambiguous motion direction stimuli are not altogether different from the ambiguous rotation stimuli, MT neurons have much stronger behavioral sensitivities when the subject attempts to discriminate the latter. The main factor accounting for this is that the subject is looking for two-dimensional, linear motion in the first case – and three-

A second effect of context is that behavioral sensitivity can strengthen over time on the same trial, following the onset of the stimulus [13, 21, 23, 27, 28, 30] but this is not always the case [15]. An important point to note is that behavioral sensitivity may rise even while the stimulus parameters remain constant. However, some results suggest that the duration of behavioral sensitivity is confined to the time in each trial when the neuron receives useful

of visual neurons that it uses to drive perceptual behavior.

which it can support the subject's perception of a given stimulus.

dimensional, cylindrical rotation in the second.

stimulus information [15, 21, 35].

behavioral sensitivity becomes.

**Figure 4.** *Stimulus and behavioral sensitivity are correlated.* **A**, The distribution of behavioral sensitivity for an example population of MT neurons, recorded during a motion detection task similar to that in Figure 3A. **B**, The distribution of stimulus sensitivity for the same population of neurons. In A&B, the height of each histogram bin shows the relative proportion of neurons with a sensitivity that falls within the bin's range. **C**, The relationship between stimulus and behavioral sensitivity for the same population of neurons. Behavioral and stimulus sensitivity were normalized for each monkey so as not to introduce any spurious correlations when the data was combined. The Spearman's correlation coefficient is R = 0.50 (p < 0.01). The best-fit, linear regression line is *y = 0.46x + 0*, where *x* is normalized stimulus sensitivity and *y* is normalized behavioral sensitivity. Data are combined from two experiments using four monkeys [31, 35].

The relationship of stimulus and behavioral sensitivity has, itself, two interesting properties. First, stimulus sensitive neurons seem to become behaviorally sensitive once the subject is well trained to perform the perceptual task [18, 37]; in fact, behavioral sensitivity only appears in neurons that can support the subject's task strategy [21, 39]. Second, the correlation between stimulus and behavioral sensitivity tightens when attention is directed to the neuron's RF (Nicolas Masse, unpublished observation). Altogether, these observations suggests that the most informative neurons are recruited by the brain to drive perceptual behavior.

There is evidence that the recruitment of informative neurons is a dynamic process, and adjusts to changing task demands. For instance, when two different types of motiondisparity stimuli are presented under different task conditions, the same MT neuron may show signs of involvement in the subject's perception of one stimulus, but not another [25]. If the same task is used but the stimulus is presented through different modalities, only the MST neurons that respond in a similar manner to both modalities have strong behavioral sensitivities [28]. When the type of stimulus is consistent, behavioral sensitivity for the same MT neuron can vary as the direction of motion is dialed closer to or further from its preferred direction, in both a motion detection [30] or discrimination [16] task; a similar result holds for MST neurons during heading discrimination [29]. Lastly, when the type of stimulus is consistent but the subject performs two different perceptual tasks, behavioral sensitivity of the same MT neuron may be selective for the behavior on one task, but not the other [22]. These results suggest that the brain is constantly attempting to optimize the pool of visual neurons that it uses to drive perceptual behavior.

170 Visual Cortex – Current Status and Perspectives

four monkeys [31, 35].

**Figure 4.** *Stimulus and behavioral sensitivity are correlated.* **A**, The distribution of behavioral sensitivity for an example population of MT neurons, recorded during a motion detection task similar to that in Figure 3A. **B**, The distribution of stimulus sensitivity for the same population of neurons. In A&B, the height of each histogram bin shows the relative proportion of neurons with a sensitivity that falls within the bin's range. **C**, The relationship between stimulus and behavioral sensitivity for the same population of neurons. Behavioral and stimulus sensitivity were normalized for each monkey so as not to introduce any spurious correlations when the data was combined. The Spearman's correlation coefficient is R = 0.50 (p < 0.01). The best-fit, linear regression line is *y = 0.46x + 0*, where *x* is normalized stimulus sensitivity and *y* is normalized behavioral sensitivity. Data are combined from two experiments using

The relationship of stimulus and behavioral sensitivity has, itself, two interesting properties. First, stimulus sensitive neurons seem to become behaviorally sensitive once the subject is well trained to perform the perceptual task [18, 37]; in fact, behavioral sensitivity only appears in neurons that can support the subject's task strategy [21, 39]. Second, the correlation between stimulus and behavioral sensitivity tightens when attention is directed to the neuron's RF (Nicolas Masse, unpublished observation). Altogether, these observations suggests that the

There is evidence that the recruitment of informative neurons is a dynamic process, and adjusts to changing task demands. For instance, when two different types of motiondisparity stimuli are presented under different task conditions, the same MT neuron may show signs of involvement in the subject's perception of one stimulus, but not another [25]. If the same task is used but the stimulus is presented through different modalities, only the MST neurons that respond in a similar manner to both modalities have strong behavioral sensitivities [28]. When the type of stimulus is consistent, behavioral sensitivity for the same MT neuron can vary as the direction of motion is dialed closer to or further from its

most informative neurons are recruited by the brain to drive perceptual behavior.

Several further properties of behavioral sensitivity are apparent in Table 1. Two of the studies examined neurons from different cortical areas using the same perceptual task ([13] coherence detection, MT & VIP; [31] speed detection, MT & LIP). Each study found that behavioral sensitivity was stronger in the areas further along the hierarchal processing stream (LIP and VIP). Similarly, a disparity discrimination study found behavioral sensitivity in V2 neurons but not in V1 [26]. Extensive investigation of somatosensory cortex has also shown that behavioral sensitivity grows along the hierarchal stream [40, 41]. These studies suggest that the closer a neuron is to downstream decision centers, the better its behavioral sensitivity becomes.

However, even lower level visual areas can demonstrate relatively strong behavioral sensitivity. Although V1 neurons had no sensitivity in a disparity-discrimination task [26], V1 is not thought to be directly involved in disparity perception [42]. On the other hand, V1 neurons are much better suited to supporting the perception of simpler stimuli, such as Gabor patches; accordingly, behavioral sensitivity emerges in V1 neurons when the subject detects Gabors [32]. Similarly, MT neurons are thought to participate in the perception of both motion [10, 11] and disparity [43]. When the subject discriminates a stimulus that requires the perception of both factors, MT neurons become more sensitive to the upcoming behavior [23]. Thus, the behavioral sensitivity of a neuron seems to reflect the extent to which it can support the subject's perception of a given stimulus.

A critical observation is that the strength of behavioral sensitivity is strongly contextual [reviewed by 44]. When subjects are presented with ambiguous stimuli, MT neurons maintain sensitivity to the subject's upcoming behaviour, in 2AFC motion direction [15] and cylindrical rotation [23] discrimination tasks. Ambiguous stimuli carry no meaningful signal; that is, both directions of motion or rotation are equally well represented. In this case, the subject can make no correct choice based on the stimulus and is forced to guess. Although the ambiguous motion direction stimuli are not altogether different from the ambiguous rotation stimuli, MT neurons have much stronger behavioral sensitivities when the subject attempts to discriminate the latter. The main factor accounting for this is that the subject is looking for two-dimensional, linear motion in the first case – and threedimensional, cylindrical rotation in the second.

A second effect of context is that behavioral sensitivity can strengthen over time on the same trial, following the onset of the stimulus [13, 21, 23, 27, 28, 30] but this is not always the case [15]. An important point to note is that behavioral sensitivity may rise even while the stimulus parameters remain constant. However, some results suggest that the duration of behavioral sensitivity is confined to the time in each trial when the neuron receives useful stimulus information [15, 21, 35].

One last contextual effect was demonstrated in a recent study of V2 neurons [27]; the subject's motivation to perform a perceptual task was varied by changing the expected reward size. Smaller rewards, and therefore less motivation, were accompanied by a decrease in behavioral sensitivity.

Linking Neural Activity to Visual Perception: Separating Sensory and Attentional Contributions 173

in the behavioral sensitivity of neurons. The alternative top-down hypothesis (Figure 1B) was formulated to explain these results – in which signals are despatched to sensory cortex

In this model, the subject's attentional state varies from trial-to-trial, resulting in variable perceptual performance [see discussion of 15, 27, see review 44]; for example (Figure 1C), changes in spatial attention, prior expectation, motivation, or simply alertness can all affect the subject's chance of success. If the same processes alter the firing rates of sensory neurons, then sensory spike counts would have a non-causal correlation with the subject's performance. In other words, visual cortical neurons could exhibit behavioral sensitivity without actually affecting the perceptual behavior. Attention is a good example of a process that varies from trial to trial and affects both firing rates and perceptual performance [33,

It is important to note that the bottom-up and top-down models, as described, are the two possible extremes at each end of a spectrum. For simplicity's sake, they have been discussed

**7. Evidence of bottom-up and top-down sources of behavioral sensitivity** 

Despite almost twenty years of study, there is no clear proof that one mechanism for behavioral sensitivity is entirely correct. The bottom-up hypothesis is attractive because it ties together a number of observations parsimoniously; neurons are weakly sensitive to the subject's behavior when they are responding to an informative stimulus, when they are able to support perception of the stimulus features, and when they are able to support the subject's task strategy. Furthermore, a neuron's behavioral sensitivity scales with its sensitivity to the stimulus (Figure 4C). These properties are accounted for if the brain pools the output from a select set of informative neurons to form perceptual decisions, while ignoring output from uninformative neurons. The action of pooling offers a reason that the average behavioral sensitivity is weak: because the impact on behavior of any one neuron is diluted in the population response. However, the size of a neural pool required to form perceptions is unknown. There is some evidence that perceptual decisions are formed using only a few, highly informative neurons [38]. But the complication of correlated activity

between neurons [49, 50] may require neural populations on a scale of hundreds [36].

A further attraction of the bottom-up hypothesis is that computational models with a bottom-up structure are able to emulate the subject's behavior, neural responses and neural behavioral sensitivity, for a variety of tasks [13, 35-37]. Unfortunately, bottom-up models have difficulty explaining other properties of behavioral sensitivity without resorting to complex mechanisms. Although it would seem sensible that a subject should guide visual judgements using the most informative visual cortical neurons, it is uncertain how a purely bottom-up mechanism could recruit them. A simple way would be to change the synaptic weighting downstream of sensory areas, strengthening the connections from informative neurons and weakening the connections from uninformative neurons. This technique is able

separately. But the brain could implement a hybrid model [e.g. 37].

from high-order areas of the brain.

34].

A final property of behavioral sensitivity is that it persists when stimulus variation is removed. In both a motion direction discrimination [15] and motion detection [13] task, stimulus variation was removed by repeating the same random dot sequence on multiple trials. In conjunction with contextual effects, these results strongly suggest that behavioral sensitivity comes from mechanisms internal to the brain.

## **6. Bottom-up sensory versus top-down attentional contributions to behavioral sensitivity**

The sensitivity of visual cortical neurons to the subject's impending perceptual behavior is a robust result. Ironically, this fundamental observation has generated some controversy. The trouble is to explain which neural mechanisms produce behavioral sensitivity. Recently, two competing theories have emerged.

The first is the bottom-up hypothesis (Figure 1A). Formulated by Newsome and colleagues [15, 36], it was built upon a foundation of earlier results from psychophysics and neurophysiology, suggesting that the value of a perceived stimulus feature is coded by the collective firing rates from a population of sensory neurons [reviewed by 1]. A population is required when the stimulus responses of individual neurons are noisy; but, by averaging together the noisy responses of a population, an accurate representation of the stimulus can be obtained and used to drive behavior. Bottom-up sources of variation (Figure 1C) may include noise in the stimulus, noisy output from earlier stages of visual processing (e.g. retina, LGN), random eye movements [45], stochastic voltage channels [46], and autogenous noise from local networks [47, 48]. The central idea is that variability in perceptual performance comes directly from the noisy cortical representations of the stimulus.

In the context of the 2AFC direction discrimination task discussed above [12], Shadlen et al. [36] built a bottom-up model using two pools of noisy MT neurons; the first pool preferred motion in one direction, and the second pool preferred motion in the opposite direction. The responses of all neurons in a pool were summed together, and the two pooled signals were used by the model to asses the direction of motion: it chose the preferred direction of whichever pooled response was stronger. Because a direct, causal connection was established between the spike counts of each MT neuron and the model's choice of direction, the model neurons had behavioral sensitivities above 0.5.

For purely bottom-up models, however, the impact of a sensory neuron on perceptual behavior should never change – nor should its behavioral sensitivity. And yet, behavioral sensitivity changes contextually depending on what the subject is looking for, how long the subject has viewed the stimulus, and the subject's motivation level. A bottom-up model can not fully account for these observations, suggesting that high-order processes are involved in the behavioral sensitivity of neurons. The alternative top-down hypothesis (Figure 1B) was formulated to explain these results – in which signals are despatched to sensory cortex from high-order areas of the brain.

172 Visual Cortex – Current Status and Perspectives

decrease in behavioral sensitivity.

**behavioral sensitivity** 

competing theories have emerged.

sensitivity comes from mechanisms internal to the brain.

One last contextual effect was demonstrated in a recent study of V2 neurons [27]; the subject's motivation to perform a perceptual task was varied by changing the expected reward size. Smaller rewards, and therefore less motivation, were accompanied by a

A final property of behavioral sensitivity is that it persists when stimulus variation is removed. In both a motion direction discrimination [15] and motion detection [13] task, stimulus variation was removed by repeating the same random dot sequence on multiple trials. In conjunction with contextual effects, these results strongly suggest that behavioral

The sensitivity of visual cortical neurons to the subject's impending perceptual behavior is a robust result. Ironically, this fundamental observation has generated some controversy. The trouble is to explain which neural mechanisms produce behavioral sensitivity. Recently, two

The first is the bottom-up hypothesis (Figure 1A). Formulated by Newsome and colleagues [15, 36], it was built upon a foundation of earlier results from psychophysics and neurophysiology, suggesting that the value of a perceived stimulus feature is coded by the collective firing rates from a population of sensory neurons [reviewed by 1]. A population is required when the stimulus responses of individual neurons are noisy; but, by averaging together the noisy responses of a population, an accurate representation of the stimulus can be obtained and used to drive behavior. Bottom-up sources of variation (Figure 1C) may include noise in the stimulus, noisy output from earlier stages of visual processing (e.g. retina, LGN), random eye movements [45], stochastic voltage channels [46], and autogenous noise from local networks [47, 48]. The central idea is that variability in perceptual

performance comes directly from the noisy cortical representations of the stimulus.

direction, the model neurons had behavioral sensitivities above 0.5.

In the context of the 2AFC direction discrimination task discussed above [12], Shadlen et al. [36] built a bottom-up model using two pools of noisy MT neurons; the first pool preferred motion in one direction, and the second pool preferred motion in the opposite direction. The responses of all neurons in a pool were summed together, and the two pooled signals were used by the model to asses the direction of motion: it chose the preferred direction of whichever pooled response was stronger. Because a direct, causal connection was established between the spike counts of each MT neuron and the model's choice of

For purely bottom-up models, however, the impact of a sensory neuron on perceptual behavior should never change – nor should its behavioral sensitivity. And yet, behavioral sensitivity changes contextually depending on what the subject is looking for, how long the subject has viewed the stimulus, and the subject's motivation level. A bottom-up model can not fully account for these observations, suggesting that high-order processes are involved

**6. Bottom-up sensory versus top-down attentional contributions to** 

In this model, the subject's attentional state varies from trial-to-trial, resulting in variable perceptual performance [see discussion of 15, 27, see review 44]; for example (Figure 1C), changes in spatial attention, prior expectation, motivation, or simply alertness can all affect the subject's chance of success. If the same processes alter the firing rates of sensory neurons, then sensory spike counts would have a non-causal correlation with the subject's performance. In other words, visual cortical neurons could exhibit behavioral sensitivity without actually affecting the perceptual behavior. Attention is a good example of a process that varies from trial to trial and affects both firing rates and perceptual performance [33, 34].

It is important to note that the bottom-up and top-down models, as described, are the two possible extremes at each end of a spectrum. For simplicity's sake, they have been discussed separately. But the brain could implement a hybrid model [e.g. 37].

## **7. Evidence of bottom-up and top-down sources of behavioral sensitivity**

Despite almost twenty years of study, there is no clear proof that one mechanism for behavioral sensitivity is entirely correct. The bottom-up hypothesis is attractive because it ties together a number of observations parsimoniously; neurons are weakly sensitive to the subject's behavior when they are responding to an informative stimulus, when they are able to support perception of the stimulus features, and when they are able to support the subject's task strategy. Furthermore, a neuron's behavioral sensitivity scales with its sensitivity to the stimulus (Figure 4C). These properties are accounted for if the brain pools the output from a select set of informative neurons to form perceptual decisions, while ignoring output from uninformative neurons. The action of pooling offers a reason that the average behavioral sensitivity is weak: because the impact on behavior of any one neuron is diluted in the population response. However, the size of a neural pool required to form perceptions is unknown. There is some evidence that perceptual decisions are formed using only a few, highly informative neurons [38]. But the complication of correlated activity between neurons [49, 50] may require neural populations on a scale of hundreds [36].

A further attraction of the bottom-up hypothesis is that computational models with a bottom-up structure are able to emulate the subject's behavior, neural responses and neural behavioral sensitivity, for a variety of tasks [13, 35-37]. Unfortunately, bottom-up models have difficulty explaining other properties of behavioral sensitivity without resorting to complex mechanisms. Although it would seem sensible that a subject should guide visual judgements using the most informative visual cortical neurons, it is uncertain how a purely bottom-up mechanism could recruit them. A simple way would be to change the synaptic weighting downstream of sensory areas, strengthening the connections from informative neurons and weakening the connections from uninformative neurons. This technique is able

to alter the mean behavioral sensitivity of a simulated neural population [37]. Unexpectedly, synaptic weight changes alone cannot account for the correlation between a neuron's stimulus and behavioral sensitivities (Figure 4C).

Linking Neural Activity to Visual Perception: Separating Sensory and Attentional Contributions 175

information than just neural spiking activity. The short, discrete, high frequency waveforms

Cortical local field potentials (LFP) are defined as the 1-200 Hz frequency bandwidth of the electrode voltage trace, and they result from the collective activity of neurons within approximately 250μm of the electrode tip [57, 58] – roughly the same spacial scale as a cortical column [59]. Because cortical tissue has no bandpassing effect on LFPs [60] and because their measurement is robust to the choice of microelectrode [61], LFPs are a clean way of taking neural activity measurements that are comparable across studies. Neurons generate LFPs in many ways [see review 62], including sub-threshold membrane activity and local oscillatory interactions between excitatory and inhibitory neurons; but an important source of LFP fluctuations are synaptic potentials in the dendrites: in other words,

Different sources of LFP fluctuations that operate at different time-scales can be partially isolated from each other by analyzing LFPs in the frequency domain [63]. The stimulus response of visual cortical LFPs from 3 to 90 Hz have been found to resemble the behavior of spiking responses from earlier processing stages [64-66], while higher frequency LFPs (>80 Hz) resemble the spiking responses of local neurons [67-71] – suggesting that LFPs below 80 Hz reflect synaptic activity. Upon closer inspection, it is found that LFPs from 1 to 12 Hz carry stimulus information that is independent of the stimulus information in LFPs above 40 Hz [72, 73]; but the band from 12 to 40 Hz does not carry stimulus information at

As visual neurons receive a substantial proportion of their input recurrently [74], LFPs must capture synaptic activity that results from neuromodulatory feedback. Generally, the spectral power of gamma band (40-80 Hz) LFPs is increased when attention is directed to the recording site's RF [75-77], although it may decrease as well [78]. In area MT, this increase of gamma LFP power resembles the increase in firing rates of neurons in the same vicinity; however, beta band (12-24 Hz) LFP power from the same recordings behave differently in the face of attention – decreasing instead of increasing [79]. The decrease in low frequency LFP power is thought to result from the same attention related feedback that causes neurons to decorrelate from each other and reduce the variability of their responses [80]; this hypothesis is bolstered by the observation that the coherence of beta band LFPs from visual areas at different stages of hierarchal processing is strengthened by goaldirected attention [81, 82]. The relationship of beta band LFPs to top-down neuromodulatory signals – and their distinct response to them – make beta LFPs a good

LFPs indirectly measure the input received by the local population of neurons.

candidate measurement of the top-down input that arrives in sensory cortex.

The first demonstration that LFP spectral power is sensitive to the subject's upcoming perceptual behavior was made from recordings in MT by Liu and Newsome [83]. Their innovation was to use distributions of spectral power rather than spike counts to compute the behavioral sensitivity of LFPs – by using a ROC curve to compare the distribution of power from trials when the subject reported the preferred stimulus of the recording site versus the distribution of power from trials when the subject reported the null stimulus. As

of spikes ride on top of lower frequency changes in voltage.

all, raising the question of what its function is.

Correlations between the spiking activity of neurons can be the determining factor of behavioral sensitivity, especially for large pools of neurons [36]. Weak, inter-neural correlations have been observed throughout the brain [29, 49, 51-54] – with diverse implications [reviewed by 50]. Importantly, the level of correlation between neurons can inflate their behavioral sensitivities in two ways: in a bottom-up model, it reduces the independence of sensory neurons and tightens the covariance of any single one with the pooled response [36]; it can also cause neurons with no impact on perception to mimic other neurons that directly support perception [17, 55]. These effects of correlation on behavioral sensitivity increase for larger pools of neurons, while the effect of synaptic weighting decreases [36]. Thus, one cannot model the relationship between a neuron's stimulus and behavioral sensitivity (Figure 4C) with synaptic weight changes, alone. However, if the correlation between two neurons is scaled by the similarity of their RF tuning and by the similarity of their stimulus sensitivity, then selectively weighting the more informative neurons can reproduce the observed relationship between stimulus and behavioral sensitivity [37]. Although an interplay between stimulus sensitivity, behavioral sensitivity, and the correlation between neurons is predicted, very little empirical verification has been published to date [24].

Bottom-up models also fail to explain the contextual variation of behavioral sensitivity. If top-down signals are able to selectively modulate the activity of targeted sensory neurons, then the top-down hypothesis is better placed to explain contextual variation, dynamic changes in behavioral sensitivity from trial to trial, and the relationship between stimulus and behavioral sensitivity. Attention [56] can modulate the activity of select neurons as well as affect the subject's behavior; thus it is a good candidate for the source of top-down behavioral sensitivity. Trial-to-trial fluctuations in attention could equally as well have been the source of behavioral sensitivity in studies that otherwise supported a bottom-up model [15, 21, 35]. Furthermore, two recent multielectrode studies were able to estimate the level of attention on each trial, from the collective responses of many simultaneously recorded neurons [33, 34]; it was found that fluctuations in attention could account for the behavioral sensitivity of V4 neurons. Thus, a top-down mechanism is better able to account for some properties of behavioral sensitivity than a bottom-up mechanism.

## **8. Local field potentials and top-down mechanisms of behavioral sensitivity**

Whether the behavioral sensitivity is due to top-down or bottom-up mechanisms depends entirely upon whether or not cortical neurons receive top-down signals in the first place. Specifically, attentional signals that predict the subject's upcoming performance. But how can such signals be measured and separated from bottom-up visual inputs? When microelectrodes are used to record a voltage trace from the brain, they deliver far more information than just neural spiking activity. The short, discrete, high frequency waveforms of spikes ride on top of lower frequency changes in voltage.

174 Visual Cortex – Current Status and Perspectives

published to date [24].

**sensitivity** 

stimulus and behavioral sensitivities (Figure 4C).

to alter the mean behavioral sensitivity of a simulated neural population [37]. Unexpectedly, synaptic weight changes alone cannot account for the correlation between a neuron's

Correlations between the spiking activity of neurons can be the determining factor of behavioral sensitivity, especially for large pools of neurons [36]. Weak, inter-neural correlations have been observed throughout the brain [29, 49, 51-54] – with diverse implications [reviewed by 50]. Importantly, the level of correlation between neurons can inflate their behavioral sensitivities in two ways: in a bottom-up model, it reduces the independence of sensory neurons and tightens the covariance of any single one with the pooled response [36]; it can also cause neurons with no impact on perception to mimic other neurons that directly support perception [17, 55]. These effects of correlation on behavioral sensitivity increase for larger pools of neurons, while the effect of synaptic weighting decreases [36]. Thus, one cannot model the relationship between a neuron's stimulus and behavioral sensitivity (Figure 4C) with synaptic weight changes, alone. However, if the correlation between two neurons is scaled by the similarity of their RF tuning and by the similarity of their stimulus sensitivity, then selectively weighting the more informative neurons can reproduce the observed relationship between stimulus and behavioral sensitivity [37]. Although an interplay between stimulus sensitivity, behavioral sensitivity, and the correlation between neurons is predicted, very little empirical verification has been

Bottom-up models also fail to explain the contextual variation of behavioral sensitivity. If top-down signals are able to selectively modulate the activity of targeted sensory neurons, then the top-down hypothesis is better placed to explain contextual variation, dynamic changes in behavioral sensitivity from trial to trial, and the relationship between stimulus and behavioral sensitivity. Attention [56] can modulate the activity of select neurons as well as affect the subject's behavior; thus it is a good candidate for the source of top-down behavioral sensitivity. Trial-to-trial fluctuations in attention could equally as well have been the source of behavioral sensitivity in studies that otherwise supported a bottom-up model [15, 21, 35]. Furthermore, two recent multielectrode studies were able to estimate the level of attention on each trial, from the collective responses of many simultaneously recorded neurons [33, 34]; it was found that fluctuations in attention could account for the behavioral sensitivity of V4 neurons. Thus, a top-down mechanism is better able to account for some

properties of behavioral sensitivity than a bottom-up mechanism.

**8. Local field potentials and top-down mechanisms of behavioral** 

Whether the behavioral sensitivity is due to top-down or bottom-up mechanisms depends entirely upon whether or not cortical neurons receive top-down signals in the first place. Specifically, attentional signals that predict the subject's upcoming performance. But how can such signals be measured and separated from bottom-up visual inputs? When microelectrodes are used to record a voltage trace from the brain, they deliver far more Cortical local field potentials (LFP) are defined as the 1-200 Hz frequency bandwidth of the electrode voltage trace, and they result from the collective activity of neurons within approximately 250μm of the electrode tip [57, 58] – roughly the same spacial scale as a cortical column [59]. Because cortical tissue has no bandpassing effect on LFPs [60] and because their measurement is robust to the choice of microelectrode [61], LFPs are a clean way of taking neural activity measurements that are comparable across studies. Neurons generate LFPs in many ways [see review 62], including sub-threshold membrane activity and local oscillatory interactions between excitatory and inhibitory neurons; but an important source of LFP fluctuations are synaptic potentials in the dendrites: in other words, LFPs indirectly measure the input received by the local population of neurons.

Different sources of LFP fluctuations that operate at different time-scales can be partially isolated from each other by analyzing LFPs in the frequency domain [63]. The stimulus response of visual cortical LFPs from 3 to 90 Hz have been found to resemble the behavior of spiking responses from earlier processing stages [64-66], while higher frequency LFPs (>80 Hz) resemble the spiking responses of local neurons [67-71] – suggesting that LFPs below 80 Hz reflect synaptic activity. Upon closer inspection, it is found that LFPs from 1 to 12 Hz carry stimulus information that is independent of the stimulus information in LFPs above 40 Hz [72, 73]; but the band from 12 to 40 Hz does not carry stimulus information at all, raising the question of what its function is.

As visual neurons receive a substantial proportion of their input recurrently [74], LFPs must capture synaptic activity that results from neuromodulatory feedback. Generally, the spectral power of gamma band (40-80 Hz) LFPs is increased when attention is directed to the recording site's RF [75-77], although it may decrease as well [78]. In area MT, this increase of gamma LFP power resembles the increase in firing rates of neurons in the same vicinity; however, beta band (12-24 Hz) LFP power from the same recordings behave differently in the face of attention – decreasing instead of increasing [79]. The decrease in low frequency LFP power is thought to result from the same attention related feedback that causes neurons to decorrelate from each other and reduce the variability of their responses [80]; this hypothesis is bolstered by the observation that the coherence of beta band LFPs from visual areas at different stages of hierarchal processing is strengthened by goaldirected attention [81, 82]. The relationship of beta band LFPs to top-down neuromodulatory signals – and their distinct response to them – make beta LFPs a good candidate measurement of the top-down input that arrives in sensory cortex.

The first demonstration that LFP spectral power is sensitive to the subject's upcoming perceptual behavior was made from recordings in MT by Liu and Newsome [83]. Their innovation was to use distributions of spectral power rather than spike counts to compute the behavioral sensitivity of LFPs – by using a ROC curve to compare the distribution of power from trials when the subject reported the preferred stimulus of the recording site versus the distribution of power from trials when the subject reported the null stimulus. As a result, LFP behavioral sensitivity was a function of LFP frequency. Liu and Newsome observed that LFPs above 50 Hz had sensitivities above 0.5, indicating more spectral power when the subject was about to choose the preferred stimulus, which was similar to the positive spike-count behavioral sensitivities of MT neurons. Lower frequency LFPs, within the realm of alpha (8-12 Hz) and beta (12-24 Hz), showed a distinct behavior: they had sensitivities less than 0.5, which indicated less spectral power when the subject was about to choose the preferred stimulus. Since top-down neuromodulatory signals can explain this result, it may be the first demonstration that correlations between neural activity and perceptual behavior come from a top-down mechanism. However, it was also found that a trial-to-trial shift of LFP spectral power from lower frequencies to high frequencies could explain this result, without top-down signals.

Linking Neural Activity to Visual Perception: Separating Sensory and Attentional Contributions 177

Nienborg and Cumming have demonstrated two contrasting processes by developing a psychophysical, reverse-correlation technique and comparing the results with the behavioral sensitivity of V2 neurons [27, 39]. Psychophysical reverse-correlation provided an estimate of the subject's strategy when performing a disparity discrimination task. More importantly, it estimated when the subject was accumulating stimulus information. They found that subjects made the most use of stimulus information early in the trial – with later information being used progressively less, even though it was equally useful. In disagreement with a bottom-up prediction, they found that behavioral sensitivity moved in the opposite direction; it was weakest at the start of the trial, and then grew for approximately 500 ms before plateauing. A similar rise in behavioral sensitivity over time has been observed in other studies [13, 17, 20, 21, 23, 28, 30]. Together, these observations suggest an early, sensory accumulation process followed by a later process that was

Rapid sensory accumulation and evaluation is necessary to explain any perceptual decision that is made before the late rise in behavioral sensitivity [84]. Accordingly, Nienborg and Cumming [27] observed behavioral sensitivity that was significantly greater than 0.5 within the first 500 ms of stimulation, when the subject made the most use of the stimulus information. But because their stimulus was long in duration and always appeared at the same time and location, it is impossible to know if the early component of behavioral sensitivity resulted from the early, sensory-accumulation process or from the late,

Another recent study has looked exclusively at the behavioral sensitivity of MT neurons during early sensory accumulation, while subjects performed a motion-detection, reactiontime task [35]. This was achieved by using a very short (50 ms) motion signal that occurred at an unpredictable time and location. Unlike studies that used long stimulus presentation times, the neural sensitivity to behavior peaked approximately 100 ms after the motion signal began, during the short burst of spikes that neurons fired in response to the motion signal. Critically, both the transient neural response and the subject's perceptual report (median RT = 400 ms) occurred well before the time in other studies that a late rise in behavioral sensitivity was established. Lastly, all aspects of the behavioral sensitivity timecourses and of the subject's perceptual behavior were well accounted for by a purely bottomup model. Together, these results suggest that sensory neurons have a bottom-up link to perceptual behavior during the early, sensory-accumulation process (illustrated in Figure 5B). The late rise in behavioral sensitivity has parallels in other studies as well. In V1 neurons, spikes fired in response to a contour are able to distinguish whether it is the target or distractor stimulus, approximately 200–600 ms after the contour appeared [85, 86]. Similarly, the aperture problem is disentangled by MT firing rates approximately 150 ms after stimulus motion begins [8]. Of special interest is the recent evidence that stimulus information arrives in MT from outside the RF through top-down channels, approximately 400-500 ms following the onset of test stimuli in a match-to-sample task [87]. In these cases, late sensory neural activity carries information that was not originally available in the initial transient response, but likely arrives from a top-down source that has solved the problem at

predictive of the subject's upcoming behavior.

behaviorally sensitive process (illustrated in Figure 5A).

## **9. Bottom-up and top-down processes work in sequence to produce behavioral sensitivity**

Studies of behavioral sensitivity have had difficulty validating either the bottom-up or topdown hypothesis. But certain aspects of their experimental designs may have led to ambiguity. For example, many key studies have used long duration stimuli of 500 ms or more [13-16, 19-23, 25, 27, 30, 83]. This is problematic when trying to distinguish between a bottom-up and top-down mechanism – because it is impossible to tell when, or even whether, the neuron is responding to the stimulus or to top-down attentional signals. Recent studies, however, have begun to dissect bottom-up from top-down processes. They suggest that both mechanisms contribute to behavioral sensitivity, one after the other (illustrated in Figure 5).

**Figure 5.** *Bottom-up and top-down contributions to behavioral sensitivity are sequential.* **A**, The order of perceptual events during the presentation of a long-duration stimulus, such as in [27]. The long duration stimulus allows a neuron's behavioral sensitivity to be dominated by top-down contributions. **B**, The order of perceptual events following the presentation of a brief stimulus, as in [35]. Because the subject tends to respond before top-down contributions take effect, a neuron's behavioral sensitivity is dominated by bottom-up contributions.

Nienborg and Cumming have demonstrated two contrasting processes by developing a psychophysical, reverse-correlation technique and comparing the results with the behavioral sensitivity of V2 neurons [27, 39]. Psychophysical reverse-correlation provided an estimate of the subject's strategy when performing a disparity discrimination task. More importantly, it estimated when the subject was accumulating stimulus information. They found that subjects made the most use of stimulus information early in the trial – with later information being used progressively less, even though it was equally useful. In disagreement with a bottom-up prediction, they found that behavioral sensitivity moved in the opposite direction; it was weakest at the start of the trial, and then grew for approximately 500 ms before plateauing. A similar rise in behavioral sensitivity over time has been observed in other studies [13, 17, 20, 21, 23, 28, 30]. Together, these observations suggest an early, sensory accumulation process followed by a later process that was predictive of the subject's upcoming behavior.

176 Visual Cortex – Current Status and Perspectives

explain this result, without top-down signals.

**behavioral sensitivity** 

dominated by bottom-up contributions.

Figure 5).

a result, LFP behavioral sensitivity was a function of LFP frequency. Liu and Newsome observed that LFPs above 50 Hz had sensitivities above 0.5, indicating more spectral power when the subject was about to choose the preferred stimulus, which was similar to the positive spike-count behavioral sensitivities of MT neurons. Lower frequency LFPs, within the realm of alpha (8-12 Hz) and beta (12-24 Hz), showed a distinct behavior: they had sensitivities less than 0.5, which indicated less spectral power when the subject was about to choose the preferred stimulus. Since top-down neuromodulatory signals can explain this result, it may be the first demonstration that correlations between neural activity and perceptual behavior come from a top-down mechanism. However, it was also found that a trial-to-trial shift of LFP spectral power from lower frequencies to high frequencies could

**9. Bottom-up and top-down processes work in sequence to produce** 

Studies of behavioral sensitivity have had difficulty validating either the bottom-up or topdown hypothesis. But certain aspects of their experimental designs may have led to ambiguity. For example, many key studies have used long duration stimuli of 500 ms or more [13-16, 19-23, 25, 27, 30, 83]. This is problematic when trying to distinguish between a bottom-up and top-down mechanism – because it is impossible to tell when, or even whether, the neuron is responding to the stimulus or to top-down attentional signals. Recent studies, however, have begun to dissect bottom-up from top-down processes. They suggest that both mechanisms contribute to behavioral sensitivity, one after the other (illustrated in

**Figure 5.** *Bottom-up and top-down contributions to behavioral sensitivity are sequential.* **A**, The order of perceptual events during the presentation of a long-duration stimulus, such as in [27]. The long duration stimulus allows a neuron's behavioral sensitivity to be dominated by top-down contributions. **B**, The order of perceptual events following the presentation of a brief stimulus, as in [35]. Because the subject tends to respond before top-down contributions take effect, a neuron's behavioral sensitivity is

Rapid sensory accumulation and evaluation is necessary to explain any perceptual decision that is made before the late rise in behavioral sensitivity [84]. Accordingly, Nienborg and Cumming [27] observed behavioral sensitivity that was significantly greater than 0.5 within the first 500 ms of stimulation, when the subject made the most use of the stimulus information. But because their stimulus was long in duration and always appeared at the same time and location, it is impossible to know if the early component of behavioral sensitivity resulted from the early, sensory-accumulation process or from the late, behaviorally sensitive process (illustrated in Figure 5A).

Another recent study has looked exclusively at the behavioral sensitivity of MT neurons during early sensory accumulation, while subjects performed a motion-detection, reactiontime task [35]. This was achieved by using a very short (50 ms) motion signal that occurred at an unpredictable time and location. Unlike studies that used long stimulus presentation times, the neural sensitivity to behavior peaked approximately 100 ms after the motion signal began, during the short burst of spikes that neurons fired in response to the motion signal. Critically, both the transient neural response and the subject's perceptual report (median RT = 400 ms) occurred well before the time in other studies that a late rise in behavioral sensitivity was established. Lastly, all aspects of the behavioral sensitivity timecourses and of the subject's perceptual behavior were well accounted for by a purely bottomup model. Together, these results suggest that sensory neurons have a bottom-up link to perceptual behavior during the early, sensory-accumulation process (illustrated in Figure 5B).

The late rise in behavioral sensitivity has parallels in other studies as well. In V1 neurons, spikes fired in response to a contour are able to distinguish whether it is the target or distractor stimulus, approximately 200–600 ms after the contour appeared [85, 86]. Similarly, the aperture problem is disentangled by MT firing rates approximately 150 ms after stimulus motion begins [8]. Of special interest is the recent evidence that stimulus information arrives in MT from outside the RF through top-down channels, approximately 400-500 ms following the onset of test stimuli in a match-to-sample task [87]. In these cases, late sensory neural activity carries information that was not originally available in the initial transient response, but likely arrives from a top-down source that has solved the problem at hand. These observations further support the idea that a top-down source of behavioral sensitivity engages sensory neurons following the initial transient response. While attentionlike signals could fill the role of a top-down source following the start of the stimulus, it would have to allow for other attentional effects that are present earlier in the trial [33, 34, 52, 88].

Linking Neural Activity to Visual Perception: Separating Sensory and Attentional Contributions 179

example, neurons do not always resemble a Poisson process [89]. Receiver operating characteristic (ROC) curves provide an unbiased, non-parametric way of quantifying the difference between any two distributions [90] – in this case, the number of spikes fired by a neuron on one set of trials versus another. Figure 2 illustrates how a ROC curve is used to

Faced with the problem of classifying a randomly sampled spike count as being from either distribution **X** (filled) or **Y** (open), the strategy adopted by the ideal observer is to choose a criterion level of *c* and assign any spike count less than *c* to **X**, and any spike count above *c* to **Y**. All possible values of *c* between 0 and the maximum spike count (*c*max) must be tested

To do so, a ROC curve is built by plotting the probability that spike counts sampled from **X** are greater than *c* against the probability that spike counts sampled from **Y** are greater than *c* (Figure 2B). When *c* = 0, all spike counts are greater than *c*, thus the beginning point of the ROC curve is always (1,1). As *c* is increased (Figure 2B, arrow) the performance of the ideal observer using that criterion level is plotted. When *c* hits *cmax*, no spike count is greater than *c*

The area under the ROC curve (Figure 2B, grey shading) is the probability that the ideal observer will correctly classify any given spike count from either distribution, and ranges between 0 and 1 accordingly; this probability is 0.75 for distributions **X** and **Y** from Figure 2A. Therefore, when **X** and **Y** are completely distinct from each other, the ideal observer correctly classifies 100% of all spike counts (area = 1, Figure 2C, right). On the other hand, if there is no distinction between **X** and **Y**, then the ideal observer has a 50% chance of correct classification – a coin toss (area = 0.5, left). If **X** and **Y** from Figure 2A switch positions then the difference between them remains the same; this is reflected by the area under the ROC curve, which is an equal distance below 0.5 after the switch (0.25 = 0.5 - 0.25, solid curve) as

2AFC – two-alternative forced choice

ROC – receiver operating characteristic

LFP – local field potential RF – receptive field

RT – reaction time

[1] Parker, A.J. and W.T. Newsome, *SENSE AND THE SINGLE NEURON: Probing the Physiology of Perception.* Annual Review of Neuroscience, 1998. 21(1): p. 227-277. [2] Vanessen, D.C., J.H.R. Maunsell, and J.L. Bixby, *The Middle Temporal Visual Area in the Macaque - Myeloarchitecture, Connections, Functional-Properties and Topographic* 

*Organization.* Journal of Comparative Neurology, 1981. 199(3): p. 293-326.

quantify the difference between two distributions of spike counts (Figure 2A).

to find the optimal criterion that makes correct classifications the most often.

and the end point of the curve is (0,0).

it was before (0.75 = 0.5 + 0.25, dotted curve).

MT i.e. V5 – middle temporal area of visual cortex MST – medial superior temporal area of visual cortex VIP – ventral intraparietal area of visual cortex LIP – lateral intraparietal area of visual cortex

**Abbreviations** 

**11. References**

LGN – lateral geniculate nucleus

## **10. Conclusion**

To study the neural correlates of sensory perception in visual cortex, it is first necessary to understand how neural activity becomes selective for the upcoming perceptual behavior of a subject. This is referred to as a neuron's behavioral sensitivity. There has been debate in the literature about whether the brain links sensory neural activity to perceptual behavior using a bottom-up or top-down mechanism. New results suggest that both mechanisms are active, but in a sequential fashion. The initial, transient responses of sensory neurons have a direct, bottom-up impact upon the subject's behavior. Later responses reflect top-down signals that are linked to high-order processes, and attention fills the criteria necessary to drive top-down behavioral sensitivity. Although potentially difficult, new studies are needed to carefully distinguish and compare the contribution of both processes to the behavioral sensitivity of visual cortical neurons.

## **Author details**

Jackson E. T. Smith and Erik P. Cook *Department of Physiology, McGill University, Montreal, Québec* 

Nicolas Y. Masse *Department of Neurobiology, University of Chicago, Chicago, Illinois* 

Chang'an A. Zhan *School of Biomedical Engineering, Southern Medical University, Guangzhou, China* 

## **Acknowledgement**

The authors wish to thank John Assad and Todd Herrington for the speed-pulse data used in Figure 4. The authors also acknowledge funding support from the Canadian Institutes of Health Research, The Natural Sciences and Engineering Research Council of Canada, and The EJLB Foundation.

## **Appendix: Area under the receiver operating characteristic (ROC) curve**

The method used to quantify the discrimination sensitivity of MT neurons in a 2AFC task [9, 12] has since become a common technique of behavioral neurophysiology, and forms the basis of all major results reported in this study. The key question is whether the neurons fired more spikes in one condition than in another. Traditional parametric methods for answering this question place restrictive assumptions on the statistics of neural activity; for example, neurons do not always resemble a Poisson process [89]. Receiver operating characteristic (ROC) curves provide an unbiased, non-parametric way of quantifying the difference between any two distributions [90] – in this case, the number of spikes fired by a neuron on one set of trials versus another. Figure 2 illustrates how a ROC curve is used to quantify the difference between two distributions of spike counts (Figure 2A).

Faced with the problem of classifying a randomly sampled spike count as being from either distribution **X** (filled) or **Y** (open), the strategy adopted by the ideal observer is to choose a criterion level of *c* and assign any spike count less than *c* to **X**, and any spike count above *c* to **Y**. All possible values of *c* between 0 and the maximum spike count (*c*max) must be tested to find the optimal criterion that makes correct classifications the most often.

To do so, a ROC curve is built by plotting the probability that spike counts sampled from **X** are greater than *c* against the probability that spike counts sampled from **Y** are greater than *c* (Figure 2B). When *c* = 0, all spike counts are greater than *c*, thus the beginning point of the ROC curve is always (1,1). As *c* is increased (Figure 2B, arrow) the performance of the ideal observer using that criterion level is plotted. When *c* hits *cmax*, no spike count is greater than *c* and the end point of the curve is (0,0).

The area under the ROC curve (Figure 2B, grey shading) is the probability that the ideal observer will correctly classify any given spike count from either distribution, and ranges between 0 and 1 accordingly; this probability is 0.75 for distributions **X** and **Y** from Figure 2A. Therefore, when **X** and **Y** are completely distinct from each other, the ideal observer correctly classifies 100% of all spike counts (area = 1, Figure 2C, right). On the other hand, if there is no distinction between **X** and **Y**, then the ideal observer has a 50% chance of correct classification – a coin toss (area = 0.5, left). If **X** and **Y** from Figure 2A switch positions then the difference between them remains the same; this is reflected by the area under the ROC curve, which is an equal distance below 0.5 after the switch (0.25 = 0.5 - 0.25, solid curve) as it was before (0.75 = 0.5 + 0.25, dotted curve).

## **Abbreviations**

178 Visual Cortex – Current Status and Perspectives

behavioral sensitivity of visual cortical neurons.

*Department of Physiology, McGill University, Montreal, Québec* 

*Department of Neurobiology, University of Chicago, Chicago, Illinois* 

*School of Biomedical Engineering, Southern Medical University, Guangzhou, China* 

The authors wish to thank John Assad and Todd Herrington for the speed-pulse data used in Figure 4. The authors also acknowledge funding support from the Canadian Institutes of Health Research, The Natural Sciences and Engineering Research Council of Canada, and

**Appendix: Area under the receiver operating characteristic (ROC) curve** 

The method used to quantify the discrimination sensitivity of MT neurons in a 2AFC task [9, 12] has since become a common technique of behavioral neurophysiology, and forms the basis of all major results reported in this study. The key question is whether the neurons fired more spikes in one condition than in another. Traditional parametric methods for answering this question place restrictive assumptions on the statistics of neural activity; for

Jackson E. T. Smith and Erik P. Cook

52, 88].

**10. Conclusion** 

**Author details** 

Nicolas Y. Masse

Chang'an A. Zhan

**Acknowledgement** 

The EJLB Foundation.

hand. These observations further support the idea that a top-down source of behavioral sensitivity engages sensory neurons following the initial transient response. While attentionlike signals could fill the role of a top-down source following the start of the stimulus, it would have to allow for other attentional effects that are present earlier in the trial [33, 34,

To study the neural correlates of sensory perception in visual cortex, it is first necessary to understand how neural activity becomes selective for the upcoming perceptual behavior of a subject. This is referred to as a neuron's behavioral sensitivity. There has been debate in the literature about whether the brain links sensory neural activity to perceptual behavior using a bottom-up or top-down mechanism. New results suggest that both mechanisms are active, but in a sequential fashion. The initial, transient responses of sensory neurons have a direct, bottom-up impact upon the subject's behavior. Later responses reflect top-down signals that are linked to high-order processes, and attention fills the criteria necessary to drive top-down behavioral sensitivity. Although potentially difficult, new studies are needed to carefully distinguish and compare the contribution of both processes to the


## **11. References**


[3] Felleman, D.J. and D.C. Van Essen, *Distributed hierarchical processing in the primate cerebral cortex.* Cereb Cortex, 1991. 1(1): p. 1-47.

Linking Neural Activity to Visual Perception: Separating Sensory and Attentional Contributions 181

[22] Sasaki, R. and T. Uka, *Dynamic Readout of Behaviorally Relevant Signals from Area MT* 

[23] Dodd, J.V., et al., *Perceptually bistable three-dimensional figures evoke high choice probabilities* 

[24] Parker, A.J., K. Krug, and B.G. Cumming, *Neuronal activity and its links with the perception of multi-stable figures.* Philosophical Transactions of the Royal Society of

[25] Krug, K., B.G. Cumming, and A.J. Parker, *Comparing perceptual signals of single V5/MT neurons in two binocular depth tasks.* Journal of Neurophysiology, 2004. 92(3): p. 1586-

[26] Nienborg, H. and B.G. Cumming, *Macaque V2 neurons, but not V1 neurons, show choice-*

[27] Nienborg, H. and B.G. Cumming, *Decision-related activity in sensory neurons reflects more* 

[28] Gu, Y., D.E. Angelaki, and G.C. Deangelis, *Neural correlates of multisensory cue integration* 

[29] Gu, Y., et al., *Perceptual learning reduces interneuronal correlations in macaque visual cortex.*

[30] Bosking, W.H. and J.H. Maunsell, *Effects of stimulus direction on the correlation between behavior and single units in area MT during a motion detection task.* J Neurosci, 2011. 31(22):

[31] Herrington, T.M. and J.A. Assad, *Neural Activity in the Middle Temporal Area and Lateral Intraparietal Area during Endogenously Cued Shifts of Attention.* Journal of Neuroscience,

[32] Palmer, C., S.Y. Cheng, and E. Seidemann, *Linking neuronal and behavioral performance in a reaction-time visual detection task.* Journal of Neuroscience, 2007. 27(30): p. 8122-8137. [33] Cohen, M.R. and J.H.R. Maunsell, *A Neuronal Population Measure of Attention Predicts Behavioral Performance on Individual Trials.* Journal of Neuroscience, 2010. 30(45): p.

[34] Cohen, M. and J. Maunsell, *Using Neuronal Populations to Study the Mechanisms* 

[35] Smith, J.E., C.A. Zhan, and E.P. Cook, *The Functional Link between Area MT Neural Fluctuations and Detection of a Brief Motion Stimulus.* J Neurosci, 2011. 31(38): p. 13458-68. [36] Shadlen, M.N., et al., *A computational analysis of the relationship between neuronal and behavioral responses to visual motion.* Journal of Neuroscience, 1996. 16(4): p. 1486-1510. [37] Law, C.T. and J.I. Gold, *Reinforcement learning can account for associative and perceptual* 

[38] Ghose, G.M. and I.T. Harrison, *Temporal Precision of Neuronal Information in a Rapid* 

[39] Nienborg, H. and B.G. Cumming, *Psychophysically measured task strategy for disparity discrimination is reflected in V2 neurons.* Nature Neuroscience, 2007. 10(12): p. 1608-1614. [40] de Lafuente, V. and R. Romo, *Neuronal correlates of subjective sensory experience.* Nature

*Underlying Spatial and Feature Attention.* Neuron, 2011. 70(6): p. 1192-1204.

*learning on a visual-decision task.* Nature Neuroscience, 2009. 12(5): p. 655-663.

*Perceptual Judgment.* Journal of Neurophysiology, 2009. 101(3): p. 1480-1493.

*during Task Switching.* Neuron, 2009. 62(1): p. 147-157.

*in cortical area MT.* J Neurosci, 2001. 21(13): p. 4809-21.

1596.

p. 8230-8.

15241-15253.

London Series B-Biological Sciences, 2002. 357(1424): p. 1053-1062.

*related activity.* Journal of Neuroscience, 2006. 26(37): p. 9567-9578.

*than a neuron's causal effect.* Nature, 2009. 459(7243): p. 89-U93.

*in macaque MSTd.* Nat Neurosci, 2008. 11(10): p. 1201-10.

Neuron, 2011. 71(4): p. 750-761.

2009. 29(45): p. 14160-14176.

Neuroscience, 2005. 8(12): p. 1698-1703.


[22] Sasaki, R. and T. Uka, *Dynamic Readout of Behaviorally Relevant Signals from Area MT during Task Switching.* Neuron, 2009. 62(1): p. 147-157.

180 Visual Cortex – Current Status and Perspectives

14(2): p. 105-26.

14(7): p. 4109-4124.

29(20): p. 6635-6648.

30(42): p. 14036-14045.

*cerebral cortex.* Cereb Cortex, 1991. 1(1): p. 1-47.

*macaque area MT.* J Neurosci, 1999. 19(4): p. 1398-415.

*performance.* J Neurosci, 1992. 12(6): p. 2331-55.

*decision.* Nature, 1989. 341(6237): p. 52-4.

Neurophysiol, 1983. 49(5): p. 1127-47.

[3] Felleman, D.J. and D.C. Van Essen, *Distributed hierarchical processing in the primate* 

[4] Maunsell, J.H. and D.C. Van Essen, *Functional properties of neurons in middle temporal visual area of the macaque monkey. I. Selectivity for stimulus direction, speed, and orientation.* J

[5] Allman, J., F. Miezin, and E. McGuinness, *Direction- and velocity-specific responses from beyond the classical receptive field in the middle temporal visual area (MT).* Perception, 1985.

[6] DeAngelis, G.C. and W.T. Newsome, *Organization of disparity-selective neurons in* 

[7] Movshon, J.A. and W.T. Newsome, *Visual response properties of striate cortical neurons* 

[8] Pack, C.C. and R.T. Born, *Temporal dynamics of a neural solution to the aperture problem in* 

[9] Britten, K.H., et al., *The Analysis of Visual-Motion - a Comparison of Neuronal and Psychophysical Performance.* Journal of Neuroscience, 1992. 12(12): p. 4745-4765. [10] Newsome, W.T. and E.B. Pare, *A selective impairment of motion perception following lesions* 

[11] Salzman, C.D., et al., *Microstimulation in visual area MT: effects on direction discrimination* 

[12] Newsome, W.T., K.H. Britten, and J.A. Movshon, *Neuronal correlates of a perceptual* 

[13] Cook, E.P. and J.H.R. Maunsell, *Dynamics of neuronal responses in macaque MT and VIP* 

[14] Celebrini, S. and W.T. Newsome, *Neuronal and Psychophysical Sensitivity to Motion Signals in Extrastriate Area Mst of the Macaque Monkey.* Journal of Neuroscience, 1994.

[15] Britten, K.H., et al., *A relationship between behavioral choice and the visual responses of* 

[16] Purushothaman, G. and D.C. Bradley, *Neural population code for fine perceptual decisions in* 

[17] Cohen, M.R. and W.T. Newsome, *Estimates of the Contribution of Single Neurons to Perception Depend on Timescale and Noise Correlation.* Journal of Neuroscience, 2009.

[18] Law, C.T. and J.I. Gold, *Neural correlates of perceptual learning in a sensory-motor, but not a* 

[19] Liu, J. and W.T. Newsome, *Correlation between speed perception and neural activity in the* 

[20] Price, N.S.C. and R.T. Born, *Timescales of Sensory- and Decision-Related Activity in the Middle Temporal and Medial Superior Temporal Areas.* Journal of Neuroscience, 2010.

[21] Uka, T. and G.C. DeAngelis, *Contribution of area MT to stereoscopic depth perception: Choice-related response modulations reflect task strategy.* Neuron, 2004. 42(2): p. 297-310.

*projecting to area MT in macaque monkeys.* J Neurosci, 1996. 16(23): p. 7733-41.

*visual area MT of macaque brain.* Nature, 2001. 409(6823): p. 1040-2.

*of the middle temporal visual area (MT).* J Neurosci, 1988. 8(6): p. 2201-11.

*during motion detection.* Nature Neuroscience, 2002. 5(10): p. 985-994.

*neurons in macaque MT.* Visual Neuroscience, 1996. 13(1): p. 87-100.

*sensory, cortical area.* Nature Neuroscience, 2008. 11(4): p. 505-513.

*middle temporal visual area.* J Neurosci, 2005. 25(3): p. 711-22.

*area MT.* Nature Neuroscience, 2005. 8(1): p. 99-106.


[41] Hernandez, A., et al., *Decoding a Perceptual Decision Process across Cortex.* Neuron, 2010. 66(2): p. 300-314.

Linking Neural Activity to Visual Perception: Separating Sensory and Attentional Contributions 183

[61] Nelson, M.J. and P. Pouget, *Do electrode properties create a problem in interpreting local field* 

[62] Logothetis, N.K., *The underpinnings of the BOLD functional magnetic resonance imaging* 

[63] Siegel, M., T.H. Donner, and A.K. Engel, *Spectral fingerprints of large-scale neuronal* 

[64] Monosov, I.E., J.C. Trageser, and K.G. Thompson, *Measurements of simultaneously recorded spiking activity and local field potentials suggest that spatial selection emerges in the* 

[65] Viswanathan, A. and R.D. Freeman, *Neurometabolic coupling in cerebral cortex reflects synaptic more than spiking activity.* Nature Neuroscience, 2007. 10(10): p. 1308-1312. [66] Khawaja, F.A., J.M.G. Tsui, and C.C. Pack, *Pattern Motion Selectivity of Spiking Outputs and Local Field Potentials in Macaque Visual Cortex.* Journal of Neuroscience, 2009. 29(43):

[67] Ray, S. and J.H.R. Maunsell, *Different Origins of Gamma Rhythm and High-Gamma Activity* 

[68] Ray, S., et al., *Neural Correlates of High-Gamma Oscillations (60-200 Hz) in Macaque Local Field Potentials and Their Potential Implications in Electrocorticography.* Journal of

[69] Rasch, M.J., et al., *Inferring spike trains from local field potentials.* Journal of

[70] Ray, S., et al., *Effect of stimulus intensity on the spike-local field potential relationship in the secondary somatosensory cortex.* Journal of Neuroscience, 2008. 28(29): p. 7334-7343. [71] Whittingstall, K. and N.K. Logothetis, *Frequency-band coupling in surface EEG reflects* 

[72] Belitski, A., et al., *Low-frequency local field potentials and spikes in primary visual cortex convey independent visual information.* Journal of Neuroscience, 2008. 28(22): p. 5696-5709. [73] Belitski, A., et al., *Sensory information in local field potentials and spikes from visual and auditory cortices: time scales and frequency bands.* J Comput Neurosci, 2010. 29(3): p. 533-45. [74] Sillito, A.M., J. Cudeiro, and H.E. Jones, *Always returning: feedback and sensory processing* 

[75] Womelsdorf, T., et al., *Gamma-band synchronization in visual cortex predicts speed of change* 

[76] Rotermund, D., et al., *Attention Improves Object Representation in Visual Cortical Field* 

[77] Fries, P., et al., *Modulation of oscillatory neuronal synchronization by selective visual* 

[78] Chalk, M., et al., *Attention Reduces Stimulus-Driven Gamma Frequency Oscillations and* 

[79] Khayat, P.S., R. Niebergall, and J.C. Martinez-Trujillo, *Frequency-Dependent Attentional Modulation of Local Field Potential Signals in Macaque Area MT.* Journal of Neuroscience,

*spiking activity in monkey visual cortex.* Neuron, 2009. 64(2): p. 281-9.

*in visual cortex and thalamus.* Trends Neurosci, 2006. 29(6): p. 307-16.

*Potentials.* Journal of Neuroscience, 2009. 29(32): p. 10120-10130.

*Spike Field Coherence in V1.* Neuron, 2010. 66(1): p. 114-125.

*potential recordings?* J Neurophysiol, 2010. 103(5): p. 2315-7.

*signal.* Journal of Neuroscience, 2003. 23(10): p. 3963-3971.

*interactions.* Nat Rev Neurosci, 2012.

p. 13702-13709.

*frontal eye field.* Neuron, 2008. 57(4): p. 614-625.

*in Macaque Visual Cortex.* Plos Biology, 2011. 9(4).

Neuroscience, 2008. 28(45): p. 11526-11536.

Neurophysiology, 2008. 99(3): p. 1461-1476.

*detection.* Nature, 2006. 439(7077): p. 733-736.

*attention.* Science, 2001. 291: p. 1560-1563.

2010. 30(20): p. 7037-7048.


[61] Nelson, M.J. and P. Pouget, *Do electrode properties create a problem in interpreting local field potential recordings?* J Neurophysiol, 2010. 103(5): p. 2315-7.

182 Visual Cortex – Current Status and Perspectives

66(2): p. 300-314.

[41] Hernandez, A., et al., *Decoding a Perceptual Decision Process across Cortex.* Neuron, 2010.

[42] Cumming, B.G. and A.J. Parker, *Responses of primary visual cortical neurons to binocular* 

[43] DeAngelis, G.C., B.G. Cumming, and W.T. Newsome, *Cortical area MT and the perception* 

[44] Krug, K., *A common neuronal code for perceptual visual cortex? Comparing choice and processes in attentional correlates in V5/MT.* Philosophical Transactions of the Royal

[45] Herrington, T.M., et al., *The Effect of Microsaccades on the Correlation between Neural Activity and Behavior in Middle Temporal, Ventral Intraparietal, and Lateral Intraparietal* 

[46] White, J.A., J.T. Rubinstein, and A.R. Kay, *Channel noise in neurons.* Trends in

[47] Sanchez-Vives, M.V. and D.A. McCormick, *Cellular and network mechanisms of rhythmic* 

[48] Timofeev, I., et al., *Origin of slow cortical oscillations in deafferented cortical slabs.* Cereb

[49] Bair, W., E. Zohary, and W.T. Newsome, *Correlated firing in macaque visual area MT: Time scales and relationship to behavior.* Journal of Neuroscience, 2001. 21(5): p. 1676-1697. [50] Averbeck, B.B., P.E. Latham, and A. Pouget, *Neural correlations, population coding and* 

[51] Vaadia, E., et al., *Dynamics of Neuronal Interactions in Monkey Cortex in Relation to* 

[52] Cohen, M.R. and J.H.R. Maunsell, *Attention improves performance primarily by reducing* 

[53] Roelfsema, P.R., V.A. Lamme, and H. Spekreijse, *Synchrony and covariation of firing rates in the primary visual cortex during contour grouping.* Nat Neurosci, 2004. 7(9): p. 982-991. [54] Romo, R., et al., *Correlated neuronal discharges that increase coding efficiency during* 

[55] Nienborg, H. and B. Cumming, *Correlations between the activity of sensory neurons and behavior: how much do they tell us about a neuron's causality?* Current Opinion in

[56] Knudsen, E.I., *Fundamental components of attention.* Annual Review of Neuroscience,

[57] Katzner, S., et al., *Local Origin of Field Potentials in Visual Cortex.* Neuron, 2009. 61(1): p.

[58] Xing, D.J., C.I. Yeh, and R.M. Shapley, *Spatial Spread of the Local Field Potential and its Laminar Variation in Visual Cortex.* Journal of Neuroscience, 2009. 29(37): p. 11540-11549. [59] Mountcastle, V.B., *The columnar organization of the neocortex.* Brain, 1997. 120: p. 701-722. [60] Logothetis, N.K., C. Kayser, and A. Oeltermann, *In vivo measurement of cortical impedance spectrum in monkeys: Implications for signal propagation.* Neuron, 2007. 55(5): p. 809-823.

Society of London Series B-Biological Sciences, 2004. 359(1446): p. 929-941.

*disparity without depth perception.* Nature, 1997. 389(6648): p. 280-3.

*of stereoscopic depth.* Nature, 1998. 394(6694): p. 677-80.

*Areas.* Journal of Neuroscience, 2009. 29(18): p. 5793-5805.

*recurrent activity in neocortex.* Nat Neurosci, 2000. 3(10): p. 1027-34.

*computation.* Nature Reviews Neuroscience, 2006. 7(5): p. 358-366.

*interneuronal correlations.* Nat Neurosci, 2009. 12(12): p. 1594-1600.

*Behavioral Events.* Nature, 1995. 373(6514): p. 515-518.

*perceptual discrimination.* Neuron, 2003. 38(4): p. 649-657.

Neurobiology, 2010. 20(3): p. 376-381.

2007. 30: p. 57-78.

35-41.

Neurosciences, 2000. 23(3): p. 131-137.

Cortex, 2000. 10(12): p. 1185-99.


[80] Harris, K.D. and A. Thiele, *Cortical state and attention.* Nature Reviews Neuroscience, 2011. 12(9): p. 509-523.

**Chapter 0**

**Chapter 8**

**Bio-Inspired Architecture for Clustering into**

**Natural and Non-Natural Facial Expressions**

Over the past years there has been an increasing surge of interest in automated facial expression analysis but there are several situations where the analysis of the expressions cannot be successfully done with the different classical approaches due to the variations in

A way to face these variations in the environmental conditions is by looking at the existing systems that efficiently locate, extract and analyse the face information. An example of these

According to several researchers [2, 27, 36], emotion processing is done in the amygdala, an area located within medial temporal lobes in the brain. Experiments carried out by [35] demonstrated that face recognition process can be achieved with the simple facial features (eyebrows and eyes vs eyebrows) of threatening and friendly faces. These experiments confirm that emotion processing involves other areas like the visual cortex, e.g. V1 [32]. Later, [17] demonstrated that fast, automatic, and parallel processing of unattended emotional faces provide important insights into the specific and dissociated neural pathways in emotion and

Although, there are different areas involved in the analysis of facial expressions in the brain

Clues from the psychological point of view, like that proposed by Ekman and Friesen [9], point out the importance of the symmetry in this process and the facial asymmetry (with left part of the face stronger than that of right part) is apparent only with deliberate and non-spontaneous (non-natural) expressions. According to this work, the symmetrical expressions show a true or natural emotion. Nevertheless, the asymmetrical expressions show a false or non-natural

In this research, we take advantage of psychological point of view to interpret our results issued by symmetrical and asymmetrical face measures when we model the visual areas of

and reproduction in any medium, provided the original work is properly cited.

©2012 Castellanos Sánchez et al., licensee InTech. This is an open access chapter distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original

© 2012 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution,

of the human being, there is no consensus about how this process is done.

work is properly cited.

Claudio Castellanos Sánchez, Manuel Hernández Hernández

Additional information is available at the end of the chapter

the environmental conditions (like illumination changes).

systems is the brain of the human being.

and Pedro Luis Sánchez Orellana

http://dx.doi.org/10.5772/48508

**1. Introduction**

face perception.

emotion.


## **Bio-Inspired Architecture for Clustering into Natural and Non-Natural Facial Expressions**

Claudio Castellanos Sánchez, Manuel Hernández Hernández and Pedro Luis Sánchez Orellana

Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/48508

## **1. Introduction**

184 Visual Cortex – Current Status and Perspectives

2011. 12(9): p. 509-523.

2010. 13(3): p. 379-85.

1273.

93.

63(3): p. 386-396.

[80] Harris, K.D. and A. Thiele, *Cortical state and attention.* Nature Reviews Neuroscience,

[81] Buschman, T.J. and E.K. Miller, *Top-down versus bottom-up control of attention in the* 

[82] Saalmann, Y.B., I.N. Pigarev, and T.R. Vidyasagar, *Neural mechanisms of visual attention: How top-down feedback highlights relevant locations.* Science, 2007. 316(5831): p. 1612-1615. [83] Liu, J. and W.T. Newsome, *Local field potential in cortical area MT: Stimulus tuning and* 

[84] Stanford, T.R., et al., *Perceptual decision making in less than 30 milliseconds.* Nat Neurosci,

[85] Roelfsema, P.R., V.A.F. Lamme, and H. Spekreijse, *Object-based attention in the primary* 

[86] Moro, S.I., et al., *Neuronal Activity in the Visual Cortex Reveals the Temporal Order of* 

[87] Lui, L.L. and T. Pasternak, *Representation of comparison signals in cortical area MT during a delayed direction discrimination task.* Journal of Neurophysiology, 2011. 106(3): p. 1260-

[88] Mitchell, J.F., K.A. Sundberg, and J.H. Reynolds, *Spatial Attention Decorrelates Intrinsic* 

[89] Kara, P., P. Reinagel, and R.C. Reid, *Low response variability in simultaneously recorded* 

[90] Swets, J.A., *Measuring the accuracy of diagnostic systems.* Science, 1988. 240(4857): p. 1285-

[91] Buschman, T.J. and E.K. Miller, *Serial, Covert Shifts of Attention during Visual Search Are Reflected by the Frontal Eye Fields and Correlated with Population Oscillations.* Neuron, 2009.

[92] Pillow, J.W., et al., *Spatio-temporal correlations and visual signalling in a complete neuronal* 

[93] Williams, Z.M., et al., *Parietal activity and the perceived direction of ambiguous apparent* 

*prefrontal and posterior parietal cortices.* Science, 2007. 315(5820): p. 1860-1862.

*behavioral correlations.* Journal of Neuroscience, 2006. 26(30): p. 7779-7790.

*visual cortex of the macaque monkey.* Nature, 1998. 395(6700): p. 376-381.

*Cognitive Operations.* Journal of Neuroscience, 2010. 30(48): p. 16293-16303.

*Activity Fluctuations in Macaque Area V4.* Neuron, 2009. 63(6): p. 879-888.

*retinal, thalamic, and cortical neurons.* Neuron, 2000. 27(3): p. 635-646.

*population.* Nature, 2008. 454(7207): p. 995-U37.

*motion.* Nature Neuroscience, 2003. 6(6): p. 616-623.

Over the past years there has been an increasing surge of interest in automated facial expression analysis but there are several situations where the analysis of the expressions cannot be successfully done with the different classical approaches due to the variations in the environmental conditions (like illumination changes).

A way to face these variations in the environmental conditions is by looking at the existing systems that efficiently locate, extract and analyse the face information. An example of these systems is the brain of the human being.

According to several researchers [2, 27, 36], emotion processing is done in the amygdala, an area located within medial temporal lobes in the brain. Experiments carried out by [35] demonstrated that face recognition process can be achieved with the simple facial features (eyebrows and eyes vs eyebrows) of threatening and friendly faces. These experiments confirm that emotion processing involves other areas like the visual cortex, e.g. V1 [32]. Later, [17] demonstrated that fast, automatic, and parallel processing of unattended emotional faces provide important insights into the specific and dissociated neural pathways in emotion and face perception.

Although, there are different areas involved in the analysis of facial expressions in the brain of the human being, there is no consensus about how this process is done.

Clues from the psychological point of view, like that proposed by Ekman and Friesen [9], point out the importance of the symmetry in this process and the facial asymmetry (with left part of the face stronger than that of right part) is apparent only with deliberate and non-spontaneous (non-natural) expressions. According to this work, the symmetrical expressions show a true or natural emotion. Nevertheless, the asymmetrical expressions show a false or non-natural emotion.

In this research, we take advantage of psychological point of view to interpret our results issued by symmetrical and asymmetrical face measures when we model the visual areas of

©2012 Castellanos Sánchez et al., licensee InTech. This is an open access chapter distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. © 2012 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### 2 Will-be-set-by-IN-TECH 186 Visual Cortex – Current Status and Perspectives Bio-Inspired Architecture for Clustering into Natural and Non-Natural Facial Expressions <sup>3</sup>

brain of the human being. Furthermore, we are inspired by early visual areas for processing the visual stream. Then, the interaction between neurons of visual cortex lead us to introduce our proposed architecture based on the detection of asymmetrical facial expressions and their clustering into natural/non-natural expressions.

**2.1. Psychological and physiological perspective**

other areas like the visual cortex, e.g. V1 [32].

**2.2. Neurophysiological perspective**

**2.3. Computational perspective**

face perception.

From the psychological and physiological perspective, the emotions can be analysed considering the symmetry of the expressions. According to Ekman and Friesen [9] through the analysis of symmetry it can be evaluated the veracity (naturalness) of an emotion. In the brain, the information about emotions is processed in the amygdala [2, 27, 36], an area located within medial temporal lobes in the brain. By using fMRI some researchers [1, 13, 32] have discovered that in this area the processing is modulated by the significance of faces, particularly with fearful expressions, also with others like gaze direction. Visual search paradigms have provided evidence the enhanced capture of attention by threatening faces. Experiments carried out by [35] demonstrated that the recognition process can be achieved with simple facial features (expressions of eyebrows and eyes together vs only eyebrows) of threatening and friendly faces. These experiments confirm that emotion processing involves

Bio-Inspired Architecture for Clustering into Natural and Non-Natural Facial Expressions 187

According to neurophysiological studies [15, 16], the task of facial expression recognition also occurs in some regions in the visual cortex (ventral and dorsal pathways). The neurons in these areas respond mainly to gesticulations and identity. In the case of gesticulations, the neurons in the superior temporal sulcus involve the dynamics of the face. In the case of identity, the neurons in the inferior temporal gyrus imply the analysis of invariant facial features. In both cases the processing is modulated by attentional mechanisms [37]. Later, [17] demonstrated that fast, automatic, and parallel processing of unattended emotional faces, provides important insights into the specific and dissociated neural pathways in emotion and

In the figure 1 we show the main pathways of processing visual stream in the cortex of the human being. The dorsal pathways analyse the motion and localisation of information, while ventral pathways analyse information about form and colour. Both pathways have the same source, V1. The visual cortex is the last specialised layer of evolution in mammals; the limbic system (not shown in the figure) is the first and primitive layer in the mammals where the

From the computational point of view, facial expression recognition implies the classification of facial motion and facial structure deformation into abstract representations completely based on visual information. To address this analysis it is necessary to perform two main tasks: face detection and facial feature extraction. In the first one, techniques have been developed to deal with the extraction and normalization of the face considering variations in the pose [10, 28] and illumination [3, 12]. However, according to [11] the main effort has been focused to feature facial extraction where techniques like appearance-based models [6, 18, 19]

Basically, the facial feature extraction has been approached by two main streams [11, 23]: facial deformation extraction models and facial motion extraction models. Motion extraction approaches focus on facial changes produced by facial expressions, whereas deformation-based methods contrast the visual information against face models to extract features produced by expressions and not by age deformation like wrinkles. The main

amygdala is situated in the temporal lobes. Then, it is covered by visual cortex.

work with significant variations in the acquired face images without normalization.

In this chapter, we will present an architecture composed of five stages in order to cluster into natural/non-natural facial expressions in a sequence of images.

The first stage process a sequence of images for eyes and mouth corners detection inspired by the sensibility of cone-cells in the eyes. With these corners we build an anthropometrical grid of six vertical bands.

The second stage is a bio-inspired treatment through three steps:


The third stage computes the ratio between the quantity of active neurons and the maximum quantity of neurons in each of six bands built in the first stage. A process inspired in the visual stream integration in IT (infero-temporal area).

The fourth stage compares each symmetrical band with an empirical threshold *κ* obtained in our experiments.

The end stage clusters in natural/non-natural expressions. This stage uses the results of the fourth stage with a bio-inspired neural network model.

In the following sections, we will present an overview of the state of the art in research of facial expressions. In addition, we present our proposed architecture to analyse the asymmetrical responses of the faces from a sequence of images. We also mention the experiments, discuss the results and finally comment on the future work.

## **2. State of the art**

The analysis of facial expressions constitute a critical and complex part of our non-verbal social interactions. Therefore, over the past years there has been an increasing surge of interest to create tools that allow to analyse the facial expression automatically. This analysis is a complex task; on one hand facial expression structure of human beings are different from one another. On the other hand, faces are analysed from sequences of images captured in different environments with variations of lighting, pose and scale changes that difficult the extraction of their structure. Traditionally, the analysis of facial expressions has been focused on identify emotions; however, both deformation and face information can be originated by other factors (like verbal communication or fatigue) rather than emotions. The emotions can be evaluated from two main perspectives: psychological, physiological and neurophysiological points of view versus computational point of view.

## **2.1. Psychological and physiological perspective**

2 Will-be-set-by-IN-TECH

brain of the human being. Furthermore, we are inspired by early visual areas for processing the visual stream. Then, the interaction between neurons of visual cortex lead us to introduce our proposed architecture based on the detection of asymmetrical facial expressions and their

In this chapter, we will present an architecture composed of five stages in order to cluster into

The first stage process a sequence of images for eyes and mouth corners detection inspired by the sensibility of cone-cells in the eyes. With these corners we build an anthropometrical grid

• Contrast and orientation detection inspired in the simple neurons in V1 (primary visual

The third stage computes the ratio between the quantity of active neurons and the maximum quantity of neurons in each of six bands built in the first stage. A process inspired in the visual

The fourth stage compares each symmetrical band with an empirical threshold *κ* obtained in

The end stage clusters in natural/non-natural expressions. This stage uses the results of the

In the following sections, we will present an overview of the state of the art in research of facial expressions. In addition, we present our proposed architecture to analyse the asymmetrical responses of the faces from a sequence of images. We also mention the experiments, discuss

The analysis of facial expressions constitute a critical and complex part of our non-verbal social interactions. Therefore, over the past years there has been an increasing surge of interest to create tools that allow to analyse the facial expression automatically. This analysis is a complex task; on one hand facial expression structure of human beings are different from one another. On the other hand, faces are analysed from sequences of images captured in different environments with variations of lighting, pose and scale changes that difficult the extraction of their structure. Traditionally, the analysis of facial expressions has been focused on identify emotions; however, both deformation and face information can be originated by other factors (like verbal communication or fatigue) rather than emotions. The emotions can be evaluated from two main perspectives: psychological, physiological and neurophysiological points of

• Integration and smoothing contours due to the performance of complex neurons in V1. • The coherence and integration activity of neurons in MT (middle temporal area) inspire us to detect the motion through a temporal interaction of our consecutive complex neurons.

clustering into natural/non-natural expressions.

of six vertical bands.

our experiments.

**2. State of the art**

natural/non-natural facial expressions in a sequence of images.

The second stage is a bio-inspired treatment through three steps:

cortex) that are modelled by Gabor-like filters.

stream integration in IT (infero-temporal area).

fourth stage with a bio-inspired neural network model.

the results and finally comment on the future work.

view versus computational point of view.

From the psychological and physiological perspective, the emotions can be analysed considering the symmetry of the expressions. According to Ekman and Friesen [9] through the analysis of symmetry it can be evaluated the veracity (naturalness) of an emotion. In the brain, the information about emotions is processed in the amygdala [2, 27, 36], an area located within medial temporal lobes in the brain. By using fMRI some researchers [1, 13, 32] have discovered that in this area the processing is modulated by the significance of faces, particularly with fearful expressions, also with others like gaze direction. Visual search paradigms have provided evidence the enhanced capture of attention by threatening faces. Experiments carried out by [35] demonstrated that the recognition process can be achieved with simple facial features (expressions of eyebrows and eyes together vs only eyebrows) of threatening and friendly faces. These experiments confirm that emotion processing involves other areas like the visual cortex, e.g. V1 [32].

## **2.2. Neurophysiological perspective**

According to neurophysiological studies [15, 16], the task of facial expression recognition also occurs in some regions in the visual cortex (ventral and dorsal pathways). The neurons in these areas respond mainly to gesticulations and identity. In the case of gesticulations, the neurons in the superior temporal sulcus involve the dynamics of the face. In the case of identity, the neurons in the inferior temporal gyrus imply the analysis of invariant facial features. In both cases the processing is modulated by attentional mechanisms [37]. Later, [17] demonstrated that fast, automatic, and parallel processing of unattended emotional faces, provides important insights into the specific and dissociated neural pathways in emotion and face perception.

In the figure 1 we show the main pathways of processing visual stream in the cortex of the human being. The dorsal pathways analyse the motion and localisation of information, while ventral pathways analyse information about form and colour. Both pathways have the same source, V1. The visual cortex is the last specialised layer of evolution in mammals; the limbic system (not shown in the figure) is the first and primitive layer in the mammals where the amygdala is situated in the temporal lobes. Then, it is covered by visual cortex.

## **2.3. Computational perspective**

From the computational point of view, facial expression recognition implies the classification of facial motion and facial structure deformation into abstract representations completely based on visual information. To address this analysis it is necessary to perform two main tasks: face detection and facial feature extraction. In the first one, techniques have been developed to deal with the extraction and normalization of the face considering variations in the pose [10, 28] and illumination [3, 12]. However, according to [11] the main effort has been focused to feature facial extraction where techniques like appearance-based models [6, 18, 19] work with significant variations in the acquired face images without normalization.

Basically, the facial feature extraction has been approached by two main streams [11, 23]: facial deformation extraction models and facial motion extraction models. Motion extraction approaches focus on facial changes produced by facial expressions, whereas deformation-based methods contrast the visual information against face models to extract features produced by expressions and not by age deformation like wrinkles. The main

Then, we extract the active neurons for each vertical band and compute their ratio of active

Bio-Inspired Architecture for Clustering into Natural and Non-Natural Facial Expressions 189

Finally, we detect the asymmetries for each pair of vertical bands and cluster it in three classes

In the next subsections we give a brief explanation of the steps of our approach where we

In our proposed approach, the images of each sequence are in RGB colour, because we use the colour during the pre-processing to detect eyes and mouth corners. The process is described below. We suggest sampling capture to 25 frames per second and 1024 × 768 pixels. High

In this stage we realize three steps which are (1) the eyes and mouth detection, (2) we build the grid and finally we utilize only six regions, the six vertical bands (6-RVB) to analyse it (with left side of the stronger than right one). These steps will be briefly explained in the following

In the first step, let *I* = {*I*1, *I*2, *I*3,... *It*} be an image face sequence where the eyes and mouth corners are correctly detected using red and green band colours as the sensibility of cones cells in our eyes. We work on the rectangle that contains the face; next, we define the point corresponding to the center of the face (*xc* and *yc*) and two parameters (width and height) for the size of the face. We use these anthropometric measures in the order to define three regions of interest that probably contain the mouth and eyes (left and right). For each region we find the corresponding coordinates of detected corners in both eyes and mouth and these

In the second step, we use the coordinates of eyes and mouth corners to build an antropomorphical grid on the face. With the processing we can obtain twenty-five small regions, we also split the central part in two regions such as can see in the figure 2 obtaining

Finally in this stage, we use only six symmetrical columns (*R*1, *R*2, *R*3, *R*4, *R*5, *R*6) of proposed

The facial expressions carry much more information that cannot be extracted from traditional facial expression analysis. So we propose, from the bio-inspired point of view, a methodology to analyse the symmetry of facial expressions. The objective with this is to achieve both a perspective, according to brain mechanisms, of the interpretation of facial motion relevance and a methodology that takes advantage of the tolerance to illumination change of the brain

thirty small regions (the step anthropometrical segmentation using black line doted).

mechanisms. This can be divided into three main processing steps:

quantity neurons. Next, each vertical band is compared to their symmetrical band.

applying the self-organizing maps (SOM).

sampling and high resolution is desirable.

coordinates are used in the next step.

grid to analyse expressions in the face.

**3.3. Bio-inspired treatment**

**3.2. Pre-processing**

paragraphs.

assume that the face has been located correctly.

**3.1. Description of input image sequences**

**Figure 1.** The visual pathway. The information comes in from the retina, next it is integrated in LGN, treated in V1, and finally processed in two different pathways : dorsal (MT, MTS, etc.) and ventral (V4, IT, etc.). The amygdala are almond-shaped groups of nuclei located deep within the medial temporal lobes and perform a primary role in the processing and memory of emotional reactions.

difference between these two methods is that deformation-based methods can be applied to single images as well as image sequences, and motion methods need a sequence of images. In both cases the facial features may be processed holistically (the faces are analysed as a whole) or locally (focusing on features from specific areas). In the case of deformation methods, the extraction of features relies on shape and texture changes through a period of time. On one hand the Holistic approaches either use the whole face [7, 25] or partial information about regions in the face like mouth or eyes [14, 21, 31]. For the motion extraction methods, the features are extracted by analysing motion vectors: holistically [8, 29], and locally [26, 34, 39] obtained.

## **3. Proposed bio-inspired architecture**

A general diagram of the proposed methodology is shown in the figure 2. In the beginning, a sequence of images or video is captured in RGB format. Next, a processing based on green and red colours is realized to detect eyes and mouth corners. In this stage, we generate a grid of 6 × 5 based on coordinates of eyes and mouth corners. In this chapter, we only chose the six vertical bands for our experiments, the other combinations for analysis of the antropomorphical grid are subjects our future analysis.

Our bio-inspired treatment obtain the simple and complex neuron responses following the function of basic neurons of visual primary cortex (V1). Next, a temporal integration in complex neuron responses simulates the neurons in middle temporal cortex (MT).

Then, we extract the active neurons for each vertical band and compute their ratio of active quantity neurons. Next, each vertical band is compared to their symmetrical band.

Finally, we detect the asymmetries for each pair of vertical bands and cluster it in three classes applying the self-organizing maps (SOM).

In the next subsections we give a brief explanation of the steps of our approach where we assume that the face has been located correctly.

### **3.1. Description of input image sequences**

In our proposed approach, the images of each sequence are in RGB colour, because we use the colour during the pre-processing to detect eyes and mouth corners. The process is described below. We suggest sampling capture to 25 frames per second and 1024 × 768 pixels. High sampling and high resolution is desirable.

## **3.2. Pre-processing**

4 Will-be-set-by-IN-TECH

**Figure 1.** The visual pathway. The information comes in from the retina, next it is integrated in LGN, treated in V1, and finally processed in two different pathways : dorsal (MT, MTS, etc.) and ventral (V4, IT, etc.). The amygdala are almond-shaped groups of nuclei located deep within the medial temporal

difference between these two methods is that deformation-based methods can be applied to single images as well as image sequences, and motion methods need a sequence of images. In both cases the facial features may be processed holistically (the faces are analysed as a whole) or locally (focusing on features from specific areas). In the case of deformation methods, the extraction of features relies on shape and texture changes through a period of time. On one hand the Holistic approaches either use the whole face [7, 25] or partial information about regions in the face like mouth or eyes [14, 21, 31]. For the motion extraction methods, the features are extracted by analysing motion vectors: holistically [8, 29], and locally [26, 34, 39]

A general diagram of the proposed methodology is shown in the figure 2. In the beginning, a sequence of images or video is captured in RGB format. Next, a processing based on green and red colours is realized to detect eyes and mouth corners. In this stage, we generate a grid of 6 × 5 based on coordinates of eyes and mouth corners. In this chapter, we only chose the six vertical bands for our experiments, the other combinations for analysis of the

Our bio-inspired treatment obtain the simple and complex neuron responses following the function of basic neurons of visual primary cortex (V1). Next, a temporal integration in

complex neuron responses simulates the neurons in middle temporal cortex (MT).

lobes and perform a primary role in the processing and memory of emotional reactions.

obtained.

**3. Proposed bio-inspired architecture**

antropomorphical grid are subjects our future analysis.

In this stage we realize three steps which are (1) the eyes and mouth detection, (2) we build the grid and finally we utilize only six regions, the six vertical bands (6-RVB) to analyse it (with left side of the stronger than right one). These steps will be briefly explained in the following paragraphs.

In the first step, let *I* = {*I*1, *I*2, *I*3,... *It*} be an image face sequence where the eyes and mouth corners are correctly detected using red and green band colours as the sensibility of cones cells in our eyes. We work on the rectangle that contains the face; next, we define the point corresponding to the center of the face (*xc* and *yc*) and two parameters (width and height) for the size of the face. We use these anthropometric measures in the order to define three regions of interest that probably contain the mouth and eyes (left and right). For each region we find the corresponding coordinates of detected corners in both eyes and mouth and these coordinates are used in the next step.

In the second step, we use the coordinates of eyes and mouth corners to build an antropomorphical grid on the face. With the processing we can obtain twenty-five small regions, we also split the central part in two regions such as can see in the figure 2 obtaining thirty small regions (the step anthropometrical segmentation using black line doted).

Finally in this stage, we use only six symmetrical columns (*R*1, *R*2, *R*3, *R*4, *R*5, *R*6) of proposed grid to analyse expressions in the face.

#### **3.3. Bio-inspired treatment**

The facial expressions carry much more information that cannot be extracted from traditional facial expression analysis. So we propose, from the bio-inspired point of view, a methodology to analyse the symmetry of facial expressions. The objective with this is to achieve both a perspective, according to brain mechanisms, of the interpretation of facial motion relevance and a methodology that takes advantage of the tolerance to illumination change of the brain mechanisms. This can be divided into three main processing steps:

#### 6 Will-be-set-by-IN-TECH 190 Visual Cortex – Current Status and Perspectives Bio-Inspired Architecture for Clustering into Natural and Non-Natural Facial Expressions <sup>7</sup>

phases.

as active neurons.

function *G<sup>θ</sup>* (*x*ˆ, *y*ˆ) is defined as:

for the image *t*.

computationally by the oriented Gabor-like filters [4, 22]:

*Sθ*,*ϕ*(*x*, *y*, *t*) =

*<sup>G</sup><sup>θ</sup>* (*x*ˆ, *<sup>y</sup>*ˆ, *<sup>ϕ</sup>*) = *exp*

*<sup>C</sup><sup>θ</sup>* (*x*, *<sup>y</sup>*, *<sup>t</sup>*) =

where *C<sup>θ</sup>* (*x*, *y*, *t*) represent complex cells responses, *<sup>π</sup>*

*C*(*x*, *y*, *t* − 1) obtaining the map of active complex neurons by

*D*(*x*, *y*, *t*) to compute the number of active neurons.

Then, the complex cells responses are estimated by

• Two simple neurons with different phase are merged to generate a complex neuron.

*t* ∑ *t*=0

where *S<sup>θ</sup>* (*x*, *y*, *t*) are simple cells which is the result of the convolution between a pool of Gabor functions (*G<sup>θ</sup>* (*x*ˆ, *y*ˆ)), in our case with 8 orientation and 2 phases, and the image *t* of the image sequences, (*x*ˆ, *y*ˆ) are the rotational components and (*x*, *y*) a position of the image. The Gabor

> <sup>−</sup> *<sup>x</sup>*ˆ<sup>2</sup> <sup>+</sup> *<sup>γ</sup>*2*y*ˆ<sup>2</sup> 2*σ*<sup>2</sup>

where *x*ˆ = *x* cos *θ* + *y* sin *θ*, *y*ˆ = −*x* sin *θ* + *y* cos *θ*, *λ* is the length-width, *θ* represent the orientation, *ϕ* the phase, *σ* the standard deviation and *γ* represent the relation of aspect.

In the second step, the responses of two diffent phases of simple cells *S<sup>θ</sup>* (*x*, *y*, *t*) are then integrated in the complex cells by using a non-linear model that allows to merge the responses.

anti-symmetric phases, respectively. Then, the eight orientations are integrated of the complex cells responses *C<sup>θ</sup>* (*x*, *y*, *t*) for obtain one matrix of the map active neurons complex *C*(*x*, *y*, *t*)

Finally the third step, the neurons in the human brain are connected to MT neurons allowing a temporal processing that is defined as the temporal integration between *C*(*x*, *y*, *t*) and

where *D*(*x*, *y*, *t*) is the result of active complex neurons in the image *t*, *C*(*x*, *y*, *t*) and *C*(*x*, *y*, *t* − 1) are the current image and the previous image, respectively. We use the information

<sup>2</sup> (*x*, *<sup>y</sup>*, *<sup>t</sup>*)<sup>2</sup> <sup>+</sup> *<sup>S</sup>θ*,<sup>−</sup> *<sup>π</sup>*

<sup>2</sup> and <sup>−</sup>*<sup>π</sup>*

*D*(*x*, *y*, *t*) = *C*(*x*, *y*, *t*) − *C*(*x*, *y*, *t* − 1) (4)

*Sθ*, *<sup>π</sup>*

 *cos* 2*π x*ˆ *<sup>λ</sup>* <sup>+</sup> *<sup>ϕ</sup>*

*I*(*x*, *y*, *t*) ∗ *G<sup>θ</sup>* (*x*ˆ, *y*ˆ, *ϕ*) (1)

<sup>2</sup> (*x*, *<sup>y</sup>*, *<sup>t</sup>*)<sup>2</sup> (3)

<sup>2</sup> are the symmetric and

(2)

• A temporal integration between two consecutive complex neurons models MT neurons (Middle temporal area). The neurons with a major response to a threshold are considered

Bio-Inspired Architecture for Clustering into Natural and Non-Natural Facial Expressions 191

In the first proposed step, considering the processing done by the visual cortex, the first neurons that receive the stimuli from the eyes are the simple cells. Physiological evidence [5, 38] affirms that neuronal populations in the primary visual cortex (V1) of mammals exhibit contrast normalization. Neurons that respond strongly to simple visual stimuli, such as sinusoidal gratings, respond less well to the same stimuli when they are presented as part of a more complex stimulus which also excites other, neighbouring neurons. The behaviour of these neurons show a preference to specific orientations, which can be modelled

**Figure 2.** General architecture of bio-inspired proposed model. After video acquisition, a pre-processing generates an anthropometrical grid. Next, a treatment inspired in V1 determines the active neurons. Then, we compare two symmetrical regions: hears, eyes and nose to extract the vector composed of ratio of quantity active neurons. Finally, the asymmetries detection are evaluated with SOM to cluster into natural or non-natural facial expressions.

• Inspiring in simple neurons of V1 (primary visual cortex) and modelled by Gabor-like filters, we extract the simple active neurons in eight different orientations and two different phases.

6 Will-be-set-by-IN-TECH

**Figure 2.** General architecture of bio-inspired proposed model. After video acquisition, a pre-processing generates an anthropometrical grid. Next, a treatment inspired in V1 determines the active neurons. Then, we compare two symmetrical regions: hears, eyes and nose to extract the vector composed of ratio of quantity active neurons. Finally, the asymmetries detection are evaluated with SOM to cluster into

• Inspiring in simple neurons of V1 (primary visual cortex) and modelled by Gabor-like filters, we extract the simple active neurons in eight different orientations and two different

natural or non-natural facial expressions.


In the first proposed step, considering the processing done by the visual cortex, the first neurons that receive the stimuli from the eyes are the simple cells. Physiological evidence [5, 38] affirms that neuronal populations in the primary visual cortex (V1) of mammals exhibit contrast normalization. Neurons that respond strongly to simple visual stimuli, such as sinusoidal gratings, respond less well to the same stimuli when they are presented as part of a more complex stimulus which also excites other, neighbouring neurons. The behaviour of these neurons show a preference to specific orientations, which can be modelled computationally by the oriented Gabor-like filters [4, 22]:

$$\mathcal{S}\_{\theta,\varphi}(\mathbf{x},\mathbf{y},t) = \sum\_{t=0}^{t} I(\mathbf{x},\mathbf{y},t) \ast \mathcal{G}\_{\theta}(\mathbf{\hat{x}},\mathbf{\hat{y}},\boldsymbol{\varrho}) \tag{1}$$

where *S<sup>θ</sup>* (*x*, *y*, *t*) are simple cells which is the result of the convolution between a pool of Gabor functions (*G<sup>θ</sup>* (*x*ˆ, *y*ˆ)), in our case with 8 orientation and 2 phases, and the image *t* of the image sequences, (*x*ˆ, *y*ˆ) are the rotational components and (*x*, *y*) a position of the image. The Gabor function *G<sup>θ</sup>* (*x*ˆ, *y*ˆ) is defined as:

$$\mathcal{G}\_{\theta}(\pounds, \mathfrak{H}, \mathfrak{p}) = \exp\left(-\frac{\mathfrak{k}^2 + \gamma^2 \mathfrak{p}^2}{2\sigma^2}\right) \cos\left(2\pi \frac{\mathfrak{k}}{\lambda} + \mathfrak{p}\right) \tag{2}$$

where *x*ˆ = *x* cos *θ* + *y* sin *θ*, *y*ˆ = −*x* sin *θ* + *y* cos *θ*, *λ* is the length-width, *θ* represent the orientation, *ϕ* the phase, *σ* the standard deviation and *γ* represent the relation of aspect.

In the second step, the responses of two diffent phases of simple cells *S<sup>θ</sup>* (*x*, *y*, *t*) are then integrated in the complex cells by using a non-linear model that allows to merge the responses. Then, the complex cells responses are estimated by

$$\mathcal{C}\_{\theta}(\mathbf{x}, \mathbf{y}, t) = \sqrt{\mathcal{S}\_{\theta, \frac{\pi}{2}}(\mathbf{x}, \mathbf{y}, t)^2 + \mathcal{S}\_{\theta, -\frac{\pi}{2}}(\mathbf{x}, \mathbf{y}, t)^2} \tag{3}$$

where *C<sup>θ</sup>* (*x*, *y*, *t*) represent complex cells responses, *<sup>π</sup>* <sup>2</sup> and <sup>−</sup>*<sup>π</sup>* <sup>2</sup> are the symmetric and anti-symmetric phases, respectively. Then, the eight orientations are integrated of the complex cells responses *C<sup>θ</sup>* (*x*, *y*, *t*) for obtain one matrix of the map active neurons complex *C*(*x*, *y*, *t*) for the image *t*.

Finally the third step, the neurons in the human brain are connected to MT neurons allowing a temporal processing that is defined as the temporal integration between *C*(*x*, *y*, *t*) and *C*(*x*, *y*, *t* − 1) obtaining the map of active complex neurons by

$$D(\mathbf{x}, y, t) = \mathbb{C}(\mathbf{x}, y, t) - \mathbb{C}(\mathbf{x}, y, t - 1) \tag{4}$$

where *D*(*x*, *y*, *t*) is the result of active complex neurons in the image *t*, *C*(*x*, *y*, *t*) and *C*(*x*, *y*, *t* − 1) are the current image and the previous image, respectively. We use the information *D*(*x*, *y*, *t*) to compute the number of active neurons.

## **3.4. Extraction of active neurons**

In this stage, we realize the extraction of active complex neurons of the active map complex neurons (*D*(*x*, *y*, *t*)). For this, we use the six vertical bands and the map of active neurons. The 6-RVB overlapping in the map of active neurons. So, we compute the quantity of active neurons (QAN) for each region (*R*1, *R*2, *R*3, *R*4, *R*5, *R*6) and find the region with the maximum quantity of active neurons (QAN). The QAN is the number of active complex neurons in a region of the image.

sequences in this database and the *κ* index. With *C<sup>κ</sup>* values we cluster in natural expression

Bio-Inspired Architecture for Clustering into Natural and Non-Natural Facial Expressions 193

These values are sending to SOM 1D that performs 5000 epochs, with a *η* = 1.000 to *η* = 0.0001

We take six columns of proposed face grid. The symmetrical regions are: R1 with R6 (ears),

The tests for our approach took only the colour sequence of images for the CK+ database while for FG-Net and LTI-HIT we take all sequences of images. The last database was split

Figure 3 shows a sequence of image of CK+ database for a non-natural expression. There are 11 asymmetries in both ears and nose regions, and 9 asymmetries in eyes region, the weighted sum is 10 = 11 ∗ 0.3 + 11 ∗ 0.2 + 9 ∗ 0.5. But this ratio is higher than fixed threshold *κ* = 7, then this sequence is non natural. We tested all the experiments for two seconds independently of sampling frequency (see table 1). We fix *κ* according table 1 for test our all available databases. In our experiments this threshold tolerates the generated asymmetries by

We test the databases Cohn-Kanade (CK+) (only colour images) [20], FG-Net [33], and LTI-HIT as shown in the table 2. We confirm that FG-Net is the most natural database. Our

In 2000 and 2010, The Institute of Electrical & Electronics Engineers at Grenoble, France, created the CK and CK+, respectively (the same CK database but 107 new sequences of which 33 are in color), consists of 123 persons with 7 different expressions per person: anger, contempt, disgust, fear, happy, sadness and surprise. The image sequences vary in duration (i.e. 10 to 60 frames) and incorporate the onset of a neutral frame at the moment of peak formation of the facial expressions. This database was captured to 30 frames per second with a resolution 640 × 480. The final frame of each image sequence was coded using FACS (Facial Action Coding System) which describes person's expression in terms of action units (AUs) [20]. Participants were instructed by an experimenter to perform a series of 23 facial displays including single action units and specified emotion like expressions of joy, surprise,

In 2006, Facial Expressions and Emotions from the Munich Technical University created the FG-Net database. This is an image database containing face images showing a number of persons performing six different basic emotions defined by Ekman & Friesen: anger, fear,

We used three databases to test our proposed approach: FG-Net, CK+, and LTI-HIT.

anger, disgust, fear, and sadness. Each display began and ended in a neutral face.

and non-natural expression.

**4. Experimental results**

and they were analysed statistically.

R2 with R5 (eyes) and R3 with R4 (nose).

per question in each interview (sequence of images).

illumination, environment, and personal conditions.

classification error was of 12% in average.

**4.1. Description of databases**

*4.1.1. Cohn-Kanade (CK+) database*

*4.1.2. FG-Net database*

So, we obtain six vectors of the quantity of active neurons (QAN): one for the QAN obtained and one for QAN expected for a region. The last one is the total possible QAN in a region. The first one computes only the active present neurons such that the ratio is always between 0 and 1. This information is used to compare two symmetric regions to detect temporal asymmetries.

## **3.5. Asymmetry detection**

Next stage, we detect the asymmetries for each region (6-RVB) of the map of active complex neurons. To obtain or detect asymmetries in each image during two seconds using 6-RVB we follow the next steps:


The detected asymmetries (*Out*) in the facial expression during the image sequences are weighed for each pair of symmetrical regions (*R*<sup>1</sup> with *R*6, *R*<sup>2</sup> with *R*<sup>5</sup> and *R*<sup>3</sup> with *R*4). For each database, we proposed an empirical threshold *κ* that we apply to all image sequence in the same database. This *κ* considers various aspects as the sampling frequency (we used two seconds in our experiments), the personal conditions, head motion and luminance-capture conditions.

#### **3.6. Clustering in natural/non-natural expressions**

For each different database we obtain a *κ* adapted to number of images in each sequence. So, we used the equation 5 for cluster the facial expressions in the natural/non-natural expressions:

$$\mathbf{C}\_{\mathbf{x}} = \frac{\mathbf{C}\_{A}}{\mathbf{C}\_{E}} \frac{2F}{T\kappa} \tag{5}$$

where *CA CE* is the ratio of quantity of active neurons (QAN) in a region, *T* is the total number of images in a typical sequence in the database to compute *κ*, *F* is the capture frequency for the sequences in this database and the *κ* index. With *C<sup>κ</sup>* values we cluster in natural expression and non-natural expression.

These values are sending to SOM 1D that performs 5000 epochs, with a *η* = 1.000 to *η* = 0.0001 and they were analysed statistically.

## **4. Experimental results**

8 Will-be-set-by-IN-TECH

In this stage, we realize the extraction of active complex neurons of the active map complex neurons (*D*(*x*, *y*, *t*)). For this, we use the six vertical bands and the map of active neurons. The 6-RVB overlapping in the map of active neurons. So, we compute the quantity of active neurons (QAN) for each region (*R*1, *R*2, *R*3, *R*4, *R*5, *R*6) and find the region with the maximum quantity of active neurons (QAN). The QAN is the number of active complex neurons in a

So, we obtain six vectors of the quantity of active neurons (QAN): one for the QAN obtained and one for QAN expected for a region. The last one is the total possible QAN in a region. The first one computes only the active present neurons such that the ratio is always between 0 and 1. This information is used to compare two symmetric regions to detect temporal

Next stage, we detect the asymmetries for each region (6-RVB) of the map of active complex neurons. To obtain or detect asymmetries in each image during two seconds using 6-RVB we

• First step, we compute the difference between the symmetric regions. For example, the

• So, we calculate a threshold (a minimum and a maximum) using the average and standard

• Then, we detect the asymmetries for each difference of regions in the image sequences, using the "rule": for each difference of regions, we verify that the difference is outside the threshold (i.e., minor to minimum and major to maximum). We count each outsider in the

The detected asymmetries (*Out*) in the facial expression during the image sequences are weighed for each pair of symmetrical regions (*R*<sup>1</sup> with *R*6, *R*<sup>2</sup> with *R*<sup>5</sup> and *R*<sup>3</sup> with *R*4). For each database, we proposed an empirical threshold *κ* that we apply to all image sequence in the same database. This *κ* considers various aspects as the sampling frequency (we used two seconds in our experiments), the personal conditions, head motion and luminance-capture

For each different database we obtain a *κ* adapted to number of images in each sequence. So, we used the equation 5 for cluster the facial expressions in the natural/non-natural

2*F*

*CE* is the ratio of quantity of active neurons (QAN) in a region, *T* is the total number of images in a typical sequence in the database to compute *κ*, *F* is the capture frequency for the

*<sup>T</sup><sup>κ</sup>* (5)

*<sup>C</sup><sup>κ</sup>* <sup>=</sup> *CA CE*

• Then, compute average and standard deviation for each difference of the regions.

• Finally, we compute the *Out* for each pair of regions in the images sequence.

**3.6. Clustering in natural/non-natural expressions**

difference between region *R*<sup>1</sup> and *R*6, that are the extreme regions.

**3.4. Extraction of active neurons**

region of the image.

**3.5. Asymmetry detection**

follow the next steps:

deviation.

sequence, *Out*.

conditions.

expressions:

where *CA*

asymmetries.

We take six columns of proposed face grid. The symmetrical regions are: R1 with R6 (ears), R2 with R5 (eyes) and R3 with R4 (nose).

The tests for our approach took only the colour sequence of images for the CK+ database while for FG-Net and LTI-HIT we take all sequences of images. The last database was split per question in each interview (sequence of images).

Figure 3 shows a sequence of image of CK+ database for a non-natural expression. There are 11 asymmetries in both ears and nose regions, and 9 asymmetries in eyes region, the weighted sum is 10 = 11 ∗ 0.3 + 11 ∗ 0.2 + 9 ∗ 0.5. But this ratio is higher than fixed threshold *κ* = 7, then this sequence is non natural. We tested all the experiments for two seconds independently of sampling frequency (see table 1). We fix *κ* according table 1 for test our all available databases. In our experiments this threshold tolerates the generated asymmetries by illumination, environment, and personal conditions.

We test the databases Cohn-Kanade (CK+) (only colour images) [20], FG-Net [33], and LTI-HIT as shown in the table 2. We confirm that FG-Net is the most natural database. Our classification error was of 12% in average.

## **4.1. Description of databases**

We used three databases to test our proposed approach: FG-Net, CK+, and LTI-HIT.

### *4.1.1. Cohn-Kanade (CK+) database*

In 2000 and 2010, The Institute of Electrical & Electronics Engineers at Grenoble, France, created the CK and CK+, respectively (the same CK database but 107 new sequences of which 33 are in color), consists of 123 persons with 7 different expressions per person: anger, contempt, disgust, fear, happy, sadness and surprise. The image sequences vary in duration (i.e. 10 to 60 frames) and incorporate the onset of a neutral frame at the moment of peak formation of the facial expressions. This database was captured to 30 frames per second with a resolution 640 × 480. The final frame of each image sequence was coded using FACS (Facial Action Coding System) which describes person's expression in terms of action units (AUs) [20]. Participants were instructed by an experimenter to perform a series of 23 facial displays including single action units and specified emotion like expressions of joy, surprise, anger, disgust, fear, and sadness. Each display began and ended in a neutral face.

#### *4.1.2. FG-Net database*

In 2006, Facial Expressions and Emotions from the Munich Technical University created the FG-Net database. This is an image database containing face images showing a number of persons performing six different basic emotions defined by Ekman & Friesen: anger, fear,

The tests for our approach took only the colour sequence of images for the CK+ database while for FG-Net and LTI-HIT we take all sequences images. The last database was split per

Bio-Inspired Architecture for Clustering into Natural and Non-Natural Facial Expressions 195

**Figure 4.** Results for face asymmetry detection in a sequence of images for FGNet. In this case, the boy is

The three databases are very different in its acquisition, illumination and manipulation of capture conditions. For weight these databases, we propose the *κ* index according to table 1. In natural conditions, facial changes on the left side are only about 2% greater than overall right-side changes [30]. We establish a level of 1 to 5 for the personal conditions (hair-style,

question.

suspect to deceitful clues.

**4.2. Remarks**

**Figure 3.** Asymmetry in vertical regions for a sequence of images in CK+. Horizontal red (black) line represents the data; horizontal green (gray) line, the difference between symmetrical regions; the separation between horizontal dashed lines is the symmetrical tolerance, and vertical dotted lines show the asymmetry (we show it only in the left graphics).

surprise, disgust, sadness and happiness. They also add the neutral expression for each person. This database consists of 19 persons with 21 sequences for each person, then it contains 399 different sequences. The sequences of images were captured to 25 frames per second with a resolution of 640 × 480 pixels [33]. This database allows to observe people react as natural as possible. As a consequence, it was tried to wake real emotions by playing video clips or still images after a short introductory phase instead of telling the person to play a role. This is in contrast to the asymmetry of luminance of the face, which includes not only asymmetry introduced by lighting, but also asymmetry introduced by the face itself.

#### *4.1.3. LTI-HIT database*

Finally, in 2010, the Children's Hospital of Tamaulipas created the LTI-HIT database. This database is taken from interviews that consists of 52 persons (11 men and 41 women) out of them 5 persons were put into one type of environment and 47 in the other. Each interview has between 27 and 33 different questions. The participants were between 18 and 66 years old. Each sequence of images were captured to 30 frames per second with resolution of 720 × 480 pixels and duration of 1.6 to 2.5 minutes per interviews. The interviews were made in natural conditions.

The tests for our approach took only the colour sequence of images for the CK+ database while for FG-Net and LTI-HIT we take all sequences images. The last database was split per question.

**Figure 4.** Results for face asymmetry detection in a sequence of images for FGNet. In this case, the boy is suspect to deceitful clues.

#### **4.2. Remarks**

10 Will-be-set-by-IN-TECH

**Figure 3.** Asymmetry in vertical regions for a sequence of images in CK+. Horizontal red (black) line represents the data; horizontal green (gray) line, the difference between symmetrical regions; the separation between horizontal dashed lines is the symmetrical tolerance, and vertical dotted lines show

surprise, disgust, sadness and happiness. They also add the neutral expression for each person. This database consists of 19 persons with 21 sequences for each person, then it contains 399 different sequences. The sequences of images were captured to 25 frames per second with a resolution of 640 × 480 pixels [33]. This database allows to observe people react as natural as possible. As a consequence, it was tried to wake real emotions by playing video clips or still images after a short introductory phase instead of telling the person to play a role. This is in contrast to the asymmetry of luminance of the face, which includes not only

Finally, in 2010, the Children's Hospital of Tamaulipas created the LTI-HIT database. This database is taken from interviews that consists of 52 persons (11 men and 41 women) out of them 5 persons were put into one type of environment and 47 in the other. Each interview has between 27 and 33 different questions. The participants were between 18 and 66 years old. Each sequence of images were captured to 30 frames per second with resolution of 720 × 480 pixels and duration of 1.6 to 2.5 minutes per interviews. The interviews were made in natural

asymmetry introduced by lighting, but also asymmetry introduced by the face itself.

the asymmetry (we show it only in the left graphics).

*4.1.3. LTI-HIT database*

conditions.

The three databases are very different in its acquisition, illumination and manipulation of capture conditions. For weight these databases, we propose the *κ* index according to table 1.

In natural conditions, facial changes on the left side are only about 2% greater than overall right-side changes [30]. We establish a level of 1 to 5 for the personal conditions (hair-style,


**Table 1.** *κ* index is the value for each of the three databases. PC, personal conditions (facial expression); HM, head movement, and LC, luminance conditions. Each feature is between 0 to 5.

skin colour and tolerable static face asymmetry), and, in average, we choose a level 3 to model all personal conditions (PC) for our available databases.

The head motion (HM) is very different too between these three databases. In CK+ all persons are very controlled and the original sequences were cut to show only the evident expression. Again, the FG-Net sequences show activities before and after an evident expression. Finally, the LTI-HIT sequences were not prepared for the major expression. It shows the natural reactions of the interviewed persons. Then, we assign to HM variable the values 1, 3 and 4 to each database, respectively.

The luminance and capture conditions (LC) is too different. CK+ was created in a studio, FG-Net in a controlled environment and LTI-HIT in an uncontrolled environment. Then, we assign to LC variable 2, 3 and 4 to each database, respectively.


**Figure 5.** Asymmetry based on the quantity index. The first DA and SOM blocks correspond to Natural

Bio-Inspired Architecture for Clustering into Natural and Non-Natural Facial Expressions 197

on average and standard deviation, while the SOM application shows the centroids guided

We have shown a bio-inspired approach to processing facial gesture for clustering into natural and non-natural facial expressions. It takes advantage of early primary visual areas in the brains of human beings for feature-extraction. Futhermore, it simulates the integration and discrimination of the superior visual areas considering the symmetrical and asymmetrical measures within the respective time span. This process ends in a bio-inspired neural network

The asymmetry between symmetrical regions of the face analysed during two seconds allows us to classify in: natural or non-natural facial expression. The quantity of asymmetries and its relations with different symmetrical regions is applied as a new biometric measure [24] (the precision of measurement is necessary). But we apply it for classification of natural or non

In our bio-inspired proposed architecture, after face detection, an anthropomorphical grid is proposed. For each region, we obtain the ratio of quantity of active neurons (QAN) obtained

This proposed model can work for any database. Furthermore, the database FG-Net and Cohn-Kanade were tested. More than 63% of CK+ video sequences contain asymmetries. This database was built with simulation expressions (non natural expressions) and we confirm

from the modelled simple and complex neurons by our Gabor-like filters.

facial expressions. The last two DA and SOM correspond to Non-Natural facial expression.

for clustering into natural and non-natural facial expressions.

natural expressions (the precision is not necessary).

by the density and dispersion.

**5. Conclusion**

**Table 2.** Natural/non-natural percentages according to *κ* index. The quantity of active neurons is manipulated into two manipulations: statistical analysis of active neurons (DA) and self-organizing maps (SOM). The percentages correspond to the obtained results with 12% ± 8% error with two expert responses.

The figure 4 shows another example of test for a sequence of images in FGNet database. The vertical dotted lines show the asymmetries (only shown the left side). The horizontal red lines are the ratios of quantity, while the horizontal green lines are the difference between symmetrical regions. Finally, the horizontal blue lines are tolerable threshold obtained from this sequence of images. In this case, the boy is suspected to deceitful clues because all the graphics show a higher value (22, 33 and 28 for each pair of vertical band) and the weighed sum is 29.3 = 22 ∗ 0.2 + 33 ∗ 0.5 + 28 ∗ 0.3 that is higher than *κ* = 18.0 index.

The table 2 resume the obtained ratios for natural/non-natural facial expressions with statistical analysis and SOM. If we see SOM percentages, these percentages show FG-Net as greater in natural facial expressions than the other two databases, CK+ also has greater percentage followed by LTI-HIT with non-natural facial expressions.

For statistical analysis we have the opposed results. This difference is due to concentration of values and the dispersion. With a simple statistical analysis, all responses are separated based

**Figure 5.** Asymmetry based on the quantity index. The first DA and SOM blocks correspond to Natural facial expressions. The last two DA and SOM correspond to Non-Natural facial expression.

on average and standard deviation, while the SOM application shows the centroids guided by the density and dispersion.

## **5. Conclusion**

12 Will-be-set-by-IN-TECH

Database (DB) PC HM LC *κ* index

**Table 1.** *κ* index is the value for each of the three databases. PC, personal conditions (facial expression);

skin colour and tolerable static face asymmetry), and, in average, we choose a level 3 to model

The head motion (HM) is very different too between these three databases. In CK+ all persons are very controlled and the original sequences were cut to show only the evident expression. Again, the FG-Net sequences show activities before and after an evident expression. Finally, the LTI-HIT sequences were not prepared for the major expression. It shows the natural reactions of the interviewed persons. Then, we assign to HM variable the values 1, 3 and

The luminance and capture conditions (LC) is too different. CK+ was created in a studio, FG-Net in a controlled environment and LTI-HIT in an uncontrolled environment. Then, we

Natural Non-natural

**DB** DA SOM DA SOM CK+ 42.42% 27.27% 33.33% 63.64% FG-Net 33.60% 56.65% 39.68% 29.63% LTI-HIT 34.55% 35.37% 38.70% 52.51%

**Table 2.** Natural/non-natural percentages according to *κ* index. The quantity of active neurons is manipulated into two manipulations: statistical analysis of active neurons (DA) and self-organizing maps (SOM). The percentages correspond to the obtained results with 12% ± 8% error with two expert

sum is 29.3 = 22 ∗ 0.2 + 33 ∗ 0.5 + 28 ∗ 0.3 that is higher than *κ* = 18.0 index.

percentage followed by LTI-HIT with non-natural facial expressions.

The figure 4 shows another example of test for a sequence of images in FGNet database. The vertical dotted lines show the asymmetries (only shown the left side). The horizontal red lines are the ratios of quantity, while the horizontal green lines are the difference between symmetrical regions. Finally, the horizontal blue lines are tolerable threshold obtained from this sequence of images. In this case, the boy is suspected to deceitful clues because all the graphics show a higher value (22, 33 and 28 for each pair of vertical band) and the weighed

The table 2 resume the obtained ratios for natural/non-natural facial expressions with statistical analysis and SOM. If we see SOM percentages, these percentages show FG-Net as greater in natural facial expressions than the other two databases, CK+ also has greater

For statistical analysis we have the opposed results. This difference is due to concentration of values and the dispersion. With a simple statistical analysis, all responses are separated based

HM, head movement, and LC, luminance conditions. Each feature is between 0 to 5.

all personal conditions (PC) for our available databases.

assign to LC variable 2, 3 and 4 to each database, respectively.

4 to each database, respectively.

responses.

CK+ 3 1 2 6 FG-Net 3 3 3 9 LTI-HIT 3 4 4 11

> We have shown a bio-inspired approach to processing facial gesture for clustering into natural and non-natural facial expressions. It takes advantage of early primary visual areas in the brains of human beings for feature-extraction. Futhermore, it simulates the integration and discrimination of the superior visual areas considering the symmetrical and asymmetrical measures within the respective time span. This process ends in a bio-inspired neural network for clustering into natural and non-natural facial expressions.

> The asymmetry between symmetrical regions of the face analysed during two seconds allows us to classify in: natural or non-natural facial expression. The quantity of asymmetries and its relations with different symmetrical regions is applied as a new biometric measure [24] (the precision of measurement is necessary). But we apply it for classification of natural or non natural expressions (the precision is not necessary).

> In our bio-inspired proposed architecture, after face detection, an anthropomorphical grid is proposed. For each region, we obtain the ratio of quantity of active neurons (QAN) obtained from the modelled simple and complex neurons by our Gabor-like filters.

> This proposed model can work for any database. Furthermore, the database FG-Net and Cohn-Kanade were tested. More than 63% of CK+ video sequences contain asymmetries. This database was built with simulation expressions (non natural expressions) and we confirm

that our results are modelled correctly regarding the capture, illumination and personal conditions.

[8] Dornaika, F., Lazkano, E. & Sierra, B. [2011]. Improving dynamic facial expression recognition with feature subset selection, *Pattern Recognition Letters* 32(5): 740 – 748.

Bio-Inspired Architecture for Clustering into Natural and Non-Natural Facial Expressions 199

[10] Essa, I. A. & Pentland, A. P. [1997]. Coding, analysis, interpretation, and recognition of

[11] Fasel, B. & Luettin, J. [2003]. Automatic facial expression analysis: a survey, *Pattern*

[12] Fellenz, W., Taylor, J., Tsapatsoulis, N. & Kollias, S. [1999]. Comparing template-based, feature-based and supervised classification of facial expressions from static images,

[14] Gu, W., Xiang, C., Venkatesh, Y., Huang, D. & Lin, H. [2012]. Facial expression recognition using radial encoding of local gabor features and classifier synthesis, *Pattern*

[15] Hasselmo, M. E., Rolls, E. T. & Baylis, G. C. [1989]. The role of expression and identity in the face-selective responses of neurons in the temporal visual cortex of the monkey,

[16] Haxby, J. V., Hoffman, E. A. & Gobbini, M. I. [2000]. The distributed human neural

[17] Hung, Y., Smith, M. L., Bayle, D. J., Mills, T., Cheyne, D. & Taylor, M. J. [2010]. Unattended emotional faces elicit early lateralized amygdala-frontal and fusiform

[18] Kim, H.-C., Kim, D., Bang, S.-Y. & Lee, S.-Y. [2004]. Face recognition using the second-order mixture-of-eigenfaces method, *Pattern Recognition* 37(2): 337 – 349. [19] Lanitis, A., Taylor, C. J. & Cootes, T. F. [1997]. Automatic interpretation and coding of face images using flexible models, *IEEE Trans. Pattern Anal. Mach. Intell.* 19: 743–756. [20] Lucey, P., Cohn, J., Kanade, T., Saragih, J., Ambadar, Z. & Matthews, I. [2010]. The extended cohn-kanade dataset (ck+): A complete dataset for action unit and emotion-specified expression, *Computer Vision and Pattern Recognition Workshops*

[21] Maalej, A., Amor, B. B., Daoudi, M., Srivastava, A. & Berretti, S. [2011]. Shape analysis of local facial patches for 3d facial expression recognition, *Pattern Recognition* 44(8): 1581

[22] Marcelja, S. [1980]. Mathematical description of the reponses of simple cortical cells,

[23] Mitra, S. & Acharya, T. [2007]. Gesture recognition: A survey, *IEEE Transactions on*

[24] Mitra, S. & Savvides, M. [2005]. Analyzing asymmetry biometric in the frequency

[25] Nair, B. M., Foytik, J., Tompkins, R., Diskin, Y., Aspiras, T. & Asari, V. [2011]. Multi-pose face recognition and tracking system, *Procedia Computer Science* 6(0): 381 – 386. [26] Park, S. & Kim, D. [2009]. Subtle facial expression recognition using motion

[27] Phillips, M. L., Young, A. W., Senior, C., Brammer, M., Andrews, C., Calder, A. J., Bullmore, E. T., Perrett, D. I., Rowland, D., Williams, S. C. R., Gray, J. A. & David, A. S.

system for face perception., *Trends in cognitive sciences* 4(6): 223–233.

*(CVPRW), 2010 IEEE Computer Society Conference on*, pp. 94–101.

*Journal of the Optical Society of America A* 70(11): 1297–1300.

*Systems, Man and Cybernetics - Part C* 37(3): 311–324.

domain for face recognition, *IEEE ICASSP* II: 953–956.

magnification, *Pattern Recognition Letters* 30(7): 708 – 716.

*Proceedings of Circuits, Systems, Communications and Computers* pp. 5331–5336. [13] Gosselin, F., Spezio, M. L., Tranel, D. & Adolphs, R. [2011]. Asymmetrical use of eye information from faces following unilateral amygdala damage, *Social Cognitive and*

[9] Ekman, P. [1980]. Asymmetry in facial expression, *Science* 209(4458): 833–834.

facial expressions, *IEEE Trans. Pattern Anal. Mach. Intell.* 19: 757–763.

*Recognition* 36(1): 259 – 275.

*Recognition* 45(1): 80 – 91.

– 1589.

*Affective Neuroscience* 6(3): 330–337.

*Behavioural Brain Research* 32(3): 203 – 218.

activations, *NeuroImage* 50(2): 727 – 733.

The different conditions of each database do not allow a simple characterization. We proposed an experimental index to model these conditions (e.g. personal conditions, head motion and luminance-capture), *κ*. In future, we will determine this index automatically for each database and, more precisely, for each sequence of images into database.

Using the index *κ* we have shown that our approach is independent of each database. This independence was tested in our experiments.

The preliminary steps of our approach show the feasibility to detect suspected persons in nervous or altered situations. Here is the beginning of our methodology to detect deceitful clues in facial expressions.

## **Author details**

#### Claudio Castellanos Sánchez

*Laboratory of Information Technology, Cinvestav-Tamaulipas and División de Estudios de Posgrado e Investigación, Instituto Tecnológico de Ciudad Victoria, Mexico*

#### Manuel Hernández Hernández

*Departamento de Ingeniería en Sistemas Computacionales, Instituto Tecnológico Superior de Tantoyuca, Mexico*

Pedro Luis Sánchez Orellana

*División de Estudios de Posgrado e Investigación, Instituto Tecnológico de Ciudad Victoria, Mexico*

## **6. References**


14 Will-be-set-by-IN-TECH

that our results are modelled correctly regarding the capture, illumination and personal

The different conditions of each database do not allow a simple characterization. We proposed an experimental index to model these conditions (e.g. personal conditions, head motion and luminance-capture), *κ*. In future, we will determine this index automatically for each database

Using the index *κ* we have shown that our approach is independent of each database. This

The preliminary steps of our approach show the feasibility to detect suspected persons in nervous or altered situations. Here is the beginning of our methodology to detect deceitful

*Laboratory of Information Technology, Cinvestav-Tamaulipas and División de Estudios de Posgrado e*

*Departamento de Ingeniería en Sistemas Computacionales, Instituto Tecnológico Superior de*

*División de Estudios de Posgrado e Investigación, Instituto Tecnológico de Ciudad Victoria, Mexico*

[1] Adolphs, R., Tranel, D. & Damasio, A. R. [Sep-1998]. The human amygdala in social

[2] Adolphs, R., Tranel, D., Damasio, H. & Damasio, A. R. [1995]. Fear and the human

[3] Belhumeur, P. N., Hespanha, J. & Kriegman, D. J. [1997]. Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection, *IEEE Transactions on Pattern Analysis*

[4] Castellanos-Sánchez, C. [2005]. *Neuromimetic connectionist model for embedded visual perception of motion*, PhD thesis, Université Henri Poincaré (Nancy I), Nancy, France.

[5] Clatworthy, P., Chirimuuta, M., Lauritzen, J. & Tolhurst, D. [2003]. Coding of the contrasts in natural images by populations of neurons in primary visual cortex (v1),

[6] del Solar, J. R. & Quinteros, J. [2008]. Illumination compensation and normalization in eigenspace-based face recognition: A comparative study of different pre-processing

[7] Devi, B. J., Veeranjaneyulu, N. & Kishore, K. [2010]. A novel face recognition system based on combining eigenfaces with fisher faces using wavelets, *Procedia Computer*

and, more precisely, for each sequence of images into database.

*Investigación, Instituto Tecnológico de Ciudad Victoria, Mexico*

judgment., *Nature* 393(6684): 470–474.

*and Machine Intelligence* 19(7): 711–720.

Bibliothèque des Sciences et Techniques.

*Vision Research* 43(18): 1983 – 2001.

*Science* 2(0): 44 – 51.

amygdala, *The Journal of Neuroscience* 15(9): 5879–5891.

approaches, *Pattern Recognition Letters* 29(14): 1966 – 1979.

independence was tested in our experiments.

clues in facial expressions.

Claudio Castellanos Sánchez

Manuel Hernández Hernández

Pedro Luis Sánchez Orellana

**Author details**

*Tantoyuca, Mexico*

**6. References**

conditions.


[1997]. Aspecific neural substrate for perceiving facial expressions of disgust, *Nature* 389(6650): 174 – 194.


© 2012 Yang, licensee InTech. This is an open access chapter distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use,

© 2012 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution,

distribution, and reproduction in any medium, provided the original work is properly cited.

and reproduction in any medium, provided the original work is properly cited.

**Vision as a Fundamentally Statistical Machine** 

As a vital driving force of systems neuroscience, visual neuroscience had its conceptual framework established more than 40 years ago based on Hubel and Wiesel's groundbreaking work on the receptive-field properties of visual neurons (Hubel & Wiesel, 1977). This framework was subsequently strengthened by David Marr's influential book (Marr, 2010). In this paradigm, visual neurons are conceived to perform bottom-up, imagebased processing to build a series of symbolic representations of visual stimuli. This paradigm, however, is deeply misleading since the generative sources in the threedimensional (3**D**) physical world of any stimulus, to which visual animals must respond successfully, cannot be determined by image-based processing (due to the inverse optics problem). This is perhaps the reason why "Now, thirty years later, the main problems that occupied Marr remain fundamental open problems in the study of perception" (Marr, 2010),

During the last 30 years, dramatic progress in computing hardware, digital imaging, statistical modeling, and visual neuroscience has promoted researchers to re-examine the computations and representations (see above) for natural vision examined in Marr's book. A range of new ideas have been proposed, many of which are summarized in books (Knill & Richards, 1996; Rao et al., 2002; Purves & Lotto, 2003; Doya et al., 2007; Trommershauser et al., 2011) and reviews (Simoncelli & Olshausen, 2001; Yuille & Kersten, 2006; Geisler, 2008; Friston, 2010). The unified theme is that vision and visual system structure and function must be understood in statistical terms. How this feat can be achieved, however, is not clear at all.

Since humans and other visual animals must respond successfully to visual stimuli whose generative sources cannot be determined in any direct way, the visual system can only generate percepts according to the probability distributions (**PDs**) of visual variables underlying the stimuli. The information pertinent to the generation of these PDs, namely, the statistics of natural visual environments, must have been incorporated into the visual circuitry by successful behavior in the world over evolutionary and developmental time.

as assessed by two prominent vision scientists and Marr's close associates.

Zhiyong Yang

**1. Introduction** 

http://dx.doi.org/10.5772/50165

Additional information is available at the end of the chapter

## **Vision as a Fundamentally Statistical Machine**

Zhiyong Yang

16 Will-be-set-by-IN-TECH

[28] Rowley, H. A., Baluja, S. & Kanade, T. [1998]. Neural network-based face detection, *IEEE*

[29] Sánchez, A., Ruiz, J., Moreno, A., Montemayor, A., Hernández, J. & Pantrigo, J. [2011]. Differential optical flow applied to automatic facial expression recognition, *Neurocomputing* 74(8): 1272 – 1282. Selected Papers from the 3rd International Work-Conference on the Interplay between Natural and Artificial Computation

[30] Schmidt, K. L., Liu, Y. & Coh, J. F. [2006]. The role of structural facial asymmetry in in

[31] Venkatesh, Y., Kassim, A. A. & Murthy, O. R. [2009]. A novel approach to classification of facial expressions from 3d-mesh datasets using modified pca, *Pattern Recognition Letters*

[32] Vuilleumier, P. & Pourtois, G. [2007]. Distributed and interactive brain mechanisms during emotion face perception: Evidence from functional neuroimaging,

[34] Wang, T.-H. & Lien, J.-J. J. [2009]. Facial expression recognition system based on rigid and non-rigid motion separation and 3d pose estimation, *Pattern Recognition* 42(5): 962 –

[35] Weymar, M., Law, A., Ohman, A. & Hamm, A. O. [2011]. The face is more than its parts: Brain dynamics of enhanced spatial attention to schematic threat, *NeuroImage* 58(3): 946

[36] Whalen, P. J., Rauch, S. L., Etcoff, N. L., McInerney, S. C., Lee, M. B. & Jenike, M. A. [January-1998]. Masked presentations of emotional facial expressions modulate amygdala activity without explicit knowledge., *The Journal of Neuroscience* 18(1): 411–418. [37] Williams, M. A., McGlone, F., Abbott, D. F. & Mattingley, J. B. [2005]. Differential amygdala responses to happy and fearful facial expressions depend on selective

[38] Willmore, B. D., Bulstrode, H. & Tolhurst, D. J. [2012]. Contrast normalization contributes to a biologically-plausible model of receptive-field development in primary visual cortex

[39] Xiang, T., Leung, M. & Cho, S. [2008]. Expression recognition using fuzzy

spatio-temporal modeling, *Pattern Recognition* 41(1): 204 – 216.

asymmetry peak facial expressions, *Literality* 11(6): 540–561.

[33] Wallhoff, F. [2006]. Facial expressions and emotion database. URL: *http://www.mmk.ei.tum.de/˜waf/fgnet/feedtum.html*

389(6650): 174 – 194.

(IWINAC 2009).

30(12): 1128 – 1137.

977.

– 954.

*Neuropsychologia* 45(1): 174 – 194.

attention, *NeuroImage* 24(2): 417 – 425.

(v1), *Vision Research* 54(0): 49 – 60.

*Trans. Pattern Anal. Mach. Intell.* 20: 23–38.

[1997]. Aspecific neural substrate for perceiving facial expressions of disgust, *Nature*

Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/50165

## **1. Introduction**

As a vital driving force of systems neuroscience, visual neuroscience had its conceptual framework established more than 40 years ago based on Hubel and Wiesel's groundbreaking work on the receptive-field properties of visual neurons (Hubel & Wiesel, 1977). This framework was subsequently strengthened by David Marr's influential book (Marr, 2010). In this paradigm, visual neurons are conceived to perform bottom-up, imagebased processing to build a series of symbolic representations of visual stimuli. This paradigm, however, is deeply misleading since the generative sources in the threedimensional (3**D**) physical world of any stimulus, to which visual animals must respond successfully, cannot be determined by image-based processing (due to the inverse optics problem). This is perhaps the reason why "Now, thirty years later, the main problems that occupied Marr remain fundamental open problems in the study of perception" (Marr, 2010), as assessed by two prominent vision scientists and Marr's close associates.

During the last 30 years, dramatic progress in computing hardware, digital imaging, statistical modeling, and visual neuroscience has promoted researchers to re-examine the computations and representations (see above) for natural vision examined in Marr's book. A range of new ideas have been proposed, many of which are summarized in books (Knill & Richards, 1996; Rao et al., 2002; Purves & Lotto, 2003; Doya et al., 2007; Trommershauser et al., 2011) and reviews (Simoncelli & Olshausen, 2001; Yuille & Kersten, 2006; Geisler, 2008; Friston, 2010). The unified theme is that vision and visual system structure and function must be understood in statistical terms. How this feat can be achieved, however, is not clear at all.

Since humans and other visual animals must respond successfully to visual stimuli whose generative sources cannot be determined in any direct way, the visual system can only generate percepts according to the probability distributions (**PDs**) of visual variables underlying the stimuli. The information pertinent to the generation of these PDs, namely, the statistics of natural visual environments, must have been incorporated into the visual circuitry by successful behavior in the world over evolutionary and developmental time.

© 2012 Yang, licensee InTech. This is an open access chapter distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. © 2012 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

During the last two decades, this statistical concept of vision has been successful in explaining aspects of vision that would be difficult to understand otherwise (see references cited above). In this chapter, I will describe several recent studies that relate the statistics of 2D and 3D natural visual scenes to visual percepts of brightness, saliency, and 3D space.

Vision as a Fundamentally Statistical Machine 203

(a), Standard simultaneous brightness contrast effect. The central square in the dark surround (left panel) appears brighter than the equiluminant square in the light surround (right panel). (b), White's illusion. Although the gray rectangles in the left panel are all equiluminant, the ones surrounded by the generally lighter context on the left appear brighter than those surrounded by the generally darker context on the right. When, however, the luminance of the target rectangles is the lowest (middle panel) or highest (right panel) value in the stimulus, the targets in the generally lighter context (on the left in the middle and right panels) appear less brighter than ones in the generally darker context (called the "inverted White's effect"). (c), Wertheimer-Benary illusion. The triangle embedded in the arm of the black cross appears brighter than the one that abuts the corner of the cross. The slightly different brightness of the equiluminant triangles is maintained whether the presentation is upside down (middle panel), or reflected along the diagonal (right panel). (d), The intertwined cross illusion. The target on the left appears substantially brighter than the equiluminant target on the right. (e), The inverted T-illusion. The inverted T-shape on the left appears somewhat

**Figure 1.** The influence of spatial patterns of luminance on the apparent brightness of a target (the

To examine whether the statistics of natural light patterns predict the perceptual phenomena shown in Fig. 1, we obtained the relevant PDs of luminance in natural scenes by sampling a database of natural scenes (van Hateren & van der Schaaf, 1998) with targetsurround configurations that had the same local geometry as the stimuli in Fig. 1. As a first step, these configurations were superimposed on the images to find light patterns in which the luminance values of both the surround and target regions were approximately homogeneous; for those configurations in which the surround comprised more than one region of the same luminance (see Fig. 1), we also required that the relevant sampled regions meet this criterion. The sampling configurations were moved in steps of one pixel to screen the full image. The mean luminance values of the target and the surrounding regions in the

**2.3. Brightness signifies context-mediated PDs of luminance in natural scenes** 

Natural environments comprise objects of different sizes at various distances that are related to each other and the observer in a variety of ways (Yang & Purves, 2003a,b). When the light arising from objects is projected onto an image plane, these complex relationships are transformed into 2D patterns of light intensity with highly structured statistics. Thus, the PD

targets [T] in each stimulus are equiluminant, and are indicated in the insets on the right).

brighter than the equiluminant target on the right (modified from Yang & Purves, 2004).

**2.2. Context-mediated PDs of luminance in natural scenes** 

samples were then calculated, and their occurrences tallied.

In the second section of this chapter, I will discuss how the PDs of luminance in specific contexts in natural scenes, referred to as the context-mediated PDs in natural scenes, predict brightness, the perception elicited by the luminance of a visual target. Our results show that brightness generated on this statistical basis accounts for a range of observations, whose causes have been debated for a long time without consensus. In the third section, I will present a simple, elegant model of the context- mediated PDs in natural scenes and a measure of visual saliency derived from these PDs. Our results show that this measure of visual saliency is a good predictor of human gaze in free-viewing both static and dynamic natural scenes. In the fourth section, I will present the statistics of 3D natural scenes and their relationship to human visual space. Our results show that human visual space is not a direct mapping of the 3D physical space but rather generated probabilistically. Finally, I will discuss the implications of these and other results for our understanding of the response properties of visual neurons, the intricate visual circuitries, the large-scale cortical organizations, the operational dynamics of the visual system, and natural vision.

## **2. The statistical structure of natural light patterns determines perceived light intensity**

## **2.1. Introduction**

In this section, I present evidence that the context-mediated PDs of luminance in natural scenes predict brightness, the perception elicited by the luminance of a visual target. A central puzzle in understanding how such percepts are generated by the visual system is that brightness does not correspond in any simple way to luminance. Thus, the same amount of light arising from a given region in a scene can elicit dramatically different brightness percepts when presented in different contexts (Fig. 1) (Kingdom, 2011). For example, in Fig.1 (a), the central square (T) in the left panel appears brighter than the same target in the right panel. This is the standard simultaneous brightness contrast effect.

A variety of explanations have been suggested since the basis for such phenomena was first debated by Helmholtz, Hering, Mach, and others (Gichrist et al., 1999; Purves et al., 2004; Kingdom, 2011). Although lateral inhibition in the early visual processing has often been proposed to account for these "illusions", this mechanism cannot explain instances in which similar overall contexts produce different brightness effects (compare Fig. 1 (a) with Figs. 1 (b) and (e); see also Fig. 1 (c)). This failure has led to several more recent suggestions, including complex filtering (Blakeslee & McCourt, 2004), the idea that brightness depends on detecting edges and junctions that promote the grouping of various luminances into interpretable spatial arrangements (Adelson, 2000; Anderson & Winawer, 2005), and the proposal that brightness is "re-synthesized" from 3D scene properties "inferred" from the stimulus (Wishart et al., 1997).

(a), Standard simultaneous brightness contrast effect. The central square in the dark surround (left panel) appears brighter than the equiluminant square in the light surround (right panel). (b), White's illusion. Although the gray rectangles in the left panel are all equiluminant, the ones surrounded by the generally lighter context on the left appear brighter than those surrounded by the generally darker context on the right. When, however, the luminance of the target rectangles is the lowest (middle panel) or highest (right panel) value in the stimulus, the targets in the generally lighter context (on the left in the middle and right panels) appear less brighter than ones in the generally darker context (called the "inverted White's effect"). (c), Wertheimer-Benary illusion. The triangle embedded in the arm of the black cross appears brighter than the one that abuts the corner of the cross. The slightly different brightness of the equiluminant triangles is maintained whether the presentation is upside down (middle panel), or reflected along the diagonal (right panel). (d), The intertwined cross illusion. The target on the left appears substantially brighter than the equiluminant target on the right. (e), The inverted T-illusion. The inverted T-shape on the left appears somewhat brighter than the equiluminant target on the right (modified from Yang & Purves, 2004).

**Figure 1.** The influence of spatial patterns of luminance on the apparent brightness of a target (the targets [T] in each stimulus are equiluminant, and are indicated in the insets on the right).

#### **2.2. Context-mediated PDs of luminance in natural scenes**

202 Visual Cortex – Current Status and Perspectives

**light intensity** 

**2.1. Introduction** 

stimulus (Wishart et al., 1997).

During the last two decades, this statistical concept of vision has been successful in explaining aspects of vision that would be difficult to understand otherwise (see references cited above). In this chapter, I will describe several recent studies that relate the statistics of 2D and 3D natural visual scenes to visual percepts of brightness, saliency, and 3D space.

In the second section of this chapter, I will discuss how the PDs of luminance in specific contexts in natural scenes, referred to as the context-mediated PDs in natural scenes, predict brightness, the perception elicited by the luminance of a visual target. Our results show that brightness generated on this statistical basis accounts for a range of observations, whose causes have been debated for a long time without consensus. In the third section, I will present a simple, elegant model of the context- mediated PDs in natural scenes and a measure of visual saliency derived from these PDs. Our results show that this measure of visual saliency is a good predictor of human gaze in free-viewing both static and dynamic natural scenes. In the fourth section, I will present the statistics of 3D natural scenes and their relationship to human visual space. Our results show that human visual space is not a direct mapping of the 3D physical space but rather generated probabilistically. Finally, I will discuss the implications of these and other results for our understanding of the response properties of visual neurons, the intricate visual circuitries, the large-scale cortical

organizations, the operational dynamics of the visual system, and natural vision.

**2. The statistical structure of natural light patterns determines perceived** 

In this section, I present evidence that the context-mediated PDs of luminance in natural scenes predict brightness, the perception elicited by the luminance of a visual target. A central puzzle in understanding how such percepts are generated by the visual system is that brightness does not correspond in any simple way to luminance. Thus, the same amount of light arising from a given region in a scene can elicit dramatically different brightness percepts when presented in different contexts (Fig. 1) (Kingdom, 2011). For example, in Fig.1 (a), the central square (T) in the left panel appears brighter than the same

target in the right panel. This is the standard simultaneous brightness contrast effect.

A variety of explanations have been suggested since the basis for such phenomena was first debated by Helmholtz, Hering, Mach, and others (Gichrist et al., 1999; Purves et al., 2004; Kingdom, 2011). Although lateral inhibition in the early visual processing has often been proposed to account for these "illusions", this mechanism cannot explain instances in which similar overall contexts produce different brightness effects (compare Fig. 1 (a) with Figs. 1 (b) and (e); see also Fig. 1 (c)). This failure has led to several more recent suggestions, including complex filtering (Blakeslee & McCourt, 2004), the idea that brightness depends on detecting edges and junctions that promote the grouping of various luminances into interpretable spatial arrangements (Adelson, 2000; Anderson & Winawer, 2005), and the proposal that brightness is "re-synthesized" from 3D scene properties "inferred" from the To examine whether the statistics of natural light patterns predict the perceptual phenomena shown in Fig. 1, we obtained the relevant PDs of luminance in natural scenes by sampling a database of natural scenes (van Hateren & van der Schaaf, 1998) with targetsurround configurations that had the same local geometry as the stimuli in Fig. 1. As a first step, these configurations were superimposed on the images to find light patterns in which the luminance values of both the surround and target regions were approximately homogeneous; for those configurations in which the surround comprised more than one region of the same luminance (see Fig. 1), we also required that the relevant sampled regions meet this criterion. The sampling configurations were moved in steps of one pixel to screen the full image. The mean luminance values of the target and the surrounding regions in the samples were then calculated, and their occurrences tallied.

#### **2.3. Brightness signifies context-mediated PDs of luminance in natural scenes**

Natural environments comprise objects of different sizes at various distances that are related to each other and the observer in a variety of ways (Yang & Purves, 2003a,b). When the light arising from objects is projected onto an image plane, these complex relationships are transformed into 2D patterns of light intensity with highly structured statistics. Thus, the PD of the luminance of, say, the central target in a standard simultaneous brightness contrast stimulus (Fig. 2 (a)) depends on the surrounding luminance values (Fig. 2 (b)).

Vision as a Fundamentally Statistical Machine 205

luminance will always have a higher percentile, and will always elicit a perception of greater brightness compared to any luminance that has a lower percentile. Since the relation Brightness=A(P)+A is not based on a particular luminance within the context in question, but rather on the entire PD of possible luminance values experienced in that context, the context-dependent relationship between brightness and luminance is highly nonlinear (see Fig. 2 (c)). In consequence, the same physical difference between two luminance values will often signify different percentile differences, and thus perceived differences in brightness. Furthermore, because the percentiles change more rapidly as the target luminance approaches the luminance of the surround one would expect greater changes of brightness,

an expectation that corresponds to the well known "crispening" effect in perception.

equiluminant target in the right panel.

**2.4. White's illusion** 

Finally, because the same value of target luminance will often correspond to different percentiles in the PDFs of target luminance in different contexts, two targets having the same luminance can elicit different brightness percepts, the higher percentile always corresponding to a brighter percept. Thus, in the standard simultaneous brightness contrast stimulus in Fig. 1 (a), the target (T) in the left panel in Fig. 2 (a) appears brighter than the

White's illusion (Fig. 1 (b)) presents a particular challenge for any explanation of brightness (White, 1979). The equiluminant rectangular areas surrounded by predominantly more luminant regions in the stimulus (on the left in the left panel of Fig. 1 (b); see also the area in the red frame in Fig. 3 (a) appear brighter than areas of identical luminance surrounded by less luminant regions (on the right in the left panel of Fig. 1 (b); see also the area in the blue frame in Fig. 3 (a). The especially perplexing characteristic of this percept is that the effect is opposite that elicited by standard simultaneous brightness contrast stimuli (Fig. 1 (a)). Even more puzzling, the effect reverses when the luminance of the rectangular targets is either the

The explanation for White's illusion provided by the statistical framework outlined above is shown in Fig. 3. When presented separately, as in Fig. 3 (a), the components of White's stimulus elicit much the same effect as in the usual presentation. By sampling the images of natural visual environments using configurations based on these components (Fig. 3 (b)), we obtained the PDFs of the luminance of a rectangular target (T) embedded in the two different configurations of surrounding luminance in White's stimulus. As shown in Fig. 3 (c), when the target in the intermediate range of luminance values (i.e., in between the luminance values at the two crossover points) abuts two dark rectangles laterally (left panel in Fig. 3 (b)), the percentile of the target luminance (red line) is higher than the percentile when the target abuts the two light rectangles (right panel in Fig. 3(b); blue line in Fig. 3 (c)). If, as we suppose, the percentile in the PDF of target luminance within any specific context determines the brightness perceived, the target with an intermediate luminance on the left in Fig. 3 (b) should appear brighter than the equiluminant target on the right. Finally, when all the luminance values in the stimulus are limited to a very narrow range (e.g., from 0 to 100 cd/m2 or from 1000 to 1100 cd/m2), when the sampling configurations are orientated vertically, or when the aspect ratio of

lowest or highest value in the stimulus (middle and right panels in Fig. 1 (b).

Fig. 2 (c) illustrates the supposition that, for any context, the visual system generates the brightness of a target according to the value of its luminance in the probability distribution function (PDF, the integral of PD) of the possible target luminance experienced in that context (Yang & Purves, 2004). This value is referred to subsequently as the percentile of the target luminance among all possible luminance values that co-occur with the contextual luminance pattern in the natural environment. In formal terms, this supposition means that the visual system generates brightness percepts according to the relationship Brightness=A(P)+A, where A and A are constants, and (P) is a monotonically increasing function of the PDF, P.

(a), The brightness elicited by a given target luminance in any context depends on the frequency of occurrence of that luminance relative to all the possible target luminance values experienced in that context in natural environments. This concept is illustrated here using the standard simultaneous brightness contrast stimulus in Fig. 1 (a*)*. The series of squares with different luminance values indicate all the possible occurrences of luminance in the target (T) in the two different contexts; the symbol indicates the relationship of a particular occurrence of luminance to the all possible occurrences of target luminance values experienced in the two contexts in natural environments. (b), This statistical relationship can be derived from the PD of target luminance values co-occurring with the luminance values and pattern of the two contexts of interest. The red and blue curves indicate the PDs of the luminance of the targets in (a), obtained by sampling the natural image database. The size of the sampling configuration was 1°x1°. (c), The brightness elicited by the luminance of the targets in (a) is based on the percentile of that luminance in the PDFs for the two different contexts, which are indicated by the icons (modified from Yang & Purves, 2004).

**Figure 2.** Brightness percepts signify context-mediated PDs of luminance in natural scenes.

By definition, then, the percentile of target luminance for the lowest luminance value within any contextual light pattern is 0% and corresponds to the perception of maximum darkness; the percentile for the highest luminance within any contextual pattern is 100% and corresponds to the maximum perceivable brightness. In any given context, a higher luminance will always have a higher percentile, and will always elicit a perception of greater brightness compared to any luminance that has a lower percentile. Since the relation Brightness=A(P)+A is not based on a particular luminance within the context in question, but rather on the entire PD of possible luminance values experienced in that context, the context-dependent relationship between brightness and luminance is highly nonlinear (see Fig. 2 (c)). In consequence, the same physical difference between two luminance values will often signify different percentile differences, and thus perceived differences in brightness. Furthermore, because the percentiles change more rapidly as the target luminance approaches the luminance of the surround one would expect greater changes of brightness, an expectation that corresponds to the well known "crispening" effect in perception.

Finally, because the same value of target luminance will often correspond to different percentiles in the PDFs of target luminance in different contexts, two targets having the same luminance can elicit different brightness percepts, the higher percentile always corresponding to a brighter percept. Thus, in the standard simultaneous brightness contrast stimulus in Fig. 1 (a), the target (T) in the left panel in Fig. 2 (a) appears brighter than the equiluminant target in the right panel.

## **2.4. White's illusion**

204 Visual Cortex – Current Status and Perspectives

function of the PDF, P.

of the luminance of, say, the central target in a standard simultaneous brightness contrast

Fig. 2 (c) illustrates the supposition that, for any context, the visual system generates the brightness of a target according to the value of its luminance in the probability distribution function (PDF, the integral of PD) of the possible target luminance experienced in that context (Yang & Purves, 2004). This value is referred to subsequently as the percentile of the target luminance among all possible luminance values that co-occur with the contextual luminance pattern in the natural environment. In formal terms, this supposition means that the visual system generates brightness percepts according to the relationship Brightness=A(P)+A, where A and A are constants, and (P) is a monotonically increasing

(a), The brightness elicited by a given target luminance in any context depends on the frequency of occurrence of that luminance relative to all the possible target luminance values experienced in that context in natural environments. This concept is illustrated here using the standard simultaneous brightness contrast stimulus in Fig. 1 (a*)*. The series of squares with different luminance values indicate all the possible occurrences of luminance in the target (T) in the two different contexts; the symbol indicates the relationship of a particular occurrence of luminance to the all possible occurrences of target luminance values experienced in the two contexts in natural environments. (b), This statistical relationship can be derived from the PD of target luminance values co-occurring with the luminance values and pattern of the two contexts of interest. The red and blue curves indicate the PDs of the luminance of the targets in (a), obtained by sampling the natural image database. The size of the sampling configuration was 1°x1°. (c), The brightness elicited by the luminance of the targets in (a) is based on the percentile of that luminance in the PDFs for the two

different contexts, which are indicated by the icons (modified from Yang & Purves, 2004).

**Figure 2.** Brightness percepts signify context-mediated PDs of luminance in natural scenes.

By definition, then, the percentile of target luminance for the lowest luminance value within any contextual light pattern is 0% and corresponds to the perception of maximum darkness; the percentile for the highest luminance within any contextual pattern is 100% and corresponds to the maximum perceivable brightness. In any given context, a higher

stimulus (Fig. 2 (a)) depends on the surrounding luminance values (Fig. 2 (b)).

White's illusion (Fig. 1 (b)) presents a particular challenge for any explanation of brightness (White, 1979). The equiluminant rectangular areas surrounded by predominantly more luminant regions in the stimulus (on the left in the left panel of Fig. 1 (b); see also the area in the red frame in Fig. 3 (a) appear brighter than areas of identical luminance surrounded by less luminant regions (on the right in the left panel of Fig. 1 (b); see also the area in the blue frame in Fig. 3 (a). The especially perplexing characteristic of this percept is that the effect is opposite that elicited by standard simultaneous brightness contrast stimuli (Fig. 1 (a)). Even more puzzling, the effect reverses when the luminance of the rectangular targets is either the lowest or highest value in the stimulus (middle and right panels in Fig. 1 (b).

The explanation for White's illusion provided by the statistical framework outlined above is shown in Fig. 3. When presented separately, as in Fig. 3 (a), the components of White's stimulus elicit much the same effect as in the usual presentation. By sampling the images of natural visual environments using configurations based on these components (Fig. 3 (b)), we obtained the PDFs of the luminance of a rectangular target (T) embedded in the two different configurations of surrounding luminance in White's stimulus. As shown in Fig. 3 (c), when the target in the intermediate range of luminance values (i.e., in between the luminance values at the two crossover points) abuts two dark rectangles laterally (left panel in Fig. 3 (b)), the percentile of the target luminance (red line) is higher than the percentile when the target abuts the two light rectangles (right panel in Fig. 3(b); blue line in Fig. 3 (c)). If, as we suppose, the percentile in the PDF of target luminance within any specific context determines the brightness perceived, the target with an intermediate luminance on the left in Fig. 3 (b) should appear brighter than the equiluminant target on the right. Finally, when all the luminance values in the stimulus are limited to a very narrow range (e.g., from 0 to 100 cd/m2 or from 1000 to 1100 cd/m2), when the sampling configurations are orientated vertically, or when the aspect ratio of the sampling configurations is changed (e.g., from 1:2 to 1:5), the PDFs derived from the database are not much different. These further results are consistent with the observations that White's stimulus elicits the similar effect when presented at a wide range of overall luminance levels, in a vertical orientation, or with different aspect ratios.

Vision as a Fundamentally Statistical Machine 207

The explanation of the Wertheimer-Benary illusion provided by the statistical framework outlined above is shown in Fig. 4. By sampling the images of natural environments using configurations based on the components of the stimulus (Fig. 4 (b)), we obtained the PDFs of target luminance in these contexts. As shown in Fig. 4 (c), when the triangular patch is embedded in a dark bar with its base facing a lighter area, the percentile of the luminance of the triangular patch (red line) is higher than the percentile when the triangular patch abuts a dark corner with its base facing a similar light background (blue line). Accordingly, the same gray patch should appear brighter in the former context than in the latter, as is the case. The PDFs obtained after changing the triangles to rectangles, rotating the configurations in Fig. 4 (b) by 180, or reflecting the configurations along the diagonal of the cross (cf. middle and right panels in Fig. 1 (c)) were much the same as those shown in Fig. 4 (c). These several observations accord with the fact that the Wertheimer-Benary effect is little

(a), The usual presentation of the Wertheimer-Benary stimulus. The separated components of the stimulus (boxed areas) elicit about the same effect as the usual presentation. (b), Configurations used to sample the database (size=0.40.4). (c), The PDFs of target luminance, given the surrounding luminances in (b) (The unit of luminance was cd/m2). The red curve corresponds to the condition shown in the left panel in (b) (Lu=205, Lv=45), and the blue curve to the condition shown in the right panel in (b) (Lu=45, Lv=205) (modified from Yang & Purves, 2004).

The context-mediated PDs of luminance in natural scenes predict equally well other

I showed that brightness percepts do not encode luminance as such, but rather the statistical relationship between the luminance in an area within a particular contextual light pattern and all possible occurrences of luminance in the context that have been experienced by humans in natural environments during evolution. The statistical basis for this aspect of visual perception is quite different from traditional approaches to rationalizing brightness. In the "relational approach" (Gichrist et al., 1999), an idea that evolved from the late 19th C.

**Figure 4.** Statistical explanation of the Wertheimer-Benary illusion.

brightness phenomena shown in Fig. 1 (Yang & Purves, 2004).

*2.6.1. The statistical nature of perception* 

changed by such manipulations.

**2.6. Discussion** 

(a), The usual presentation of White's illusion; boxed areas indicate the basic components of the stimulus, which when separated elicit about the same effect as the usual presentation. (b), The sampling configurations used to obtain the PDFs of target luminance (the red and blue rectangles), given a pattern of surrounds with luminance values Lu and Lv (size of the sampling configuration in this example was 0.6[H]0.3[V] and the unit of luminance was cd/m2). (c), The PDFs of the luminance of the targets in these contexts (red curve: Lu=145, Lv=105; blue curve: Lu=105, Lv=145). For the middle luminance values lying within the two crossover points (at ~105 and 145), the red curve is above the blue curve; as a result, the luminance configurations in (b) generate White's illusion (as indicated by the insets). For other luminance values of the target, the blue curve is above the red curve; as a result, the luminance configurations in (b*)* generate the inverted White's effect (modified from Yang & Purves, 2004).

**Figure 3.** Statistical explanation of White's illusion.

An aspect of White's illusion that has been particularly difficult to explain is the so-called "inverted White's effect": when the target luminance is either the lowest or the highest value in the stimulus, the effect is actually opposite the usual percept (see the middle and right panels in Fig. 1 (b)). The explanation for this further anomaly is also evident in Fig. 3 (c). When the target luminance is the lowest value in the presentation (see insets), the blue curve is above the red curve. As a result, a relatively dark target surrounded by more light area should now appear darker, as it does (see also the middle panel of Fig. 1 (b)). By the same token, when the target luminance is the highest value in the stimulus (see insets), the blue curve is also above the red curve. Accordingly, the relatively light target surrounded by more dark area should appear brighter, as it does (see also the right panel of Fig. 1 (b)). Thus the statistical structure of natural light patterns predicts not only White's illusion, but the inverted White's effect as well. Notice further that the two crossover points of the blue and red curves shift to the right when the contextual luminances increase, and to the left when they decrease; thus the inverted effect will be apparent, although altered in magnitude, for any luminance values of the surrounding areas.

#### **2.5. The Wertheimer-Benary illusion**

In the Wertheimer-Benary illusion (Fig. 1(c)), the equiluminant gray triangles appear differently bright, the triangle embedded in the arm of the cross looking slightly brighter than the triangle in the corner of the cross.

The explanation of the Wertheimer-Benary illusion provided by the statistical framework outlined above is shown in Fig. 4. By sampling the images of natural environments using configurations based on the components of the stimulus (Fig. 4 (b)), we obtained the PDFs of target luminance in these contexts. As shown in Fig. 4 (c), when the triangular patch is embedded in a dark bar with its base facing a lighter area, the percentile of the luminance of the triangular patch (red line) is higher than the percentile when the triangular patch abuts a dark corner with its base facing a similar light background (blue line). Accordingly, the same gray patch should appear brighter in the former context than in the latter, as is the case. The PDFs obtained after changing the triangles to rectangles, rotating the configurations in Fig. 4 (b) by 180, or reflecting the configurations along the diagonal of the cross (cf. middle and right panels in Fig. 1 (c)) were much the same as those shown in Fig. 4 (c). These several observations accord with the fact that the Wertheimer-Benary effect is little changed by such manipulations.

(a), The usual presentation of the Wertheimer-Benary stimulus. The separated components of the stimulus (boxed areas) elicit about the same effect as the usual presentation. (b), Configurations used to sample the database (size=0.40.4). (c), The PDFs of target luminance, given the surrounding luminances in (b) (The unit of luminance was cd/m2). The red curve corresponds to the condition shown in the left panel in (b) (Lu=205, Lv=45), and the blue curve to the condition shown in the right panel in (b) (Lu=45, Lv=205) (modified from Yang & Purves, 2004).

**Figure 4.** Statistical explanation of the Wertheimer-Benary illusion.

The context-mediated PDs of luminance in natural scenes predict equally well other brightness phenomena shown in Fig. 1 (Yang & Purves, 2004).

#### **2.6. Discussion**

206 Visual Cortex – Current Status and Perspectives

levels, in a vertical orientation, or with different aspect ratios.

generate the inverted White's effect (modified from Yang & Purves, 2004).

**Figure 3.** Statistical explanation of White's illusion.

any luminance values of the surrounding areas.

**2.5. The Wertheimer-Benary illusion** 

than the triangle in the corner of the cross.

the sampling configurations is changed (e.g., from 1:2 to 1:5), the PDFs derived from the database are not much different. These further results are consistent with the observations that White's stimulus elicits the similar effect when presented at a wide range of overall luminance

(a), The usual presentation of White's illusion; boxed areas indicate the basic components of the stimulus, which when separated elicit about the same effect as the usual presentation. (b), The sampling configurations used to obtain the PDFs of target luminance (the red and blue rectangles), given a pattern of surrounds with luminance values Lu and Lv (size of the sampling configuration in this example was 0.6[H]0.3[V] and the unit of luminance was cd/m2). (c), The PDFs of the luminance of the targets in these contexts (red curve: Lu=145, Lv=105; blue curve: Lu=105, Lv=145). For the middle luminance values lying within the two crossover points (at ~105 and 145), the red curve is above the blue curve; as a result, the luminance configurations in (b) generate White's illusion (as indicated by the insets). For other luminance values of the target, the blue curve is above the red curve; as a result, the luminance configurations in (b*)*

An aspect of White's illusion that has been particularly difficult to explain is the so-called "inverted White's effect": when the target luminance is either the lowest or the highest value in the stimulus, the effect is actually opposite the usual percept (see the middle and right panels in Fig. 1 (b)). The explanation for this further anomaly is also evident in Fig. 3 (c). When the target luminance is the lowest value in the presentation (see insets), the blue curve is above the red curve. As a result, a relatively dark target surrounded by more light area should now appear darker, as it does (see also the middle panel of Fig. 1 (b)). By the same token, when the target luminance is the highest value in the stimulus (see insets), the blue curve is also above the red curve. Accordingly, the relatively light target surrounded by more dark area should appear brighter, as it does (see also the right panel of Fig. 1 (b)). Thus the statistical structure of natural light patterns predicts not only White's illusion, but the inverted White's effect as well. Notice further that the two crossover points of the blue and red curves shift to the right when the contextual luminances increase, and to the left when they decrease; thus the inverted effect will be apparent, although altered in magnitude, for

In the Wertheimer-Benary illusion (Fig. 1(c)), the equiluminant gray triangles appear differently bright, the triangle embedded in the arm of the cross looking slightly brighter

#### *2.6.1. The statistical nature of perception*

I showed that brightness percepts do not encode luminance as such, but rather the statistical relationship between the luminance in an area within a particular contextual light pattern and all possible occurrences of luminance in the context that have been experienced by humans in natural environments during evolution. The statistical basis for this aspect of visual perception is quite different from traditional approaches to rationalizing brightness. In the "relational approach" (Gichrist et al., 1999), an idea that evolved from the late 19th C. debate between Helmholtz, Hering, and others, brightness percepts are "recovered" by the visual system from explicitly coded luminance contrasts and gradients. Another idea is that brightness depends on intermediate-level visual processes that detect edges, gradients and junctions, which are then grouped into specific spatial layouts (Adelson, 2000; Anderson & Winawer, 2005). Finally, the brightness elicited by a given luminance has been also considered as being "re-synthesized" by processing at several levels of the visual system that is based on inferences about the possible arrangements of surfaces in 3D, their material properties and their illumination (Wishart et al., 1997).

Vision as a Fundamentally Statistical Machine 209

(1)

(2)

**3.2. Context-mediated PDs in natural scenes and visual saliency**

the ICs for the context first and then the other ICs of natural scenes.

ICA filters (i.e., , , *WW W s sc c* ) can be obtained as follows

van der Schaaf, 1998).

alone.

A visual feature is a random variable and co-occurs at certain probabilities with other visual features in natural scenes. We call these the context-mediated PDs in natural scenes. Here, a context refers to the natural scene patch that co-occurs with a visual target in question in space and/or time domains. We proposed to represent the context-mediated PDs in natural scenes using independent components (**ICs**) of natural scenes. There are two reasons for this. First, it has been argued extensively that the early visual cortex represents incoming stimuli in an efficient manner (Simoncelli & Olshausen, 2001). Second, the filters of the ICs of natural scenes are very much like the receptive fields of simple cells in V1 (van Hateren &

To model the context-mediated PDs in static natural scenes, we used a center-surround configuration in which the scene patch within the circular center serves as the target and the scene patch in the annular surround as the context (Xu et al., 2010). We sampled a large number of scene patches from the McGill calibrated color image database of natural scenes (Olmos & Kingdom, 2004). Thus, each sample is a pair of a patch in center ( *Xc* ) and a patch in the surrounding area ( *Xs* ) (Fig. 5 (a)). We developed a model of natural scenes in this configuration (Eq. (1)). In Eq. (1), *As* , *Ac* , and *Asc* are ICs. This model allows us to calculate

> 0 *ss s c sc c sc XA U X A AU*

0 *ss s sc sc c c UW X U W WX* 

Therefore, we obtained three sets of ICs. First, the columns of *As* are the ICs for *Xs* . Second, the columns of *Asc* are the ICs for *Xc* that are paired with the ICs for *Xs* . Finally, the

Fig. 5 (b) shows paired chromatic ICs for *Xc* and *Xs* . Fig. 5 (c) shows paired achromatic ICs for *Xc* and *Xs* . The chromatic ICs for the surround have red-green (L-M) or blue-yellow [S- (LM)] opponency. The chromatic paired ICs for the center are extensions of the ICs for the surround. Fig. 5 (d) shows the ICs for *Xc* , including chromatic and achromatic ICs, that are not paired with any ICs for *Xs* . Fig. 5 (e) shows examples of the ICs for the center computed

To obtain the context-mediated PDs in dynamic natural scenes, we used sequences of image patches in which the current frame severed as the target and the three preceding frames as the context. We sampled a large number of sequences of image patches (~ 490,000) from a video database (Itti & Baldi, 2009) and performed the ICA according to Eq. (1). Fig. 6 (a) shows the paired chromatic spatiotemporal ICs. Fig. 6 (b) shows the paired achromatic

columns of *Ac* are the ICs for *Xc* that are not paired with any ICs for *Xs* .

The common deficiency of these several ways of thinking about brightness is their failure to relate the statistics of light patterns experienced in the course of evolution to what the corresponding brightness percepts need to signify (namely, the relationship of a particular occurrence of luminance to all possible occurrences of luminance in a given context). Since light patterns on the retina are the only information the visual system receives, basing brightness percepts on the statistics of natural light patterns allows visual animals to deal optimally with all possible natural occurrences of luminance, employing the full range of perceivable brightness to represent the physical world.

## *2.6.2. Neural instantiation of context-mediated PDs of luminance in natural scenes*

What sort of neural mechanisms, then, could incorporate these statistics of natural light patterns and relate them to brightness percepts? Although the answer is not known, the present results suggest that the circuitry at all levels of the visual system instantiates the statistical structures of light patterns in natural environments. In this conception, the centersurround organization of the receptive fields of retinal ganglion cells provides the initial basis for representing the necessary statistics. A further speculation would be that neural circuitry at the level of visual cortex is organized to instantiate the statistics of luminance patterns with arbitrary target and context shapes and sizes. As a result, the neuronal response at each location would signify the percentile of the target luminance in the PDF pertinent to a given context.

## **3. Visual saliency emerging from context-mediated PDs in natural scenes**

## **3.1. Introduction**

In this section, I present a simple model of the context-mediated PDs in natural scenes and derive a measure of visual saliency from these PDs. Visual saliency is the perceptual quality that makes some items in visual scenes stand out from their immediate contexts (Itti & Koch, 2001). Visual saliency plays important roles in natural vision in that saliency can direct eye movements and facilitate object detection and scene understanding. We developed a model of the context-mediated PDs in natural scenes using a modified algorithm for independent component analysis (**ICA**) (Hyvarinen, 1999) and demonstrated that visual saliency based on the context-mediated PDs in natural scenes is a good predictor of human gaze in freeviewing both static and dynamic natural scenes (Xu et al., 2010).

#### **3.2. Context-mediated PDs in natural scenes and visual saliency**

208 Visual Cortex – Current Status and Perspectives

properties and their illumination (Wishart et al., 1997).

perceivable brightness to represent the physical world.

pertinent to a given context.

**3.1. Introduction**

debate between Helmholtz, Hering, and others, brightness percepts are "recovered" by the visual system from explicitly coded luminance contrasts and gradients. Another idea is that brightness depends on intermediate-level visual processes that detect edges, gradients and junctions, which are then grouped into specific spatial layouts (Adelson, 2000; Anderson & Winawer, 2005). Finally, the brightness elicited by a given luminance has been also considered as being "re-synthesized" by processing at several levels of the visual system that is based on inferences about the possible arrangements of surfaces in 3D, their material

The common deficiency of these several ways of thinking about brightness is their failure to relate the statistics of light patterns experienced in the course of evolution to what the corresponding brightness percepts need to signify (namely, the relationship of a particular occurrence of luminance to all possible occurrences of luminance in a given context). Since light patterns on the retina are the only information the visual system receives, basing brightness percepts on the statistics of natural light patterns allows visual animals to deal optimally with all possible natural occurrences of luminance, employing the full range of

*2.6.2. Neural instantiation of context-mediated PDs of luminance in natural scenes* 

What sort of neural mechanisms, then, could incorporate these statistics of natural light patterns and relate them to brightness percepts? Although the answer is not known, the present results suggest that the circuitry at all levels of the visual system instantiates the statistical structures of light patterns in natural environments. In this conception, the centersurround organization of the receptive fields of retinal ganglion cells provides the initial basis for representing the necessary statistics. A further speculation would be that neural circuitry at the level of visual cortex is organized to instantiate the statistics of luminance patterns with arbitrary target and context shapes and sizes. As a result, the neuronal response at each location would signify the percentile of the target luminance in the PDF

**3. Visual saliency emerging from context-mediated PDs in natural scenes** 

In this section, I present a simple model of the context-mediated PDs in natural scenes and derive a measure of visual saliency from these PDs. Visual saliency is the perceptual quality that makes some items in visual scenes stand out from their immediate contexts (Itti & Koch, 2001). Visual saliency plays important roles in natural vision in that saliency can direct eye movements and facilitate object detection and scene understanding. We developed a model of the context-mediated PDs in natural scenes using a modified algorithm for independent component analysis (**ICA**) (Hyvarinen, 1999) and demonstrated that visual saliency based on the context-mediated PDs in natural scenes is a good predictor of human gaze in free-

viewing both static and dynamic natural scenes (Xu et al., 2010).

A visual feature is a random variable and co-occurs at certain probabilities with other visual features in natural scenes. We call these the context-mediated PDs in natural scenes. Here, a context refers to the natural scene patch that co-occurs with a visual target in question in space and/or time domains. We proposed to represent the context-mediated PDs in natural scenes using independent components (**ICs**) of natural scenes. There are two reasons for this. First, it has been argued extensively that the early visual cortex represents incoming stimuli in an efficient manner (Simoncelli & Olshausen, 2001). Second, the filters of the ICs of natural scenes are very much like the receptive fields of simple cells in V1 (van Hateren & van der Schaaf, 1998).

To model the context-mediated PDs in static natural scenes, we used a center-surround configuration in which the scene patch within the circular center serves as the target and the scene patch in the annular surround as the context (Xu et al., 2010). We sampled a large number of scene patches from the McGill calibrated color image database of natural scenes (Olmos & Kingdom, 2004). Thus, each sample is a pair of a patch in center ( *Xc* ) and a patch in the surrounding area ( *Xs* ) (Fig. 5 (a)). We developed a model of natural scenes in this configuration (Eq. (1)). In Eq. (1), *As* , *Ac* , and *Asc* are ICs. This model allows us to calculate the ICs for the context first and then the other ICs of natural scenes.

$$
\begin{bmatrix} X\_s \\ X\_c \end{bmatrix} = \begin{bmatrix} A\_s & 0 \\ A\_{sc} & A\_c \end{bmatrix} \begin{bmatrix} \mathcal{U}\_s \\ \mathcal{U}\_{sc} \end{bmatrix} \tag{1}
$$

ICA filters (i.e., , , *WW W s sc c* ) can be obtained as follows

$$
\begin{bmatrix} \boldsymbol{\mathcal{U}}\_{s} \\ \boldsymbol{\mathcal{U}}\_{sc} \end{bmatrix} = \begin{bmatrix} \boldsymbol{\mathcal{W}}\_{s} & \boldsymbol{0} \\ \boldsymbol{\mathcal{W}}\_{sc} & \boldsymbol{\mathcal{W}}\_{c} \end{bmatrix} \begin{bmatrix} \boldsymbol{X}\_{s} \\ \boldsymbol{X}\_{c} \end{bmatrix} \tag{2}
$$

Therefore, we obtained three sets of ICs. First, the columns of *As* are the ICs for *Xs* . Second, the columns of *Asc* are the ICs for *Xc* that are paired with the ICs for *Xs* . Finally, the columns of *Ac* are the ICs for *Xc* that are not paired with any ICs for *Xs* .

Fig. 5 (b) shows paired chromatic ICs for *Xc* and *Xs* . Fig. 5 (c) shows paired achromatic ICs for *Xc* and *Xs* . The chromatic ICs for the surround have red-green (L-M) or blue-yellow [S- (LM)] opponency. The chromatic paired ICs for the center are extensions of the ICs for the surround. Fig. 5 (d) shows the ICs for *Xc* , including chromatic and achromatic ICs, that are not paired with any ICs for *Xs* . Fig. 5 (e) shows examples of the ICs for the center computed alone.

To obtain the context-mediated PDs in dynamic natural scenes, we used sequences of image patches in which the current frame severed as the target and the three preceding frames as the context. We sampled a large number of sequences of image patches (~ 490,000) from a video database (Itti & Baldi, 2009) and performed the ICA according to Eq. (1). Fig. 6 (a) shows the paired chromatic spatiotemporal ICs. Fig. 6 (b) shows the paired achromatic spatiotemporal ICs. Fig. 6 (c) shows the unpaired ICs for the current frame, which are oriented bars and have red-green or blue-yellow opponency.

Vision as a Fundamentally Statistical Machine 211

(a), Selected 28 red/green or blue/yellow paired ICs. (b), Selected 78 paired bright/dark ICs. (c), Examples of unpaired

max ln ( ) ln ( ) *i i sc sc*

*i i*

where max(|) *c s P XX* is the maximum probability of a target, *Xc* , that co-occurs with a context, *Xs* , in natural scenes. Thus, if the probability of the occurrence of a target is low relative to that of the most likely occurrence in the context in natural scenes, the target is

max ln ( | ) ln ( | ) *cs cs S P X X PX X* (4)

*S P u Pu* (5)

target ICs (modified from Xu et al., 2010). **Figure 6.** ICs of natural moving scenes.

We proposed a measure of visual saliency as

Substituting Eq. (3) into Eq. (4), we have

salient within the context (Fig. 7).

(a), Samples of patches of natural scenes. Each central patch in the left panel is paired with a surrounding patch in the right panel in the same raster order. (b), Examples of paired chromatic center and surround ICs. ICs in the panels are paired in the same raster order. (c), Examples of paired achromatic center and surround ICs. (d), Examples of unpaired center ICs. (e), Examples of the ICs for the center computed alone (modified from Xu et al., 2010).

**Figure 5.** ICs of color images of natural scenes.

The context-mediated PDs of natural scenes, i.e., the conditional PDs, ( | ) *c s PX X* , can be derived using the Bayesian formula as follows

$$P(X\_c \mid X\_s) = \frac{P(X\_c \mid X\_s)}{P(X\_s)} \propto \frac{P(U\_s)P(U\_{sc})}{P(U\_s)} = \prod\_i P(u\_{sc}^i) \tag{3}$$

where *<sup>i</sup> sc u* is the amplitude of the ith unpaired IC for *Xc* . Therefore, the context-mediated PDs depend only on the unpaired ICs for *Xc* . We modeled ( ) *<sup>i</sup> sc P u* as generalized Gaussian PDs.

#### Vision as a Fundamentally Statistical Machine 211

(a), Selected 28 red/green or blue/yellow paired ICs. (b), Selected 78 paired bright/dark ICs. (c), Examples of unpaired target ICs (modified from Xu et al., 2010).

**Figure 6.** ICs of natural moving scenes.

210 Visual Cortex – Current Status and Perspectives

**Figure 5.** ICs of color images of natural scenes.

where *<sup>i</sup>*

PDs.

derived using the Bayesian formula as follows

PDs depend only on the unpaired ICs for *Xc* . We modeled ( ) *<sup>i</sup>*

oriented bars and have red-green or blue-yellow opponency.

spatiotemporal ICs. Fig. 6 (c) shows the unpaired ICs for the current frame, which are

(a), Samples of patches of natural scenes. Each central patch in the left panel is paired with a surrounding patch in the right panel in the same raster order. (b), Examples of paired chromatic center and surround ICs. ICs in the panels are paired in the same raster order. (c), Examples of paired achromatic center and surround ICs. (d), Examples of unpaired center ICs. (e), Examples of the ICs for the center computed alone (modified from Xu et al., 2010).

The context-mediated PDs of natural scenes, i.e., the conditional PDs, ( | ) *c s PX X* , can be

( , ) ( )( ) (|) ( ) () ()

*PX X PU PU PX X P u*

*c s sc s s i*

*sc u* is the amplitude of the ith unpaired IC for *Xc* . Therefore, the context-mediated

*c s s sc i*

*PX PU* (3)

*sc P u* as generalized Gaussian

We proposed a measure of visual saliency as

$$S = \ln P\_{\text{max}}(X\_c \mid X\_s) - \ln P(X\_c \mid X\_s) \tag{4}$$

Substituting Eq. (3) into Eq. (4), we have

$$S = \sum\_{i} \ln P\_{\text{max}}(\mu\_{sc}^{i}) - \sum\_{i} \ln P(\mu\_{sc}^{i}) \tag{5}$$

where max(|) *c s P XX* is the maximum probability of a target, *Xc* , that co-occurs with a context, *Xs* , in natural scenes. Thus, if the probability of the occurrence of a target is low relative to that of the most likely occurrence in the context in natural scenes, the target is salient within the context (Fig. 7).

Vision as a Fundamentally Statistical Machine 213

called shuffled fixations. The average area under the ROC curve is 0.6803, which means the saliency measures at fixations are significantly higher than the saliency measures at shuffled fixations. Similarly, we measured the KL divergence between two histograms of saliency measures: the histogram of saliency measures at the fixated locations in a test scene and the histogram of saliency measures at the same locations in a different scene randomly selected

First column: input scenes. Second column, saliency maps produced by our model. Third column: saliency maps given by the AIM model. Fourth column: density maps of human fixation. Saliency is coded in color-scale (red/blue:

Our model of visual saliency is a good predictor of human gaze in free-viewing static natural scenes, outperforming all other models that we tested. As shown in Table 1 (Xu et al., 2010), our model has an average KL divergence of 0.3016 and the average ROC measure is 0.6803. The average KL divergence and ROC measure for the AIM model are 0.2879 and 0.6799 respectively, which were calculated using the code provided by the authors. The

from the dataset (Zhang et al., 2008).

high/low saliency) (modified from Xu et al., 2010).

**Figure 8.** Examples of saliency maps of natural scenes.

results for other models in Table 1 were given in (Zhang et al., 2008).

(a), An image patch with an salient feature at the center (left), the probabilities of the ICs (middle), and the PD of the IC that has the smallest probability (right). (b), An image patch with an non-salient feature at the center (left), the probabilities of the ICs (middle), and the PD of the IC that has the smallest probability (right). In (a) and (b), the red circles indicate the probability of the amplitude of the IC of the features in the centers (modified from Xu et al., 2010).

**Figure 7.** Visual saliency based on the context-mediated PDs in natural scenes.

### **3.3. Visual saliency and human gaze in free-viewing static natural scenes**

Human gaze in free-viewing natural scenes is probably driven by visual saliency in natural scenes. To test this hypothesis, we used a dataset of human gaze collected from 20 human subjects in free-viewing 120 images (Bruce & Tsotsos, 2009). Fig. 8 shows the saliency maps based on the context-mediated PDs in natural scenes and the density maps of human gaze for six scenes. The saliency maps based on the information maximization (**AIM**) model are also shown (Bruce & Tsotsos, 2009). Evidently, the salient features and objects in these scenes predicted by the saliency maps accord with human observations and the saliency maps predicted by our model qualitatively matched the density maps of human gaze.

To quantitatively examine how well this model of visual saliency predicts human fixation, we used the receiver operating characteristic (**ROC**) metric and the Kullback–Leibler (**KL**) divergence. The ROC metric measures the area under the ROC curve. To calculate this metric, we used visual saliency as a feature to classify the locations where the saliency measures are greater than a threshold as fixations and the rest as nonfixated locations. By varying the threshold, we obtained an ROC curve and calculated the area under the curve, which indicates how well the saliency maps predict human gaze.

To avoid a central tendency in human gaze, we used the ROC measure described in (Tatler et al., 2005). We compared the saliency measures at the attended locations to the saliency measures in that scene at the locations that are attended in different scenes in the dataset, called shuffled fixations. The average area under the ROC curve is 0.6803, which means the saliency measures at fixations are significantly higher than the saliency measures at shuffled fixations. Similarly, we measured the KL divergence between two histograms of saliency measures: the histogram of saliency measures at the fixated locations in a test scene and the histogram of saliency measures at the same locations in a different scene randomly selected from the dataset (Zhang et al., 2008).

First column: input scenes. Second column, saliency maps produced by our model. Third column: saliency maps given by the AIM model. Fourth column: density maps of human fixation. Saliency is coded in color-scale (red/blue: high/low saliency) (modified from Xu et al., 2010).

**Figure 8.** Examples of saliency maps of natural scenes.

212 Visual Cortex – Current Status and Perspectives

(a), An image patch with an salient feature at the center (left), the probabilities of the ICs (middle), and the PD of the IC that has the smallest probability (right). (b), An image patch with an non-salient feature at the center (left), the probabilities of the ICs (middle), and the PD of the IC that has the smallest probability (right). In (a) and (b), the red circles indicate the probability of the amplitude of the IC of the features in the centers (modified from Xu et al., 2010).

Human gaze in free-viewing natural scenes is probably driven by visual saliency in natural scenes. To test this hypothesis, we used a dataset of human gaze collected from 20 human subjects in free-viewing 120 images (Bruce & Tsotsos, 2009). Fig. 8 shows the saliency maps based on the context-mediated PDs in natural scenes and the density maps of human gaze for six scenes. The saliency maps based on the information maximization (**AIM**) model are also shown (Bruce & Tsotsos, 2009). Evidently, the salient features and objects in these scenes predicted by the saliency maps accord with human observations and the saliency

**3.3. Visual saliency and human gaze in free-viewing static natural scenes** 

maps predicted by our model qualitatively matched the density maps of human gaze.

To quantitatively examine how well this model of visual saliency predicts human fixation, we used the receiver operating characteristic (**ROC**) metric and the Kullback–Leibler (**KL**) divergence. The ROC metric measures the area under the ROC curve. To calculate this metric, we used visual saliency as a feature to classify the locations where the saliency measures are greater than a threshold as fixations and the rest as nonfixated locations. By varying the threshold, we obtained an ROC curve and calculated the area under the curve,

To avoid a central tendency in human gaze, we used the ROC measure described in (Tatler et al., 2005). We compared the saliency measures at the attended locations to the saliency measures in that scene at the locations that are attended in different scenes in the dataset,

**Figure 7.** Visual saliency based on the context-mediated PDs in natural scenes.

which indicates how well the saliency maps predict human gaze.

Our model of visual saliency is a good predictor of human gaze in free-viewing static natural scenes, outperforming all other models that we tested. As shown in Table 1 (Xu et al., 2010), our model has an average KL divergence of 0.3016 and the average ROC measure is 0.6803. The average KL divergence and ROC measure for the AIM model are 0.2879 and 0.6799 respectively, which were calculated using the code provided by the authors. The results for other models in Table 1 were given in (Zhang et al., 2008).


Vision as a Fundamentally Statistical Machine 215

We calculated the KL-divergence for this dataset as described above. Humans tend to gaze at visual features that have high saliency, as shown by the KL divergence measures in Table 2 (Xu et al., 2010). The KL-divergence measure for our model is 0.3153, which is higher than the saliency metric (0.205) (Itti et al., 1998) and the surprise metric 0.241 (Itti & Baldi, 2009),

Our model of visual saliency is different from all other models. There are four classes of models of visual saliency. The first class of models do not use PDs in natural scenes but involve complex image-based computation that includes feature extraction, feature pooling, and normalization (Itti et al., 1998). The second class of models make use of PDs computed from the current scene the subject is seeing (Bruce & Tsotsos, 2009). The third class of models are based on PDs in natural scenes that are not dependent on specific contexts (Zhang et al., 2008). Finally, there is a biologically inspired neural network model (Zhaoping & May, 2007). Our model is unique in that: 1) the PDs are computed from an ensemble of natural scenes that presumably approximate the statistics human experienced during evolution and development; and 2) the PDs are dependent on specific contexts in natural

but slightly lower than the AIM model (0.328) (Bruce & Tsotsos, 2009).

**Model KL (SE)** Bruce et al. (2009)0.328(0.009) Itti et al. (2009) 0.241(0.006) Zhang et al. (2009) 0.181 Itti et al. (1998) 0.205(0.006) Our model 0.315(0.003) **Table 2.** KL-divergence for saliency maps of dynamic natural scenes (SE: standard error).

*3.5.2. Neurons as estimators of context-mediated PDs in natural scenes* 

evolutionary and developmental time.

These results support the notion that neurons in the early visual cortex act as estimators of the context-mediated PDs in natural scenes. This way, any single neuron relates an occurrence of any visual variable to the underlying PD in natural scenes. These PDs are related to all possible stimuli in natural scenes experienced by the visual animals over

This hypothesis is distinct from the conventional view of neurons as feature detectors, the efficient coding hypothesis (Simoncelli & Olshausen, 2001), predictive coding (Rao & Ballard, 1999), the proposal that neurons encode logarithmic likelihood functions (Rao, 2004), and several recent V1 neuronal models that involve complex spatial-tempo structures but don't function as estimators of PDs in natural scenes (Rust et al., 2005; Chen at al., 2007).

*3.5.1. Distinctions from other models of visual saliency* 

**3.5. Discussion**

scenes.

**Table 1.** ROC metric and KL-divergence for saliency maps of static natural scenes (SE: standard error).

## **3.4. Visual saliency and human gaze in free-viewing natural movies**

We used a database of human gaze collected from 8 subjects in free-viewing 50 videos, including indoor scenes, outdoor scenes, television clips, and video games (Itti & Baldi, 2009). Fig. 9 shows the saliency maps for selected frames in 6 videos. The 3 contextual video frames and the target frame are shown to the left and the saliency maps to the right. As predicted by the saliency maps, the moving objects in these videos appear to be salient (e.g., the character in the game video, the falling water drop, the soccer player and the ball, the moving car and the walking policeman, and the jogger and the football player). These predictions accord well with human observations.

**Figure 9.** Saliency maps of dynamic natural scenes. Examples of contextual frames (3 left columns) and target frames (4th column) in 6 video clips and saliency maps (rightmost column) (modified from Xu et al., 2010).

We calculated the KL-divergence for this dataset as described above. Humans tend to gaze at visual features that have high saliency, as shown by the KL divergence measures in Table 2 (Xu et al., 2010). The KL-divergence measure for our model is 0.3153, which is higher than the saliency metric (0.205) (Itti et al., 1998) and the surprise metric 0.241 (Itti & Baldi, 2009), but slightly lower than the AIM model (0.328) (Bruce & Tsotsos, 2009).

## **3.5. Discussion**

214 Visual Cortex – Current Status and Perspectives

**Model KL (SE) ROC (SE)** Bruce et al. (2009) 0.2879(0.0048) 0.6799(0.0024) Itti et al. (1998)0.1130(0.0011) 0.6146(0.0008) Gao et al. (2009) 0.1535(0.0016) 0.6395(0.0007) Zhang et al.: DOG (2008) 0.1723(0.0012) 0.6570(0.0007) Zhang et al.: ICA (2008) 0.2097(0.0016) 0.6682(0.0008) Our model 0.3016(0.0051) 0.6803(0.0027)

**3.4. Visual saliency and human gaze in free-viewing natural movies**

predictions accord well with human observations.

al., 2010).

**Table 1.** ROC metric and KL-divergence for saliency maps of static natural scenes (SE: standard error).

We used a database of human gaze collected from 8 subjects in free-viewing 50 videos, including indoor scenes, outdoor scenes, television clips, and video games (Itti & Baldi, 2009). Fig. 9 shows the saliency maps for selected frames in 6 videos. The 3 contextual video frames and the target frame are shown to the left and the saliency maps to the right. As predicted by the saliency maps, the moving objects in these videos appear to be salient (e.g., the character in the game video, the falling water drop, the soccer player and the ball, the moving car and the walking policeman, and the jogger and the football player). These

**Figure 9.** Saliency maps of dynamic natural scenes. Examples of contextual frames (3 left columns) and target frames (4th column) in 6 video clips and saliency maps (rightmost column) (modified from Xu et

## *3.5.1. Distinctions from other models of visual saliency*

Our model of visual saliency is different from all other models. There are four classes of models of visual saliency. The first class of models do not use PDs in natural scenes but involve complex image-based computation that includes feature extraction, feature pooling, and normalization (Itti et al., 1998). The second class of models make use of PDs computed from the current scene the subject is seeing (Bruce & Tsotsos, 2009). The third class of models are based on PDs in natural scenes that are not dependent on specific contexts (Zhang et al., 2008). Finally, there is a biologically inspired neural network model (Zhaoping & May, 2007). Our model is unique in that: 1) the PDs are computed from an ensemble of natural scenes that presumably approximate the statistics human experienced during evolution and development; and 2) the PDs are dependent on specific contexts in natural scenes.


**Table 2.** KL-divergence for saliency maps of dynamic natural scenes (SE: standard error).

#### *3.5.2. Neurons as estimators of context-mediated PDs in natural scenes*

These results support the notion that neurons in the early visual cortex act as estimators of the context-mediated PDs in natural scenes. This way, any single neuron relates an occurrence of any visual variable to the underlying PD in natural scenes. These PDs are related to all possible stimuli in natural scenes experienced by the visual animals over evolutionary and developmental time.

This hypothesis is distinct from the conventional view of neurons as feature detectors, the efficient coding hypothesis (Simoncelli & Olshausen, 2001), predictive coding (Rao & Ballard, 1999), the proposal that neurons encode logarithmic likelihood functions (Rao, 2004), and several recent V1 neuronal models that involve complex spatial-tempo structures but don't function as estimators of PDs in natural scenes (Rust et al., 2005; Chen at al., 2007).

Since the response of any single neuron encodes and decodes the PD of the visual variable in natural scenes, this concept is also different from probabilistic population codes where populations of neurons automatically encode PDs due to varying tuning among neurons and noise (Ma et al., 2005).

Vision as a Fundamentally Statistical Machine 217

evolved visual systems would have taken advantage of this probabilistic information in

(a), Specific distance tendency. When a simple object is presented in an otherwise dark environment, observers usually judge it to be at a distance of 2-4m, regardless of its actual distance. In these diagrams, which are not to scale, 'Phy' indicates the physical position of the object and 'Per' the perceived position. (b), Equidistance tendency. Under these same conditions, an object is usually judged to be at about the same distance from the observer as neighboring objects, even when their physical distances differ. (c), Perceived distance of objects at eye-level. The distances of nearby objects presented at eye-level tend to be overestimated, whereas the distances of further objects tend to be underestimated. (d), Perceived distance of objects on the ground. An object on the ground a few meters away tends to appear closer and slightly elevated with respect to its physical position. Moreover, the perceived location becomes increasingly elevated and relatively closer to the observer as the angle of the line of sight approaches the horizontal plane at eye-level. (e), Effects of terrain on distance perception. Under more realistic conditions, the distance of an object on a uniform ground-plane a few meters away from the observer is usually accurately perceived. When, however, the terrain is disrupted by a dip, the same object appears to be further away; conversely, when the ground-plane is disrupted by a

The distance (in meters) of each pixel is indicated by color coding. Black areas are regions where the laser beam did not

This probabilistic strategy can be formalized in terms of Bayesian inference (Knill & Richards, 1996; Trommershauser et al., 2011). In this framework, the PD of the physical

P(S|I)=P(I|S)P(S)/P(I) (6)

hump, the object tends to appear closer than it is (modified from Yang & Purves, 2003a).

generating perceptions of physical space.

**Figure 10.** Anomalies in perceived distance.

**Figure 11.** A range image acquired by laser range scanning.

sources underlying a visual stimulus, P(S|I) can be expressed as

return a value.

## **4. Statistics of 3D natural scenes and visual space**

## **4.1. Introduction**

In the last two sections, I presented evidence that aspects of human natural vision are generated on the basis of the PDs of visual variables in 2D natural scenes. However, the most fundamental task of vision is to generate visual percepts and visually guided behaviors in the 3D physical world. In this section, I present PDs in 3D natural scenes and relate them to the characteristics of human visual space.

Visual space is characterized by perceived geometrical properties such as distance, linearity, and parallelism. An appealing intuition is that these properties are the result of a direct transformation of the Euclidean characteristics of physical space (Hershenson, 1998; Loomis et al., 1996; Gillam, 1996). This assumption is, however, inconsistent with a variety of puzzling and often subtle discrepancies between the predicted consequences of any direct mapping of physical space and what people actually see. A number of examples in perceived distance, the simplest aspect of visual space, show that the apparent distance of objects bears no simple relation to their physical distance from the observer (Loomis et al., 1996; Gillam, 1996) (Fig. 10). Although a variety of explanations have been proposed, there has been little or no agreement about the basis of this phenomenology.

We tested the hypothesis that these anomalies of perceived distance are all manifestations of a probabilistic strategy for generating visual percepts in response to inevitably ambiguous visual stimuli (Knill & Richards, 1996; Purves & Lotto, 2003; Trommershauser et al., 2011). A straightforward way of examining this idea in the case of visual space is to analyze the statistical relationship between geometrical features (e.g., points, lines and surfaces) in the image plane and the corresponding physical geometry in representative visual scenes. Accordingly, we used a database of natural scene geometry acquired with a laser range scanner to test whether the otherwise puzzling phenomenology of perceived distance can be explained in statistical terms (Fig. 11).

## **4.2. A probabilistic concept of visual space**

The challenge in generating perceptions of distance (and spatial relationships more generally) is the inevitable ambiguity of visual stimuli. When any point in space is projected onto the retina, the corresponding point in projection could have been generated by an infinite number of different locations in the physical world. In consequence, the relationship between any projected image and its source is inherently ambiguous. Nevertheless, the PD of the distances of un-occluded object surfaces from the observer must have a potentially informative statistical structure. Given this inevitable ambiguity, it seems likely that highly evolved visual systems would have taken advantage of this probabilistic information in generating perceptions of physical space.

(a), Specific distance tendency. When a simple object is presented in an otherwise dark environment, observers usually judge it to be at a distance of 2-4m, regardless of its actual distance. In these diagrams, which are not to scale, 'Phy' indicates the physical position of the object and 'Per' the perceived position. (b), Equidistance tendency. Under these same conditions, an object is usually judged to be at about the same distance from the observer as neighboring objects, even when their physical distances differ. (c), Perceived distance of objects at eye-level. The distances of nearby objects presented at eye-level tend to be overestimated, whereas the distances of further objects tend to be underestimated. (d), Perceived distance of objects on the ground. An object on the ground a few meters away tends to appear closer and slightly elevated with respect to its physical position. Moreover, the perceived location becomes increasingly elevated and relatively closer to the observer as the angle of the line of sight approaches the horizontal plane at eye-level. (e), Effects of terrain on distance perception. Under more realistic conditions, the distance of an object on a uniform ground-plane a few meters away from the observer is usually accurately perceived. When, however, the terrain is disrupted by a dip, the same object appears to be further away; conversely, when the ground-plane is disrupted by a hump, the object tends to appear closer than it is (modified from Yang & Purves, 2003a).

**Figure 10.** Anomalies in perceived distance.

216 Visual Cortex – Current Status and Perspectives

**4. Statistics of 3D natural scenes and visual space** 

relate them to the characteristics of human visual space.

explained in statistical terms (Fig. 11).

**4.2. A probabilistic concept of visual space** 

has been little or no agreement about the basis of this phenomenology.

and noise (Ma et al., 2005).

**4.1. Introduction** 

Since the response of any single neuron encodes and decodes the PD of the visual variable in natural scenes, this concept is also different from probabilistic population codes where populations of neurons automatically encode PDs due to varying tuning among neurons

In the last two sections, I presented evidence that aspects of human natural vision are generated on the basis of the PDs of visual variables in 2D natural scenes. However, the most fundamental task of vision is to generate visual percepts and visually guided behaviors in the 3D physical world. In this section, I present PDs in 3D natural scenes and

Visual space is characterized by perceived geometrical properties such as distance, linearity, and parallelism. An appealing intuition is that these properties are the result of a direct transformation of the Euclidean characteristics of physical space (Hershenson, 1998; Loomis et al., 1996; Gillam, 1996). This assumption is, however, inconsistent with a variety of puzzling and often subtle discrepancies between the predicted consequences of any direct mapping of physical space and what people actually see. A number of examples in perceived distance, the simplest aspect of visual space, show that the apparent distance of objects bears no simple relation to their physical distance from the observer (Loomis et al., 1996; Gillam, 1996) (Fig. 10). Although a variety of explanations have been proposed, there

We tested the hypothesis that these anomalies of perceived distance are all manifestations of a probabilistic strategy for generating visual percepts in response to inevitably ambiguous visual stimuli (Knill & Richards, 1996; Purves & Lotto, 2003; Trommershauser et al., 2011). A straightforward way of examining this idea in the case of visual space is to analyze the statistical relationship between geometrical features (e.g., points, lines and surfaces) in the image plane and the corresponding physical geometry in representative visual scenes. Accordingly, we used a database of natural scene geometry acquired with a laser range scanner to test whether the otherwise puzzling phenomenology of perceived distance can be

The challenge in generating perceptions of distance (and spatial relationships more generally) is the inevitable ambiguity of visual stimuli. When any point in space is projected onto the retina, the corresponding point in projection could have been generated by an infinite number of different locations in the physical world. In consequence, the relationship between any projected image and its source is inherently ambiguous. Nevertheless, the PD of the distances of un-occluded object surfaces from the observer must have a potentially informative statistical structure. Given this inevitable ambiguity, it seems likely that highly

The distance (in meters) of each pixel is indicated by color coding. Black areas are regions where the laser beam did not return a value.

**Figure 11.** A range image acquired by laser range scanning.

This probabilistic strategy can be formalized in terms of Bayesian inference (Knill & Richards, 1996; Trommershauser et al., 2011). In this framework, the PD of the physical sources underlying a visual stimulus, P(S|I) can be expressed as

$$\mathbf{P(S \mid I) = P(I \mid S)P(S)/P(I)} \tag{6}$$

where S represents the parameters of physical scene geometry and I the visual image. P(S) is the PD of scene geometry in typical visual environments (the prior), P(I|S) the PD of stimulus I generated by the scene geometry S (the likelihood function), and P(I) a normalization constant.

Vision as a Fundamentally Statistical Machine 219

(a), The scale-invariant PD of the distances from the center of the laser scanner to all the physical locations in the database (black line). The red line represents the PD of distances derived from a simple model in which 1000 planar rectangular surfaces were uniformly placed at distances from 2.5-300 m, from 150 m left to 150 m right, and from the ground to 25 m above the ground (which was 1.65m below the image center). The sizes of these uniformly distributed surfaces ranged from 0.2-18 m. Five hundred 512512 images of this model made by a pinhole camera method showed statistical behavior similar to that derived from the range image database for a wide variety of specific values, although with different slopes and modes. The model also generated statistical behavior similar to that shown in (b) and (c) (not shown). (b), PDs of the differences in the physical distances of two locations separated by three different angles in the horizontal plane. (c), PDs of the horizontal distances of physical locations at different heights with respect

**Figure 12.** PDs of the physical distances from the image plane of points in the range images of natural

When little or no other information is available in a scene, observers tend to perceive objects at a distance of 2-4m (Owens et al., 1976). In the absence of any distance cues, the likelihood function in Eq. (6) is flat; the apparent distance of a point in physical space should therefore accord with the PD of the distances of all points in typical visual scenes (see Eq. (6)). As indicated in Fig. 12 (a), this distribution has a maximum probability at about 3 m. The agreement between this PD of distances in natural scenes and the relevant psychophysical evidence is thus consistent with a probabilistic explanation of the 'specific distance

to eye-level (modified from Yang & Purves, 2003a).

scenes.

tendency'.

If visual space is indeed determined by the PD of 3D scene geometry underlying visual stimuli, then, under reduced-cue conditions, the prior PD of distances to the observer in typical viewing environments should bias perceived distances. By the same token, the PD of the distances between locations in a scene should bias the apparent relative distances among them. Finally, when additional information pertinent to distance is present, these biases will be reduced.

## **4.3. PDs of distances in natural scenes**

The information at each pixel in the range image database is the distance, elevation, and azimuth of the corresponding location in the physical scene relative to the laser scanner (Fig. 11). These data were used to compute the PD of distances from the center of the scanner to locations in the physical scenes in the database.

The first of several statistical features apparent in the analysis is that the PD of the radial distances from the scanner to physical locations in the scenes has a maximum at about 3 m, declining approximately exponentially over greater distances (Fig. 12 (a)). This PD is scaleinvariant, meaning that any scaled version of the geometry of a set of natural scenes will, in statistical terms, be much the same (Lee et al., 2001). A simple model of natural 3D scenes generates a scaling-invariant PD of object distances nearly identical to that obtained from natural scenes (see legend of Fig. 12).

A second statistical feature of the analysis concerns how different physical locations in natural scenes are typically related to each other with respect to distance from the observer. The PD of the differences in the distance from the observer to any two physical locations is highly skewed, having a maximum near zero and a long tail (Fig. 12 (b)). Even for physical separations as large as 30, the most probable difference between the distances from the image plane of two locations is minimal.

A third statistical feature is that the PD of horizontal distances from the scanner to physical locations changes relatively little with height in the scene (the height of the center of scanner was always 1.65m above the ground, thus approximating eye-level of an average adult) (Fig. 12 (c)). The PD of physical distances at eye-level has a maximum at about 4.7 m and decays gradually as the distances increase. The PDs of the horizontal distances of physical locations at different heights above and below eye-level also tend to have a maximum at about 3m, and are similar in shape.

## **4.4. Perceived distances in impoverished settings**

How, then, do these scale-invariant PDs of distances from the image plane in natural scenes account for the anomalies of visual space summarized in Fig. 10?

**4.3. PDs of distances in natural scenes** 

locations in the physical scenes in the database.

natural scenes (see legend of Fig. 12).

image plane of two locations is minimal.

**4.4. Perceived distances in impoverished settings** 

account for the anomalies of visual space summarized in Fig. 10?

and are similar in shape.

normalization constant.

where S represents the parameters of physical scene geometry and I the visual image. P(S) is the PD of scene geometry in typical visual environments (the prior), P(I|S) the PD of stimulus I generated by the scene geometry S (the likelihood function), and P(I) a

If visual space is indeed determined by the PD of 3D scene geometry underlying visual stimuli, then, under reduced-cue conditions, the prior PD of distances to the observer in typical viewing environments should bias perceived distances. By the same token, the PD of the distances between locations in a scene should bias the apparent relative distances among them. Finally, when additional information pertinent to distance is present, these biases will be reduced.

The information at each pixel in the range image database is the distance, elevation, and azimuth of the corresponding location in the physical scene relative to the laser scanner (Fig. 11). These data were used to compute the PD of distances from the center of the scanner to

The first of several statistical features apparent in the analysis is that the PD of the radial distances from the scanner to physical locations in the scenes has a maximum at about 3 m, declining approximately exponentially over greater distances (Fig. 12 (a)). This PD is scaleinvariant, meaning that any scaled version of the geometry of a set of natural scenes will, in statistical terms, be much the same (Lee et al., 2001). A simple model of natural 3D scenes generates a scaling-invariant PD of object distances nearly identical to that obtained from

A second statistical feature of the analysis concerns how different physical locations in natural scenes are typically related to each other with respect to distance from the observer. The PD of the differences in the distance from the observer to any two physical locations is highly skewed, having a maximum near zero and a long tail (Fig. 12 (b)). Even for physical separations as large as 30, the most probable difference between the distances from the

A third statistical feature is that the PD of horizontal distances from the scanner to physical locations changes relatively little with height in the scene (the height of the center of scanner was always 1.65m above the ground, thus approximating eye-level of an average adult) (Fig. 12 (c)). The PD of physical distances at eye-level has a maximum at about 4.7 m and decays gradually as the distances increase. The PDs of the horizontal distances of physical locations at different heights above and below eye-level also tend to have a maximum at about 3m,

How, then, do these scale-invariant PDs of distances from the image plane in natural scenes

(a), The scale-invariant PD of the distances from the center of the laser scanner to all the physical locations in the database (black line). The red line represents the PD of distances derived from a simple model in which 1000 planar rectangular surfaces were uniformly placed at distances from 2.5-300 m, from 150 m left to 150 m right, and from the ground to 25 m above the ground (which was 1.65m below the image center). The sizes of these uniformly distributed surfaces ranged from 0.2-18 m. Five hundred 512512 images of this model made by a pinhole camera method showed statistical behavior similar to that derived from the range image database for a wide variety of specific values, although with different slopes and modes. The model also generated statistical behavior similar to that shown in (b) and (c) (not shown). (b), PDs of the differences in the physical distances of two locations separated by three different angles in the horizontal plane. (c), PDs of the horizontal distances of physical locations at different heights with respect to eye-level (modified from Yang & Purves, 2003a).

**Figure 12.** PDs of the physical distances from the image plane of points in the range images of natural scenes.

When little or no other information is available in a scene, observers tend to perceive objects at a distance of 2-4m (Owens et al., 1976). In the absence of any distance cues, the likelihood function in Eq. (6) is flat; the apparent distance of a point in physical space should therefore accord with the PD of the distances of all points in typical visual scenes (see Eq. (6)). As indicated in Fig. 12 (a), this distribution has a maximum probability at about 3 m. The agreement between this PD of distances in natural scenes and the relevant psychophysical evidence is thus consistent with a probabilistic explanation of the 'specific distance tendency'.

The similar apparent distance of an object to the apparent distances of its near neighbors in the retinal image (the 'equidistance tendency' (Owens et al., 1976)) also accords with the PD of the distances of locations in the natural scenes. In the absence of additional information about differences in the distances of two nearby locations, the likelihood function is again more or less flat. As a result, the PD of the differences of the physical distances from the image plane to any two locations in natural scenes should strongly bias the perceived difference in their distances. Since this distribution between two locations with relatively small angular separations (the black line in Fig. 12 (b)) has a maximum near zero, any two neighboring objects should be perceived to be at about the same distance from the observer. However, at larger angular separations (the green line in Fig. 12 (b)) the probability associated with small absolute differences in the distance to the two points is lower than the corresponding probabilities for smaller separations, and the distribution relatively flatter. Accordingly, the tendency to see neighboring points at the same distance from the observer would be expected to decrease somewhat as a function of increasing angular separation. Finally, when more specific information about the distance difference is present, this tendency should decrease. Each of these several tendencies has been observed in psychophysical studies of the 'equidistance tendency'.

Vision as a Fundamentally Statistical Machine 221

m at an elevation of -10(Fig. 14 (b)). The distances of the average physical locations at different elevation angles of the scanning beam form a gentle curve. Below eye-level, the height of this curve is relatively near the ground for closer distances, but increases slowly as the horizontal distance from the observer increases. If the portion of the curve at heights below eye-level in Fig. 14 (c) is taken as an index of the average ground, it is apparent that the average ground is neither a horizontal plane nor a plane with constant slant, but a curved surface that is increasingly inclined toward the observer as a function of horizontal

These characteristics of distance as a function of the elevation of the line of sight can thus account for the otherwise puzzling perceptual effects shown in Fig. 10 (d). The perceived location of an object on the ground without much additional information varies according to the declination of the line of sight, the object appearing closer and higher than it really is as a function of this angle. The apparent location of an object predicted by the PDs in Fig. 14 is increasingly higher and closer to the observer as the declination of the line of sight decreases, in agreement with the relevant psychophysical data (Ooi et al., 2001) (Fig. 13 (b)). Other characteristics of visual space shown in Fig. 10 can be explained in the same way

(a), The perceived distances predicted from the PD of physical distances measured at eye-level. The solid line represents the local mass mean of the PD obtained by multiplying the PD in Fig. 12 (c) (black line) by a Gaussian likelihood function of distances with a standard deviation of 1.4 m. The dashed line represents the equivalence of perceived and physical distances for comparison. (b), The perceived distances of objects on the ground in the absence of other information predicted from the PD in Fig. 14 (a). The likelihood function at an angular declination was a Gaussian function, i.e., ~ exp(-(-0)2/22) (0=sin-1(H/R), =8, R is radial distance, and H=1.65m). The prior was the distribution of distance at angular declinations within [-8, +8]. The ground in the diagram is a horizontal plane 1.65m below eye-level (icon). The predicted perceptual locations of objects on the ground are indicated by the solid

**Figure 13.** The perceived distances predicted for objects located at eye-level and objects on the ground.

distance.

(Yang & Purves, 2003a).

line (modified from Yang & Purves, 2003a).

#### **4.5. Perceived distances in more complex circumstances**

The following explanations for the phenomena illustrated in Figs. 10 (c) and (d) are somewhat more complex since, in contrast to the 'specific distance' and 'equidistance' tendencies, the relevant psychophysical observations were made under conditions that entailed some degree of contextual visual information. Thus, the relevant likelihood functions are no longer flat. Since their form is not known, we used a Gaussian to approximate the likelihood function in the following analyses.

The PD of physical distances at eye-level (the black line in Fig. 12 (c)) accounts for the perceptual anomalies in response to stimuli generated by near and far objects presented at this height (see Fig. 10 (c)). As shown in Fig. 13 (a), the distance that should be perceived on this basis is approximately a linear function of physical distance, with near distances being overestimated and far distances underestimated; the physical distance at which overestimation changes to underestimation is about 5-6 m. The effect of these statistics accords both qualitatively and quantitatively with the distances reported under these experimental conditions (Philbeck & Loomis, 1997).

To examine whether the perceptual observations summarized in Fig. 10 (d) can also be explained in these terms, we computed the PD of physical distances of points at different elevation angles of the laser beam relative to the horizontal plane at eye-level (Fig. 14). As shown in Fig. 14 (a), the PD of distances is more dispersed when the line of sight is directed above rather than below eye-level. The distribution shifts toward nearer distances with increasing absolute elevation angle, a tendency that is more pronounced below than above eye-level. A more detailed examination of the distribution within 30 m shows a single salient ridge below eye-level (indicated in red), extending from ~3 m near the ground to ~10 m at an elevation of -10(Fig. 14 (b)). The distances of the average physical locations at different elevation angles of the scanning beam form a gentle curve. Below eye-level, the height of this curve is relatively near the ground for closer distances, but increases slowly as the horizontal distance from the observer increases. If the portion of the curve at heights below eye-level in Fig. 14 (c) is taken as an index of the average ground, it is apparent that the average ground is neither a horizontal plane nor a plane with constant slant, but a curved surface that is increasingly inclined toward the observer as a function of horizontal distance.

220 Visual Cortex – Current Status and Perspectives

psychophysical studies of the 'equidistance tendency'.

**4.5. Perceived distances in more complex circumstances** 

approximate the likelihood function in the following analyses.

experimental conditions (Philbeck & Loomis, 1997).

The similar apparent distance of an object to the apparent distances of its near neighbors in the retinal image (the 'equidistance tendency' (Owens et al., 1976)) also accords with the PD of the distances of locations in the natural scenes. In the absence of additional information about differences in the distances of two nearby locations, the likelihood function is again more or less flat. As a result, the PD of the differences of the physical distances from the image plane to any two locations in natural scenes should strongly bias the perceived difference in their distances. Since this distribution between two locations with relatively small angular separations (the black line in Fig. 12 (b)) has a maximum near zero, any two neighboring objects should be perceived to be at about the same distance from the observer. However, at larger angular separations (the green line in Fig. 12 (b)) the probability associated with small absolute differences in the distance to the two points is lower than the corresponding probabilities for smaller separations, and the distribution relatively flatter. Accordingly, the tendency to see neighboring points at the same distance from the observer would be expected to decrease somewhat as a function of increasing angular separation. Finally, when more specific information about the distance difference is present, this tendency should decrease. Each of these several tendencies has been observed in

The following explanations for the phenomena illustrated in Figs. 10 (c) and (d) are somewhat more complex since, in contrast to the 'specific distance' and 'equidistance' tendencies, the relevant psychophysical observations were made under conditions that entailed some degree of contextual visual information. Thus, the relevant likelihood functions are no longer flat. Since their form is not known, we used a Gaussian to

The PD of physical distances at eye-level (the black line in Fig. 12 (c)) accounts for the perceptual anomalies in response to stimuli generated by near and far objects presented at this height (see Fig. 10 (c)). As shown in Fig. 13 (a), the distance that should be perceived on this basis is approximately a linear function of physical distance, with near distances being overestimated and far distances underestimated; the physical distance at which overestimation changes to underestimation is about 5-6 m. The effect of these statistics accords both qualitatively and quantitatively with the distances reported under these

To examine whether the perceptual observations summarized in Fig. 10 (d) can also be explained in these terms, we computed the PD of physical distances of points at different elevation angles of the laser beam relative to the horizontal plane at eye-level (Fig. 14). As shown in Fig. 14 (a), the PD of distances is more dispersed when the line of sight is directed above rather than below eye-level. The distribution shifts toward nearer distances with increasing absolute elevation angle, a tendency that is more pronounced below than above eye-level. A more detailed examination of the distribution within 30 m shows a single salient ridge below eye-level (indicated in red), extending from ~3 m near the ground to ~10 These characteristics of distance as a function of the elevation of the line of sight can thus account for the otherwise puzzling perceptual effects shown in Fig. 10 (d). The perceived location of an object on the ground without much additional information varies according to the declination of the line of sight, the object appearing closer and higher than it really is as a function of this angle. The apparent location of an object predicted by the PDs in Fig. 14 is increasingly higher and closer to the observer as the declination of the line of sight decreases, in agreement with the relevant psychophysical data (Ooi et al., 2001) (Fig. 13 (b)).

Other characteristics of visual space shown in Fig. 10 can be explained in the same way (Yang & Purves, 2003a).

(a), The perceived distances predicted from the PD of physical distances measured at eye-level. The solid line represents the local mass mean of the PD obtained by multiplying the PD in Fig. 12 (c) (black line) by a Gaussian likelihood function of distances with a standard deviation of 1.4 m. The dashed line represents the equivalence of perceived and physical distances for comparison. (b), The perceived distances of objects on the ground in the absence of other information predicted from the PD in Fig. 14 (a). The likelihood function at an angular declination was a Gaussian function, i.e., ~ exp(-(-0)2/22) (0=sin-1(H/R), =8, R is radial distance, and H=1.65m). The prior was the distribution of distance at angular declinations within [-8, +8]. The ground in the diagram is a horizontal plane 1.65m below eye-level (icon). The predicted perceptual locations of objects on the ground are indicated by the solid line (modified from Yang & Purves, 2003a).

**Figure 13.** The perceived distances predicted for objects located at eye-level and objects on the ground.

Vision as a Fundamentally Statistical Machine 223

database of range images, we showed that the phenomena illustrated in Fig. 10 can all be

If visual space is indeed generated by a probabilistic strategy, then explaining the relevant perceptual phenomenology will inevitably require knowledge of the statistical properties of natural visual environments with respect to observers. Visual space generated probabilistically will necessarily be a space in which perceived distances are not a simple mapping of physical distances; on the contrary, apparent distance will always be determined by the way all the available information at that moment affects the PD of the

These and many other studies present a strong case supporting the concept that vision works as a fundamentally statistical machine. In this concept, even the simplest visual percept has a statistical basis, i.e., it is related to a certain statistics in the natural environments that supports routinely successful visually guided behavior. The statistics of natural visual environments must have been incorporated into the visual circuitry by

There are a range of statistics in the natural environments. These include the statistics of 2D and 3D natural scenes in both space and time domains. As discussed here and elsewhere (Geisler, 2008), these statistics are related to a range of aspects of human natural vision. Since natural environments consist of objects of various physical properties that are arranged in 3D space and move in a variety of ways, the statistics of natural objects, activities, and events, though not discussed here, are critical for our understanding of human object recognition and activity and event understanding (Yuille & Kersten, 2006;

What could be the neural mechanisms underlying this fundamentally statistical machine? A broad hypothesis is that the response properties of visual neurons and their connections, the organization of visual cortex, the patterns of activity elicited by visual stimuli, and visual perception are all determined by the PDs of visual stimuli. In this conception, neurons do not detect or encode features, but by virtue of their activity levels, act as estimators of the PDs of the variables underlying any given stimulus. From this perspective, the function of visual cortical circuitry is to propagate, combine, and transform these PDs. The iterated structure of the primary visual cortex in primates may thus be organized in the way it is in order to generate PDs pertinent to simpler aspects of visual stimuli. By the same token, the extrastriate visual cortical areas may serve to generate PDs pertinent to more complex aspects of visual stimuli by propagating, combining, and transforming the PDs elaborated in the V1 area. The activity patterns elicited by any visual stimulus would, in this conception, be determined by the joint PDs of the variables underlying visual stimuli, which, in turn,

This statistical concept of vision and visual system structure and function is radically different from the conventional view, where visual neurons are conceived to perform

successful behavior in the world over evolutionary and developmental time.

gamut of the possible sources of any physical point in the scene.

rationalized in this framework.

Doya et al., 2007; Friston, 2010).

determine what people actually see.

**5. Conclusion** 

(a), Contour plot of the logarithm of the PD of distances at elevation angles indicated by color coding. Red indicates a probability value of ~10-2 and blue ~10-5.5. (b), Blowup of (a) showing the PD of distances within 30 m in greater detail. In this case, red indicates a probability value of ~10-1.5. (c), The average distance as a function of elevation angle, based on the data in (a). The vertical axis is the height relative to eye level; the horizontal axis is the horizontal distance from the image plane. The curve below eye-level, if modeled as a piece-wise plane, would have a slant of about 1.5° from 3- 15 m, and about 5° from 15-24 m (modified from Yang & Purves, 2003a).

**Figure 14.** PD of physical distances at different elevation angles.

#### **4.6. Discussion**

When projected onto the retina, 3D spatial relationships in the physical world are necessarily transformed into 2D relationships in the image plane. As a result, the physical sources underlying any geometrical configuration in the retinal image are uncertain: a multitude of different scene geometries could underlie any particular configuration in the image. This uncertain link between retinal stimuli and physical sources presents a biological dilemma, since an observer's fate depends on visually guided behavior that accords with real-world physical sources.

Given this quandary, we set out to explore the hypothesis that the uncertain relationship of images and sources is addressed by a probabilistic strategy, using the phenomenology of visual space to test this idea. If physical and perceptual space are indeed related in this way, then the characteristics of human visual space should accord with the PDs of 3D natural scene geometry. Observers would be expected to perceive objects in positions substantially and systematically different from their physical locations when countervailing empirical information is not available, or at locations predicted by the altered PDs of the possible sources of the stimulus in question when other contextual information is available. Using a database of range images, we showed that the phenomena illustrated in Fig. 10 can all be rationalized in this framework.

If visual space is indeed generated by a probabilistic strategy, then explaining the relevant perceptual phenomenology will inevitably require knowledge of the statistical properties of natural visual environments with respect to observers. Visual space generated probabilistically will necessarily be a space in which perceived distances are not a simple mapping of physical distances; on the contrary, apparent distance will always be determined by the way all the available information at that moment affects the PD of the gamut of the possible sources of any physical point in the scene.

## **5. Conclusion**

222 Visual Cortex – Current Status and Perspectives

(a), Contour plot of the logarithm of the PD of distances at elevation angles indicated by color coding. Red indicates a probability value of ~10-2 and blue ~10-5.5. (b), Blowup of (a) showing the PD of distances within 30 m in greater detail. In this case, red indicates a probability value of ~10-1.5. (c), The average distance as a function of elevation angle, based on the data in (a). The vertical axis is the height relative to eye level; the horizontal axis is the horizontal distance from the image plane. The curve below eye-level, if modeled as a piece-wise plane, would have a slant of about 1.5° from 3-

When projected onto the retina, 3D spatial relationships in the physical world are necessarily transformed into 2D relationships in the image plane. As a result, the physical sources underlying any geometrical configuration in the retinal image are uncertain: a multitude of different scene geometries could underlie any particular configuration in the image. This uncertain link between retinal stimuli and physical sources presents a biological dilemma, since an observer's fate depends on visually guided behavior that accords with

Given this quandary, we set out to explore the hypothesis that the uncertain relationship of images and sources is addressed by a probabilistic strategy, using the phenomenology of visual space to test this idea. If physical and perceptual space are indeed related in this way, then the characteristics of human visual space should accord with the PDs of 3D natural scene geometry. Observers would be expected to perceive objects in positions substantially and systematically different from their physical locations when countervailing empirical information is not available, or at locations predicted by the altered PDs of the possible sources of the stimulus in question when other contextual information is available. Using a

15 m, and about 5° from 15-24 m (modified from Yang & Purves, 2003a). **Figure 14.** PD of physical distances at different elevation angles.

**4.6. Discussion** 

real-world physical sources.

These and many other studies present a strong case supporting the concept that vision works as a fundamentally statistical machine. In this concept, even the simplest visual percept has a statistical basis, i.e., it is related to a certain statistics in the natural environments that supports routinely successful visually guided behavior. The statistics of natural visual environments must have been incorporated into the visual circuitry by successful behavior in the world over evolutionary and developmental time.

There are a range of statistics in the natural environments. These include the statistics of 2D and 3D natural scenes in both space and time domains. As discussed here and elsewhere (Geisler, 2008), these statistics are related to a range of aspects of human natural vision. Since natural environments consist of objects of various physical properties that are arranged in 3D space and move in a variety of ways, the statistics of natural objects, activities, and events, though not discussed here, are critical for our understanding of human object recognition and activity and event understanding (Yuille & Kersten, 2006; Doya et al., 2007; Friston, 2010).

What could be the neural mechanisms underlying this fundamentally statistical machine? A broad hypothesis is that the response properties of visual neurons and their connections, the organization of visual cortex, the patterns of activity elicited by visual stimuli, and visual perception are all determined by the PDs of visual stimuli. In this conception, neurons do not detect or encode features, but by virtue of their activity levels, act as estimators of the PDs of the variables underlying any given stimulus. From this perspective, the function of visual cortical circuitry is to propagate, combine, and transform these PDs. The iterated structure of the primary visual cortex in primates may thus be organized in the way it is in order to generate PDs pertinent to simpler aspects of visual stimuli. By the same token, the extrastriate visual cortical areas may serve to generate PDs pertinent to more complex aspects of visual stimuli by propagating, combining, and transforming the PDs elaborated in the V1 area. The activity patterns elicited by any visual stimulus would, in this conception, be determined by the joint PDs of the variables underlying visual stimuli, which, in turn, determine what people actually see.

This statistical concept of vision and visual system structure and function is radically different from the conventional view, where visual neurons are conceived to perform bottom-up, image-based processing (e.g., computing zero-crossing, luminance and texture gradients, stereoscopic and motion correspondence, and grouping) to build a series of symbolic representations of visual stimuli (e.g., primal sketch, 2½) sketch, and 3D representation) (Marr, 2010). Since the statistics of natural scenes, which are, as argued above, fundamental to the generation of natural vision and visually guided behaviors, are not contained in any current stimulus the visual animal is seeing, any image-based feature extraction/representation construction in the current stimulus *per se* will not generate percepts that allow routinely successful behaviors. The results presented here and many others support this statistical concept of vision and visual system structure and function and several recent reviews also point to this new concept (Knill & Richards, 1996; Rao et al., 2002; Purves & Lotto, 2003; Doya et al., 2007; Trommershauser et al., 2011; Simoncelli & Olshausen, 2001; Yuille & Kersten, 2006; Geisler, 2008; Friston, 2010), but much is left to the next generation of neuroscientists.

Vision as a Fundamentally Statistical Machine 225

Gichrist, A. et al. (1999). An anchoring theory of lightness perception. *Psychological Review,*

Gillam, B. (1995). The perception of spatial layout from static optical information. In: *Perception of space and motion, W.* Epstein & S. Rogers, (Eds.), pp. 23-67, Academic Press,

Hershenson, M. (1998). *Visual space perception*: *A Primer.* MIT Press, ISBN 978-0262581677,

Hubel, D. H. & Wiesel, T. N. (1977). Functional architecture of macaque monkey visual

Hyvarinen, A. (1999). Fast and robust fixed-point algorithms for independent component

Itti, L. & Baldi, P. (2009). Bayesian surprise attracts human attention. *Vision Research*, Vol. 49,

Itti, L. & Koch, C. (2001). Computational modelling of visual attention. *Nature Reviews* 

Itti, L.; Koch, C. & Niebur, E. (1998). A model of saliency-based visual attention for rapid scene analysis. *IEEE Transactions on Pattern Analysis and Machine Intelligence*, Vol. 20, No.

Kingdom, F. A. A. (2011). Lightness, brightness and transparency: A quarter century of new ideas, captivating demonstrations and unrelenting controversy. *Vision Research*, Vol. 51,

Knill, D. C. & Richards, W. (Eds.). (1996). *Perception as Bayesian Inference.* Cambridge Univ.

Lee, A. B.; Mumford, D. & Huang, J. (2001). Occlusion models for natural images: A statistical study of a scale-invariant dead leaves model. *International Journal of Computer* 

Loomis, J. M.; Da Silva, J. A.; Philbeck, J. W. & Fukusima, S. S. (1996). Visual perception of location and distance. *Current Directions in Psychological Science,* Vol. 5, No. 3, pp. 72-77. Ma, W. J.; Beck, J. M.; Latham, P. E. & Pouget, A. (2006). Bayesian inference with

Marr, D. (2010). *Vision: A Computational Investigation into the Human Representation and Processing of Visual Information* (reprinted from 1982 edition). MIT Press, ISBN 978-0-262-

Olmos, A. & Kingdom, F. A. A. (2004). A biologically inspired algorithm for the recovery of

Ooi, T. L.; Wu, B. & He, Z. J. (2001). Distance determined by the angular declination below

Owens, D. A. & Leibowitz, H. W. (1976). Oculomotor adjustments in darkness and the specific

Philbeck, J. W. & Loomis, J. M. (1997). Comparison of two indicators of perceived egocentric distance under full-cue and reduced-cue conditions. *J. Exp. Psychol. Hum. Percept.* 

Purves, D.; Williams, S. M.; Nundy, S. & Lotto, R. B. (2004). Perceiving the Intensity of Light.

probabilistic population codes. *Nature Neuroscience,* Vol. 9, pp. 1432-1438.

shading and reflectance images. *Perception*, Vol. 33, No. 12, pp. 1463 - 1473.

distance tendency. *Attention, Perception, & Psychophysics,* Vol. 20, No. 1, pp. 2-9.

Vol. 106, No. 4, pp. 795-834.

Cambridge, MA, USA.

pp. 1295-1306.

11, pp. 1254-1259.

No. 7, pp. 652-673.

Inc., ISBN 978-0122405303, San Diego, CA, USA.

*Neuroscience,* Vol. 2, No. 3, pp. 194-203.

Press, ISNB 052146109X, New York, USA.

*Vision,* Vol. 41, No. 1-2, pp. 35-59.

51462-0, Cambridge, MA, USA.

*Perform.,* Vol. 23, No. 1, pp. 72-85.

the horizon. *Nature,* Vol. 414, pp. 197-200.

*Psychological Review,* Vol. 111, No. 1, pp. 142-158.

cortex. *Proc. R. Soc. Lond. B*, Vol. 198, No. 1130, pp 1-59.

analysis. *IEEE Trans Neural Netw,* Vol. 10, No. 3, pp. 626-634

## **Author details**

Zhiyong Yang *Brain and Behavior Discovery Institute and Department of Ophthalmology, Georgia Health Sciences University, USA* 

## **6. References**


Gichrist, A. et al. (1999). An anchoring theory of lightness perception. *Psychological Review,* Vol. 106, No. 4, pp. 795-834.

224 Visual Cortex – Current Status and Perspectives

next generation of neuroscientists.

*Georgia Health Sciences University, USA* 

Cambridge, MA, USA.

104, No. 48, pp. 19120-19125.

*Neuroscience*, Vol. 11, p. 127-138*.*.

*Rev. Psychol*. Vol. 59, pp. 167-192.

Vol. 434, pp. 79-83.

*Brain and Behavior Discovery Institute and Department of Ophthalmology,* 

normalization. *Vision Research,* Vol. 44, No. 21, pp. 2483-2503.

theoretic approach. *Journal of Vision*, Vol. 9, No. 5, pp. 1-24.

*Analysis and Machine Intelligence*, Vol. 31, No. 6, pp. 989-1005.

**Author details** 

Zhiyong Yang

**6. References** 

bottom-up, image-based processing (e.g., computing zero-crossing, luminance and texture gradients, stereoscopic and motion correspondence, and grouping) to build a series of symbolic representations of visual stimuli (e.g., primal sketch, 2½) sketch, and 3D representation) (Marr, 2010). Since the statistics of natural scenes, which are, as argued above, fundamental to the generation of natural vision and visually guided behaviors, are not contained in any current stimulus the visual animal is seeing, any image-based feature extraction/representation construction in the current stimulus *per se* will not generate percepts that allow routinely successful behaviors. The results presented here and many others support this statistical concept of vision and visual system structure and function and several recent reviews also point to this new concept (Knill & Richards, 1996; Rao et al., 2002; Purves & Lotto, 2003; Doya et al., 2007; Trommershauser et al., 2011; Simoncelli & Olshausen, 2001; Yuille & Kersten, 2006; Geisler, 2008; Friston, 2010), but much is left to the

Adelson, E. H. (2000). Lightness and perception and lightness illusion. In: *The New Cognitive Neuroscience (2nd Ed)*, M. Gazzaiga, (Ed.), pp. 339-351, MIT Press, ISBN 0262071959,

Anderson, B. L. & Winawer, J. (2005). Image segmentation and lightness perception. *Nature*,

Blakeslee, B. & McCourt, M. E. (2004). A unified theory of brightness contrast and assimilation incorporating oriented multiscale spatial filtering and contrast

Bruce, N. D. & Tsotsos, J. K. (2009). Saliency, attention, and visual search: an information

Chen, X.; Han, F.; Poo, M. M. & Dan, Y. (2007). Excitatory and suppressive receptive field subunits in awake monkey primary visual cortex (V1). *Proc. Natl. Acad. Sci. USA,* Vol.

Doya, K.; Ishii, S.; Pouget, A. & Rao, R. P. N. (Eds.). (2007). *Bayesian Brain, Probabilistic Approaches to Neural Coding.* MIT Press, ISBN 0-262-04238-X, Cambridge, MA, USA. Friston, K. (2010). The free-energy principle: a unified brain theory? *Nature Reviews* 

Gao, D.; Han, S. & Vasconcelos, N. (2009). Discriminant saliency, the detection of suspicious coincidences, and applications to visual recognition. *IEEE Transactions on Pattern* 

Geisler, W. S. (2008). Visual perception and the statistical properties of natural scenes. *Annu.* 


Purves, D. & Lotto, R. B. (2003). *Why We See What We Do, AN EMPIRICAL THEORY OF VISION*. Sinauer Associates, Inc., ISBN 0-87893-752-8, Sunderland, MA, USA.

**Chapter 1**

**Chapter 10**

**Models of Information Processing**

Additional information is available at the end of the chapter

Vincent de Ladurantaye, Jean Rouat and Jacques Vanden-Abeele

In the most general meaning of the word, a model is a way of representing something of interest. If we ask a biologist to model the visual system, he will probably talk about neurons, dendrites and synapses. On the other hand, if we ask a mathematician to model the visual

There are an almost infinite number of ways to model any given system, and this is particularly true if we are interested in something as complex as the human brain, including the visual cortex. It is impossible to review every type of model present in the literature. This chapter does not, by any means, claim to be an accurate review of all the models of vision published so far. Instead, we try to regroup the different types of models into small categories, to give the reader a good overview of what is possible to achieve in terms of modeling biological vision. Each section of the chapter concentrates on one of these categories, begins with an outline of the general properties and goals of the global model category, and

There are many reasons why one might want to elaborate and use a model. Different kinds of models achieve different goals, however they all serve a common purpose: a tool to learn and better understand our world. Models are used as a simplification of reality. Many models only represent a specific aspect of a complex system and make arbitrary assumptions about the rest of the system which may be unrelated, judged negligible or simply unknown. A good example is the fact that current neurobiological models of the brain mainly focus on modeling neurons, neglecting interneurons and glial cells. Current models ignore glial cells, probably because they are not fully understood. Some models might purposefully ignore some parts of

Because of the extreme complexity of the cortex, simplification is unavoidable and even productive. Modeling is an efficient approach for simplifying reality as a way to gain a better

and reproduction in any medium, provided the original work is properly cited.

©2012 Rouat et al., licensee InTech. This is an open access chapter distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly

© 2012 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution,

a system to isolate a specific structure and/or function that we want to understand.

cited.

system, he will probably talk about variables, probabilities and differential equations.

then explores different implementation methods (i.e. specific models).

**in the Visual Cortex**

http://dx.doi.org/10.5772/50616

**1. Introduction**

**1.1. Why models?**

