**3. External developments**

An important role in the demise of RL derived from advances in information theory and control theory in engineering. This happened during the 1940-50s with the publication of several seminal works like those of Shannon (1948), Turing (1936) and Wiener (1948). Their importance consisted in showing that it was possible to formulate rigorous mathematical theories and models to study information processing. In control theory (Wiener, 1948), for instance, the term "control" referred to the auto-correction of internal parameters of a system based on a feedback signal indicating the error between the wished (or the expected) value of an internal parameter and its real value, typically provided by the environment. This general theory of control (called cybernetics) (literally from ancient Greek: "the art of piloting"), did not refer to a particular system: instead, it provided mathematical models to study control phenomena occurring *inside* any system, being animal or artificial or even social. A similar story holds for information theory (Shannon, 1948), which provided the concept of "information", a measure that did not refer to any directly measurable physical variable, but instead to the *internal* "surprise" of any system receiving an external signal.

These new disciplines showed that it was possible, and indeed a proficient and powerful approach, to investigate the internal functioning of systems (including biological organisms), by mathematical modelling of their hidden machinery that was not directly investigable. In this way, the philosophical-methodological assumption of behaviorism, according to which the scientific approach should be limited to strictly empirical investigation, was shown to be unnecessary for scientific progress.

#### **4. Precursors to the return of RL**

284 Neuroimaging – Cognitive and Clinical Neuroscience

During the 40s, more and more data were piling up demonstrating the insufficiency of behaviorism to account for human and animal behavior. For example, Tolman and colleagues showed that animals can and do learn even without obtaining reinforcement (Tolman, 1948). They performed a series of experiments on maze learning in rats. It was shown that animals left free to familiarize themselves with the maze before the reinforcement experimental session, were afterwards able to find the food in the maze much more efficiently than completely naive animals. To explain these findings Tolman introduced the concept of the "cognitive map", i.e. an internal representation of the maze that the rats used to find reinforcers more efficiently. Because of this and other demonstrations that animals hold some kind of internal representation of the environment (memory), Tolman formed part of what became known as "the cognitive revolution". During the same period, but in the field of psychobiology, Donald Hebb wrote *The Organization of Behaviour* (1949), a seminal work in which for the first time a neurobiological theory of learning was proposed. Hebb suggested that the synaptic connection between two neurons improves its efficacy after repeated simultaneous activity of them. This law, properly called "Hebbian rule" and describing what was called "Hebbian Learning", provided the first neural hypothesis on the basis of memory, thus opening the "black box", which behaviorists considered not scientifically investigable. The depth of Hebb's intuition can be better understood if we consider that the Hebbian rule has been experimentally proven almost twenty years after its formulation, with the discover of synaptic long term

Another strong criticism came from psycholinguistics. In a famous review study, Noam Chomsky (1959) argued that the RL paradigm was not suitable to explain the generative feature of natural language (i.e. the possibility to express a quasi-infinite variety of verbal expressions). In the same work, Chomsky also provided a survey on research in animal behavior (e.g., imprinting) that seemed to be in striking contrast with key behaviorist tenets. Finally, and most importantly from the theoretical point of view, Chomsky showed that Skinner himself was obliged to introduce hypotheses about internal variables (e.g., internal

An important role in the demise of RL derived from advances in information theory and control theory in engineering. This happened during the 1940-50s with the publication of several seminal works like those of Shannon (1948), Turing (1936) and Wiener (1948). Their importance consisted in showing that it was possible to formulate rigorous mathematical theories and models to study information processing. In control theory (Wiener, 1948), for instance, the term "control" referred to the auto-correction of internal parameters of a system based on a feedback signal indicating the error between the wished (or the expected) value of an internal parameter and its real value, typically provided by the environment. This general theory of control (called cybernetics) (literally from ancient Greek: "the art of piloting"), did not refer to a particular system: instead, it provided mathematical models to study control phenomena occurring *inside* any system, being animal or artificial or even social. A similar story holds for information theory (Shannon, 1948), which provided the concept of "information", a measure that did not refer to any directly measurable physical variable, but instead to the *internal* "surprise" of any system receiving an external signal.

**2. Internal criticisms to RL** 

potentiation (LTP) in the rabbit hippocampus (Lømo, 1966).

self-reinforcements), in order to explain human verbal behavior.

**3. External developments** 

Because of these developments, behaviorism, and with it RL, was discredited for several decades. Instead an alternative paradigm became dominant, according to which the human mind could be construed as a computer that manipulates abstract symbols (e.g., Neisser, 1967; Atkinson and Shiffrin, 1968). However, in recent years the RL framework became influential again. At least two developments in the second part of the 20th century prepared a renewed interest for RL. The first originated in human learning theory; the second from a new discipline called connectionist psychology, which proposed itself as an alternative to the then canonical symbol-manipulation paradigm for the study of cognition.

#### **4.1 Human learning theory and the Rescorla-Wagner model**

Important phenomena observed in the behavioral lab could not be accounted for with the standard behaviorist conceptualization (Rescorla and Wagner, 1972). For example, blocking (Kamin, 1969) refers to the fact that an organism only learns about the contingency between two events to the extent that one of the events is unexpected. To account for blocking, Rescorla and Wagner added a crucial ingredient to an associative learning framework, namely prediction error. Prediction error refers to the difference between an external feedback signal indicating the correct response or stimulus on the one hand, and the response or stimulus predicted by the organism on the other. Here it is worth noting the influence (and indeed similarity) of the cybernetic concept of feedback on the formulation of the concept of prediction error. Rescorla and Wagner proposed a formal model which learned by updating associations between events (e.g., stimulus and response) using prediction error (Rescorla and Wagner, 1972). This model formed the basis for many human learning theories (e.g., Kruschke, 2008; Pearce and Hall, 1980; Van Hamme and Wasserman, 1994), and can be represented by the following equations:

$$
\delta\_t = \lambda\_t - V\_t \tag{1}
$$

$$V\_{t+1} = V\_t + \alpha \delta\_t \tag{2}$$

where is the prediction error, *V* is the prediction of the organism, and is the actual outcome from the environment. Equation 2 shows how the new expectations are updated by the prediction error from time point *t* to *t* + 1; is a learning rate parameter modulating the prediction error.

#### **4.2 The connectionist approach**

A second development preparing the cultural ground for reviving the field of RL was connectionist psychology. Here, the study of psychological phenomena was grounded on the construction of artificial neural networks, i.e. models simulating both the nervous

Reinforcement Learning, High-Level Cognition, and the Human Brain 287

Fig. 1. Shifting of dopaminergic activity from reward to predictor of reward (CS). Reprinted

with permission from Schultz et al. (1997).

system and cognitive processes, providing what was called a sub-symbolic explanation of cognition. This new field was inspired by the fast developing neurosciences; in particular, the scientists developing this new branch not only did not adhere to the dogma that theorizing should remain at the behavioral level, but they also attempted to bridge the explanatory gap between the biological level of neurons and synapses on the one hand, and the psychological level of language and other forms of high-level cognition on the other.

An important step was taken by McClelland, Rumelhart and colleagues (Rumelhart and McClelland, 1986). Models similar to theirs had been developed by other researchers before (Grossberg, 1973) but Rumelhart and McClelland developed a series of applications that made these connectionist models almost instantly influential. At the core of these models is again the Rescorla-Wagner idea that learning consists of updating associations based on prediction errors. However, the authors proposed a generalized learning rule (backpropagation), which allowed learning also for so-called "hidden units", that is, neurons that do not receive external feedback. In backpropagation, such neurons use as a prediction error a linear combination of prediction errors of other neurons that do receive external feedback. This development made the learning rule many orders more powerful than that of Rescorla and Wagner. With the more powerful learning rule, the connectionists were able to investigate linguistic phenomena such as past tense formation (Rumelhart and McClelland, 1987), naming aloud (Seidenberg and McClelland, 1989), and sentence comprehension (St. John and McClelland, 1990).

#### **4.3 The new RL approach**

With these important historical precedents, RL learning became influential again during the early 1990s partly because of its important contributions to Machine Learning, a branch of Artificial Intelligence. One of the main protagonists of this revival was Richard Sutton, who developed another generalization of the Rescorla-Wagner rule, called temporal-difference (TD) learning (Sutton, 1988). The original Rescorla-Wagner rule had a *spatial* limitation in the sense that not all neurons received feedback, and this problem was solved by backpropagation. Similarly, the Rescorla-Wagner rule also has a *temporal* limitation in the sense that feedback is not always available to the model – only when there is explicit supervisory feedback. The TD learning algorithm solved this latter problem, because it allowed learning by not only comparing a prediction with external feedback (which may or may not be available, depending on an appropriate teacher's availability), but additionally by comparing a prediction with an earlier prediction (which is always available). In this case the learning signal is the TD error (here denoted as TD ), in which both the comparisons between previous prediction and external feedback and previous prediction and current prediction play a role. The TD error signal can be written as follows:

$$
\delta\_{t+1}^{\rm TD} = \lambda\_{t+1} + \gamma V\_{t+1} - V\_t \tag{3}
$$

where is the external feedback already defined in Equation (1) and is a discount factor. The symbol *V* was used before to denote the organism's prediction; in RL applications, it refers specifically to reward prediction. This rule is more powerful than the Rescorla-Wagner rule: For example, Tesauro (1989) demonstrated that a neural network equipped with TD learning can learn to play backgammon at a worldmaster level.

system and cognitive processes, providing what was called a sub-symbolic explanation of cognition. This new field was inspired by the fast developing neurosciences; in particular, the scientists developing this new branch not only did not adhere to the dogma that theorizing should remain at the behavioral level, but they also attempted to bridge the explanatory gap between the biological level of neurons and synapses on the one hand, and the psychological level of language and other forms of high-level cognition on the other. An important step was taken by McClelland, Rumelhart and colleagues (Rumelhart and McClelland, 1986). Models similar to theirs had been developed by other researchers before (Grossberg, 1973) but Rumelhart and McClelland developed a series of applications that made these connectionist models almost instantly influential. At the core of these models is again the Rescorla-Wagner idea that learning consists of updating associations based on prediction errors. However, the authors proposed a generalized learning rule (backpropagation), which allowed learning also for so-called "hidden units", that is, neurons that do not receive external feedback. In backpropagation, such neurons use as a prediction error a linear combination of prediction errors of other neurons that do receive external feedback. This development made the learning rule many orders more powerful than that of Rescorla and Wagner. With the more powerful learning rule, the connectionists were able to investigate linguistic phenomena such as past tense formation (Rumelhart and McClelland, 1987), naming aloud (Seidenberg and McClelland, 1989), and sentence

With these important historical precedents, RL learning became influential again during the early 1990s partly because of its important contributions to Machine Learning, a branch of Artificial Intelligence. One of the main protagonists of this revival was Richard Sutton, who developed another generalization of the Rescorla-Wagner rule, called temporal-difference (TD) learning (Sutton, 1988). The original Rescorla-Wagner rule had a *spatial* limitation in the sense that not all neurons received feedback, and this problem was solved by backpropagation. Similarly, the Rescorla-Wagner rule also has a *temporal* limitation in the sense that feedback is not always available to the model – only when there is explicit supervisory feedback. The TD learning algorithm solved this latter problem, because it allowed learning by not only comparing a prediction with external feedback (which may or may not be available, depending on an appropriate teacher's availability), but additionally by comparing a prediction with an earlier prediction (which is always available). In this case the learning signal is the TD error (here denoted as TD ), in which both the comparisons between previous prediction and external feedback and previous prediction and current

TD *t t tt* 11 1 *V V* (3)

is a discount factor.

comprehension (St. John and McClelland, 1990).

prediction play a role. The TD error signal can be written as follows:

is the external feedback already defined in Equation (1) and

with TD learning can learn to play backgammon at a worldmaster level.

The symbol *V* was used before to denote the organism's prediction; in RL applications, it refers specifically to reward prediction. This rule is more powerful than the Rescorla-Wagner rule: For example, Tesauro (1989) demonstrated that a neural network equipped

**4.3 The new RL approach** 

where 

Fig. 1. Shifting of dopaminergic activity from reward to predictor of reward (CS). Reprinted with permission from Schultz et al. (1997).

Reinforcement Learning, High-Level Cognition, and the Human Brain 289

that their model was indeed able to predict several effects linked to the ERN, for example the fact that this EEG component appears only when there is a violation of the reward

Another example comes from the study of Parkinson's disease, a neurological disorder whose pathological basis consists of the degeneration of the dopaminergic neurons in the substantia nigra pars compacta, source of the main brainstem input to basal ganglia. Parkinson patients are impaired in learning from positive outcomes (reward), while performance is preserved for learning based on negative outcomes (punishment) (Frank et al., 2004). A neural model was proposed representing the interactions between basal ganglia, cortex and substantia nigra (Frank, 2005). In this model, the basal ganglia consists of two neural populations; "Go" neurons fire when an action planned in cortex is allowed to be implemented, whereas "No Go" neurons suppress the action planned in the cortex. Both Go and No Go populations learn by dopaminergic (i.e., reinforcement-related) bursts and dips coming from the substantia nigra. One of the advantages of the model consists in explaining several symptoms of Parkinson's disease. For instance, with reduced dopaminergic input (simulation of the substantia nigra degeneration), the basal ganglia are impaired at learning in Go neural populations, and hence impaired specifically in learning by rewards, just like human Parkinson patients. In addition, the model successfully predicts that this distinction between Go versus No Go learning holds true in high-level cognition as

Finally, it is worth describing briefly the work of Gläscher et al. (2010), which showed, by a combined computational and fMRI study, that the human brain also implements RL-like algorithms for creating abstract models of the environment. This study resembled the historical experiment of Tolman (1948). Volunteers were at first exposed to a simplified artificial environment, in which each single state was represented by an abstract figure (a fractal) (Figure 2). The subjects were asked to "navigate" inside this environment by performing binary choices (left or right). Each choice was followed by a transition to one of two possible states, each with some probability. In the first part of the experiment, subjects freely navigated in this environment, resembling Tolman's latent learning phase. In the second part, subjects received a monetary reward in some of the final states. In this way they had to exploit the latent learning acquired during the first part of the experiment to maximize reward, again as in Tolman's paradigm. Through a model-based analysis of the fMRI signal from both experimental phases, the authors localized the brain regions involved in both the latent learning (leading to a cognitive map or model of the environment) and the subsequent model-driven RL. While the RL-related areas were those typically found in the literature (ventral striatum and dopaminergic system), the areas involved in the formation

of cognitive maps were the dorsolateral prefrontal cortex and intraparietal sulci.

state prediction error (SPE) is the following (note the similarity with Equation (1)):

The merit of this work consisted not only in the localization of two separate circuits for RL and environment-model (cognitive map) learning, but also in the demonstration that the two processes can be based on very similar computational mechanisms. One is the already described "prediction error" (Equation (1)), the other the "state prediction error". The latter is formally similar to the prediction error (comparison between predictions and real outcomes), but it deals with environmental state transitions. The mathematical form of the

SPE

1 ,,' *t t T sas* (4)

prediction.

well (Frank et al., 2004).

A few years later, the RL paradigm received the decisive boost to come back to the attention of the broad scientific community. This derived from its official entrance into the domain of neurophysiology. In particular, with single-unit recording Wolfram Schulz and colleagues discovered dopaminergic neurons in the brainstem ventral tegmental area (VTA) and substantia nigra (SN) of macaque monkeys that exhibited a prediction error signature. In a classical conditioning experiment, Schultz et al. (1993) presented a conditioned stimulus (CS, e.g. a light), followed by an unconditioned stimulus (US, e.g. a drop of juice) some seconds later. Initially, dopaminergic neurons respond to the US only. After some trials, the dopaminergic neurons respond to the CS, but no longer to the US (Figure 1). This backward shift in time is exactly what was predicted by TD learning (Montague et al., 1996). Hence, this strongly suggested that the mammalian nervous system implements a RL (in particular, TD) algorithm to learn associations between stimuli.

#### **5. RL in high-level cognition: Conceptual and empirical advances**

Ever since the seminal findings of Schultz et al. (1993), the marriage between neuroscience and RL never stopped providing benefits for the study of learning and the nervous system. We here discuss a few highlights from the recent literature.

One conceptual development of RL consisted in the discovery that besides reward value, other value dimensions can be estimated and used to discount reward value (e.g., effort, Kennerley et al., 2006, or delay, Rudebeck et al., 2006). More generally, not only value but also upcoming states of the world can be estimated (Sutton and Barto, 1998). This allows the organism to make more far-sighted actions than with immediate values estimates only. Further, RL models have been proposed with the same computational power as the benchmark backpropagation algorithm (O'Reilly & Frank, 2004; Roelfsema and Van Ooyen, 2005), providing a biologically plausible alternative to backpropagation.

At the empirical level, clever experimental paradigms in combination with modern imaging technology allowed demonstrating the validity of RL models for human cognition. Using fMRI, Seymour et al. (2004) identified a TD signal in the human brain, similar to what was found by Schultz and colleagues in the monkey brain. Seymour et al. used a cued pain learning paradigm, in which a first CS (CS1) predicted (statistically) a second CS (CS2), which then (deterministically) predicted the upcoming pain level. In the striatum (ventral putamen), they observed a pain prediction error signal which responded to CS1 onset, and to CS2 if it differed from CS1 (i.e., was unpredicted based on CS1). Similar paradigms were used using appetitive learning (O'Doherty et al., 2003). The TD learning framework has also been applied extensively to EEG data, in particular the error-related negativity, for example in the work of Holroyd and Coles (2002). These authors successfully compared the performance of a TD learning-based computational model with the dynamics of the errorrelated negativity (ERN) from human volunteers. The model was aimed to clarify the roles of anterior cingulate cortex (ACC) and the ventral striatal structures in an instrumental conditioning paradigm. The authors proposed that the ventral striatum implements TD learning in order to estimate the value of external stimuli in terms of expected reward, while the ACC functions as a filter of several possible motor responses. In their proposal, the ACC would select the motor plans that are expected to be the most effective to achieve future rewards, based on the reward predictions computed by the ventral striatum. In this model, the ERN would be the result of ACC activity following the suppression of dopaminergic input from the ventral striatum. In a series of EEG experiments, Holroyd and Coles showed

A few years later, the RL paradigm received the decisive boost to come back to the attention of the broad scientific community. This derived from its official entrance into the domain of neurophysiology. In particular, with single-unit recording Wolfram Schulz and colleagues discovered dopaminergic neurons in the brainstem ventral tegmental area (VTA) and substantia nigra (SN) of macaque monkeys that exhibited a prediction error signature. In a classical conditioning experiment, Schultz et al. (1993) presented a conditioned stimulus (CS, e.g. a light), followed by an unconditioned stimulus (US, e.g. a drop of juice) some seconds later. Initially, dopaminergic neurons respond to the US only. After some trials, the dopaminergic neurons respond to the CS, but no longer to the US (Figure 1). This backward shift in time is exactly what was predicted by TD learning (Montague et al., 1996). Hence, this strongly suggested that the mammalian nervous system implements a RL (in particular,

TD) algorithm to learn associations between stimuli.

We here discuss a few highlights from the recent literature.

**5. RL in high-level cognition: Conceptual and empirical advances** 

2005), providing a biologically plausible alternative to backpropagation.

Ever since the seminal findings of Schultz et al. (1993), the marriage between neuroscience and RL never stopped providing benefits for the study of learning and the nervous system.

One conceptual development of RL consisted in the discovery that besides reward value, other value dimensions can be estimated and used to discount reward value (e.g., effort, Kennerley et al., 2006, or delay, Rudebeck et al., 2006). More generally, not only value but also upcoming states of the world can be estimated (Sutton and Barto, 1998). This allows the organism to make more far-sighted actions than with immediate values estimates only. Further, RL models have been proposed with the same computational power as the benchmark backpropagation algorithm (O'Reilly & Frank, 2004; Roelfsema and Van Ooyen,

At the empirical level, clever experimental paradigms in combination with modern imaging technology allowed demonstrating the validity of RL models for human cognition. Using fMRI, Seymour et al. (2004) identified a TD signal in the human brain, similar to what was found by Schultz and colleagues in the monkey brain. Seymour et al. used a cued pain learning paradigm, in which a first CS (CS1) predicted (statistically) a second CS (CS2), which then (deterministically) predicted the upcoming pain level. In the striatum (ventral putamen), they observed a pain prediction error signal which responded to CS1 onset, and to CS2 if it differed from CS1 (i.e., was unpredicted based on CS1). Similar paradigms were used using appetitive learning (O'Doherty et al., 2003). The TD learning framework has also been applied extensively to EEG data, in particular the error-related negativity, for example in the work of Holroyd and Coles (2002). These authors successfully compared the performance of a TD learning-based computational model with the dynamics of the errorrelated negativity (ERN) from human volunteers. The model was aimed to clarify the roles of anterior cingulate cortex (ACC) and the ventral striatal structures in an instrumental conditioning paradigm. The authors proposed that the ventral striatum implements TD learning in order to estimate the value of external stimuli in terms of expected reward, while the ACC functions as a filter of several possible motor responses. In their proposal, the ACC would select the motor plans that are expected to be the most effective to achieve future rewards, based on the reward predictions computed by the ventral striatum. In this model, the ERN would be the result of ACC activity following the suppression of dopaminergic input from the ventral striatum. In a series of EEG experiments, Holroyd and Coles showed that their model was indeed able to predict several effects linked to the ERN, for example the fact that this EEG component appears only when there is a violation of the reward prediction.

Another example comes from the study of Parkinson's disease, a neurological disorder whose pathological basis consists of the degeneration of the dopaminergic neurons in the substantia nigra pars compacta, source of the main brainstem input to basal ganglia. Parkinson patients are impaired in learning from positive outcomes (reward), while performance is preserved for learning based on negative outcomes (punishment) (Frank et al., 2004). A neural model was proposed representing the interactions between basal ganglia, cortex and substantia nigra (Frank, 2005). In this model, the basal ganglia consists of two neural populations; "Go" neurons fire when an action planned in cortex is allowed to be implemented, whereas "No Go" neurons suppress the action planned in the cortex. Both Go and No Go populations learn by dopaminergic (i.e., reinforcement-related) bursts and dips coming from the substantia nigra. One of the advantages of the model consists in explaining several symptoms of Parkinson's disease. For instance, with reduced dopaminergic input (simulation of the substantia nigra degeneration), the basal ganglia are impaired at learning in Go neural populations, and hence impaired specifically in learning by rewards, just like human Parkinson patients. In addition, the model successfully predicts that this distinction between Go versus No Go learning holds true in high-level cognition as well (Frank et al., 2004).

Finally, it is worth describing briefly the work of Gläscher et al. (2010), which showed, by a combined computational and fMRI study, that the human brain also implements RL-like algorithms for creating abstract models of the environment. This study resembled the historical experiment of Tolman (1948). Volunteers were at first exposed to a simplified artificial environment, in which each single state was represented by an abstract figure (a fractal) (Figure 2). The subjects were asked to "navigate" inside this environment by performing binary choices (left or right). Each choice was followed by a transition to one of two possible states, each with some probability. In the first part of the experiment, subjects freely navigated in this environment, resembling Tolman's latent learning phase. In the second part, subjects received a monetary reward in some of the final states. In this way they had to exploit the latent learning acquired during the first part of the experiment to maximize reward, again as in Tolman's paradigm. Through a model-based analysis of the fMRI signal from both experimental phases, the authors localized the brain regions involved in both the latent learning (leading to a cognitive map or model of the environment) and the subsequent model-driven RL. While the RL-related areas were those typically found in the literature (ventral striatum and dopaminergic system), the areas involved in the formation of cognitive maps were the dorsolateral prefrontal cortex and intraparietal sulci.

The merit of this work consisted not only in the localization of two separate circuits for RL and environment-model (cognitive map) learning, but also in the demonstration that the two processes can be based on very similar computational mechanisms. One is the already described "prediction error" (Equation (1)), the other the "state prediction error". The latter is formally similar to the prediction error (comparison between predictions and real outcomes), but it deals with environmental state transitions. The mathematical form of the state prediction error (SPE) is the following (note the similarity with Equation (1)):

$$\left(\delta\_t^{\rm SP}\right)^{\rm SP} = 1 - T\_t\left(s\_\prime a\_\prime s^\prime\right) \tag{4}$$

Reinforcement Learning, High-Level Cognition, and the Human Brain 291

poorly understood. To tackle this issue, researchers have tried recasting executive functioning in neural models. We will describe these models and demonstrate how the union of RL and connectionist models provides steps toward understanding the neural basis

Working within the connectionist framework, Cohen et al. (1990) proposed a model of the Stroop task, a widely used index of cognitive control. In this task, subjects are shown a color word in a given ink color, with the color and word either congruent (e.g., the word RED written in red), or incongruent (e.g., the word RED written in green). The subject's task is to name the ink color. Because word reading is automatic in literate adults, cognitive control is required to override the automatic tendency to read the word. Although subjects can do this, a congruency cost is typically observed, with incongruent trials slower than congruent ones. The Stroop task is widely used in clinical contexts to assess executive functioning, and differentiates between healthy subjects and various patient groups suffering from impairments in cognitive control (e.g., ADHD, Willcutt et al., 2005; Parkinson's disease, Bonnin et al., 2010). In the Cohen et al. model, a distinction is made between an input layer for the relevant dimension and an input layer for the irrelevant dimension, each projecting to a response layer. Crucially, Cohen et al. added task demand units which bias responding toward the relevant dimension (input layer). This brings top-down modulation in an

Botvinick et al. (2001) further developed the model of Cohen et al. They argued that the earlier model did not specify when cognitive control is required. In particular, cognitive control is required only on incongruent trials (e.g., RED written in green), not on congruent ones. For this purpose, they introduced the notion of response conflict, measuring the extent to which responses are simultaneously active. They proposed the conflict monitoring model, according to which response conflict is calculated in anterior cingulate cortex (ACC). Conceptually, this was an advance over the previous model, because not only top-down modulation but also the trial-by-trial cognitive control could be captured in an associative learning framework. In addition, it has been highly influential and allowed accounting for many data. For example, using fMRI Botvinick et al. (1999) demonstrated that human ACC was more active on incongruent trials following a congruent trial than on incongruent trials following an incongruent trial. This finding contradicted the popular notion that ACC activity reflects executive control itself (because the subject should be more "controlled" after an incongruent trial), but was in line with the conflict monitoring model because there should be more conflict after a congruent trial. Note that this model is also a control model in the cybernetic sense mentioned before: It detects when something goes wrong, and when

Verguts and Notebaert (2008, 2009) further developed this line of work. They started from the fact that the conflict monitoring model specifies when control should be exerted, but not where (see also Blais et al., 2007). To confront this issue, the authors proposed a neural model in which the implementation of cognitive control was based on an error signal modulating the Hebbian learning between active model neurons. This error signal was, like in Gläscher et al.'s work, borrowed from the RL domain. The new measure, which could be called "conflict prediction error", was computed by comparing the actual amount of conflict, evoked by a stimulus, with the expected mean amount of conflict. This model successfully predicted that cognitive control should not extend across different task input dimensions

of cognitive control.

associative model framework.

so, it leads to adaptation in the system.

**6.1 Associative models of cognitive control** 

where the value 1 corresponds to the probability of being in the current state (*s'*; with probability 1), and *T sas* ,,' is the expected probability of transition from previous state *s* (the previous state) to the current state given (chosen) action *a*. The expectation *T* is updated by means of the state prediction error:

$$T\_{t+1}\left(s, a\_\prime s'\right) = T\_t\left(s, a\_\prime s'\right) + a\delta\_t^{\rm SPE} \tag{5}$$

In conclusion, this work showed that prediction error and state prediction error are similar computations but calculated in different brain circuits. This suggested a neurophysiological and computational basis of Tolman's discoveries sixty years before.

Fig. 2. Formal structure of state space in Gläscher et al.'s experiment. Reprinted with permission from Gläscher et al. (2010).

#### **6. A case study: RL, cognitive control, and anterior cingulate cortex**

One of the remaining mysteries of the human mind is executive functioning or cognitive control – the rapid modulation of behavior when called for by unexpected circumstances. In the symbol-manipulation paradigm, executive functioning was proposed to originate from a central executive endowed with two or more "slave systems" (typically the visuospatial sketch pad and the phonologic loop; Baddeley and Hitch, 1974). Detailed models have been developed of the slave systems (Burgess and Hitch, 1999), and in general great progress has been made in understanding them. However, the role of the central executive has remained poorly understood. To tackle this issue, researchers have tried recasting executive functioning in neural models. We will describe these models and demonstrate how the union of RL and connectionist models provides steps toward understanding the neural basis of cognitive control.

### **6.1 Associative models of cognitive control**

290 Neuroimaging – Cognitive and Clinical Neuroscience

where the value 1 corresponds to the probability of being in the current state (*s'*; with probability 1), and *T sas* ,,' is the expected probability of transition from previous state *s* (the previous state) to the current state given (chosen) action *a*. The expectation *T* is updated

SPE

In conclusion, this work showed that prediction error and state prediction error are similar computations but calculated in different brain circuits. This suggested a neurophysiological

Fig. 2. Formal structure of state space in Gläscher et al.'s experiment. Reprinted with

**6. A case study: RL, cognitive control, and anterior cingulate cortex** 

One of the remaining mysteries of the human mind is executive functioning or cognitive control – the rapid modulation of behavior when called for by unexpected circumstances. In the symbol-manipulation paradigm, executive functioning was proposed to originate from a central executive endowed with two or more "slave systems" (typically the visuospatial sketch pad and the phonologic loop; Baddeley and Hitch, 1974). Detailed models have been developed of the slave systems (Burgess and Hitch, 1999), and in general great progress has been made in understanding them. However, the role of the central executive has remained

and computational basis of Tolman's discoveries sixty years before.

<sup>1</sup> ,,' ,,' *T sas T sas t tt* (5)

by means of the state prediction error:

permission from Gläscher et al. (2010).

Working within the connectionist framework, Cohen et al. (1990) proposed a model of the Stroop task, a widely used index of cognitive control. In this task, subjects are shown a color word in a given ink color, with the color and word either congruent (e.g., the word RED written in red), or incongruent (e.g., the word RED written in green). The subject's task is to name the ink color. Because word reading is automatic in literate adults, cognitive control is required to override the automatic tendency to read the word. Although subjects can do this, a congruency cost is typically observed, with incongruent trials slower than congruent ones. The Stroop task is widely used in clinical contexts to assess executive functioning, and differentiates between healthy subjects and various patient groups suffering from impairments in cognitive control (e.g., ADHD, Willcutt et al., 2005; Parkinson's disease, Bonnin et al., 2010). In the Cohen et al. model, a distinction is made between an input layer for the relevant dimension and an input layer for the irrelevant dimension, each projecting to a response layer. Crucially, Cohen et al. added task demand units which bias responding toward the relevant dimension (input layer). This brings top-down modulation in an associative model framework.

Botvinick et al. (2001) further developed the model of Cohen et al. They argued that the earlier model did not specify when cognitive control is required. In particular, cognitive control is required only on incongruent trials (e.g., RED written in green), not on congruent ones. For this purpose, they introduced the notion of response conflict, measuring the extent to which responses are simultaneously active. They proposed the conflict monitoring model, according to which response conflict is calculated in anterior cingulate cortex (ACC). Conceptually, this was an advance over the previous model, because not only top-down modulation but also the trial-by-trial cognitive control could be captured in an associative learning framework. In addition, it has been highly influential and allowed accounting for many data. For example, using fMRI Botvinick et al. (1999) demonstrated that human ACC was more active on incongruent trials following a congruent trial than on incongruent trials following an incongruent trial. This finding contradicted the popular notion that ACC activity reflects executive control itself (because the subject should be more "controlled" after an incongruent trial), but was in line with the conflict monitoring model because there should be more conflict after a congruent trial. Note that this model is also a control model in the cybernetic sense mentioned before: It detects when something goes wrong, and when so, it leads to adaptation in the system.

Verguts and Notebaert (2008, 2009) further developed this line of work. They started from the fact that the conflict monitoring model specifies when control should be exerted, but not where (see also Blais et al., 2007). To confront this issue, the authors proposed a neural model in which the implementation of cognitive control was based on an error signal modulating the Hebbian learning between active model neurons. This error signal was, like in Gläscher et al.'s work, borrowed from the RL domain. The new measure, which could be called "conflict prediction error", was computed by comparing the actual amount of conflict, evoked by a stimulus, with the expected mean amount of conflict. This model successfully predicted that cognitive control should not extend across different task input dimensions

Reinforcement Learning, High-Level Cognition, and the Human Brain 293

them. The general learning rule described above implements such a scheme. Because of the Hebbian component (input and output cells active together), individual synapses (which connect input and output neurons) are selected; and because of the value signal, the most

Just like in Darwinism applied to natural evolution, one key ingredient of ND is variation (called degeneracy by Edelman, 1978), or exploration when the unit of variation is not the individual synapse but rather responses (Aston-Jones & Cohen, 2001). From this variation, a selection can be made, based on an appropriate value signal. Computationally, Dehaene et al. (1987) demonstrated that temporal sequence learning can be achieved by such a variation-and-selection process. In neuroimaging, Daw et al. (2006) demonstrated that frontopolar cortex was used when subjects were in an exploration (rather than exploitation) phase of learning. Besides a few exceptions, however, variation and selection remain poorly studied. Given that it is a key component of RL, we suggest that its further exploration will learn us much more about high-level cognition and its implementation in the human brain.

Arbuthnott, G. W., Ingham, C. A., & Wickens, J. R. (2000). Dopamine and synaptic plasticity

Ashby, F. G., Ennis, J. M., & Spiering, B. J. (2007). A neurobiological theory of automaticity

Aston-Jones, G., & Cohen, J. D. (2005). An integrative theory of locus coeruleus-

Atkinson, R.C., and Shiffrin, R.M. (1968). "Human memory: A proposed system and its

Baddeley, A.D., and Hitch, G. (1974). "Working memory," in *The psychology of learning and* 

Behrens, T.E., Woolrich, M.W., Walton, M.E., and Rushworth, M.F. (2007). Learning the value of information in an uncertain world. *Nat Neurosci* 10, 1214-1221. Blais, C., and Bunge, S. (2011). Behavioral and neural evidence for item-specific performance

Blais, C., Robidoux, S., Risko, E.F., and Besner, D. (2007). Item-specific adaptation and the

Bonnin, C.A., Houeto, J.L., Gil, R., and Bouquet, C.A. (2010). Adjustments of conflict

Botvinick, M., Braver, T.S., Barch, D.M., Carter, C.S., and Cohen, J.D. (2001). Conflict

Botvinick, M., Nystrom, L.E., Fissell, K., Carter, C.S., and Cohen, J.D. (1999). Conflict monitoring versus selection-for-action in anterior cingulate cortex. *Nature* 402, 179-181.

monitoring in Parkinson's disease. *Neuropsychology* 24, 542-546.

monitoring and cognitive control. *Psychol Rev* 108, 624-652.

norepinephrine function: Adaptive gain and optimal performance. *Annual Review of* 

control processes," in *The psychology of learning and motivation*. (New York:

*motivation: Advances in research and theory,* ed. G.H. Bower. (New York: Academic

conflict-monitoring hypothesis: a computational model. *Psychol Rev* 114, 1076-1086.

MS and TV were supported by BOF/GOA Grant BOF08/GOA/011.

in the neostriatum. *Journal of Anatomy, 196,* 587-596.

in perceptual categorization. *Psychological Review, 114,* 632-656.

appropriate synapses are chosen.

**8. Acknowledgements** 

*Neuroscience,* 28, 403-450.

Academic Press), 89–195.

monitoring. *J Cogn Neurosci* 22, 2758-2767.

Press).

**9. References** 

(Notebaert and Verguts, 2008) or even across task effectors (Braem et al., in press). Consistent with the model, it was recently demonstrated that ACC responds to item-specific congruencies, not block-level congruencies (Blais and Bunge, 2011).
