Since its introduction, the reward prediction error theory of dopamine has explained a wealth of empirical phenomena, providing a unifying framework for understanding the representation of reward and value in the brain1,2,3. According to the now canonical theory, reward predictions are represented as a single scalar quantity, which supports learning about the expectation, or mean, of stochastic outcomes. Here we propose an account of dopamine-based reinforcement learning inspired by recent artificial intelligence research on distributional reinforcement learning4,5,6. We hypothesized that the brain represents possible future rewards not as a single mean, but instead as a probability distribution, effectively representing multiple future outcomes simultaneously and in parallel. This idea implies a set of empirical predictions, which we tested using single-unit recordings from mouse ventral tegmental area. Our findings provide strong evidence for a neural realization of distributional reinforcement learning.
Reinforcement learning models of the basal ganglia map the phasic dopamine signal to reward prediction errors (RPEs). Conventional models assert that, when a stimulus reliably predicts a reward with fixed delay, dopamine activity during the delay period and at reward time should converge to baseline through learning. However, recent studies have found that dopamine exhibits a gradual ramp before reward in certain conditions even after extensive learning, such as when animals are trained to run to obtain the reward, thus challenging the conventional RPE models. In this work, we begin with the limitation of temporal uncertainty (animals cannot perfectly estimate time to reward), and show that sensory feedback, which reduces this uncertainty, will cause an unbiased learner to produce RPE ramps. On the other hand, in the absence of feedback, RPEs will be flat after learning. These results reconcile the seemingly conflicting data on dopamine behaviors under the RPE hypothesis.
Rapid phasic activity of midbrain dopamine neurons are thought to signal reward prediction errors (RPEs), resembling temporal difference errors used in machine learning. Recent studies describing slowly increasing dopamine signals have instead proposed that they represent state values and arise independently from somatic spiking activity. Here, we developed novel experimental paradigms using virtual reality that disambiguate RPEs from values. We examined the dopamine circuit activity at various stages including somatic spiking, axonal calcium signals, and striatal dopamine concentrations. Our results demonstrate that ramping dopamine signals are consistent with RPEs rather than value, and this ramping is observed at all the stages examined. We further show that ramping dopamine signals can be driven by a dynamic stimulus that indicates a gradual approach to a reward. We provide a unified computational understanding of rapid phasic and slowly ramping dopamine signals: dopamine neurons perform a derivative-like computation over values on a moment-by-moment basis.
Midbrain dopamine signals are widely thought to report reward prediction errors that drive learning in the basal ganglia. However, dopamine has also been implicated in various probabilistic computations, such as encoding uncertainty and controlling exploration. Here, we show how these different facets of dopamine signalling can be brought together under a common reinforcement learning framework. The key idea is that multiple sources of uncertainty impinge on reinforcement learning computations: uncertainty about the state of the environment, the parameters of the value function and the optimal action policy. Each of these sources plays a distinct role in the prefrontal cortex–basal ganglia circuit for reinforcement learning and is ultimately reflected in dopamine activity. The view that dopamine plays a central role in the encoding and updating of beliefs brings the classical prediction error theory into alignment with more recent theories of Bayesian reinforcement learning.
The ability to predict future outcomes increases the fitness of the animal. Decades of research have shown that dopamine neurons broadcast reward prediction error (RPE) signals—the discrepancy between actual and predicted reward—to drive learning to predict future outcomes. Recent studies have begun to show, however, that dopamine neurons are more diverse than previously thought. In this review, we will summarize a series of our studies that have shown unique properties of dopamine neurons projecting to the posterior “tail” of the striatum (TS) in terms of anatomy, activity, and function. Specifically, TS-projecting dopamine neurons are activated by a subset of negative events including threats from a novel object, send prediction errors for external threats, and reinforce avoidance behaviors. These results indicate that there are at least two axes of dopamine-mediated reinforcement learning in the brain—one learning from canonical RPEs and another learning from threat prediction errors. We argue that the existence of multiple learning systems is an adaptive strategy that makes possible each system optimized for its own needs. The compartmental organization in the mammalian striatum resembles that of a dopamine-recipient area in insects (mushroom body), pointing to a principle of dopamine function conserved across phyla.
Learning to predict future outcomes is critical for driving appropriate behaviors. Reinforcement learning (RL) models have successfully accounted for such learning, relying on reward prediction errors (RPEs) signaled by midbrain dopamine neurons. It has been proposed that when sensory data provide only ambiguous information about which state an animal is in, it can predict reward based on a set of probabilities assigned to hypothetical states (called the belief state). Here we examine how dopamine RPEs and subsequent learning are regulated under state uncertainty. Mice are first trained in a task with two potential states defined by different reward amounts. During testing, intermediate-sized rewards are given in rare trials. Dopamine activity is a non-monotonic function of reward size, consistent with RL models operating on belief states. Furthermore, the magnitude of dopamine responses quantitatively predicts changes in behavior. These results establish the critical role of state inference in RL.
Midbrain dopamine neurons are well known for their role in reward-based reinforcement learning. We found that the activity of dopamine axons in the posterior tail of the striatum (TS) scaled with the novelty and intensity of external stimuli, but did not encode reward value. We demonstrated that the ablation of TS-projecting dopamine neurons specifically inhibited avoidance of novel or high-intensity stimuli without affecting animals’ initial avoidance responses, suggesting a role in reinforcement rather than simply in avoidance itself. Furthermore, we found that animals avoided optogenetic activation of dopamine axons in TS during a choice task and that this stimulation could partially reinstate avoidance of a familiar object. These results suggest that TS-projecting dopamine neurons reinforce avoidance of threatening stimuli. More generally, our results indicate that there are at least two axes of reinforcement learning using dopamine in the striatum: one based on value and one based on external threat.
Animals make predictions based on currently available information. In natural settings, sensory cues may not reveal complete information, requiring the animal to infer the “hidden state” of the environment. The brain structures important in hidden state inference remain unknown. A previous study showed that midbraindopamine neurons exhibit distinct response patterns depending on whether reward is delivered in 100% (task 1) or 90% of trials (task 2) in a classical conditioning task. Here we found that inactivation of the medial prefrontal cortex (mPFC) affected dopaminergic signaling in task 2, in which the hidden state must be inferred (“will reward come or not?”), but not in task 1, where the state was known with certainty. Computational modeling suggests that the effects of inactivation are best explained by a circuit in which the mPFC conveys inference over hidden states to the dopamine system.
Parenting is essential for the survival and wellbeing of mammalian offspring. However, we lack a circuit-level understanding of how distinct components of this behaviour are coordinated. Here we investigate how galanin-expressing neurons in the medial preoptic area (MPOAGal) of the hypothalamus coordinate motor, motivational, hormonal and social aspects of parenting in mice. These neurons integrate inputs from a large number of brain areas and the activation of these inputs depends on the animal's sex and reproductive state. Subsets of MPOAGal neurons form discrete pools that are defined by their projection sites. While the MPOAGalpopulation is active during all episodes of parental behaviour, individual pools are tuned to characteristic aspects of parenting. Optogenetic manipulation of MPOAGal projections mirrors this specificity, affecting discrete parenting components. This functional organization, reminiscent of the control of motor sequences by pools of spinal cord neurons, provides a new model for how discrete elements of a social behaviour are generated at the circuit level.
Although modified rabies viruses have emerged as a powerful tool for tracing the inputs to genetically defined populations of neurons, the toxicity of the virus has limited its utility. A recent study employed a self-inactivating rabies (SiR) virus that enables recording or manipulation of targeted neurons for months.
Dopamine neurons facilitate learning by calculating reward prediction error, or the difference between expected and actual reward. Despite two decades of research, it remains unclear how dopamine neurons make this calculation. Here we review studies that tackle this problem from a diverse set of approaches, from anatomy to electrophysiology to computational modeling and behavior. Several patterns emerge from this synthesis: that dopamine neurons themselves calculate reward prediction error, rather than inherit it passively from upstream regions; that they combine multiple separate and redundant inputs, which are themselves interconnected in a dense recurrent network; and that despite the complexity of inputs, the output from dopamine neurons is remarkably homogeneous and robust. The more we study this simple arithmetic computation, the knottier it appears to be, suggesting a daunting (but stimulating) path ahead for neuroscience more generally.
Midbrain dopamine neurons signal reward prediction error (RPE), or actual minus expected reward. The temporal difference (TD) learning model has been a cornerstone in understanding how dopamine RPEs could drive associative learning. Classically, TD learning imparts value to features that serially track elapsed time relative to observable stimuli. In the real world, however, sensory stimuli provide ambiguous information about the hidden state of the environment, leading to the proposal that TD learning might instead compute a value signal based on an inferred distribution of hidden states (a 'belief state'). Here we asked whether dopaminergic signaling supports a TD learning framework that operates over hidden states. We found that dopamine signaling showed a notable difference between two tasks that differed only with respect to whether reward was delivered in a deterministic manner. Our results favor an associative learning rule that combines cached values with hidden-state inference.
Our motor outputs are constantly re-calibrated to adapt to systematic perturbations. This motor adaptation is thought to depend on the ability to form a memory of a systematic perturbation, often called an internal model. However, the mechanisms underlying the formation, storage, and expression of such models remain unknown. Here, we developed a mouse model to study forelimb adaptation to force field perturbations. We found that temporally precise photoinhibition of somatosensory cortex (S1) applied concurrently with the force field abolished the ability to update subsequent motor commands needed to reduce motor errors. This S1 photoinhibition did not impair basic motor patterns, post-perturbation completion of the action, or their performance in a reward-based learning task. Moreover, S1 photoinhibition after partial adaptation blocked further adaptation, but did not affect the expression of already-adapted motor commands. Thus, S1 is critically involved in updating the memory about the perturbation that is essential for forelimb motor adaptation.
Dopamine neurons are thought to encode novelty in addition to reward prediction error (the discrepancy between actual and predicted values). In this study, we compared dopamine activity across the striatum using fiber fluorometry in mice. During classical conditioning, we observed opposite dynamics in dopamine axon signals in the ventral striatum (‘VS dopamine’) and the posterior tail of the striatum (‘TS dopamine’). TS dopamine showed strong excitation to novel cues, whereas VS dopamine showed no responses to novel cues until they had been paired with a reward. TS dopamine cue responses decreased over time, depending on what the cue predicted. Additionally, TS dopamine showed excitation to several types of stimuli including rewarding, aversive, and neutral stimuli whereas VS dopamine showed excitation only to reward or reward-predicting cues. Together, these results demonstrate that dopamine novelty signals are localized in TS along with general salience signals, while VS dopamine reliably encodes reward prediction error.
Dopamine neurons encode the difference between actual and predicted reward, or reward prediction error (RPE). Although many models have been proposed to account for this computation, it has been difficult to test these models experimentally. Here we established an awake electrophysiological recording system, combined with rabies virusand optogenetic cell-type identification, to characterize the firing patterns of monosynaptic inputs to dopamine neurons while mice performed classical conditioningtasks. We found that each variable required to compute RPE, including actual and predicted reward, was distributed in input neurons in multiple brain areas. Further, many input neurons across brain areas signaled combinations of these variables. These results demonstrate that even simple arithmetic computations such as RPE are not localized in specific brain areas but, rather, distributed across multiple nodes in a brain-wide network. Our systematic method to examine both activity and connectivity revealed unexpected redundancy for a simple computation in the brain.
Dopamine is thought to regulate learning from appetitive and aversive events. Here we examined how optogenetically-identified dopamine neurons in the lateral ventral tegmental area of mice respond to aversive events in different conditions. In low reward contexts, most dopamine neurons were exclusively inhibited by aversive events, and expectation reduced dopamine neurons’ responses to reward and punishment. When a single odor predicted both reward and punishment, dopamine neurons’ responses to that odor reflected the integrated value of both outcomes. Thus, in low reward contexts, dopamine neurons signal value prediction errors (VPEs) integrating information about both reward and aversion in a common currency. In contrast, in high reward contexts, dopamine neurons acquired a short-latency excitation to aversive events that masked their VPE signaling. Our results demonstrate the importance of considering the contexts to examine the representation in dopamine neurons and uncover different modes of dopamine signaling, each of which may be adaptive for different environments.
Neurons in higher cortical areas, such as the prefrontal cortex, are often tuned to a variety of sensory and motor variables, and are therefore said to display mixed selectivity. This complexity of single neuron responses can obscure what information these areas represent and how it is represented. Here we demonstrate the advantages of a new dimensionality reduction technique, demixed principal component analysis (dPCA), that decomposes population activity into a few components. In addition to systematically capturing the majority of the variance of the data, dPCA also exposes the dependence of the neural representation on task parameters such as stimuli, decisions, or rewards. To illustrate our method we reanalyze population data from four datasets comprising different species, different cortical areas and different experimental tasks. In each case, dPCA provides a concise way of visualizing the data that summarizes the task-dependent features of the population response in a single figure.