In the brain, dopamine is thought to drive reward-based learning by signaling temporal difference reward prediction errors (TD errors), a ‘teaching signal’ used to train computers. Recent studies using optogenetic manipulations have provided multiple pieces of evidence supporting that phasic dopamine signals function as TD errors. Furthermore, novel experimental results have indicated that when the current state of the environment is uncertain, dopamine neurons compute TD errors using ‘belief states’ or a probability distribution over potential states. It remains unclear how belief states are computed but emerging evidence suggests involvement of the prefrontal cortex and the hippocampus. These results refine our understanding of the role of dopamine in learning and the algorithms by which dopamine functions in the brain.
Rapid phasic activity of midbrain dopamine neurons is thought to signal reward prediction errors (RPEs), resembling temporal difference errors used in machine learning. However, recent studies describing slowly increasing dopamine signals have instead proposed that they represent state values and arise independent from somatic spiking activity. Here we developed experimental paradigms using virtual reality that disambiguate RPEs from values. We examined dopamine circuit activity at various stages, including somatic spiking, calcium signals at somata and axons, and striatal dopamine concentrations. Our results demonstrate that ramping dopamine signals are consistent with RPEs rather than value, and this ramping is observed at all stages examined. Ramping dopamine signals can be driven by a dynamic stimulus that indicates a gradual approach to a reward. We provide a unified computational understanding of rapid phasic and slowly ramping dopamine signals: dopamine neurons perform a derivative-like computation over values on a moment-by-moment basis.
Reinforcement-learning theories provide a normative perspective on learning and decision-making. In the 1990s, neurophysiology experiments revealed an exceptional correspondence between the activity of midbrain dopamine neurons and the reward prediction error (RPE) signal used to train computers in a reinforcement-learning algorithm called temporal difference (TD) learning. Studies of midbrain dopamine neurons play a pivotal role at the interface of empirical and theoretical studies. A theoretical framework for reinforcement learning has facilitated the interpretation of neurophysiology data and has guided the design of future studies. Here we discuss recent developments in the interplay between experimental findings and theories of dopamine signaling. In particular, recent studies emphasize the importance of state uncertainty in the neurobiological implementation of reinforcement learning.
It has been proposed that the activity of dopamine neurons approximates temporal difference (TD) prediction error, a teaching signal developed in reinforcement learning, a field of machine learning. However, whether this similarity holds true during learning remains elusive. In particular, some TD learning models predict that the error signal gradually shifts backward in time from reward delivery to a reward-predictive cue, but previous experiments failed to observe such a gradual shift in dopamine activity. Here we demonstrate conditions in which such a shift can be detected experimentally. These shared dynamics of TD error and dopamine activity narrow the gap between machine learning theory and biological brains, tightening a long-sought link.
Different regions of the striatum regulate different types of behavior. However, how dopamine signals differ across striatal regions and how dopamine regulates different behaviors remain unclear. Here, we compared dopamine axon activity in the ventral, dorsomedial, and dorsolateral striatum, while mice performed a perceptual and value-based decision task. Surprisingly, dopamine axon activity was similar across all three areas. At a glance, the activity multiplexed different variables such as stimulus-associated values, confidence and reward feedback at different phases of the task. Our modeling demonstrates, however, that these modulations can be inclusively explained by moment-by-moment changes in the expected reward, i.e. the temporal difference error. A major difference between areas was the overall activity level of reward responses: reward responses in dorsolateral striatum were positively shifted, lacking inhibitory responses to negative prediction errors. The differences in dopamine signals put specific constraints on the properties of behaviors controlled by dopamine in these regions.
Learning about rewards and punishments is critical for survival. Classical studies have demonstrated an impressive correspondence between the firing of dopamine neurons in the mammalian midbrain and the reward prediction errors of reinforcement learning algorithms, which express the difference between actual reward and predicted mean reward. However, it may be advantageous to learn not only the mean but also the complete distribution of potential rewards. Recent advances in machine learning have revealed a biologically plausible set of algorithms for reconstructing this reward distribution from experience. Here, we review the mathematical foundations of these algorithms as well as initial evidence for their neurobiological implementation. We conclude by highlighting outstanding questions regarding the circuit computation and behavioral readout of these distributional codes.
Learning from successes and failures often improves the quality of subsequent decisions. Past outcomes, however, should not influence purely perceptual decisions after task acquisition is complete since these are designed so that only sensory evidence determines the correct choice. Yet, numerous studies report that outcomes can bias perceptual decisions, causing spurious changes in choice behavior without improving accuracy. Here we show that the effects of reward on perceptual decisions are principled: past rewards bias future choices specifically when previous choice was difficult and hence decision confidence was low. We identified this phenomenon in six datasets from four laboratories, across mice, rats, and humans, and sensory modalities from olfaction and audition to vision. We show that this choice-updating strategy can be explained by reinforcement learning models incorporating statistical decision confidence into their teaching signals. Thus, reinforcement learning mechanisms are continually engaged to produce systematic adjustments of choices even in well-learned perceptual decisions in order to optimize behavior in an uncertain world.
Since its introduction, the reward prediction error theory of dopamine has explained a wealth of empirical phenomena, providing a unifying framework for understanding the representation of reward and value in the brain1,2,3. According to the now canonical theory, reward predictions are represented as a single scalar quantity, which supports learning about the expectation, or mean, of stochastic outcomes. Here we propose an account of dopamine-based reinforcement learning inspired by recent artificial intelligence research on distributional reinforcement learning4,5,6. We hypothesized that the brain represents possible future rewards not as a single mean, but instead as a probability distribution, effectively representing multiple future outcomes simultaneously and in parallel. This idea implies a set of empirical predictions, which we tested using single-unit recordings from mouse ventral tegmental area. Our findings provide strong evidence for a neural realization of distributional reinforcement learning.
Reinforcement learning models of the basal ganglia map the phasic dopamine signal to reward prediction errors (RPEs). Conventional models assert that, when a stimulus reliably predicts a reward with fixed delay, dopamine activity during the delay period and at reward time should converge to baseline through learning. However, recent studies have found that dopamine exhibits a gradual ramp before reward in certain conditions even after extensive learning, such as when animals are trained to run to obtain the reward, thus challenging the conventional RPE models. In this work, we begin with the limitation of temporal uncertainty (animals cannot perfectly estimate time to reward), and show that sensory feedback, which reduces this uncertainty, will cause an unbiased learner to produce RPE ramps. On the other hand, in the absence of feedback, RPEs will be flat after learning. These results reconcile the seemingly conflicting data on dopamine behaviors under the RPE hypothesis.
Midbrain dopamine signals are widely thought to report reward prediction errors that drive learning in the basal ganglia. However, dopamine has also been implicated in various probabilistic computations, such as encoding uncertainty and controlling exploration. Here, we show how these different facets of dopamine signalling can be brought together under a common reinforcement learning framework. The key idea is that multiple sources of uncertainty impinge on reinforcement learning computations: uncertainty about the state of the environment, the parameters of the value function and the optimal action policy. Each of these sources plays a distinct role in the prefrontal cortex–basal ganglia circuit for reinforcement learning and is ultimately reflected in dopamine activity. The view that dopamine plays a central role in the encoding and updating of beliefs brings the classical prediction error theory into alignment with more recent theories of Bayesian reinforcement learning.
The ability to predict future outcomes increases the fitness of the animal. Decades of research have shown that dopamine neurons broadcast reward prediction error (RPE) signals—the discrepancy between actual and predicted reward—to drive learning to predict future outcomes. Recent studies have begun to show, however, that dopamine neurons are more diverse than previously thought. In this review, we will summarize a series of our studies that have shown unique properties of dopamine neurons projecting to the posterior “tail” of the striatum (TS) in terms of anatomy, activity, and function. Specifically, TS-projecting dopamine neurons are activated by a subset of negative events including threats from a novel object, send prediction errors for external threats, and reinforce avoidance behaviors. These results indicate that there are at least two axes of dopamine-mediated reinforcement learning in the brain—one learning from canonical RPEs and another learning from threat prediction errors. We argue that the existence of multiple learning systems is an adaptive strategy that makes possible each system optimized for its own needs. The compartmental organization in the mammalian striatum resembles that of a dopamine-recipient area in insects (mushroom body), pointing to a principle of dopamine function conserved across phyla.
Learning to predict future outcomes is critical for driving appropriate behaviors. Reinforcement learning (RL) models have successfully accounted for such learning, relying on reward prediction errors (RPEs) signaled by midbrain dopamine neurons. It has been proposed that when sensory data provide only ambiguous information about which state an animal is in, it can predict reward based on a set of probabilities assigned to hypothetical states (called the belief state). Here we examine how dopamine RPEs and subsequent learning are regulated under state uncertainty. Mice are first trained in a task with two potential states defined by different reward amounts. During testing, intermediate-sized rewards are given in rare trials. Dopamine activity is a non-monotonic function of reward size, consistent with RL models operating on belief states. Furthermore, the magnitude of dopamine responses quantitatively predicts changes in behavior. These results establish the critical role of state inference in RL.
Midbrain dopamine neurons are well known for their role in reward-based reinforcement learning. We found that the activity of dopamine axons in the posterior tail of the striatum (TS) scaled with the novelty and intensity of external stimuli, but did not encode reward value. We demonstrated that the ablation of TS-projecting dopamine neurons specifically inhibited avoidance of novel or high-intensity stimuli without affecting animals’ initial avoidance responses, suggesting a role in reinforcement rather than simply in avoidance itself. Furthermore, we found that animals avoided optogenetic activation of dopamine axons in TS during a choice task and that this stimulation could partially reinstate avoidance of a familiar object. These results suggest that TS-projecting dopamine neurons reinforce avoidance of threatening stimuli. More generally, our results indicate that there are at least two axes of reinforcement learning using dopamine in the striatum: one based on value and one based on external threat.
Animals make predictions based on currently available information. In natural settings, sensory cues may not reveal complete information, requiring the animal to infer the “hidden state” of the environment. The brain structures important in hidden state inference remain unknown. A previous study showed that midbraindopamine neurons exhibit distinct response patterns depending on whether reward is delivered in 100% (task 1) or 90% of trials (task 2) in a classical conditioning task. Here we found that inactivation of the medial prefrontal cortex (mPFC) affected dopaminergic signaling in task 2, in which the hidden state must be inferred (“will reward come or not?”), but not in task 1, where the state was known with certainty. Computational modeling suggests that the effects of inactivation are best explained by a circuit in which the mPFC conveys inference over hidden states to the dopamine system.
Parenting is essential for the survival and wellbeing of mammalian offspring. However, we lack a circuit-level understanding of how distinct components of this behaviour are coordinated. Here we investigate how galanin-expressing neurons in the medial preoptic area (MPOAGal) of the hypothalamus coordinate motor, motivational, hormonal and social aspects of parenting in mice. These neurons integrate inputs from a large number of brain areas and the activation of these inputs depends on the animal's sex and reproductive state. Subsets of MPOAGal neurons form discrete pools that are defined by their projection sites. While the MPOAGalpopulation is active during all episodes of parental behaviour, individual pools are tuned to characteristic aspects of parenting. Optogenetic manipulation of MPOAGal projections mirrors this specificity, affecting discrete parenting components. This functional organization, reminiscent of the control of motor sequences by pools of spinal cord neurons, provides a new model for how discrete elements of a social behaviour are generated at the circuit level.
Although modified rabies viruses have emerged as a powerful tool for tracing the inputs to genetically defined populations of neurons, the toxicity of the virus has limited its utility. A recent study employed a self-inactivating rabies (SiR) virus that enables recording or manipulation of targeted neurons for months.
Dopamine neurons facilitate learning by calculating reward prediction error, or the difference between expected and actual reward. Despite two decades of research, it remains unclear how dopamine neurons make this calculation. Here we review studies that tackle this problem from a diverse set of approaches, from anatomy to electrophysiology to computational modeling and behavior. Several patterns emerge from this synthesis: that dopamine neurons themselves calculate reward prediction error, rather than inherit it passively from upstream regions; that they combine multiple separate and redundant inputs, which are themselves interconnected in a dense recurrent network; and that despite the complexity of inputs, the output from dopamine neurons is remarkably homogeneous and robust. The more we study this simple arithmetic computation, the knottier it appears to be, suggesting a daunting (but stimulating) path ahead for neuroscience more generally.