SlideShare a Scribd company logo
1 of 40
1
Rare Rewards Amplify Dopamine Learning Responses
Kathryn M. Rothenhoefer, Tao Hong, Aydin Alikaya, William
R. Stauffer*.
Affiliations:
Center for Neuroscience, Center for the Neural Basis of
Cognition, Systems Neuroscience
Institute, The Brain Institute, University of Pittsburgh,
Pittsburgh, PA, USA. 5
*Correspondence to: [email protected]
Abstract:
Dopamine neurons drive learning by coding reward prediction
errors (RPEs), which are formalized
as subtractions of predicted values from reward values.
Subtractions accommodate point estimate 10
predictions of value, such as the average value. However, point
estimate predictions fail to capture
many features of choice and learning behaviors. For instance,
reaction times and learning rates
consistently reflect higher moments of probability distributions.
Here, we demonstrate that
dopamine RPE responses code probability distributions. We
presented monkeys with rewards that
were drawn from the tails of normal and uniform reward size
distributions to generate rare and 15
common RPEs, respectively. Behavioral choices and pupil
diameter measurements indicated that
monkeys learned faster and registered greater arousal from rare
RPEs, compared to common RPEs
of identical magnitudes. Dopamine neuron recordings indicated
that rare rewards amplified RPE
responses. These results demonstrate that dopamine responses
reflect probability distributions and
suggest a neural mechanism for the amplified learning and
enhanced arousal associated with rare 20
events.
not certified by peer review) is the author/funder. All rights
reserved. No reuse allowed without permission.
The copyright holder for this preprint (which wasthis version
posted November 22, 2019. ; https://doi.org/10.1101/851709doi:
bioRxiv preprint
https://doi.org/10.1101/851709
2
Main Text:
Making accurate predictions is evolutionarily adaptive.
Accurate predictions enable
individuals to be in the right place at the right time, choose the
best options, and efficiently scale
the vigor of responses. Dopamine neurons are crucial for
building accurate reward predictions.
Phasic dopamine responses code for reward prediction errors:
the differences between the values 5
of received and predicted rewards (1-8). These signals cause
predictions to be modified through
associative and extinction learning (9, 10). Likewise, phasic
dopamine neuron stimulation during
reward delivery increase both the dopamine responses to reward
predicting cues and the choices
for those same cues (11). Although it is well understood how
predicted reward values affect
dopamine responses, these predictions are simply point
estimates – normally the average value – 10
of probability distributions. It is unknown how the form of
probability distributions affects
dopamine learning signals.
Probability distributions affect behavioral measures of learning
and decision making.
Learning the expected value takes longer when rewards are
sampled from broader distributions,
compared to when they are drawn from narrower distributions
(12). Likewise, even with well 15
learned values, decision makers take a longer time to choose
between options when the value
difference between rewards is smaller, compared to when the
difference is larger (13, 14). These
behavioral measures of learning and decision making
demonstrate that the probability distributions
over reward values, and not simply the point estimates of
values, influence learning and decision
making. Prior electrophysiological recordings have
demonstrated that dopamine signals adapt to 20
the range between two outcomes (15), but it is not known
whether these same dopamine signals
reflect probability distributions over reward values.
not certified by peer review) is the author/funder. All rights
reserved. No reuse allowed without permission.
The copyright holder for this preprint (which wasthis version
posted November 22, 2019. ; https://doi.org/10.1101/851709doi:
bioRxiv preprint
https://doi.org/10.1101/851709
3
Here, we asked whether the shapes of probability distributions
were reflected in dopamine
reward prediction error responses. We created two discrete
reward size distributions that reflected,
roughly, the shapes of normal and uniform distributions. We
trained monkeys to predict rewards
drawn from these distributions. Crucially, the normal
distribution resulted in rare prediction errors
following rewards drawn from the tails of that distribution,
whereas the same rewards, with 5
identical prediction errors, were drawn with greater frequency
from the uniform distribution. We
found that monkeys learned to choose the better option within
fewer trials when rewards were
drawn from normal distributions, compared to when rewards
were drawn from uniform
distributions. Moreover, we found that pupil diameter was
correlated with rare prediction errors,
but uncorrelated with common prediction errors of the same
magnitude, suggesting greater 10
vigilance to rare outcomes. Using single neuron recording, we
show that dopamine responses
reflect the shape of predicted reward distribution. Specifically,
rare prediction errors evoked
significantly larger responses than common prediction errors
with identical magnitudes. Together,
these results demonstrate a complementary but distinct
mechanism from TD-like reward prediction
error responses for learning based on probability distributions.
15
Results
Reward size distributions
We used non-informative images (fractal pictures) to predict
rewards drawn from
differently shaped distributions. Distribution shapes were
defined according to relative reward 20
frequency. One fractal image predicted an equal probability of
receiving a small, medium, or large
volume of juice reward. We define this as the uniform reward
size distribution (Fig. 1A left). A
second fractal image predicted that the small and the large
reward would be given 2 out of 15
not certified by peer review) is the author/funder. All rights
reserved. No reuse allowed without permission.
The copyright holder for this preprint (which wasthis version
posted November 22, 2019. ; https://doi.org/10.1101/851709doi:
bioRxiv preprint
https://doi.org/10.1101/851709
4
times, and the medium sized reward would be given the
remaining 11 out of 15 times (normal
reward size distribution, Fig. 1A right). Importantly, both
reward size distributions were
symmetrical and were comprised of the same three reward
magnitudes. Therefore, the uniform and
normal distributions had identical Pascalian expected values.
However, rewards drawn from the
tails of the normal distribution were rare, compared to the
frequency of identical rewards drawn 5
from the tails of the uniform distribution. Anticipatory licking
reflected the expected value of both
the distributions, as well as the expected value of safe cues
(Figure 1B, p = 0.019, Linear
Regression). Thus, the animals learned that the cues predicted
rewards.
Dopamine neurons code prediction errors in the subjective
values of rewards (16),
according to the following equation: 10
���������� ����� = ������ ����������
����� − ��������� ���������� �����.
Eq.1
To investigate whether dopamine responses were sensitive to
probability distributions, we sought
to compare dopamine responses when the conventional
prediction errors, defined according to Eq. 15
1, were identical. Therefore, we created reward distributions
with the same expected values and
then measured the distribution of the subjective values. We
measured the subjective values of six
reward distributions – three normal reward distributions with
center locations of 0.3, 0.4, and 0.5
ml and three uniform distributions with center locations of 0.3,
0.4, and 0.5 ml. Monkeys made
choices between cues that predicted a distribution and safe
values (Fig. 1C, Methods). We plotted 20
the probability of choosing the safe option as a function of the
safe option volume and generated
psychometric functions (Figure 1D). We used the certainty
equivalents (CEs) – defined as the safe
volume at the point of subjective equivalence between the two
options (Fig. 1D, vertical dashed.
not certified by peer review) is the author/funder. All rights
reserved. No reuse allowed without permission.
The copyright holder for this preprint (which wasthis version
posted November 22, 2019. ; https://doi.org/10.1101/851709doi:
bioRxiv preprint
https://doi.org/10.1101/851709
5
Line) – as a measure of the distribution subjective value.
Analysis of variance (ANOVA) on the
CEs for the normal and uniform distribution cues indicated a
significant effect of EV on CE (F =
71.81, p < 0.001, ANOVA). Crucially, we found no significant
effect of the distribution type on
the CEs (F = 3.57, p = 0.07, ANOVA). To increase our power to
detect small differences in
subjective value, we combined the CE data from all three EVs,
yet still found no significant 5
difference between the subjective values of the normal and
uniform distributions (p = 0.20, two-
sample t-test). Therefore, within the limited range of reward
sizes we used, the data indicate that
the normal and uniform reward size distributions had similar
subjective values. These results
indicated that the prediction errors generated from the
distributions could be readily compared and
ensured that disparities between prediction error responses were
not driven by differences in the 10
predicted subjective values.
15
not certified by peer review) is the author/funder. All rights
reserved. No reuse allowed without permission.
The copyright holder for this preprint (which wasthis version
posted November 22, 2019. ; https://doi.org/10.1101/851709doi:
bioRxiv preprint
https://doi.org/10.1101/851709
6
Figure 1. Reward size distributions. A. Fractal images
predicted that one of three juice reward sizes would be drawn
from a uniform (left) or normal (right) reward size distribution.
The uniform distribution predicted that each size would be
drawn with equal relative frequency (R.F.) (1/3 for small, 5
medium, and large sized rewards). The normal distribution
predicted that the small and large reward volumes would be
drawn 2/15 times, and the medium reward volume would be
drawn the remaining 11/15 times. B. Anticipatory licking
indicated that the monkeys learned the predicted reward 10
values. Black dots indicate the normalized licking duration
data for predicted reward volume, and the grey dotted line
indicates the linear fit to the data. Error bars indicate SEM
across session. C. The choice task used to measure subjective
value. Animals made saccade-
directed choices between a distribution predicting cue and a
safe alternative option. The safe 15
alternative option was a value bar with a minimum and
maximum of 0 and 0.8 ml at the bottom
and top, respectively. The intersection between the horizontal
bar and the scale indicated the
volume of juice that would be received if monkeys selected the
safe cue. D. Probability of choosing
the safe cue as a function of the value of the safe option, when
the distribution predicting cue had
an expected value (EV) of 0.4ml. Dots show average choice
probability for 9 safe value options 20
for monkey B (left), and Monkey S (right). Solid lines are a
logistic fit to the data. Red indicates
data from normal distribution blocks, grey indicates data from
uniform distribution blocks. The
dashed horizontal lines indicate subjective equivalence, and the
CE for each distribution type is
indicated with the dashed vertical lines.
25
Distribution shape affects learning
We next investigated how probability distributions affected
learning (Fig. 2A). Each
learning block consisted of 15-25 trials and used two, never-
before-seen cues (Supplementary
Methods). Both cues predicted either normal or uniform
distributions and had two different
not certified by peer review) is the author/funder. All rights
reserved. No reuse allowed without permission.
The copyright holder for this preprint (which wasthis version
posted November 22, 2019. ; https://doi.org/10.1101/851709doi:
bioRxiv preprint
https://doi.org/10.1101/851709
7
expected values (EVs). The animals had to learn the relative
EVs to maximize reward. We
analyzed behavioral choice data from two animals across 16
sessions, each of which included a
mixture of normal and uniform blocks (Supplementary
Methods). As expected, the probability of
choosing the higher value option on the first trial of each block
was not significantly different from
chance (Fig. 2B, left). Comparison between the first and
fifteenth trial of each block revealed that 5
the animals learned to choose the higher value option on 87%
and 85% of trials, in the normal and
uniform blocks, respectively (Fig. 2B, p < 0.001, paired t-test
for both distributions). There was
no significant difference in the average performance between
the fifteenth trials of normal and
uniform blocks (Fig. 2B, right, p = 0.7, t-test). Together, these
results demonstrate that the animals
learned which option had a higher EV. 10
We hypothesized that the animals would learn faster from
reward sizes sampled from the
normal distribution compared to the uniform distribution. We
used a reinforcement learning (RL)
model to quantify the learned values (Supplementary Methods).
Our model, fit to the behavioral
choices, performed well at predicting the true values of the two
choice options (Fig 2C). To
differentiate active learning from stable, asymptotic-like
behavior, we fit a logarithmic function to 15
the estimated prediction errors (Fig. 2D). When the change in
the log-fitted values went below a
robust threshold, we considered the values to be learned
(Supplementary Methods). Using this
approach, we determined the block-wise number of trials needed
to learn. We found that, on
average, animals needed 4 additional trials to learn in the
uniform blocks, compared to the normal
blocks (Fig. 2E, p < 0.001, Mann-Whitney U test). These data
demonstrate that the monkeys 20
learned faster from normal reward distributions, compared to
uniform reward distribution. Thus,
rare prediction errors have greater behavioral relevance than
common prediction errors of the same
magnitude.
not certified by peer review) is the author/funder. All rights
reserved. No reuse allowed without permission.
The copyright holder for this preprint (which wasthis version
posted November 22, 2019. ; https://doi.org/10.1101/851709doi:
bioRxiv preprint
https://doi.org/10.1101/851709
8
Figure 2. Faster learning from normal reward
distributions. A. Animals made choices between two never-
before-seen fractal images with different expected values
(EVs), but the same distribution type (normal or uniform). B.
Box and whisker plots show the probability of choosing the 5
higher-valued option on trials 1 and 15, separated by normal
(red) and uniform (grey) blocks. Triangles represent the
averages. n.s. = not significantly different. C. Actual (blue) and
estimated (orange) value differences for two choice options.
The primary y-axis shows the EV differences between the two
10
choice options, and the x-axis shows trial number. The block-
wise changes in the true value difference are reflected by the
model. The black tick marks correspond to correct and
incorrect choices, defined by the relative expected values, and
indicated by the secondary y-axis.
D. Fitted prediction errors as a function of trial within normal
(red) and uniform (grey) blocks. 15
Dashed line indicated the threshold for designating that block
values have been learned. Dots
indicate the transition from the learning to the stable phase. E.
Box and whisker plot showing the
number of trials in the learning phase for normal and uniform
distributions. The plot follows the
same format as panel B.
20
To investigate autonomic responses to prediction errors, we
analyzed the pupil responses
during the choice task (Fig 3A). We used deconvolution to
separate the effects of distinct trial
events on pupil responses (Supplementary Methods). The
deconvolution analysis indicated that
the correlation between pupil responses and RPE was higher
during the learning phase of normal
blocks, compared to the learning phase of uniform blocks (Fig.
3B. p = 0.014, Wilcoxon rank-sum 25
test). There were no significant differences in the correlation
during the stable phases of normal
and uniform blocks (Fig. 3B, p = 0.78, Wilcoxon rank-sum test).
Thus, pupil responses indicated
greater arousal to rare prediction errors during learning.
not certified by peer review) is the author/funder. All rights
reserved. No reuse allowed without permission.
The copyright holder for this preprint (which wasthis version
posted November 22, 2019. ; https://doi.org/10.1101/851709doi:
bioRxiv preprint
https://doi.org/10.1101/851709
9
Figure 3. Pupil diameter was correlated with reward
prediction error during learning normal distributions. A.
Averaged normalized pupil diameter response (light grey) to
different epochs during behavior. B. Box and whisker plots 5
showing the correlation between pupil diameter and RPE, in
the learning and stable phases of normal (red) and uniform
(grey) blocks. The plot follows the
same format as Figure 2B.
Dopamine responses to rare rewards 10
After showing that probability distributions were reflected in
learning behavior and
autonomic responses, we investigated the neural correlates of
probability distributions. To do so,
we recorded extracellular dopamine neuron action potentials in
a passive viewing task wherein
monkeys viewed distribution-predicting cues and received
rewards (Supplementary Methods).
Dopamine neurons were similarly activated by both distribution
predicting cues, as would be 15
expected from responses to cues that had similar values (Fig.
4A). At the time of reward delivery,
dopamine neurons are activated or suppressed by rewards that
are better or worse than predicted,
respectively. Therefore, we expected dopamine neurons to be
activated by delivery of 0.6 ml and
suppressed by delivery of 0.2 ml. Surprisingly, dopamine
activations to delivery of 0.6 ml were
larger in normal distribution trials, compared to dopamine
activations following delivery of the 20
same volume reward in uniform distribution trials (Fig 4B, solid
lines). Similarly, dopamine
responses were more strongly suppressed by delivery of 0.2 ml
reward during normal distribution
trials, compared to response suppressions to the same reward
during uniform distribution trials
(Fig. 4B, dashed lines). As the reward predicting cues had the
same subjective values, this response
amplification occurred despite the fact that the conventional
reward prediction errors were 25
not certified by peer review) is the author/funder. All rights
reserved. No reuse allowed without permission.
The copyright holder for this preprint (which wasthis version
posted November 22, 2019. ; https://doi.org/10.1101/851709doi:
bioRxiv preprint
https://doi.org/10.1101/851709
10
identical (Eq. 1). Moreover, because the amplification was bi-
directional – both activations and
suppressions were amplified in single neurons – this effect
could not be attributed to differences
in predicted subjective values. Thus, the effects we observed
were robust to errors in the
measurement of subjective value.
We sought to quantify the response amplification (Fig. 4B)
across neurons. The dynamic 5
ranges of dopamine responses are larger in the positive domain,
compared to the negative domain.
Therefore, we transformed the responses across the positive and
negative domains onto a linear
scale (Supplementary Methods). We then used linear regression
to measure the response slopes of
each neuron to positive and negative RPEs generated during
normal and uniform distribution trials.
We plotted the response slopes for each neuron in both
conditions (Fig. 4C). A majority of neurons 10
had a steeper slope for rare RPEs generated during normal
distribution trials, compared to common
RPEs generated during uniform distribution trials (Fig. 4C
inset, p = 0.001, n = 60 neurons, one-
sample t-test). To ensure that the amplification effect was bi-
directional and not driven solely by
positive responses, we separately analyzed the positive and
negative prediction error responses for
the 39 neurons above the unity line in Figure 4C. Across this
subpopulation, both activations and 15
suppressions were significantly amplified (Fig. 4D, negative
and positive prediction error
responses p < 0.01 and 0.001, respectively, Wilcoxon rank-sum
test). There were no significant
differences in the positive or negative responses in the
subpopulation below the unity line (p >
0.05, n = 21, Wilcoxon rank-sum test). Population peri-
stimulus time histograms (PSTHs) for the
amplified subpopulation show differences in both the positive
and negative response domains (Fig. 20
5E). Thus, phasic dopamine responses are amplified by rare
RPEs. These data demonstrate that
RPE responses are sensitive to predicted probabi lity
distributions.
not certified by peer review) is the author/funder. All rights
reserved. No reuse allowed without permission.
The copyright holder for this preprint (which wasthis version
posted November 22, 2019. ; https://doi.org/10.1101/851709doi:
bioRxiv preprint
https://doi.org/10.1101/851709
11
Figure 4. Amplified
dopamine reward
prediction error
responses to rare 5
rewards. A. Population
PSTHs of conditioned
stimuli (CS) responses to
the normal and uniform
distribution predicting 10
cues. There was no
significant difference
between the population
responses (p > 0.6 for
both the early and late response components, indicated on the x-
axis by the light and dark grey 15
bars, respectively). B. Single neuron reward responses from
both monkeys. Top: PSTHs show
impulse rate as a function of time, aligned to the time of reward
(black vertical bar). Light grey
bars along with x-axis indicates the response window used for
analysis. Bottom: Raster plots,
separated by normal and uniform distributions and by reward
sizes 0.2 and 0.6 ml. Every line
represents the time of an action potential, and every row
represents a trial. C. Reward response 20
slopes for every neuron in normal and uniform distribution
trials. Each dot represents one neuron
(n = 60 neurons), and the two circled neurons are the example
neurons in B. Inset shows the data
as a function of the difference from the unity line (diagonal). D.
Change in normalized impulse
rate from baseline in normal and uniform distribution trials, for
negative RPE responses (left), and
of the mean (SEM) across 39 25
neurons above the unity line. E. Population PSTH for neurons
that showed amplified responses to
the same rewards in the normal and uniform distribution trials.
In panels A, B, D, and E, red and
grey colors indicate normal and uniform distributions,
respectively. In A, B, and E, solid PSTHs
represent positive prediction error responses, and dashed PSTHs
represent negative prediction
error responses. 30
not certified by peer review) is the author/funder. All rights
reserved. No reuse allowed without permission.
The copyright holder for this preprint (which wasthis version
posted November 22, 2019. ; https://doi.org/10.1101/851709doi:
bioRxiv preprint
https://doi.org/10.1101/851709
12
The passive viewing task used to collect the neural data (Fig. 4)
offers the greatest level of
control, because during each trial only one reward-predictive
cue is shown to the animals and no
choices are required. To validate that the neuronal sensitivity to
probability distributions was also
present during complex behaviors, we recorded an additional 8
and 12 neurons from monkey B
and S, respectively, while the animals learned the predicted
values and made choices (Fig. 1C). 5
We used the trial-wise RPE’s derived from the model to
categorize each response as a positive or
negative prediction error response (Supplementary Methods).
We found that positive and negative
RPEs elicited amplified neural responses in normal distribution
trials, compared to uniform
distributions trials (Fig. 5A). The amplification effect was
significant in the population (Fig. 5B,
negative RPE: p = 0.019; positive RPE: p = 0.001, n = 20
neurons, Wilcoxon rank-sum test). These 10
findings confirm that dopamine responses are sensitive to
probability distributions during complex
behaviors and demonstrate that amplified dopamine responses
can be used to guide active learning
and decision making.
Figure 5. Amplification in dopamine neurons persistent in
learning conditions. A. Single neuron PSTHs aligned to 15
reward (black vertical line) show amplified responses to
positive and negative RPEs in normal distribution trials,
compared to uniform distribution trials. Solid PSTHs indicate
positive prediction error responses,
and dashed PSTHs are negative prediction error responses. The
light grey bar along the x-axis
indicates the response window used for analysis. B. Change in
normalized impulse rate from 20
baseline in normal and uniform conditions, for negative RPE
responses (left), and positive RPE
panels, red and grey colors indicate
normal and uniform distributions, respectively.
not certified by peer review) is the author/funder. All rights
reserved. No reuse allowed without permission.
The copyright holder for this preprint (which wasthis version
posted November 22, 2019. ; https://doi.org/10.1101/851709doi:
bioRxiv preprint
https://doi.org/10.1101/851709
13
Discussion
Here we show that dopamine-dependent learning behavior and
dopamine reward prediction
error responses reflect probability distributions. Rare
prediction errors – compared to commonly
occurring prediction errors of the same magnitude – evoke
faster learning, increased autonomic
arousal, and amplified neural learning signals. Together, these
data reveal a novel computational 5
paradigm for phasic dopamine responses that is distinct from,
but complementary to, conventional
reward prediction errors. Rare events are often highly
significant (17). Our data show decision
makers exhibit greater vigilance towards rare rewards and learn
more from rare reward prediction
errors. Amplified dopamine prediction error responses provide a
mechanistic account for these
behavioral effects. 10
More than 20 years ago, single unit recordings demonstrated the
importance of
unpredictability for dopamine neuron responses (2). Since then,
the successful application of TD
learning theory to dopamine signals has largely subsumed the
role of unpredictability, and recast
it within the framework of value-based prediction errors (18,
19). In most experimental paradigms,
reward unpredictability is captured by frequentist probability of
rewards and, therefore, 15
unpredictability is factored directly into expected value (5, 7,
20-22). The resulting Pascalian
expected values are learned by TD models (23) and reflected in
dopamine responses (5, 20, 22).
Here, we dissociated unpredictability from expected value by
pseudorandomly drawing rewards
from symmetric probability distributions with equal values.
Monkeys learned faster from the more
unpredictable rewards drawn from the tails of normal
distributions. Likewise, dopamine responses 20
were amplified by greater unpredictability, even when the
conventional prediction error described
in Eq. 1 was identical. These data reinforce the importance of
unpredictability for dopamine
responses and learning.
not certified by peer review) is the author/funder. All rights
reserved. No reuse allowed without permission.
The copyright holder for this preprint (which wasthis version
posted November 22, 2019. ; https://doi.org/10.1101/851709doi:
bioRxiv preprint
https://doi.org/10.1101/851709
14
The amplified dopamine responses to rare rewards suggest that
reinforcement learning
approaches that acquire only the average value of past outcomes
are insufficient to describe the
phasic activity of dopamine neurons. In fact, dopamine
responses distinguish between predicted
reward distributions, even when the average expected values
and average subjective values of
those distributions are identical. Consequently, monkeys and
their dopamine neurons appear to 5
learn probability distributions, and this information is used to
boost their performance and gain
more reward. Therefore, these data suggest that an updated RL
algorithm that learns higher
moments of probability distributions, such as Kalman TD (24),
will provide the proper conceptual
framework to explain information processing in dopamine
responses.
Distribution learning is critical to Bayesian inference and,
indeed, the results that we show 10
here are consistent with a signal that could guide Bayesian
inference to optimize choices and
maximize rewards. However, further experimentation is
required to understand whether dopamine
signals actually support Bayesian inference. Distribution
learning is also emerging as an important
approach in machine learning and artificial intelligence (AI)
(25). Biological learning signals have
inspired deep reinforcement learning algorithms with
performance that exceeds expert human 15
performance on Atari games, chess, and Go (26, 27). The
distribution-sensitive neural signals that
we observed here could offer further guidance to computational
learning; emphasizing the impact
of rare events drawn from tails of reward distribution could
restrict the necessary search space.
Dopamine reward prediction error signals participate in Hebbian
learning (28, 29), and
these signals are likely responsible for updating action values
stored in the striatum (30). The 20
amplified dopamine responses coupled with the faster learning
dynamics observed here, suggest
that the magnitude of dopamine release affected cellular
learning mechanisms in the striatum.
Moreover, amplified dopamine responses have the ability to
modulate dopamine concentrations in
not certified by peer review) is the author/funder. All rights
reserved. No reuse allowed without permission.
The copyright holder for this preprint (which wasthis version
posted November 22, 2019. ; https://doi.org/10.1101/851709doi:
bioRxiv preprint
https://doi.org/10.1101/851709
15
the prefrontal cortex (PFC). The level of PFC dopamine is
tightly linked to neuronal signaling and
working memory performance (31). Therefore, amplification of
dopamine could explain the
exaggerated salience of real – and possibly imagined – rare
events, and postulates a neural
mechanism to explain aberrant learning observed in mental
health disorders, such as
schizophrenia. 5
Conclusion:
Dopamine neurons code reward prediction errors (RPEs), which
are classically defined as
the differences between received and predicted rewards.
However, our data show that predicted
probability distributions, rather than just the predicted average
values, affects dopamine responses. 10
Specifically, rare rewards drawn from normal distributions
amplified dopamine responses,
compared to the same rewards drawn more frequently from
uniform distributions. Crucially, the
classically defined RPEs had identical magnitudes, following
rare and common rewards. From a
behavioral perspective, rare prediction errors often signal
underlying changes in the environment
and therefore demand greater vigilance. From a theoretical
perspective, this result demonstrates 15
that biologically inspired reinforcement learning algorithms
should account for full probability
distributions, rather than just point estimates.
20
not certified by peer review) is the author/funder. All rights
reserved. No reuse allowed without permission.
The copyright holder for this preprint (which wasthis version
posted November 22, 2019. ; https://doi.org/10.1101/851709doi:
bioRxiv preprint
https://doi.org/10.1101/851709
16
References and Notes
1. T. Ljungberg, P. Apicella, W. Schultz, Responses of monkey
dopamine neurons during
learning of behavioral reactions. J Neurophysiol 67, 145-163
(1992).
2. J. Mirenowicz, W. Schultz, Importance of unpredictability for
reward responses in
primate dopamine neurons. J Neurophysiol 72, 1024-1027
(1994). 5
3. J. Mirenowicz, W. Schultz, Preferential activation of
midbrain dopamine neurons by
appetitive rather than aversive stimuli. Nature 379, 449-451
(1996).
4. P. Waelti, A. Dickinson, W. Schultz, Dopamine responses
comply with basic
assumptions of formal learning theory. Nature 412, 43-48
(2001).
5. C. D. Fiorillo, P. N. Tobler, W. Schultz, Discrete Coding of
Reward Probability and 10
Uncertainty by Dopamine Neurons. Science 299, 1898-1902
(2003).
6. H. M. Bayer, B. Lau, P. W. Glimcher, Statistics of midbrain
dopamine neuron spike
trains in the awake primate. J Neurophysiol 98, 1428-1439
(2007).
7. H. Nakahara, H. Itoh, R. Kawagoe, Y. Takikawa, O.
Hikosaka, Dopamine neurons can
represent context-dependent prediction error. Neuron 41, 269-
280 (2004). 15
8. E. S. Bromberg-Martin, M. Matsumoto, S. Hong, O.
Hikosaka, A pallidus-habenula-
dopamine pathway signals inferred stimulus values. J
Neurophysiol 104, 1068-1076
(2010).
9. E. E. Steinberg et al., A causal link between prediction
errors, dopamine neurons and
learning. Nat Neurosci 16, 966-973 (2013). 20
10. C. Y. Chang et al., Brief optogenetic inhibition of dopamine
neurons mimics endogenous
negative reward prediction errors. Nat Neurosci 19, 111-116
(2016).
11. W. R. Stauffer et al., Dopamine Neuron-Specific
Optogenetic Stimulation in Rhesus
Macaques. Cell 166, 1564-1571 e1566 (2016).
12. K. M. J. Diederen, W. Schultz, Scaling prediction errors to
reward variability benefits 25
error-driven learning in humans. Journal of Neurophysiology
114, 1628-1640 (2015).
13. I. Krajbich, C. Armel, A. Rangel, Visual fixations and the
computation and comparison
of value in simple choice. Nat Neurosci 13, 1292-1298 (2010).
14. M. Milosavljevic, J. Malmaud, A. Huth, C. Koch, A.
Rangel, The Drift Diffusion Model
can account for the accuracy and reaction time of value-based
choices under high and low 30
time pressure. Judgm Decis Mak 5, 437-449 (2010).
15. P. N. Tobler, C. D. Fiorillo, W. Schultz, Adaptive coding of
reward value by dopamine
neurons. Science 307, 1642-1645 (2005).
16. W. R. Stauffer, A. Lak, W. Schultz, Dopamine reward
prediction error responses reflect
marginal utility. Curr Biol 24, 2491-2500 (2014). 35
17. N. N. Taleb, The Black Swan: The Impact of the Highly
Improbable. (Random House
Publishing Group, 2007).
18. W. Schultz, P. Dayan, P. R. Montague, A neural substrate of
prediction and reward.
Science 275, 1593-1599 (1997).
19. P. R. Montague, P. Dayan, T. J. Sejnowski, A framework for
mesencephalic dopamine 40
systems based on predictive Hebbian learning. J Neurosci 16,
1936-1947 (1996).
20. A. Lak, W. R. Stauffer, W. Schultz, Dopamine neurons learn
relative chosen value from
probabilistic rewards. Elife 5, (2016).
21. M. Matsumoto, O. Hikosaka, Two types of dopamine neuron
distinctly convey positive
and negative motivational signals. Nature 459, 837-841 (2009).
45
not certified by peer review) is the author/funder. All rights
reserved. No reuse allowed without permission.
The copyright holder for this preprint (which wasthis version
posted November 22, 2019. ; https://doi.org/10.1101/851709doi:
bioRxiv preprint
https://doi.org/10.1101/851709
17
22. G. Morris, A. Nevet, D. Arkadir, E. Vaadia, H. Bergman,
Midbrain dopamine neurons
encode decisions for future action. Nat Neurosci 9, 1057-1063
(2006).
23. R. Sutton, A. Barto, Reinforcement Learning: An
Introduction. M. I. T. Press, Ed., (MIT
Press, 1998).
24. S. J. Gershman, A Unifying Probabilistic View of
Associative Learning. PLoS Comput 5
Biol 11, e1004567 (2015).
25. M. G. Bellemare, W. Dabney, R. Munos, paper presented at
the Proceedings of the 34th
International Conference on Machine Learning - Volume 70,
Sydney, NSW, Australia,
2017.
26. V. Mnih et al., Human-level control through deep
reinforcement learning. Nature 518, 10
529-533 (2015).
27. D. Silver et al., A general reinforcement learning algorithm
that masters chess, shogi, and
Go through self-play. Science 362, 1140-1144 (2018).
28. Z. Brzosko, S. Zannone, W. Schultz, C. Clopath, O. Paulsen,
Sequential neuromodulation
of Hebbian plasticity offers mechanism for effective reward-
based navigation. Elife 6, 15
(2017).
29. W. Shen, M. Flajolet, P. Greengard, D. J. Surmeier,
Dichotomous dopaminergic control
of striatal synaptic plasticity. Science 321, 848-851 (2008).
30. K. Samejima, Y. Ueda, K. Doya, M. Kimura, Representation
of action-specific reward
values in the striatum. Science 310, 1337-1340 (2005). 20
31. S. Vijayraghavan, M. Wang, S. G. Birnbaum, G. V.
Williams, A. F. Arnsten, Inverted-U
dopamine D1 receptor actions on prefrontal neurons engaged in
working memory. Nat
Neurosci 10, 376-384 (2007).
Acknowledgments: The authors would like to thank Andreea
Bostan for thoughtful comments 25
and discussion of the manuscript, and Jacquelyn Breter for her
work in animal care and
enrichment for this project. Funding: This research was
supported by University of Pittsburgh
Brain Institute (WRS), and NIH DP2MH113095 (WRS). Author
contributions: KMR and
WRS conceptualized and designed the experiments. KMR, AA,
and WRS collected the data.
KMR, TH, and WRS analyzed the data. All authors discussed
the results of the experiment. 30
KMR, TH, and WRS wrote the manuscript. All authors revised
the manuscript. Competing
interests: The authors declare no competing interests. Data and
materials availability:
Available upon request.
not certified by peer review) is the author/funder. All rights
reserved. No reuse allowed without permission.
The copyright holder for this preprint (which wasthis version
posted November 22, 2019. ; https://doi.org/10.1101/851709doi:
bioRxiv preprint
https://doi.org/10.1101/851709
18
Supplementary Materials:
Materials and Methods
Figures S1-S2
References
not certified by peer review) is the author/funder. All rights
reserved. No reuse allowed without permission.
The copyright holder for this preprint (which wasthis version
posted November 22, 2019. ; https://doi.org/10.1101/851709doi:
bioRxiv preprint
https://doi.org/10.1101/851709Rare Rewards Amplify Dopamine
Learning ResponsesAbstract:Dopamine neurons drive learning
by coding reward prediction errors (RPEs), which are
formalized as subtractions of predicted values from reward
values. Subtractions accommodate point estimate predictions of
value, such as the average value. However, p...ResultsReward
size distributionsDopamine responses to rare
rewardsDiscussionSupplementary Materials:

More Related Content

Similar to 1 Rare Rewards Amplify Dopamine Learning Responses

Neuroeconomic Class
Neuroeconomic ClassNeuroeconomic Class
Neuroeconomic Class
tkvaran
 
Review Z Test Ci 1
Review Z Test Ci 1Review Z Test Ci 1
Review Z Test Ci 1
shoffma5
 
Statistics orientation
Statistics orientationStatistics orientation
Statistics orientation
darrincoe
 

Similar to 1 Rare Rewards Amplify Dopamine Learning Responses (20)

Analyzing quantitative data
Analyzing quantitative dataAnalyzing quantitative data
Analyzing quantitative data
 
Analysis of variance
Analysis of varianceAnalysis of variance
Analysis of variance
 
Inferential statistics
Inferential  statisticsInferential  statistics
Inferential statistics
 
Neuroeconomic Class
Neuroeconomic ClassNeuroeconomic Class
Neuroeconomic Class
 
Movie Ticket Post
Movie Ticket PostMovie Ticket Post
Movie Ticket Post
 
Multiple sample test - Anova, Chi-square, Test of association, Goodness of Fit
Multiple sample test - Anova, Chi-square, Test of association, Goodness of Fit Multiple sample test - Anova, Chi-square, Test of association, Goodness of Fit
Multiple sample test - Anova, Chi-square, Test of association, Goodness of Fit
 
hypothesis testing.pptx
hypothesis testing.pptxhypothesis testing.pptx
hypothesis testing.pptx
 
Statistical Significance Tests.pptx
Statistical Significance Tests.pptxStatistical Significance Tests.pptx
Statistical Significance Tests.pptx
 
Chapter 5 part2- Sampling Distributions for Counts and Proportions (Binomial ...
Chapter 5 part2- Sampling Distributions for Counts and Proportions (Binomial ...Chapter 5 part2- Sampling Distributions for Counts and Proportions (Binomial ...
Chapter 5 part2- Sampling Distributions for Counts and Proportions (Binomial ...
 
Normal and standard normal distribution
Normal and standard normal distributionNormal and standard normal distribution
Normal and standard normal distribution
 
non para.doc
non para.docnon para.doc
non para.doc
 
4_5875144622430228750.docx
4_5875144622430228750.docx4_5875144622430228750.docx
4_5875144622430228750.docx
 
Hypothesis Testing. Inferential Statistics pt. 2
Hypothesis Testing. Inferential Statistics pt. 2Hypothesis Testing. Inferential Statistics pt. 2
Hypothesis Testing. Inferential Statistics pt. 2
 
Parametric tests
Parametric  testsParametric  tests
Parametric tests
 
Analysis of variance
Analysis of varianceAnalysis of variance
Analysis of variance
 
Lec 4 random sampling
Lec 4 random samplingLec 4 random sampling
Lec 4 random sampling
 
Biostatistics ii4june
Biostatistics ii4juneBiostatistics ii4june
Biostatistics ii4june
 
Review Z Test Ci 1
Review Z Test Ci 1Review Z Test Ci 1
Review Z Test Ci 1
 
Statistics orientation
Statistics orientationStatistics orientation
Statistics orientation
 
Basics of Sample Size Estimation
Basics of Sample Size EstimationBasics of Sample Size Estimation
Basics of Sample Size Estimation
 

More from AbbyWhyte974

1. Use Postman” to test API at  httpspostman-echo.coma. Use
1. Use Postman” to test API at  httpspostman-echo.coma. Use1. Use Postman” to test API at  httpspostman-echo.coma. Use
1. Use Postman” to test API at  httpspostman-echo.coma. Use
AbbyWhyte974
 
1. Use the rubric to complete the assignment and pay attention t
1. Use the rubric to complete the assignment and pay attention t1. Use the rubric to complete the assignment and pay attention t
1. Use the rubric to complete the assignment and pay attention t
AbbyWhyte974
 
1. True or false. Unlike a merchandising business, a manufacturing
1. True or false. Unlike a merchandising business, a manufacturing1. True or false. Unlike a merchandising business, a manufacturing
1. True or false. Unlike a merchandising business, a manufacturing
AbbyWhyte974
 
1. Top hedge fund manager Sally Buffit believes that a stock with
1. Top hedge fund manager Sally Buffit believes that a stock with 1. Top hedge fund manager Sally Buffit believes that a stock with
1. Top hedge fund manager Sally Buffit believes that a stock with
AbbyWhyte974
 
1. This question is on the application of the Binomial option
1. This question is on the application of the Binomial option1. This question is on the application of the Binomial option
1. This question is on the application of the Binomial option
AbbyWhyte974
 
1. Tiktaalik httpswww.palaeocast.comtiktaalikW
1. Tiktaalik        httpswww.palaeocast.comtiktaalikW1. Tiktaalik        httpswww.palaeocast.comtiktaalikW
1. Tiktaalik httpswww.palaeocast.comtiktaalikW
AbbyWhyte974
 
1. This week, we learned about the balanced scorecard and dashboar
1. This week, we learned about the balanced scorecard and dashboar1. This week, we learned about the balanced scorecard and dashboar
1. This week, we learned about the balanced scorecard and dashboar
AbbyWhyte974
 
1. The company I chose was Amazon2.3.4.1) Keep i
1. The company I chose was Amazon2.3.4.1) Keep i1. The company I chose was Amazon2.3.4.1) Keep i
1. The company I chose was Amazon2.3.4.1) Keep i
AbbyWhyte974
 
1. Think about a persuasive speech that you would like to present
1. Think about a persuasive speech that you would like to present 1. Think about a persuasive speech that you would like to present
1. Think about a persuasive speech that you would like to present
AbbyWhyte974
 
1. The two properties about a set of measurements of a dependent v
1. The two properties about a set of measurements of a dependent v1. The two properties about a set of measurements of a dependent v
1. The two properties about a set of measurements of a dependent v
AbbyWhyte974
 
1. The Danube River flows through 10 countries. Name them in the s
1. The Danube River flows through 10 countries. Name them in the s1. The Danube River flows through 10 countries. Name them in the s
1. The Danube River flows through 10 countries. Name them in the s
AbbyWhyte974
 
1. The 3 genes that you will compare at listed below. Take a look.
1. The 3 genes that you will compare at listed below. Take a look.1. The 3 genes that you will compare at listed below. Take a look.
1. The 3 genes that you will compare at listed below. Take a look.
AbbyWhyte974
 
1. Student and trainer detailsStudent details Full nameStu
1. Student and trainer detailsStudent details  Full nameStu1. Student and trainer detailsStudent details  Full nameStu
1. Student and trainer detailsStudent details Full nameStu
AbbyWhyte974
 
1. Student uses MS Excel to calculate income tax expense or refund
1. Student uses MS Excel to calculate income tax expense or refund1. Student uses MS Excel to calculate income tax expense or refund
1. Student uses MS Excel to calculate income tax expense or refund
AbbyWhyte974
 
1. Socrates - In your view, what was it about Socrates’ teachings
1. Socrates - In your view, what was it about Socrates’ teachings 1. Socrates - In your view, what was it about Socrates’ teachings
1. Socrates - In your view, what was it about Socrates’ teachings
AbbyWhyte974
 
1. Select a patient” (friend or family member) on whom to perform
1. Select a patient” (friend or family member) on whom to perform1. Select a patient” (friend or family member) on whom to perform
1. Select a patient” (friend or family member) on whom to perform
AbbyWhyte974
 
1. Respond to your classmates’ question and post. Submission to y
1. Respond to your classmates’ question and post.  Submission to y1. Respond to your classmates’ question and post.  Submission to y
1. Respond to your classmates’ question and post. Submission to y
AbbyWhyte974
 
1. Review the HCAPHS survey document, by clicking on the hyperlink
1. Review the HCAPHS survey document, by clicking on the hyperlink1. Review the HCAPHS survey document, by clicking on the hyperlink
1. Review the HCAPHS survey document, by clicking on the hyperlink
AbbyWhyte974
 
1. Reference is ch. 5 in the e-text, or ch. 2 in paper text...plea
1. Reference is ch. 5 in the e-text, or ch. 2 in paper text...plea1. Reference is ch. 5 in the e-text, or ch. 2 in paper text...plea
1. Reference is ch. 5 in the e-text, or ch. 2 in paper text...plea
AbbyWhyte974
 

More from AbbyWhyte974 (20)

1. Use Postman” to test API at  httpspostman-echo.coma. Use
1. Use Postman” to test API at  httpspostman-echo.coma. Use1. Use Postman” to test API at  httpspostman-echo.coma. Use
1. Use Postman” to test API at  httpspostman-echo.coma. Use
 
1. Use the rubric to complete the assignment and pay attention t
1. Use the rubric to complete the assignment and pay attention t1. Use the rubric to complete the assignment and pay attention t
1. Use the rubric to complete the assignment and pay attention t
 
1. True or false. Unlike a merchandising business, a manufacturing
1. True or false. Unlike a merchandising business, a manufacturing1. True or false. Unlike a merchandising business, a manufacturing
1. True or false. Unlike a merchandising business, a manufacturing
 
1. Top hedge fund manager Sally Buffit believes that a stock with
1. Top hedge fund manager Sally Buffit believes that a stock with 1. Top hedge fund manager Sally Buffit believes that a stock with
1. Top hedge fund manager Sally Buffit believes that a stock with
 
1. This question is on the application of the Binomial option
1. This question is on the application of the Binomial option1. This question is on the application of the Binomial option
1. This question is on the application of the Binomial option
 
1. Tiktaalik httpswww.palaeocast.comtiktaalikW
1. Tiktaalik        httpswww.palaeocast.comtiktaalikW1. Tiktaalik        httpswww.palaeocast.comtiktaalikW
1. Tiktaalik httpswww.palaeocast.comtiktaalikW
 
1. This week, we learned about the balanced scorecard and dashboar
1. This week, we learned about the balanced scorecard and dashboar1. This week, we learned about the balanced scorecard and dashboar
1. This week, we learned about the balanced scorecard and dashboar
 
1. The company I chose was Amazon2.3.4.1) Keep i
1. The company I chose was Amazon2.3.4.1) Keep i1. The company I chose was Amazon2.3.4.1) Keep i
1. The company I chose was Amazon2.3.4.1) Keep i
 
1. Think about a persuasive speech that you would like to present
1. Think about a persuasive speech that you would like to present 1. Think about a persuasive speech that you would like to present
1. Think about a persuasive speech that you would like to present
 
1. The two properties about a set of measurements of a dependent v
1. The two properties about a set of measurements of a dependent v1. The two properties about a set of measurements of a dependent v
1. The two properties about a set of measurements of a dependent v
 
1. The Danube River flows through 10 countries. Name them in the s
1. The Danube River flows through 10 countries. Name them in the s1. The Danube River flows through 10 countries. Name them in the s
1. The Danube River flows through 10 countries. Name them in the s
 
1. The 3 genes that you will compare at listed below. Take a look.
1. The 3 genes that you will compare at listed below. Take a look.1. The 3 genes that you will compare at listed below. Take a look.
1. The 3 genes that you will compare at listed below. Take a look.
 
1. Student and trainer detailsStudent details Full nameStu
1. Student and trainer detailsStudent details  Full nameStu1. Student and trainer detailsStudent details  Full nameStu
1. Student and trainer detailsStudent details Full nameStu
 
1. Student uses MS Excel to calculate income tax expense or refund
1. Student uses MS Excel to calculate income tax expense or refund1. Student uses MS Excel to calculate income tax expense or refund
1. Student uses MS Excel to calculate income tax expense or refund
 
1. Socrates - In your view, what was it about Socrates’ teachings
1. Socrates - In your view, what was it about Socrates’ teachings 1. Socrates - In your view, what was it about Socrates’ teachings
1. Socrates - In your view, what was it about Socrates’ teachings
 
1. Select a patient” (friend or family member) on whom to perform
1. Select a patient” (friend or family member) on whom to perform1. Select a patient” (friend or family member) on whom to perform
1. Select a patient” (friend or family member) on whom to perform
 
1. Respond to your classmates’ question and post. Submission to y
1. Respond to your classmates’ question and post.  Submission to y1. Respond to your classmates’ question and post.  Submission to y
1. Respond to your classmates’ question and post. Submission to y
 
1. Review the HCAPHS survey document, by clicking on the hyperlink
1. Review the HCAPHS survey document, by clicking on the hyperlink1. Review the HCAPHS survey document, by clicking on the hyperlink
1. Review the HCAPHS survey document, by clicking on the hyperlink
 
1. Saint Leo Portal loginUser ID[email protected]
1. Saint Leo Portal loginUser ID[email protected]          1. Saint Leo Portal loginUser ID[email protected]
1. Saint Leo Portal loginUser ID[email protected]
 
1. Reference is ch. 5 in the e-text, or ch. 2 in paper text...plea
1. Reference is ch. 5 in the e-text, or ch. 2 in paper text...plea1. Reference is ch. 5 in the e-text, or ch. 2 in paper text...plea
1. Reference is ch. 5 in the e-text, or ch. 2 in paper text...plea
 

Recently uploaded

SPLICE Working Group: Reusable Code Examples
SPLICE Working Group:Reusable Code ExamplesSPLICE Working Group:Reusable Code Examples
SPLICE Working Group: Reusable Code Examples
Peter Brusilovsky
 
會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文
會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文
會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文
中 央社
 

Recently uploaded (20)

An Overview of the Odoo 17 Knowledge App
An Overview of the Odoo 17 Knowledge AppAn Overview of the Odoo 17 Knowledge App
An Overview of the Odoo 17 Knowledge App
 
Including Mental Health Support in Project Delivery, 14 May.pdf
Including Mental Health Support in Project Delivery, 14 May.pdfIncluding Mental Health Support in Project Delivery, 14 May.pdf
Including Mental Health Support in Project Delivery, 14 May.pdf
 
Basic Civil Engineering notes on Transportation Engineering & Modes of Transport
Basic Civil Engineering notes on Transportation Engineering & Modes of TransportBasic Civil Engineering notes on Transportation Engineering & Modes of Transport
Basic Civil Engineering notes on Transportation Engineering & Modes of Transport
 
Major project report on Tata Motors and its marketing strategies
Major project report on Tata Motors and its marketing strategiesMajor project report on Tata Motors and its marketing strategies
Major project report on Tata Motors and its marketing strategies
 
24 ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH SỞ GIÁO DỤC HẢI DƯ...
24 ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH SỞ GIÁO DỤC HẢI DƯ...24 ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH SỞ GIÁO DỤC HẢI DƯ...
24 ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH SỞ GIÁO DỤC HẢI DƯ...
 
Sternal Fractures & Dislocations - EMGuidewire Radiology Reading Room
Sternal Fractures & Dislocations - EMGuidewire Radiology Reading RoomSternal Fractures & Dislocations - EMGuidewire Radiology Reading Room
Sternal Fractures & Dislocations - EMGuidewire Radiology Reading Room
 
Spring gala 2024 photo slideshow - Celebrating School-Community Partnerships
Spring gala 2024 photo slideshow - Celebrating School-Community PartnershipsSpring gala 2024 photo slideshow - Celebrating School-Community Partnerships
Spring gala 2024 photo slideshow - Celebrating School-Community Partnerships
 
ESSENTIAL of (CS/IT/IS) class 07 (Networks)
ESSENTIAL of (CS/IT/IS) class 07 (Networks)ESSENTIAL of (CS/IT/IS) class 07 (Networks)
ESSENTIAL of (CS/IT/IS) class 07 (Networks)
 
Mattingly "AI & Prompt Design: Named Entity Recognition"
Mattingly "AI & Prompt Design: Named Entity Recognition"Mattingly "AI & Prompt Design: Named Entity Recognition"
Mattingly "AI & Prompt Design: Named Entity Recognition"
 
FICTIONAL SALESMAN/SALESMAN SNSW 2024.pdf
FICTIONAL SALESMAN/SALESMAN SNSW 2024.pdfFICTIONAL SALESMAN/SALESMAN SNSW 2024.pdf
FICTIONAL SALESMAN/SALESMAN SNSW 2024.pdf
 
ANTI PARKISON DRUGS.pptx
ANTI         PARKISON          DRUGS.pptxANTI         PARKISON          DRUGS.pptx
ANTI PARKISON DRUGS.pptx
 
diagnosting testing bsc 2nd sem.pptx....
diagnosting testing bsc 2nd sem.pptx....diagnosting testing bsc 2nd sem.pptx....
diagnosting testing bsc 2nd sem.pptx....
 
DEMONSTRATION LESSON IN ENGLISH 4 MATATAG CURRICULUM
DEMONSTRATION LESSON IN ENGLISH 4 MATATAG CURRICULUMDEMONSTRATION LESSON IN ENGLISH 4 MATATAG CURRICULUM
DEMONSTRATION LESSON IN ENGLISH 4 MATATAG CURRICULUM
 
The Story of Village Palampur Class 9 Free Study Material PDF
The Story of Village Palampur Class 9 Free Study Material PDFThe Story of Village Palampur Class 9 Free Study Material PDF
The Story of Village Palampur Class 9 Free Study Material PDF
 
SPLICE Working Group: Reusable Code Examples
SPLICE Working Group:Reusable Code ExamplesSPLICE Working Group:Reusable Code Examples
SPLICE Working Group: Reusable Code Examples
 
Observing-Correct-Grammar-in-Making-Definitions.pptx
Observing-Correct-Grammar-in-Making-Definitions.pptxObserving-Correct-Grammar-in-Making-Definitions.pptx
Observing-Correct-Grammar-in-Making-Definitions.pptx
 
Book Review of Run For Your Life Powerpoint
Book Review of Run For Your Life PowerpointBook Review of Run For Your Life Powerpoint
Book Review of Run For Your Life Powerpoint
 
VAMOS CUIDAR DO NOSSO PLANETA! .
VAMOS CUIDAR DO NOSSO PLANETA!                    .VAMOS CUIDAR DO NOSSO PLANETA!                    .
VAMOS CUIDAR DO NOSSO PLANETA! .
 
會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文
會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文
會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文
 
Andreas Schleicher presents at the launch of What does child empowerment mean...
Andreas Schleicher presents at the launch of What does child empowerment mean...Andreas Schleicher presents at the launch of What does child empowerment mean...
Andreas Schleicher presents at the launch of What does child empowerment mean...
 

1 Rare Rewards Amplify Dopamine Learning Responses

  • 1. 1 Rare Rewards Amplify Dopamine Learning Responses Kathryn M. Rothenhoefer, Tao Hong, Aydin Alikaya, William R. Stauffer*. Affiliations: Center for Neuroscience, Center for the Neural Basis of Cognition, Systems Neuroscience Institute, The Brain Institute, University of Pittsburgh, Pittsburgh, PA, USA. 5 *Correspondence to: [email protected] Abstract: Dopamine neurons drive learning by coding reward prediction errors (RPEs), which are formalized as subtractions of predicted values from reward values. Subtractions accommodate point estimate 10 predictions of value, such as the average value. However, point estimate predictions fail to capture many features of choice and learning behaviors. For instance, reaction times and learning rates
  • 2. consistently reflect higher moments of probability distributions. Here, we demonstrate that dopamine RPE responses code probability distributions. We presented monkeys with rewards that were drawn from the tails of normal and uniform reward size distributions to generate rare and 15 common RPEs, respectively. Behavioral choices and pupil diameter measurements indicated that monkeys learned faster and registered greater arousal from rare RPEs, compared to common RPEs of identical magnitudes. Dopamine neuron recordings indicated that rare rewards amplified RPE responses. These results demonstrate that dopamine responses reflect probability distributions and suggest a neural mechanism for the amplified learning and enhanced arousal associated with rare 20 events. not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which wasthis version posted November 22, 2019. ; https://doi.org/10.1101/851709doi: bioRxiv preprint https://doi.org/10.1101/851709
  • 3. 2 Main Text: Making accurate predictions is evolutionarily adaptive. Accurate predictions enable individuals to be in the right place at the right time, choose the best options, and efficiently scale the vigor of responses. Dopamine neurons are crucial for building accurate reward predictions. Phasic dopamine responses code for reward prediction errors: the differences between the values 5 of received and predicted rewards (1-8). These signals cause predictions to be modified through associative and extinction learning (9, 10). Likewise, phasic dopamine neuron stimulation during reward delivery increase both the dopamine responses to reward predicting cues and the choices for those same cues (11). Although it is well understood how predicted reward values affect dopamine responses, these predictions are simply point estimates – normally the average value – 10 of probability distributions. It is unknown how the form of
  • 4. probability distributions affects dopamine learning signals. Probability distributions affect behavioral measures of learning and decision making. Learning the expected value takes longer when rewards are sampled from broader distributions, compared to when they are drawn from narrower distributions (12). Likewise, even with well 15 learned values, decision makers take a longer time to choose between options when the value difference between rewards is smaller, compared to when the difference is larger (13, 14). These behavioral measures of learning and decision making demonstrate that the probability distributions over reward values, and not simply the point estimates of values, influence learning and decision making. Prior electrophysiological recordings have demonstrated that dopamine signals adapt to 20 the range between two outcomes (15), but it is not known whether these same dopamine signals reflect probability distributions over reward values. not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which wasthis version
  • 5. posted November 22, 2019. ; https://doi.org/10.1101/851709doi: bioRxiv preprint https://doi.org/10.1101/851709 3 Here, we asked whether the shapes of probability distributions were reflected in dopamine reward prediction error responses. We created two discrete reward size distributions that reflected, roughly, the shapes of normal and uniform distributions. We trained monkeys to predict rewards drawn from these distributions. Crucially, the normal distribution resulted in rare prediction errors following rewards drawn from the tails of that distribution, whereas the same rewards, with 5 identical prediction errors, were drawn with greater frequency from the uniform distribution. We found that monkeys learned to choose the better option within fewer trials when rewards were drawn from normal distributions, compared to when rewards were drawn from uniform distributions. Moreover, we found that pupil diameter was correlated with rare prediction errors,
  • 6. but uncorrelated with common prediction errors of the same magnitude, suggesting greater 10 vigilance to rare outcomes. Using single neuron recording, we show that dopamine responses reflect the shape of predicted reward distribution. Specifically, rare prediction errors evoked significantly larger responses than common prediction errors with identical magnitudes. Together, these results demonstrate a complementary but distinct mechanism from TD-like reward prediction error responses for learning based on probability distributions. 15 Results Reward size distributions We used non-informative images (fractal pictures) to predict rewards drawn from differently shaped distributions. Distribution shapes were defined according to relative reward 20 frequency. One fractal image predicted an equal probability of receiving a small, medium, or large volume of juice reward. We define this as the uniform reward size distribution (Fig. 1A left). A
  • 7. second fractal image predicted that the small and the large reward would be given 2 out of 15 not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which wasthis version posted November 22, 2019. ; https://doi.org/10.1101/851709doi: bioRxiv preprint https://doi.org/10.1101/851709 4 times, and the medium sized reward would be given the remaining 11 out of 15 times (normal reward size distribution, Fig. 1A right). Importantly, both reward size distributions were symmetrical and were comprised of the same three reward magnitudes. Therefore, the uniform and normal distributions had identical Pascalian expected values. However, rewards drawn from the tails of the normal distribution were rare, compared to the frequency of identical rewards drawn 5 from the tails of the uniform distribution. Anticipatory licking reflected the expected value of both the distributions, as well as the expected value of safe cues (Figure 1B, p = 0.019, Linear
  • 8. Regression). Thus, the animals learned that the cues predicted rewards. Dopamine neurons code prediction errors in the subjective values of rewards (16), according to the following equation: 10 ���������� ����� = ������ ���������� ����� − ��������� ���������� �����. Eq.1 To investigate whether dopamine responses were sensitive to probability distributions, we sought to compare dopamine responses when the conventional prediction errors, defined according to Eq. 15 1, were identical. Therefore, we created reward distributions with the same expected values and then measured the distribution of the subjective values. We measured the subjective values of six reward distributions – three normal reward distributions with center locations of 0.3, 0.4, and 0.5 ml and three uniform distributions with center locations of 0.3, 0.4, and 0.5 ml. Monkeys made choices between cues that predicted a distribution and safe values (Fig. 1C, Methods). We plotted 20
  • 9. the probability of choosing the safe option as a function of the safe option volume and generated psychometric functions (Figure 1D). We used the certainty equivalents (CEs) – defined as the safe volume at the point of subjective equivalence between the two options (Fig. 1D, vertical dashed. not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which wasthis version posted November 22, 2019. ; https://doi.org/10.1101/851709doi: bioRxiv preprint https://doi.org/10.1101/851709 5 Line) – as a measure of the distribution subjective value. Analysis of variance (ANOVA) on the CEs for the normal and uniform distribution cues indicated a significant effect of EV on CE (F = 71.81, p < 0.001, ANOVA). Crucially, we found no significant effect of the distribution type on the CEs (F = 3.57, p = 0.07, ANOVA). To increase our power to detect small differences in subjective value, we combined the CE data from all three EVs, yet still found no significant 5
  • 10. difference between the subjective values of the normal and uniform distributions (p = 0.20, two- sample t-test). Therefore, within the limited range of reward sizes we used, the data indicate that the normal and uniform reward size distributions had similar subjective values. These results indicated that the prediction errors generated from the distributions could be readily compared and ensured that disparities between prediction error responses were not driven by differences in the 10 predicted subjective values. 15 not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which wasthis version posted November 22, 2019. ; https://doi.org/10.1101/851709doi: bioRxiv preprint https://doi.org/10.1101/851709 6
  • 11. Figure 1. Reward size distributions. A. Fractal images predicted that one of three juice reward sizes would be drawn from a uniform (left) or normal (right) reward size distribution. The uniform distribution predicted that each size would be drawn with equal relative frequency (R.F.) (1/3 for small, 5 medium, and large sized rewards). The normal distribution predicted that the small and large reward volumes would be drawn 2/15 times, and the medium reward volume would be drawn the remaining 11/15 times. B. Anticipatory licking indicated that the monkeys learned the predicted reward 10 values. Black dots indicate the normalized licking duration data for predicted reward volume, and the grey dotted line indicates the linear fit to the data. Error bars indicate SEM across session. C. The choice task used to measure subjective value. Animals made saccade- directed choices between a distribution predicting cue and a safe alternative option. The safe 15 alternative option was a value bar with a minimum and maximum of 0 and 0.8 ml at the bottom and top, respectively. The intersection between the horizontal
  • 12. bar and the scale indicated the volume of juice that would be received if monkeys selected the safe cue. D. Probability of choosing the safe cue as a function of the value of the safe option, when the distribution predicting cue had an expected value (EV) of 0.4ml. Dots show average choice probability for 9 safe value options 20 for monkey B (left), and Monkey S (right). Solid lines are a logistic fit to the data. Red indicates data from normal distribution blocks, grey indicates data from uniform distribution blocks. The dashed horizontal lines indicate subjective equivalence, and the CE for each distribution type is indicated with the dashed vertical lines. 25 Distribution shape affects learning We next investigated how probability distributions affected learning (Fig. 2A). Each learning block consisted of 15-25 trials and used two, never- before-seen cues (Supplementary Methods). Both cues predicted either normal or uniform distributions and had two different not certified by peer review) is the author/funder. All rights
  • 13. reserved. No reuse allowed without permission. The copyright holder for this preprint (which wasthis version posted November 22, 2019. ; https://doi.org/10.1101/851709doi: bioRxiv preprint https://doi.org/10.1101/851709 7 expected values (EVs). The animals had to learn the relative EVs to maximize reward. We analyzed behavioral choice data from two animals across 16 sessions, each of which included a mixture of normal and uniform blocks (Supplementary Methods). As expected, the probability of choosing the higher value option on the first trial of each block was not significantly different from chance (Fig. 2B, left). Comparison between the first and fifteenth trial of each block revealed that 5 the animals learned to choose the higher value option on 87% and 85% of trials, in the normal and uniform blocks, respectively (Fig. 2B, p < 0.001, paired t-test for both distributions). There was no significant difference in the average performance between the fifteenth trials of normal and
  • 14. uniform blocks (Fig. 2B, right, p = 0.7, t-test). Together, these results demonstrate that the animals learned which option had a higher EV. 10 We hypothesized that the animals would learn faster from reward sizes sampled from the normal distribution compared to the uniform distribution. We used a reinforcement learning (RL) model to quantify the learned values (Supplementary Methods). Our model, fit to the behavioral choices, performed well at predicting the true values of the two choice options (Fig 2C). To differentiate active learning from stable, asymptotic-like behavior, we fit a logarithmic function to 15 the estimated prediction errors (Fig. 2D). When the change in the log-fitted values went below a robust threshold, we considered the values to be learned (Supplementary Methods). Using this approach, we determined the block-wise number of trials needed to learn. We found that, on average, animals needed 4 additional trials to learn in the uniform blocks, compared to the normal blocks (Fig. 2E, p < 0.001, Mann-Whitney U test). These data demonstrate that the monkeys 20 learned faster from normal reward distributions, compared to
  • 15. uniform reward distribution. Thus, rare prediction errors have greater behavioral relevance than common prediction errors of the same magnitude. not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which wasthis version posted November 22, 2019. ; https://doi.org/10.1101/851709doi: bioRxiv preprint https://doi.org/10.1101/851709 8 Figure 2. Faster learning from normal reward distributions. A. Animals made choices between two never- before-seen fractal images with different expected values (EVs), but the same distribution type (normal or uniform). B. Box and whisker plots show the probability of choosing the 5 higher-valued option on trials 1 and 15, separated by normal (red) and uniform (grey) blocks. Triangles represent the averages. n.s. = not significantly different. C. Actual (blue) and
  • 16. estimated (orange) value differences for two choice options. The primary y-axis shows the EV differences between the two 10 choice options, and the x-axis shows trial number. The block- wise changes in the true value difference are reflected by the model. The black tick marks correspond to correct and incorrect choices, defined by the relative expected values, and indicated by the secondary y-axis. D. Fitted prediction errors as a function of trial within normal (red) and uniform (grey) blocks. 15 Dashed line indicated the threshold for designating that block values have been learned. Dots indicate the transition from the learning to the stable phase. E. Box and whisker plot showing the number of trials in the learning phase for normal and uniform distributions. The plot follows the same format as panel B. 20 To investigate autonomic responses to prediction errors, we analyzed the pupil responses during the choice task (Fig 3A). We used deconvolution to separate the effects of distinct trial
  • 17. events on pupil responses (Supplementary Methods). The deconvolution analysis indicated that the correlation between pupil responses and RPE was higher during the learning phase of normal blocks, compared to the learning phase of uniform blocks (Fig. 3B. p = 0.014, Wilcoxon rank-sum 25 test). There were no significant differences in the correlation during the stable phases of normal and uniform blocks (Fig. 3B, p = 0.78, Wilcoxon rank-sum test). Thus, pupil responses indicated greater arousal to rare prediction errors during learning. not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which wasthis version posted November 22, 2019. ; https://doi.org/10.1101/851709doi: bioRxiv preprint https://doi.org/10.1101/851709 9 Figure 3. Pupil diameter was correlated with reward prediction error during learning normal distributions. A. Averaged normalized pupil diameter response (light grey) to
  • 18. different epochs during behavior. B. Box and whisker plots 5 showing the correlation between pupil diameter and RPE, in the learning and stable phases of normal (red) and uniform (grey) blocks. The plot follows the same format as Figure 2B. Dopamine responses to rare rewards 10 After showing that probability distributions were reflected in learning behavior and autonomic responses, we investigated the neural correlates of probability distributions. To do so, we recorded extracellular dopamine neuron action potentials in a passive viewing task wherein monkeys viewed distribution-predicting cues and received rewards (Supplementary Methods). Dopamine neurons were similarly activated by both distribution predicting cues, as would be 15 expected from responses to cues that had similar values (Fig. 4A). At the time of reward delivery, dopamine neurons are activated or suppressed by rewards that are better or worse than predicted, respectively. Therefore, we expected dopamine neurons to be activated by delivery of 0.6 ml and
  • 19. suppressed by delivery of 0.2 ml. Surprisingly, dopamine activations to delivery of 0.6 ml were larger in normal distribution trials, compared to dopamine activations following delivery of the 20 same volume reward in uniform distribution trials (Fig 4B, solid lines). Similarly, dopamine responses were more strongly suppressed by delivery of 0.2 ml reward during normal distribution trials, compared to response suppressions to the same reward during uniform distribution trials (Fig. 4B, dashed lines). As the reward predicting cues had the same subjective values, this response amplification occurred despite the fact that the conventional reward prediction errors were 25 not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which wasthis version posted November 22, 2019. ; https://doi.org/10.1101/851709doi: bioRxiv preprint https://doi.org/10.1101/851709 10 identical (Eq. 1). Moreover, because the amplification was bi-
  • 20. directional – both activations and suppressions were amplified in single neurons – this effect could not be attributed to differences in predicted subjective values. Thus, the effects we observed were robust to errors in the measurement of subjective value. We sought to quantify the response amplification (Fig. 4B) across neurons. The dynamic 5 ranges of dopamine responses are larger in the positive domain, compared to the negative domain. Therefore, we transformed the responses across the positive and negative domains onto a linear scale (Supplementary Methods). We then used linear regression to measure the response slopes of each neuron to positive and negative RPEs generated during normal and uniform distribution trials. We plotted the response slopes for each neuron in both conditions (Fig. 4C). A majority of neurons 10 had a steeper slope for rare RPEs generated during normal distribution trials, compared to common RPEs generated during uniform distribution trials (Fig. 4C inset, p = 0.001, n = 60 neurons, one- sample t-test). To ensure that the amplification effect was bi- directional and not driven solely by
  • 21. positive responses, we separately analyzed the positive and negative prediction error responses for the 39 neurons above the unity line in Figure 4C. Across this subpopulation, both activations and 15 suppressions were significantly amplified (Fig. 4D, negative and positive prediction error responses p < 0.01 and 0.001, respectively, Wilcoxon rank-sum test). There were no significant differences in the positive or negative responses in the subpopulation below the unity line (p > 0.05, n = 21, Wilcoxon rank-sum test). Population peri- stimulus time histograms (PSTHs) for the amplified subpopulation show differences in both the positive and negative response domains (Fig. 20 5E). Thus, phasic dopamine responses are amplified by rare RPEs. These data demonstrate that RPE responses are sensitive to predicted probabi lity distributions. not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which wasthis version posted November 22, 2019. ; https://doi.org/10.1101/851709doi: bioRxiv preprint https://doi.org/10.1101/851709
  • 22. 11 Figure 4. Amplified dopamine reward prediction error responses to rare 5 rewards. A. Population PSTHs of conditioned stimuli (CS) responses to the normal and uniform distribution predicting 10 cues. There was no significant difference between the population responses (p > 0.6 for both the early and late response components, indicated on the x- axis by the light and dark grey 15
  • 23. bars, respectively). B. Single neuron reward responses from both monkeys. Top: PSTHs show impulse rate as a function of time, aligned to the time of reward (black vertical bar). Light grey bars along with x-axis indicates the response window used for analysis. Bottom: Raster plots, separated by normal and uniform distributions and by reward sizes 0.2 and 0.6 ml. Every line represents the time of an action potential, and every row represents a trial. C. Reward response 20 slopes for every neuron in normal and uniform distribution trials. Each dot represents one neuron (n = 60 neurons), and the two circled neurons are the example neurons in B. Inset shows the data as a function of the difference from the unity line (diagonal). D. Change in normalized impulse rate from baseline in normal and uniform distribution trials, for negative RPE responses (left), and of the mean (SEM) across 39 25 neurons above the unity line. E. Population PSTH for neurons that showed amplified responses to the same rewards in the normal and uniform distribution trials. In panels A, B, D, and E, red and
  • 24. grey colors indicate normal and uniform distributions, respectively. In A, B, and E, solid PSTHs represent positive prediction error responses, and dashed PSTHs represent negative prediction error responses. 30 not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which wasthis version posted November 22, 2019. ; https://doi.org/10.1101/851709doi: bioRxiv preprint https://doi.org/10.1101/851709 12 The passive viewing task used to collect the neural data (Fig. 4) offers the greatest level of control, because during each trial only one reward-predictive cue is shown to the animals and no choices are required. To validate that the neuronal sensitivity to probability distributions was also present during complex behaviors, we recorded an additional 8 and 12 neurons from monkey B and S, respectively, while the animals learned the predicted values and made choices (Fig. 1C). 5
  • 25. We used the trial-wise RPE’s derived from the model to categorize each response as a positive or negative prediction error response (Supplementary Methods). We found that positive and negative RPEs elicited amplified neural responses in normal distribution trials, compared to uniform distributions trials (Fig. 5A). The amplification effect was significant in the population (Fig. 5B, negative RPE: p = 0.019; positive RPE: p = 0.001, n = 20 neurons, Wilcoxon rank-sum test). These 10 findings confirm that dopamine responses are sensitive to probability distributions during complex behaviors and demonstrate that amplified dopamine responses can be used to guide active learning and decision making. Figure 5. Amplification in dopamine neurons persistent in learning conditions. A. Single neuron PSTHs aligned to 15 reward (black vertical line) show amplified responses to positive and negative RPEs in normal distribution trials, compared to uniform distribution trials. Solid PSTHs indicate positive prediction error responses, and dashed PSTHs are negative prediction error responses. The
  • 26. light grey bar along the x-axis indicates the response window used for analysis. B. Change in normalized impulse rate from 20 baseline in normal and uniform conditions, for negative RPE responses (left), and positive RPE panels, red and grey colors indicate normal and uniform distributions, respectively. not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which wasthis version posted November 22, 2019. ; https://doi.org/10.1101/851709doi: bioRxiv preprint https://doi.org/10.1101/851709 13 Discussion Here we show that dopamine-dependent learning behavior and dopamine reward prediction error responses reflect probability distributions. Rare prediction errors – compared to commonly occurring prediction errors of the same magnitude – evoke faster learning, increased autonomic
  • 27. arousal, and amplified neural learning signals. Together, these data reveal a novel computational 5 paradigm for phasic dopamine responses that is distinct from, but complementary to, conventional reward prediction errors. Rare events are often highly significant (17). Our data show decision makers exhibit greater vigilance towards rare rewards and learn more from rare reward prediction errors. Amplified dopamine prediction error responses provide a mechanistic account for these behavioral effects. 10 More than 20 years ago, single unit recordings demonstrated the importance of unpredictability for dopamine neuron responses (2). Since then, the successful application of TD learning theory to dopamine signals has largely subsumed the role of unpredictability, and recast it within the framework of value-based prediction errors (18, 19). In most experimental paradigms, reward unpredictability is captured by frequentist probability of rewards and, therefore, 15 unpredictability is factored directly into expected value (5, 7, 20-22). The resulting Pascalian
  • 28. expected values are learned by TD models (23) and reflected in dopamine responses (5, 20, 22). Here, we dissociated unpredictability from expected value by pseudorandomly drawing rewards from symmetric probability distributions with equal values. Monkeys learned faster from the more unpredictable rewards drawn from the tails of normal distributions. Likewise, dopamine responses 20 were amplified by greater unpredictability, even when the conventional prediction error described in Eq. 1 was identical. These data reinforce the importance of unpredictability for dopamine responses and learning. not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which wasthis version posted November 22, 2019. ; https://doi.org/10.1101/851709doi: bioRxiv preprint https://doi.org/10.1101/851709 14 The amplified dopamine responses to rare rewards suggest that reinforcement learning
  • 29. approaches that acquire only the average value of past outcomes are insufficient to describe the phasic activity of dopamine neurons. In fact, dopamine responses distinguish between predicted reward distributions, even when the average expected values and average subjective values of those distributions are identical. Consequently, monkeys and their dopamine neurons appear to 5 learn probability distributions, and this information is used to boost their performance and gain more reward. Therefore, these data suggest that an updated RL algorithm that learns higher moments of probability distributions, such as Kalman TD (24), will provide the proper conceptual framework to explain information processing in dopamine responses. Distribution learning is critical to Bayesian inference and, indeed, the results that we show 10 here are consistent with a signal that could guide Bayesian inference to optimize choices and maximize rewards. However, further experimentation is required to understand whether dopamine signals actually support Bayesian inference. Distribution learning is also emerging as an important
  • 30. approach in machine learning and artificial intelligence (AI) (25). Biological learning signals have inspired deep reinforcement learning algorithms with performance that exceeds expert human 15 performance on Atari games, chess, and Go (26, 27). The distribution-sensitive neural signals that we observed here could offer further guidance to computational learning; emphasizing the impact of rare events drawn from tails of reward distribution could restrict the necessary search space. Dopamine reward prediction error signals participate in Hebbian learning (28, 29), and these signals are likely responsible for updating action values stored in the striatum (30). The 20 amplified dopamine responses coupled with the faster learning dynamics observed here, suggest that the magnitude of dopamine release affected cellular learning mechanisms in the striatum. Moreover, amplified dopamine responses have the ability to modulate dopamine concentrations in not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which wasthis version posted November 22, 2019. ; https://doi.org/10.1101/851709doi: bioRxiv preprint
  • 31. https://doi.org/10.1101/851709 15 the prefrontal cortex (PFC). The level of PFC dopamine is tightly linked to neuronal signaling and working memory performance (31). Therefore, amplification of dopamine could explain the exaggerated salience of real – and possibly imagined – rare events, and postulates a neural mechanism to explain aberrant learning observed in mental health disorders, such as schizophrenia. 5 Conclusion: Dopamine neurons code reward prediction errors (RPEs), which are classically defined as the differences between received and predicted rewards. However, our data show that predicted probability distributions, rather than just the predicted average values, affects dopamine responses. 10 Specifically, rare rewards drawn from normal distributions amplified dopamine responses,
  • 32. compared to the same rewards drawn more frequently from uniform distributions. Crucially, the classically defined RPEs had identical magnitudes, following rare and common rewards. From a behavioral perspective, rare prediction errors often signal underlying changes in the environment and therefore demand greater vigilance. From a theoretical perspective, this result demonstrates 15 that biologically inspired reinforcement learning algorithms should account for full probability distributions, rather than just point estimates. 20 not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which wasthis version posted November 22, 2019. ; https://doi.org/10.1101/851709doi: bioRxiv preprint https://doi.org/10.1101/851709 16
  • 33. References and Notes 1. T. Ljungberg, P. Apicella, W. Schultz, Responses of monkey dopamine neurons during learning of behavioral reactions. J Neurophysiol 67, 145-163 (1992). 2. J. Mirenowicz, W. Schultz, Importance of unpredictability for reward responses in primate dopamine neurons. J Neurophysiol 72, 1024-1027 (1994). 5 3. J. Mirenowicz, W. Schultz, Preferential activation of midbrain dopamine neurons by appetitive rather than aversive stimuli. Nature 379, 449-451 (1996). 4. P. Waelti, A. Dickinson, W. Schultz, Dopamine responses comply with basic assumptions of formal learning theory. Nature 412, 43-48 (2001). 5. C. D. Fiorillo, P. N. Tobler, W. Schultz, Discrete Coding of Reward Probability and 10 Uncertainty by Dopamine Neurons. Science 299, 1898-1902 (2003). 6. H. M. Bayer, B. Lau, P. W. Glimcher, Statistics of midbrain dopamine neuron spike trains in the awake primate. J Neurophysiol 98, 1428-1439
  • 34. (2007). 7. H. Nakahara, H. Itoh, R. Kawagoe, Y. Takikawa, O. Hikosaka, Dopamine neurons can represent context-dependent prediction error. Neuron 41, 269- 280 (2004). 15 8. E. S. Bromberg-Martin, M. Matsumoto, S. Hong, O. Hikosaka, A pallidus-habenula- dopamine pathway signals inferred stimulus values. J Neurophysiol 104, 1068-1076 (2010). 9. E. E. Steinberg et al., A causal link between prediction errors, dopamine neurons and learning. Nat Neurosci 16, 966-973 (2013). 20 10. C. Y. Chang et al., Brief optogenetic inhibition of dopamine neurons mimics endogenous negative reward prediction errors. Nat Neurosci 19, 111-116 (2016). 11. W. R. Stauffer et al., Dopamine Neuron-Specific Optogenetic Stimulation in Rhesus Macaques. Cell 166, 1564-1571 e1566 (2016). 12. K. M. J. Diederen, W. Schultz, Scaling prediction errors to reward variability benefits 25 error-driven learning in humans. Journal of Neurophysiology
  • 35. 114, 1628-1640 (2015). 13. I. Krajbich, C. Armel, A. Rangel, Visual fixations and the computation and comparison of value in simple choice. Nat Neurosci 13, 1292-1298 (2010). 14. M. Milosavljevic, J. Malmaud, A. Huth, C. Koch, A. Rangel, The Drift Diffusion Model can account for the accuracy and reaction time of value-based choices under high and low 30 time pressure. Judgm Decis Mak 5, 437-449 (2010). 15. P. N. Tobler, C. D. Fiorillo, W. Schultz, Adaptive coding of reward value by dopamine neurons. Science 307, 1642-1645 (2005). 16. W. R. Stauffer, A. Lak, W. Schultz, Dopamine reward prediction error responses reflect marginal utility. Curr Biol 24, 2491-2500 (2014). 35 17. N. N. Taleb, The Black Swan: The Impact of the Highly Improbable. (Random House Publishing Group, 2007). 18. W. Schultz, P. Dayan, P. R. Montague, A neural substrate of prediction and reward. Science 275, 1593-1599 (1997). 19. P. R. Montague, P. Dayan, T. J. Sejnowski, A framework for
  • 36. mesencephalic dopamine 40 systems based on predictive Hebbian learning. J Neurosci 16, 1936-1947 (1996). 20. A. Lak, W. R. Stauffer, W. Schultz, Dopamine neurons learn relative chosen value from probabilistic rewards. Elife 5, (2016). 21. M. Matsumoto, O. Hikosaka, Two types of dopamine neuron distinctly convey positive and negative motivational signals. Nature 459, 837-841 (2009). 45 not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which wasthis version posted November 22, 2019. ; https://doi.org/10.1101/851709doi: bioRxiv preprint https://doi.org/10.1101/851709 17 22. G. Morris, A. Nevet, D. Arkadir, E. Vaadia, H. Bergman, Midbrain dopamine neurons encode decisions for future action. Nat Neurosci 9, 1057-1063 (2006). 23. R. Sutton, A. Barto, Reinforcement Learning: An
  • 37. Introduction. M. I. T. Press, Ed., (MIT Press, 1998). 24. S. J. Gershman, A Unifying Probabilistic View of Associative Learning. PLoS Comput 5 Biol 11, e1004567 (2015). 25. M. G. Bellemare, W. Dabney, R. Munos, paper presented at the Proceedings of the 34th International Conference on Machine Learning - Volume 70, Sydney, NSW, Australia, 2017. 26. V. Mnih et al., Human-level control through deep reinforcement learning. Nature 518, 10 529-533 (2015). 27. D. Silver et al., A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. Science 362, 1140-1144 (2018). 28. Z. Brzosko, S. Zannone, W. Schultz, C. Clopath, O. Paulsen, Sequential neuromodulation of Hebbian plasticity offers mechanism for effective reward- based navigation. Elife 6, 15 (2017). 29. W. Shen, M. Flajolet, P. Greengard, D. J. Surmeier,
  • 38. Dichotomous dopaminergic control of striatal synaptic plasticity. Science 321, 848-851 (2008). 30. K. Samejima, Y. Ueda, K. Doya, M. Kimura, Representation of action-specific reward values in the striatum. Science 310, 1337-1340 (2005). 20 31. S. Vijayraghavan, M. Wang, S. G. Birnbaum, G. V. Williams, A. F. Arnsten, Inverted-U dopamine D1 receptor actions on prefrontal neurons engaged in working memory. Nat Neurosci 10, 376-384 (2007). Acknowledgments: The authors would like to thank Andreea Bostan for thoughtful comments 25 and discussion of the manuscript, and Jacquelyn Breter for her work in animal care and enrichment for this project. Funding: This research was supported by University of Pittsburgh Brain Institute (WRS), and NIH DP2MH113095 (WRS). Author contributions: KMR and WRS conceptualized and designed the experiments. KMR, AA, and WRS collected the data. KMR, TH, and WRS analyzed the data. All authors discussed the results of the experiment. 30
  • 39. KMR, TH, and WRS wrote the manuscript. All authors revised the manuscript. Competing interests: The authors declare no competing interests. Data and materials availability: Available upon request. not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which wasthis version posted November 22, 2019. ; https://doi.org/10.1101/851709doi: bioRxiv preprint https://doi.org/10.1101/851709 18 Supplementary Materials: Materials and Methods Figures S1-S2 References not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which wasthis version posted November 22, 2019. ; https://doi.org/10.1101/851709doi: bioRxiv preprint https://doi.org/10.1101/851709Rare Rewards Amplify Dopamine
  • 40. Learning ResponsesAbstract:Dopamine neurons drive learning by coding reward prediction errors (RPEs), which are formalized as subtractions of predicted values from reward values. Subtractions accommodate point estimate predictions of value, such as the average value. However, p...ResultsReward size distributionsDopamine responses to rare rewardsDiscussionSupplementary Materials: