Neuroeconomic Class

530 views

Published on

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
530
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
25
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Neuroeconomic Class

  1. 1. Neuroeconomic class 11/3/2009 Alessandro Grecucci
  2. 2. <ul><li>What is “reward” ? </li></ul><ul><li>Object or event that generate approach and consummatory behavior, produce learning or represent positive outcomes of economic decisions. </li></ul><ul><li>What is its function? </li></ul><ul><li>In absence of dedicated receptors for reward the brain evolved a REWARD SIGNAL to modulate neural processes underlying behavior. </li></ul>
  3. 3. <ul><li>How is implemented in the brain? </li></ul><ul><li>Pure reward neurons do exist! </li></ul><ul><li>How they work? </li></ul><ul><li>Prediction error / Rescorla-Wagner rule </li></ul>
  4. 4. <ul><li>Decision making is based on value </li></ul><ul><li>1) PAVLOVIAN VALUES </li></ul><ul><li>US  UR ; US+NS... ; CS  CR </li></ul><ul><li>2) GOAL VALUES (instrumental learning) </li></ul><ul><li>Goal directed actions are based on 2 factors: </li></ul><ul><li>1) contingency between action and outcome </li></ul><ul><li>2) valuation of outcome </li></ul><ul><li>3) HABIT VALUES </li></ul><ul><li>do not satisfy the criteria for goal values </li></ul>
  5. 5.
  6. 6. <ul><li>1) Pavlovian values (EXPECTANCY) </li></ul><ul><li>1) Goal directed actions (REWARD) </li></ul><ul><li>3) Habit values (REINFORCEMENT) </li></ul>
  7. 7. <ul><li>1) MARKOV decision processes </li></ul><ul><li>Ingradients: </li></ul><ul><li>s </li></ul><ul><li>t </li></ul><ul><li>s(t)  s(t+1) </li></ul><ul><li>T (s,a; s’) = P </li></ul><ul><li>r(t)= R(s(t)) </li></ul><ul><li>But “s” is unknown ... </li></ul><ul><li>State action value function: </li></ul><ul><li>Q(s,a)=E(r(t) + r(t+1) + r(t+2)...) </li></ul>
  8. 8. <ul><li>2) The ACTOR/CRITIC model </li></ul><ul><li>Separates value learning in: </li></ul><ul><li>1)learning about states (Pavlovian) </li></ul><ul><li>2)learning about actions (instrumental) : A=Q(s,a)-V(s) </li></ul>
  9. 9. <ul><li>EXPECTED REWARD </li></ul><ul><li>Pavlovian Values : amygdala, OFC, ventral striatum </li></ul><ul><li>LEARNING OF STATE VALUE </li></ul><ul><li>Reinforcement learning is based on error prediction signals that is on dopamine signal </li></ul>
  10. 10. <ul><li>ACTOR / CRITIC model </li></ul><ul><li>The model predicts separate processes and dedicated structures: </li></ul><ul><li>Actor  Dorsal Striatum </li></ul><ul><li>Critic  Ventral Striatum </li></ul><ul><li>(Montague) </li></ul>O’Doherty et al. 2004
  11. 11. <ul><li>Valentin et al. 2007 </li></ul>Goal directed actions
  12. 12. <ul><li>Chapter 21 appears to makes the case that prediction error processing in energy efficient and may have been selected for evolutionary. How plausible is this? </li></ul><ul><li>The prediction error appears to be sensitive to time - is there any work that has examined whether it is sensitive to space? </li></ul><ul><li>It appears from Chapter 24 that there are two classes of instrumental behaviour - those which remain goal directed and those which become automatic (&quot;habits&quot;). These &quot;habits&quot; sound a bit like fixed interval reinforcement schedules, while the &quot;goal directed&quot; behaviours sound like fixed ratio (or even variable ratio) schedules. An example of a habit is discussed in which a rat continues to press a lever to obtain (the now) poisonous saccharin - but the rat is not consuming the saccharin. Has the overtraining &quot;transferred&quot; the rewarding value to the lever pressing (in that pressing a lever is in itself rewarding? </li></ul>
  13. 13. <ul><li>Adaptive resonance theory (ART) by Carpenter & Gaddam </li></ul><ul><li>J Neurosci. 2009 Oct 28;29(43):13524-31. </li></ul><ul><li>Human reinforcement learning subdivides structured action spaces by learning effector-specific values. </li></ul><ul><li>Gershman SJ , Pesaran B , Daw ND . </li></ul><ul><li>how neural systems for reinforcement learning-such as prediction error signals for action valuation associated with dopamine and the striatum-can cope with this &quot;curse of dimensionality.“? </li></ul><ul><li>We propose a reinforcement learning framework that allows for learned action valuations to be decomposed into effector-specific components when appropriate to a task, and test it by studying to what extent human behavior and blood oxygen level-dependent (BOLD) activity can exploit such a decomposition in a multieffector choice task. Subjects made simultaneous decisions with their left and right hands and received separate reward feedback for each hand movement. We found that choice behavior was better described by a learning model that decomposed the values of bimanual movements into separate values for each effector </li></ul>
  14. 14. <ul><li>1.  There seems to be good evidence that dopamine neurons are involved in transmitting information about rewards, but less evidence about how information about aversive stimuli is transmitted.  The book suggests that seratonin may be involved in the case of aversive stimuli.  What is the evidence for the role of seratonin in transmitting information about aversive stimuli? </li></ul><ul><li>  </li></ul><ul><li>2.  Although these chapters provided a good overview of research on the role of dopamine, they did spend much time extending the research done with very simple learning tasks to more complex or realistic decision scenarios.  The Balleine et al chapter briefly discusses how the proposed goal vs habit model they focus on might connect up with popular dual-process models, but didn’t really make any strong claims.  As someone who had trouble connecting the dots between the research discussed in this chapter and more complex real-world decisions, I’d be curious to hear from other people in the group about how they think we can take the ideas from these chapters and apply them in more complex types of decisions. </li></ul>
  15. 15. <ul><li>The role of the dorsomedial part of the prefrontal cortex serotonergic innervation in rat responses to the aversively conditioned context: behavioral, biochemical and immunocytochemical studies. </li></ul><ul><li>Lehner M, Taracha E, Turzyńska D, Sobolewska A, Hamed A, Kołomańska P, Skórzewska A, Maciejak P, Szyndler J, Bidziński A, Płaźnik A. </li></ul><ul><li>Behav Brain Res. 2008 Oct 10;192(2):203-15. Epub 2008 Apr 16. </li></ul><ul><li>differences in animal reactivity to conditioned aversive stimuli using the conditioned fear test (a contextual fear-freezing response), in rats </li></ul><ul><li>Local administration of serotonergic neurotoxin (5,7-dihydroxytryptamine) to the dorsomedial part of the prefrontal cortex caused a very strong, structure and neurotransmitter selective depletion of serotonin concentration </li></ul><ul><li>the serotonergic lesion significantly disinhibited rat behavior controlled by fear </li></ul>
  16. 16. <ul><li>Researchers often conclude they found reward-related activity in the brain (e.g. in the striatum, ofc or vmpfc). But what does this actually mean? What element of a reward elicits this response? Is it the response of approach, the elicitation of positive affect, the informational value of a positive outcome or a combination of the above? Could it be that different reward-related brain regions each focus on different elements of a reward? </li></ul><ul><li>R: the presentation of the CS alone activates motivational systems </li></ul><ul><li>To what extent are RPE’s affected by individual differences? For instance, I am thinking whether IQ could correlate with the speed at which PE signals shift back in time (from time of presentation US to presentation CS). </li></ul><ul><li>R: pathological populations (Parkinson, Addicted…) </li></ul>
  17. 17. <ul><li>Chapter 21: 'Reference dependence' of dopamine reactions. The fact that dopamine reacts to deviations from the expected outcome supports reference frame dependent valuation in dopamine system (in addition to a role as a prediction error signal in learning). Also I was intrigued by the considerations of risk in the dopamine system (deviations from means are normalized by the standard deviation). Are there any good tests on this idea? </li></ul><ul><li>Chapter 24: I was wondering how the different valuation systems in the striatum that were proposed here would correspond to economical terms used in decision making tasks. So here they separated Pavlovian value (ventral striatum), habitual value (dorsolatal striatum) and goal value (dorsomedial striatum). How would these map to anticipated utility (Pavlovian?), decision utility ('wanting'; goal value?), and experience utility ('liking')? Or are the tasks simply too different that this type of comparison doesn't make any sense? </li></ul>
  18. 18. <ul><li>People commonly look for regions of the brain that correlate with various components of a computational model.  For example, the ventral striatum reliably correlates with a prediction error (delta) signal and the MPFC appears to correlate with overall value signals.  However, we know that the models are not a perfect fit of behavior, so it’s likely they are an even worse fit of brain data.  In fact the process of fitting a model to behavior and then looking for correlates of those fits in the brain seems limiting in trying to understand how the brain is making complex computations.  Thus, one methodological advance might be to fit the models directly to the brain data and skip the behavior.  Whole brain searches with different components of a model might reveal regions of the brain involved computationally in producing behavior, but may not necessarily by a reliable correlate.  Also, potentially looking for the neural correlates of the residual computations (i.e. aspects of the model that do not predict behavior) might also be informative in uncovering these processes. </li></ul>
  19. 19. <ul><li>What are the differences between phasic and sustained activations in risky choices? They occur within the same population of dopamine neurons, but sustained activation occurs with reward uncertainty about motivationally relevant stimuli (Fiorillo et al., 2003). Both, phasic and sustained dopamine activation, increase with reward magnitude but in a different way. Indeed, sustained activation increases with the discrepancy between potential rewards, that is, with increase of reward uncertainty. Phasic activations increase instead monotonically with increasing reward probability and value (e.g., Tobler et al., 2005). This difference seems to be relevant on risk-taking vs. risk-averse choice situations. It seems to suggest a different role of them on risk taking behavior. What about that? Can they be considered as having a different timing and application on risk-taking choices? </li></ul>
  20. 20. <ul><li>Dopamine and social reward </li></ul><ul><li>  </li></ul><ul><li>Schultz (2009) overviews compelling evidence that midbrain dopaminergic neurons encode a reward prediction-error signal. This is accomplished via enhanced or depressed firing rates when larger or smaller than expected rewards, respectively, are obtained. Pessiglione et al. (2006) provide supporting causal evidence: prediction-error signals were enhanced following L-DOPA administration (a metabolic precursor of dopamine) but depressed following haloperidol administration (a dopamine antagonist). Behavioral performance in relation to learning was also impacted as expected, with L-DOPA participants exhibiting enhanced learning while haloperidol participants exhibited suppressed learning. </li></ul><ul><li>Have experimental dopamine manipulations been used in the context of social cognition, for example, using a repeated trust game? We update our expectations of what the other player will do based on experience, and this updating will rely on a prediction-error signal; if someone is more or less trustworthy (or generous) than expected, this needs to be noted and learned. Haloperidol is hypothesized to disrupt this ability. Presumably the first player would retain a relatively static strategy, not learning as quickly to distrust (and give less money to) players that fail to reciprocate and, likewise, not learning as quickly to trust (and give more money to) players that do reciprocate. As for the second player, a failure to generate a prediction-error signal would presumably impair appreciative and punitive responses, and thus the amount returned will just be whatever amount is usually returned when he or she receives precisely what it expected given the circumstances. </li></ul>
  21. 21. <ul><li>Yet another value: response valuation </li></ul><ul><li>  </li></ul><ul><li>Balleine, Daw, and O’Doherty (2009) outline three distinct values: goal values, Pavlovian values, and habit values. Goal values represent the rewarding features of particular outcomes (e.g., the obtainment of sweet juice). Pavlovian values represent the likelihood of obtaining an outcome independent of any action (e.g., the probability of getting juice given an auditory tone). Finally, habit values are an encoded propensity to perform a particular response given a previously reinforced stimulus (e.g., deciding to press the lever has resulted in juice the past 100 times, so now just press the lever ignoring cues). </li></ul><ul><li>Are there values tracking something other than outcomes (goal values), the probability an outcome will occur without action (Pavlovian values), or habit tendencies? A candidate is what might be called response valuation (Fontaine and Dodge, 2009). Specifically, response valuation refers to an evaluation of whether a considered behavior per se is congruent with, to put is loosely, the “sociomoral character” of the decision maker – the question is whether the behavior being considered matches who a person self-identifies with as a social actor and moral agent. Importantly, this evaluation is distinct from evaluation of the stimulus or the outcomes with which the considered behavior is associated. For example, perhaps Brutus unjustly insults Angel, and Angel thinks that it is perfectly acceptable to slap Brutus in the face; he deserves it and performing the action will obtain the wanted goal of stopping his comments. Nonetheless, Angel does not think she is the type of person who goes around slapping people and therefore thinks this would be “out-of-character” for herself; she thus decides against slapping Brutus, sacrificing obtainment of her goal value that he be punished and shut up. </li></ul><ul><li>Note that response evaluation is likely to be unique to human behavior, although perhaps present to some extent within other species with some degree of advanced social cognition. For the rat studies discussed in the Balleine et al. (2009) chapter, as well as the simple conditioning tasks with humans, the postulation of response valuation seems unwarranted. </li></ul><ul><li>(Perhaps response valuation can just be recast as a different sort of goal value, namely, the goal of maintaining one’s self-image.) </li></ul>
  22. 22. <ul><li>A note on Partial Observable Processes, Prediction Errors and Gradient Ascent methods (Balleine et al. 2009) . A state is Markovian when it retains all relevant past information (i.e. ‘independence of path’.) Assuming that there is a horizon T, which can be either finite or infinite, the decision maker need to take into account only the state X( t) in order to transition to X( t+1). In fact, the reinforcement agent is trying to find the policy that maximizes the expected reward, conditioning over the observed states. Besides dynamic programming, TD(λ) rules are typical procedures for computing such policies1. </li></ul><ul><li>As Balleine et al. show, converging evidence supports the hypothesis that some form of prediction error (i.e. the core mechanism in TD(λ)) occurs in the brain2. In what follows, I discuss two generalizations of this line of research, which seems coherent with the authors’ arguments, and which might provide a way of better understanding the relation between reinforcement learning and neuronal activity. </li></ul><ul><li>1. Partially Observable Markov Decision Processes </li></ul><ul><li>2. Gradient Ascent Methods </li></ul>

×