Pain Avoidance Learning
“Model-based” and “Model-free” systems
Oliver Wang
Department of Cognitive Neuroscience
Project Concepts
● Creation of a cognitive map of the space
● Learning the values of actions as well as states
● Only learning the values of each action
Model-Based System Model-Free System
Project Basis
● Neural Computations Underlying Arbitration between Model-based and Model-free Learning
by Sang Wan Lee, Shinsuke Shimojo, and John P. O’Doherty (2014)
● We hypothesized that a similar arbitration method might exist in aversion learning as it does in reward learning.
Literature
Minimal research has been done on pain aversion learning
● Hendersen and Graham (Avoidance of Heat by Rats, 1979)
● Prevost and O’Doherty (Pavlovian Aversive Learning, 2013)
● Gillan and Robbins (Enhanced Avoidance Habits, 2014)
Task Design: Two-layer Markov Decision Task
● Training session followed by 2 sessions, each with 48 blocks, each with on average 5 trials
● Sequential 2-choices (L/R) to final state
● Following states are determined by the choice and the probabilities of each branch at that time
● 4 block conditions:
o Flexible, high uncertainty
o Flexible, low uncertainty
o Specific, high uncertainty
o Specific, low uncertainty
Block Conditions
Flexible
Final state values are set and you receive the
number of shocks indicated
Encourages a Model-Free strategy
Specific
Bin color must match final state color to receive
the number of shocks indicated.
Otherwise you receive 4, the maximum number,
of shocks.
Encourages a Model-Based strategy
Example of 1 trial, flexible condition, high
uncertainty
Probability: Uncertainty
High uncertainty vs. Low uncertainty
1. High uncertainty refers to a (.5,.5) chance
between the 2 resulting states.
1. Low uncertainty refers to a (.9,.1) chance,
and thus a state (left state in the diagram
to the right) is much more highly favorable.
*Uncertainty is maintained throughout each
block
Behavioral Results
Participants: 16 subjects
Behavioral Results con’t
Observation
● Significantly higher proportion
observed in flexible, low
uncertainty condition
Conclusion
● Some difference in the arbitrator
must exist
Model-free Simulation
16 Simulated Subjects
Alpha = .03, Beta = 1
MF Learning is able to replicate only the results of the flexible condition
Subject Choices: MB/MF
Left
● Refers to the flexible
condition
● MB system is not
adequate
Right
● Refers to the specific
condition
● MB system is
adequate
Parameters
Observation
● Parameter for “learning rate for the estimate of absolute reward prediction error” is much
greater in our pain aversion task
Interpretation
● Suggests a more dynamic arbitration system exists
*Pain Aversion Parameters (left bar) and Reward Based Parameters (right bar) (Lee)
Conclusion
1. Both Model-free and Model-based systems
exist in aversion learning.
1. Although the arbitration process between
the two systems share many similarities to
reward-based learning, there exists subtle
differences between the two.
Next Steps
1. fMRI 2. Modeling
ありがとうございます
Thank you for listening.

ATR Presentation

  • 1.
    Pain Avoidance Learning “Model-based”and “Model-free” systems Oliver Wang Department of Cognitive Neuroscience
  • 2.
    Project Concepts ● Creationof a cognitive map of the space ● Learning the values of actions as well as states ● Only learning the values of each action
  • 3.
  • 4.
    Project Basis ● NeuralComputations Underlying Arbitration between Model-based and Model-free Learning by Sang Wan Lee, Shinsuke Shimojo, and John P. O’Doherty (2014) ● We hypothesized that a similar arbitration method might exist in aversion learning as it does in reward learning.
  • 5.
    Literature Minimal research hasbeen done on pain aversion learning ● Hendersen and Graham (Avoidance of Heat by Rats, 1979) ● Prevost and O’Doherty (Pavlovian Aversive Learning, 2013) ● Gillan and Robbins (Enhanced Avoidance Habits, 2014)
  • 6.
    Task Design: Two-layerMarkov Decision Task ● Training session followed by 2 sessions, each with 48 blocks, each with on average 5 trials ● Sequential 2-choices (L/R) to final state ● Following states are determined by the choice and the probabilities of each branch at that time ● 4 block conditions: o Flexible, high uncertainty o Flexible, low uncertainty o Specific, high uncertainty o Specific, low uncertainty
  • 7.
    Block Conditions Flexible Final statevalues are set and you receive the number of shocks indicated Encourages a Model-Free strategy Specific Bin color must match final state color to receive the number of shocks indicated. Otherwise you receive 4, the maximum number, of shocks. Encourages a Model-Based strategy Example of 1 trial, flexible condition, high uncertainty
  • 8.
    Probability: Uncertainty High uncertaintyvs. Low uncertainty 1. High uncertainty refers to a (.5,.5) chance between the 2 resulting states. 1. Low uncertainty refers to a (.9,.1) chance, and thus a state (left state in the diagram to the right) is much more highly favorable. *Uncertainty is maintained throughout each block
  • 9.
  • 10.
    Behavioral Results con’t Observation ●Significantly higher proportion observed in flexible, low uncertainty condition Conclusion ● Some difference in the arbitrator must exist
  • 11.
    Model-free Simulation 16 SimulatedSubjects Alpha = .03, Beta = 1 MF Learning is able to replicate only the results of the flexible condition
  • 12.
    Subject Choices: MB/MF Left ●Refers to the flexible condition ● MB system is not adequate Right ● Refers to the specific condition ● MB system is adequate
  • 13.
    Parameters Observation ● Parameter for“learning rate for the estimate of absolute reward prediction error” is much greater in our pain aversion task Interpretation ● Suggests a more dynamic arbitration system exists *Pain Aversion Parameters (left bar) and Reward Based Parameters (right bar) (Lee)
  • 14.
    Conclusion 1. Both Model-freeand Model-based systems exist in aversion learning. 1. Although the arbitration process between the two systems share many similarities to reward-based learning, there exists subtle differences between the two.
  • 15.
    Next Steps 1. fMRI2. Modeling
  • 16.

Editor's Notes

  • #2 Good afternoon. My name is Oliver Wang and I am an intern in the department of cognitive neuroscience. I have been working with Dr. Yoshida from the department and Dr. Ben Seymour from CiNET on the pain avoidance learning task that I will be soon be presenting on. But first, a little bit about myself. I am currently an undergraduate at Stanford University in the United States, studying biology and anthropology. In the upcoming term, I will be a Junior, or third year. ATR is the first official internship that I’ve really participated in, and I am excited to share what I have been working on for the past 2 months since I started here.
  • #3 So first a little bit of background on my projects core concepts. The basis and inspiration was an experiment done just earlier this year by Sang Wan Lee’s team from CalTech, that described a new model for reward-based learning that involved both model-based and model-free systems, working together in one model. But before we get into their hybrid model, let’s first look at each system individually. First off, the model-based system describes a system where we create a cognitive map of the space, how one state goes to the next, and understanding details about those transitions as well as the rewards in each step. In many ways it is like planning ahead knowing what each step is worth, and knowing the reward you expect before you get there. However, the model-free system is much simpler in that choices are made based on choosing actions with the highest expected rewards.
  • #4 In this simple example, you can see that in the model-based system, the mouse has learned his path to the cheese, but also has accumulated knowledge for the other paths, and knows that if it takes a certain colored path it will have a certain chance of getting to either a dead end, the cat, or the cheese. In the model-free system on the right, the mouse has also learned how to get to the cheese, but only in the sense that at the first interaction turning left has a better chance of reward than does turning right, and that process repeats at each intersection. This mouse in this example would not even know that a cat was in the maze, just that the optimal path is a series of actions that ultimately looks like the one I’ve highlighted in red. So an obvious downfall of the model-free system is if there is change in the setup or goal, because the subject must relearn the optimal choices throughout the entirety of the maze. But although the model-based system can respond to changes much better, it is difficult to upkeep and memory errors while learning can produce an incorrect map and can often lead to very costly errors.
  • #5 Not only did Lee’s team suggest that both those models existed, but that there exists an arbitration mechanism that allocates a degree of control to each. For example, a high Prediction Error in one model, say the red, would indicate that that system is not that sure of itself and therefore not very reliable, and thus would shift more control over to the other system, the blue model. This constant weighing and re-weighing of reliabilities thus allows the more reliable system to dominate decision making, and make the best, optimal choice. I hypothesized that a similar arbitration mechanism might exist with pain aversion, one based on both model-based and model-free systems, rather than the traditional stand-alone model-free system, and we did that my mirroring the approach and task that was used in Lee’s paper.
  • #6 But first, The reason we are looking into pain avoidance is that unlike in reward-based learning, pain has received very little attention from the research community. Because pain is unlike rewards in that it is difficult to devalue or make less unpleasant, we suspect that a different system must be in play here, and thus why we believe a more complex model, like that suggested by Lee’s team, may be needed. I’ve listed here some of the studies that have begun to uncover the model-based learning side of aversion but are somewhat incomplete. Hendersen and Graham’s study, over 30 years ago, suggested that rats could revalue heat when moved from extremely hot environments to cold ones - which first suggested there might be two systems. Then Prevost and O’Darty found evidence for both systems for aversive tasting juice in a Pavlovian aversive learning task - but subjects didn’t have to make any decisions in that task. Then most recently, Gillan and Robbins have done a number of experiments on shock avoidance (which my project also uses) that has begun to explore these two systems - but hasn’t adequately distinguished between the two competing systems. And so we hoped that our task would show more clearly that such systems do exist.
  • #7 In order to acquire comparable data to that produced by Lee’s team, the task remained entirely unchanged except that of the outcome states, which were changed from monetary rewards to a number of electric shocks. To deliver those shocks, before each experiment, we calibrated a shock machine which would deliver a current through an electrode attached to the the subject's hand. As you will see soon, the task is quite complicated, and so each participant had a training session to try and understand the task, that was then followed by the 2 experimental sessions, each with 48 blocks, each with an average of 5 trials. The task involved making 2 sequential choices that would get you to a final state worth a certain number of shocks. The final state you would arrive at was determined by your choices and the block condition. I’ll talk more about what these blocks are in the next slide, but the blocks were randomly assigned and mixed, and there were no breaks between any trials or blocks.
  • #8 The 4 conditions for these blocks are split into 2 categories, flexible and specific, and then further with high and low transition probabilities, although I won’t talk much of the latter. The flexible goal condition is distinguished by a gray bin, and will accept all final states as they are, and thus you receive the number of shocks indicated from 0-4. Because the flexible condition maintains constant final state values, the model-free strategy is encouraged because prediction errors will generally be lower - this is like the mouse that only knows his way to the cheese, or the best option. However the specific condition, which is distinguished by a colored bin, will only accept final states whose color matches its bin color. If they match, the subject receives the number of shocks associated with that color. If they don’t match, then the subject receives 4 shocks, the maximum. For a similar reason as I said before, the specific condition favors the model-based strategy because the final goal values constantly change and more knowledge of the entire task is needed to make the colors match.
  • #9 Uncertainty then refers to the state-to-state transition probabilities. Only 2 distributions were used and these were unknown to the subject. This fluctuation in uncertainty was to elicit varying prediction errors to again see shifts between the two models. Although our analysis focuses more on the differences created from the flexible and specific conditions that I talked about on the previous slide, is not to say that these differing uncertainty are not significant as well. now To the right here, you’ll see an example of one trial. The subject starts in the purple state, and decided to go right first, then a 50/50 transition probability brought it to the pink state, in which the subject is prompted again to make a choice. In this case he chose to go right again, that once again was subject to the 50/50 transition probability. And this was followed by the final state which happens to be the red state worth 0 shocks. And because the bin is gray (a flexible condition block), the subject receives 0 shocks. However, if the bin was say blue, then he would have received 4 shocks because the colors do not match. Overall, the details of the task are rather complicated, but what’s most important is that by having these different kinds of blocks, we are encouraging different model strategies.
  • #10 So we conducted the experiment over the course of about 3 weeks and was able to collect behavioral data for 16 subjects. Here is a simple, and hopefully you can read it, graph of the results as compared to that produced by Lee’s reward learning task. The thin red lines are the values produced in Lee’s original task. I’ve scaled it on Lee’s original scale so we can compare the results more easily. The side by side bars refer to different uncertainties, while the grouped bars refer to the specific condition on the left and then the flexible to the right. And so as you can see, the leftmost graph, of the mean reward per trial, and the center graph, of the hit rate, show very similar results to that of Lee, which may suggest that an arbitrator also exists for our pain task. But the third graph, which is of the proportion of optimal choices, begins to show significant differences, especially so in the flexible, low uncertainty condition (the dark blue bar on the way right).
  • #11 Here is a close up of that third graph of the proportion of optimal choices. Because our data produced a much higher proportion of optimal choices than in Lee’s, it suggests that aversion learning’s arbitrator differs somehow from reward learning’s, although we can’t quite say how with just this data. But because of the similarities we found in the previous two graphs, we can say that some arbitrator must exist that differs somewhat from that that exists for reward-based learning.
  • #12 Then in order to test the second part of my hypothesis that aversion learning indeed does not follow the traditionally believed model-free only system, I ran a simulation of the task that only used model-free learning and the results are displayed. They are the same graphs as were on the previous slide, except that now the red lines refer to the values that came from our pain aversion experiment. We predicted that the simulation would do well to replicate the effects of the flexible condition, because as I said earlier, the flexible condition encourages use of the model-free system, but not so well with the specific. And this is indeed the case. The model-free simulation was able to produce results that closely mirrored the experimental results for the flexible condition (the right grouped bars in each graph), but not so well with the specific condition. This strongly suggests that other systems or factors are in play, like that suggested by Lee, and that a model-free system alone is not adequate to explain aversion learning.
  • #13 To further confirm our hypothesis that both systems do exist. We ran a log likelihood test comparing the degree to which model-based versus model-free reinforcement learning accounts best for participants’ choices. The left graph refers to the flexible condition, or situations where the arbitrator determined that the model-based system was not favored (or below 50 percent) and as you can see by the negative result, we found that the model-based system does not accurately predict participant’s choices in these flexible condition trials. However, on the right hand side, as predicted during specific condition trials, the model based system does predict quite well participants’ choices. In other words, the likelihood gets better when we combine the two models according to their respective reliabilities.
  • #14 As the final part of my analysis, with the help of Lee’s team, we compared the fitting of our data with the parameters produced in his original study. Of 6 parameters, the second from the left was the only one that ours showed a significantly greater value. This parameter is defined as the learning rate for the estimate of absolute reward prediction error. Now what that means is that even small reward prediction errors (which are associated with model-free learning) will elicit greater changes in the estimate of the reward prediction error. Thus, because this estimate is highly responsive to changes, the model free reliability is updated more rapidly. That means that in essence, relative to reward learning, pain learning is more dynamic and more quickly switches between the two systems due to more frequent reliability updating. The interpretation of this may mean that subjects are more responsive to changes in the task than they are in reward learning.
  • #15 In conclusion, our data supports that, first off, both a model-free and model-based system are needed to accurately model aversion learning, as is also needed in reward based learning. However, secondly, there seems to exist subtle, but distinct differences between how avoidance learning’s and reward learning’s arbitrators function. How and what those differences affecting aversion learning are , we hope to soon find out as we continue with the project.
  • #16 Speaking of our next steps, now that we have been able to find differences between pain aversion learning and reward based learning behaviorally, we plan to look at how brain activity reflects these observed differences and look more in depth into what causes these differences using fMRI. We also hope to refine our model and eventually create an accurate model for pain aversion as Lee did with reward based learning. Unfortunately I don’t have enough time here in Japan to see these steps through, but I’m glad to pass the reigns on to the next intern. Regardless I have learned a lot from this internship and I am glad to have gotten to spend my summer here at ATR.
  • #17 Thank you for listening.