I2b2 2008

609 views

Published on

Published in: Technology, Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
609
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
10
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

I2b2 2008

  1. 1. Learning High Precision Rules to Make Predictions of Morbidities in Discharge Summaries <ul><ul><li>Ted Pedersen </li></ul></ul><ul><ul><li>Department of Computer Science </li></ul></ul><ul><ul><li>University of Minnesota, Duluth </li></ul></ul><ul><ul><li>http://www.d.umn.edu/~tpederse </li></ul></ul>
  2. 2. i2b2 Obesity Challenge (2008) <ul><li>Predict if patient suffers from obesity and any of 15 other co-morbidities based on : </li></ul><ul><ul><li>Content of discharge summary (textual task) </li></ul></ul><ul><ul><li>Expert judgment and content of discharge summary (intuitive task) </li></ul></ul><ul><li>Supervised Learning </li></ul><ul><ul><li>730 manually annotated training examples </li></ul></ul><ul><ul><ul><li>730 patients * 16 diseases = max 11,680 annotations </li></ul></ul></ul><ul><ul><li>507 held out test records </li></ul></ul>
  3. 3. Training Examples <ul><li>Intuitive Task </li></ul><ul><ul><li>N : 7,362 examples (69.1%) </li></ul></ul><ul><ul><li>Y : 3,267 examples (30.7%) </li></ul></ul><ul><ul><li>Q : 26 examples (00.2%) </li></ul></ul><ul><ul><li>Total of 10,655 annotations </li></ul></ul><ul><li>Textual Task </li></ul><ul><ul><li>U : 8,296 examples (71.3%) </li></ul></ul><ul><ul><li>Y : 3,208 examples (27.6%) </li></ul></ul><ul><ul><li>N : 87 examples (00.7%) </li></ul></ul><ul><ul><li>Q : 39 examples (00.3%) </li></ul></ul><ul><ul><li>Total of 11,630 annotations </li></ul></ul>
  4. 4. Early Working Title “ Is it Possible to Perform Accurate Supervised Learning of Highly Skewed Data with Minority Classes That Have Distributional Characteristics Significantly Less Prevalent Than Commonly Observed Values of Standard Deviations in Cross Validation Studies Done Across a Wide Range of Machine Learning Datasets?”
  5. 5. No. Abstract
  6. 6. Supervised Learning is Noisy <ul><li>Noise visible in cross validation results </li></ul><ul><ul><li>Divide data into 10 blocks, train on 9, test on 1 </li></ul></ul><ul><ul><li>Repeat until all 10 blocks have been tested, average, find standard deviation </li></ul></ul><ul><ul><li>Commonly 2-4% </li></ul></ul><ul><ul><li>True for this data based on preliminary cross validation studies of training data.... </li></ul></ul><ul><li>If you have a class with few instances that is (as a percentage) less than your standard deviation during cross validation, it's like it's not really there.... </li></ul><ul><ul><li>Very small minority classes lost in the noise </li></ul></ul>
  7. 7. What Now? <ul><li>Very unlikely to learn these very small minority classes </li></ul><ul><li>Very unlikely that attempting to learn them will cause significantly different models to result... </li></ul><ul><li>Some possibility that test data won't be representative of training data.... </li></ul><ul><li>......... </li></ul><ul><li>.............. </li></ul><ul><li>..................... </li></ul><ul><li>............................. </li></ul>
  8. 8. ...Remove Them... <ul><li>Reduces to 2 class problem, avoids possibility (slight) of machine learning algorithm being misled by minority classes </li></ul><ul><li>Intuitive Task </li></ul><ul><ul><li>N : 7,362 examples (69.3%) </li></ul></ul><ul><ul><li>Y : 3,267 examples (30.7%) </li></ul></ul><ul><ul><li>Total of 10,629 annotations </li></ul></ul><ul><li>Textual Task </li></ul><ul><ul><li>U : 8,296 examples (72.1%) </li></ul></ul><ul><ul><li>Y : 3,208 examples (27.9%) </li></ul></ul><ul><ul><li>Total of 11,504 annotations </li></ul></ul>
  9. 9. Focus on High Precision <ul><li>Get the two dominant classes right </li></ul><ul><li>Concede that the minority classes are unlikely to be classified correctly </li></ul><ul><li>Be conservative, what if training data not representative? </li></ul><ul><li>Downside - puts a hard ceiling on macro F-score (that we probably would have hit as a practical matter anyway...) </li></ul><ul><ul><li>Max f-score macro for intuitive task = 67% </li></ul></ul><ul><ul><li>Max f-score macro for textual task = 50% </li></ul></ul>
  10. 10. Cross Validation Studies <ul><li>Carried out a number of CV studies using the training data </li></ul><ul><ul><li>Divide data into 10 blocks, train on 9 and test on 1, repeat until all blocks have been used for evaluation </li></ul></ul><ul><ul><li>Selected features and machine learning methods based on these results (plus some intuitions about which method will best deal with unseen test data) </li></ul></ul>
  11. 11. Features <ul><li>Explored various Ngram features that occur more than 1 time in training data (unigrams, bigrams, trigrams and 4-grams) </li></ul><ul><ul><li>Converted all text to lower case, discarded single character and numeric data </li></ul></ul><ul><ul><li>Occur anywhere in training record / discharge summary </li></ul></ul><ul><ul><li>Removed 200 common stop words </li></ul></ul><ul><li>Unigrams most accurate in CV studies </li></ul><ul><ul><li>Approximately 9,000 features (varying slightly with morbidity) </li></ul></ul>
  12. 12. Machine Learning Methods <ul><li>Experimented with different algorithms (from Weka) during CV studies : </li></ul><ul><ul><li>Rule Learners </li></ul></ul><ul><ul><ul><li>RIPPER (JRip) </li></ul></ul></ul><ul><ul><ul><li>C4.5 (J48) </li></ul></ul></ul><ul><ul><li>Model Fitters </li></ul></ul><ul><ul><ul><li>Naïve Bayes </li></ul></ul></ul><ul><ul><ul><li>Support Vector Machine (SMO) </li></ul></ul></ul><ul><ul><ul><li>Voted Perceptron </li></ul></ul></ul><ul><ul><li>Meta Learner </li></ul></ul><ul><ul><ul><li>AdaBoost </li></ul></ul></ul>
  13. 13. Rule Learners Rule <ul><li>Most accurate methods in CV studies (by far) were JRip and J48, with minor improvements observed due to AdaBoost </li></ul><ul><li>Both are greedy algorithms, and susceptible to falling into local minima that don't lead to overall best result </li></ul><ul><ul><li>Offset that possibility with AdaBoost </li></ul></ul><ul><ul><ul><li>Re-applies learning algorithm iteratively based on the “hardness” of the training data </li></ul></ul></ul><ul><ul><ul><li>Did not expect it to improve results, just a defense against unpredictable test data </li></ul></ul></ul>
  14. 14. Why Rule Learners Ruled? <ul><li>JRip learns bottom-up – find a rule that covers one class of training examples, then another, does not consider all training examples at once </li></ul><ul><li>J48 learns top down – find a feature that cleanly divides training examples into classes, then another </li></ul><ul><li>Unigram features may have been too noisy for model fitters, may have been too few training examples </li></ul>
  15. 15. Best CV Results Entries to Challenge <ul><li>JRip </li></ul><ul><li>JRip with AdaBoost </li></ul><ul><li>J48 with AdaBoost </li></ul>
  16. 16. Intuitive Task Performance <ul><li>JRip and JRip-AdaBoost </li></ul><ul><ul><li>Precision : .93 (micro) .95 (macro) </li></ul></ul><ul><ul><li>Recall : .93 (micro) .60 (macro) </li></ul></ul><ul><ul><li>F-Score : .93 (micro) .61 (macro) </li></ul></ul><ul><li>J48-AdaBoost </li></ul><ul><ul><li>Precision : .92 (micro) .95 (macro) </li></ul></ul><ul><ul><li>Recall : .92 (micro) .60 (macro) </li></ul></ul><ul><ul><li>F-Score : .92 (micro) .60 (macro) </li></ul></ul><ul><li>Mean </li></ul><ul><ul><li>Precision : .91 (micro) .78 (macro) </li></ul></ul><ul><ul><li>Recall : .90 (micro) .60 (macro) </li></ul></ul><ul><ul><li>F-Score : .90 (micro) .60 (macro) </li></ul></ul>
  17. 17. Textual Task Performance <ul><li>JRIP and JRIP-AdaBoost and J48 </li></ul><ul><ul><li>Precision : .93 (micro) .96 (macro) </li></ul></ul><ul><ul><li>Recall : .93 (micro) .45 (macro) </li></ul></ul><ul><ul><li>F-Score : .93 (micro) .46 (macro) </li></ul></ul><ul><li>Mean </li></ul><ul><ul><li>Precision : .91 (micro) .75 (macro) </li></ul></ul><ul><ul><li>Recall : .91 (micro) .56 (macro) </li></ul></ul><ul><ul><li>F-Score : .91 (micro) .56 (macro) </li></ul></ul>
  18. 18. Post Challenge <ul><li>Re-ran experiments using all training data </li></ul><ul><li>Results essentially the same </li></ul><ul><li>Machine Learning Algorithms removed noise (as they should have) </li></ul><ul><li>Test data had very similar distributional characteristics as training data (so our defensive strategy not strictly necessary) </li></ul>
  19. 19. JRip Rules for Asthma (intuitive) <ul><li>(asthma = 1) => class=Y (70/6) </li></ul><ul><li>(lente = 1) => class=Y (3/1) </li></ul><ul><li>=> class=N (471/2) </li></ul><ul><ul><li>(X/Y) means there are X training examples covered by this rule, of which all but Y were correctly classified </li></ul></ul>
  20. 20. JRip rules for Asthma (textual) <ul><li>(asthma = 1) => class=Y (97/4) </li></ul><ul><li>=> class=U (626/0) </li></ul>
  21. 21. JRip rules for Depression (intuitive) <ul><li>(depression = 1) and (minutes = 0) => class=Y (79/8) </li></ul><ul><li>(prozac = 1) => class=Y (7/0) </li></ul><ul><li>(celexa = 1) => class=Y (8/0) </li></ul><ul><li>(paxil = 1) => class=Y (7/1) </li></ul><ul><li>(zoloft = 1) => class=Y (10/1) </li></ul><ul><li>(depression = 1) and (order = 1) => class=Y (7/2) </li></ul><ul><li>(wellbutrin = 1) => class=Y (2/0) </li></ul><ul><li>(citalopram = 1) => class=Y (2/0) </li></ul><ul><li>=> class=N (434/3) </li></ul>(
  22. 22. JRip rules for Depression (textual) <ul><li>(depression = 1) => class=Y (134/31) </li></ul><ul><li>=> class=U (594/1) </li></ul>
  23. 23. JRip Rules for Gallstones (intuitive) <ul><li>(cholecystectomy = 1) => class=Y (71/0) </li></ul><ul><li>(gallstones = 1) => class=Y (18/3) </li></ul><ul><li>(chole = 1) => class=Y (6/1) </li></ul><ul><li>(gallbladder = 1) => class=Y (9/4) </li></ul><ul><li>=> class=N (606/5) </li></ul>
  24. 24. JRip Rules for Gallstones (textual) <ul><li>(cholecystectomy = 1) => class=Y (77/1) </li></ul><ul><li>(gallstones = 1) => class=Y (17/1) </li></ul><ul><li>(chole = 1) => class=Y (6/1) </li></ul><ul><li>(codeine = 1) and (monitored = 1) and (physical = 1) => class=Y(5/0) </li></ul><ul><li>(card = 1) and (obese = 1) => class=Y (2/0) </li></ul><ul><li>=> class=U (617/5) </li></ul>
  25. 25. JRip Rules for Obesity (intuitive) <ul><li>( obese = 1) => class=Y (162/1) </li></ul><ul><li>(obesity = 1) => class=Y (50/0) </li></ul><ul><li>(incisions = 1) and (positive = 0) and (service = 1) => class=Y (9/2) </li></ul><ul><li>=> class=N (309/5) </li></ul>
  26. 26. JRip Rules for Obesity (textual) <ul><li>(obese = 1) => class=Y (222/0) </li></ul><ul><li>(obesity = 1) => class=Y (70/0) </li></ul><ul><li>=> class=U (430/6) </li></ul>
  27. 27. Conclusions <ul><li>Rule learners performed with high precision in i2b2 Obesity Challenge </li></ul><ul><li>Rules learned were relatively simple, but descriptive </li></ul><ul><li>There was no advantage or disadvantage to removing minority classes </li></ul><ul><ul><li>We got them wrong no matter what </li></ul></ul><ul><li>Model Fitters struggled, combination of large feature space and smaller number of instances might have resulted in lots of noise </li></ul>
  28. 28. Software Used <ul><li>WSD-Shell (Duluth) </li></ul><ul><ul><li>Also used in Senseval ( http://senseval.org/ ) </li></ul></ul><ul><ul><li>http://www.d.umn.edu/~tpederse/wsdshell.html </li></ul></ul><ul><li>Ngram Statistics Package (Duluth) </li></ul><ul><ul><li>http://ngram.sourceforge.net </li></ul></ul><ul><li>Weka (Waikato) </li></ul><ul><ul><li>http://www.cs.waikato.ac.nz/ml/weka/ </li></ul></ul>

×