Your SlideShare is downloading. ×
0
I2 B2 2006 Pedersen
I2 B2 2006 Pedersen
I2 B2 2006 Pedersen
I2 B2 2006 Pedersen
I2 B2 2006 Pedersen
I2 B2 2006 Pedersen
I2 B2 2006 Pedersen
I2 B2 2006 Pedersen
I2 B2 2006 Pedersen
I2 B2 2006 Pedersen
I2 B2 2006 Pedersen
I2 B2 2006 Pedersen
I2 B2 2006 Pedersen
I2 B2 2006 Pedersen
I2 B2 2006 Pedersen
I2 B2 2006 Pedersen
I2 B2 2006 Pedersen
I2 B2 2006 Pedersen
I2 B2 2006 Pedersen
I2 B2 2006 Pedersen
I2 B2 2006 Pedersen
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
317
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. November 10, 2006 I2B2 - Smoker Status Challenge 1 Determining Smoker Status using Supervised and Unsupervised Learning with Lexical Features Ted Pedersen University of Minnesota, Duluth tpederse@d.umn.edu http://www.d.umn.edu/~tpederse
  • 2. November 10, 2006 I2B2 - Smoker Status Challenge 2 Approaches • Smoking Status as Text Classification – supervised learning – lexical features – techniques used to good effect in word sense disambiguation • Smoking Status as Text Clustering – unsupervised learning – lexical features – techniques used to good effect in word sense discrimination
  • 3. November 10, 2006 I2B2 - Smoker Status Challenge 3 Objectives • How well do WSD techniques generalize to related but different problems? – smoking status as "meaning" of record?? – not quite the same problem… • How well do WSD features generalize? – bag of words, unigrams – bigrams – collocations • How well do learning algorithms generalize? – supervised and unsupervised
  • 4. November 10, 2006 I2B2 - Smoker Status Challenge 4 Experimental Variations Supervised Learning • Learning Algorithm – naïve Bayesian classifier – J48 decision tree – support vector machine (SMO) • Feature Sets (also used in unsupervised) – unigrams, bigrams, trigrams – various frequency and measure of association cutoffs – Stop List of 472 words • 392 function words • 80 words that occurred in more than half the records
  • 5. November 10, 2006 I2B2 - Smoker Status Challenge 5 Decision Tree • J48 most accurate when using unigram features that occurred 5 or more times in the training data – over 3,600 unigrams as candidate features – decision tree has 47 nodes and 24 leaves – accuracy of 82% (327/401)
  • 6. November 10, 2006 I2B2 - Smoker Status Challenge 6 Decision Tree unigrams : 5 or more times
  • 7. November 10, 2006 I2B2 - Smoker Status Challenge 7 82% accuracy (327/398) 10-fold cross validation on train a b c d e <-- classified as 20 5 1 7 3 | a = PAST-SMOKER 8 46 3 8 1 | b = NON-SMOKER 8 2 240 2 0 | c = UNKNOWN 7 5 1 21 1 | d = CURR- SMOKER 1 3 1 4 0 | e = SMOKER
  • 8. November 10, 2006 I2B2 - Smoker Status Challenge 8 Manual Inspection • From the decision tree learned from the 3,600 features, we decided to use the following in a second experiment: – cigarette, drinks, quit, smoke, smoked, smoker, smokes, smoking, tobacco
  • 9. November 10, 2006 I2B2 - Smoker Status Challenge 9 9-feature Decision Tree selected from unigram tree • quit = 0 • | smoking = 0 • | | smoker = 0 • | | | tobacco = 0 • | | | | smoke = 0 • | | | | | drinks = 0 • | | | | | | cigarette = 0 • | | | | | | | smoked = 0: UNKNOWN (253.0/3.0) • | | | | | | | smoked = 1: PAST-SMOKER (2.0/1.0) • | | | | | | cigarette = 1: NON-SMOKER (3.0/1.0) • | | | | | drinks = 1: NON-SMOKER (6.0/3.0) • | | | | smoke = 1: NON-SMOKER (16.0) • | | | tobacco = 1 • | | | | smokes = 0: NON-SMOKER (39.0/7.0) • | | | | smokes = 1: CURRENT-SMOKER (2.0) • | | smoker = 1: CURRENT-SMOKER (11.0/5.0) • | smoking = 1: CURRENT-SMOKER (42.0/22.0) • quit = 1: PAST-SMOKER (24.0/4.0)
  • 10. November 10, 2006 I2B2 - Smoker Status Challenge 10 9-feature Decision Tree 87% accuracy (345/398) 10 fold cross validation on train a b c d e <-- classified as 20 5 1 10 0 | a = PAST-SMOKER 0 51 2 13 0 | b = NON-SMOKER 0 1 250 1 0 | c = UNKNOWN 5 4 2 24 0 | d = CURR-SMOKER 0 3 1 5 0 | e = SMOKER
  • 11. November 10, 2006 I2B2 - Smoker Status Challenge 11 9-feature Decision Tree 82% accuracy (85/104) evaluation data a b c d e <-- classified as 62 0 1 0 0 | a = UNKNOWN 1 10 1 0 4 | b = NON-SMOKER 0 2 4 0 5 | c = PAST-SMOKER 0 0 0 0 3 | d = SMOKER 0 1 1 0 9 | e = CURR-SMOKER
  • 12. November 10, 2006 I2B2 - Smoker Status Challenge 12 9-feature Decision Tree 90% accuracy (94/104) evaluation data a b f <-- classified as 62 0 1 | a = UNKNOWN 1 10 5 | b = NON-SMOKER 0 3 22 | f = ALL-SMOKER
  • 13. November 10, 2006 I2B2 - Smoker Status Challenge 13 Unsupervised Experiments • Bigram Features – allow up to 5 intervening words – occur 2 or more times in training data – limit to those that include "smok" --> 96 features – social smoking, pack smoking, smoking alcohol, smoking family, smoke drink, cigarette smoking, allergies smoking, allergies smoked, smoking quit, quit smoking, smoker drinks, former smoker, social smoke, denies smoking, habits smoking, ...
  • 14. November 10, 2006 I2B2 - Smoker Status Challenge 14 Unsupervised Context Representations • 2nd order Context Representations – Latent Semantic Analysis, native SenseClusters – each record represented by a vector that is the average of vectors that represent the individual features : • LSA – each bigram is replaced by a vector showing the records in which it occurs • native SenseClusters – each word is replaced by a vector showing the second words it occurs with as a bigram
  • 15. November 10, 2006 I2B2 - Smoker Status Challenge 15 Unsupervised Clustering • Once vectors for all records are created, they are clustered using a partitional method similar to k-means • The number of clusters is automatically discovered using the PK2 measure, which compares successive values of clustering criterion function • assign clusters to categories based on distribution in training data – unknown, non-smoker, past-smoker, current-smoker, smoker
  • 16. November 10, 2006 I2B2 - Smoker Status Challenge 16 SenseClusters 69% accuracy (72/104) evaluation data a b c d e <-- classified as 63 0 0 0 0 | a = UNKNOWN 10 0 0 0 6 | b = NON-SMOKER 2 1 0 0 8 | c = PAST-SMOKER 1 0 0 0 2 | d = SMOKER 2 0 0 0 9 | e = CURR-SMOKER
  • 17. November 10, 2006 I2B2 - Smoker Status Challenge 17 SenseClusters 79% accuracy (82/104) evaluation data a b f <-- classified as 63 0 0 | a = UNKNOWN 10 0 6 | b = NON-SMOKER 5 1 19 | f = ALL-SMOKER
  • 18. November 10, 2006 I2B2 - Smoker Status Challenge 18 Latent Semantic Analysis 68% accuracy (71/104) evaluation data a b c d e <-- classified as 63 0 0 0 0 | a = UNKNOWN 10 0 0 0 6 | b = NON-SMOKER 1 3 0 0 7 | c = PAST-SMOKER 1 0 0 0 2 | d = SMOKER 2 1 0 0 8 | e = CURR-SMOKER
  • 19. November 10, 2006 I2B2 - Smoker Status Challenge 19 Latent Semantic Analysis 77% accuracy (80/104) evaluation data a b f <-- classified as 63 0 0 | a = UNKNOWN 10 0 6 | b = NON-SMOKER 4 4 17 | f = ALL-SMOKER
  • 20. November 10, 2006 I2B2 - Smoker Status Challenge 20 Conclusions • Results dominated by UNKNOWN – sets lower bound of 61% • Errors dominated by confusion in ALL-SMOKER – reduction to 3 classes improves results significantly • Decision tree aided feature selection • Manual tuning of feature sets performed since records focus well beyond smoking status • Unsupervised clustering found "right" number of clusters perhaps, did well in that light
  • 21. November 10, 2006 I2B2 - Smoker Status Challenge 21 Software Resources • Supervised Experiments – SenseTools (free, from Duluth) http://www.d.umn.edu/~tpederse/sensetools.html – Weka (free, from Waikato) http://www.cs.waikato.ac.nz/ml/weka/ • Unsupervised Experiments – SenseClusters (free, from Duluth) http://senseclusters.sourceforge.net

×