Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Enhancing Twitter Data Analysis with Simple Semantic Filtering: Example in Tracking Influenza-Like Illnesses


Published on

Presented at HISB 2012, La Jolla, CA Sep 27-28, 2012

Published in: Technology
  • Be the first to comment

Enhancing Twitter Data Analysis with Simple Semantic Filtering: Example in Tracking Influenza-Like Illnesses

  1. 1. Enhancing Twitter Data Analysis with Simple Semantic Filtering: Example in Tracking Influenza-Like Illnesses Son Doan1, Lucila Ohno-Machado1, Nigel Collier2 1Division of Biomedical Informatics, University of California San Diego 2National Institute of Informatics, Japan IEEE HISB 2012 UCSD, La Jolla, CA Sep 27-28, 2012
  2. 2. Time Sentinel PCP Field Laboratory Rumors networks reports workers reports Certainty Twitter>Twitter> “I’m sick with a“Ahh! Really bad Twitter> chest infection”throat.” “Still getting worse. Staying at home News report> News report> temp is up to 39.5.” “Mystery illness “Influenza starts causes concern.” early this year.”
  3. 3. Social media in event tracking• Event tracking/predicting: – Predict election, gasoline price: O’Connor et al. (2010) – Predict stock market: Bollen et al. (2011) – Earthquake warning: Sasaki et al. (2010), Guy et al. (2010) – Public mood tracking: Golder and Macy (2011), Doan and Collier (2011)• Predicting the Influenza-Like Illness rate: – Google Flu Trends: Ginsberg et al. (2009), Valdivia et al. (2010), now extended to dengue tracking (Chan et al. (2012))  used query logs, but the query data is closed – Culotta (2009), Lampos and Christinini (2010), Signorini et al. (2011), Chew and Eysenbach (2011), Doan et al. (2012)  used Twitter
  4. 4. Twitter characteristics• Twitter posts (tweets) are limited to 140 characters – High use of abbreviations and aliases – Dynamic lexicon of semantic tags (hashtags)• Very high volume of data: Generate 430 million tweets per day• High numbers of users: Over 500 active million users• Meta data: Geo-tagging, time stamping, user profile• Event reports sometimes ahead of newswire, e.g. Iranian presidential protests, swine flu outbreak reports from CDC, deaths of famous people (Petrovic et al. 2010)
  5. 5. Twitter corpusTimeline: 36 weeks for the US 2009 influenza season (Aug 30, 2009 to May8, 2010), ‘Gardenhose’ data sampling method (~5% sampling rate from thewhole data)Name Total 25 mil 20 milTweets 587,290,394 15 milUsers 23,571,765 10 milURL 136,034,309 5 milHash 96,399,587Tags Thanks to Brendan O’Connor (CMU) and Twitter Inc.
  6. 6. Existing methods: empirical approach for predicting the ILI rate Case definition from CDC ILI-related Twitter tweets Influenza-like Illness (ILI) = corpus fever (> 100o F)* AND ILI-related cough and/or sore throat (in the absence of a known keywords filtering cause other than influenza) *Temperature can be measured in Culotta4 Signorini3 Chew3 the office or at home flu swine h1n1 cough flu swine flu Every year: headache influenza swineflu 3~5 million severe illness 250 000 – 500 000 deaths sore throat (WHO 2009)Gold standard from laboratory data reported by the US Outpatient Influenza-Like Illness Surveillance Network (ILINet) (CDC)
  7. 7. Our approach: two-step filtering Semantic Syndrome-related filteringTwitter filteringcorpus Step 1 Step 2 Syndrome only Negation Emoticon Syndrome + “flu” HashTags Humor Syndrome + “flu” - URL Geo Knowledge-based Semantic level approach
  8. 8. Knowledge-based approach If the tweeter is referring to someone else‘s symptom then filter out. Only retain if the tweeter is referring to their own symptoms.Name ExampleSyndrome only tweets containing syndrome Barber just coughed keywords on me in the chair.Syndrome + “flu” tweets containing syndrome I got flu n coughed a keywords and “flu” lot.Syndrome + “flu” - tweets containing syndrome 7-year-old boy dies ofURL keywords and “flu”, remove flu,pneumonia < URL> links
  9. 9. Snapshot of BioCaster ontology
  10. 10. Extract syndrome-related keywords from BioCaster ontologyWe extracted keywords only from respiratory syndromeachy chest cold symptom respiratory failureapnea cough runny noseasthma dyspnea short of breathasthmatic dyspnoea shortness of breath 37blocked nose gasping for air sinusitis respiratorybreathing difficulties lung sounds sore throat syndrome keywordsbreathing trouble pneumonia stop breathingbronchitis rales stuffy nose… … …
  11. 11. Semantic level filteringName ExamplesNegation Remove negation in tweets I don’t have fluEmoticon Remove tweets containing Glad to hear that you’re beating the flu. smiley emoticons, e.g., :-),,:D :-) Hope you don’t get the nasty cough that everyone’s getting this yearHashTags Keeps tweets containing Still coughing smh #swineflu #h1n1 keyword “flu”Humor Remove humor features in Hm Im kinda wanting to go to NYC really tweets, e.g., “haha”,”hihi”, soon ***cough … cough*** @Ctmomofsix “***cough … cough***” =)Geo Tweets from graphical locations (e.g., US)
  12. 12. Detecting negation in Twitter Semantic tagsExampleRule A: If VBZ is followed by XX then that sentence is negative
  13. 13. Correlation to the CDC data Method Pearson corr (%)Empirical approach Culotta4 94.85 Signorini4 94.73 Chew3 94.48Knowledge-based Syndrome only 88.60approach Syndrome + “flu” 97.13 Syndrome + “flu” - URL 97.52* (p=0.06)Semantic-based Negation 97.65level Emoticon 97.52 HashTags 97.61 Humor 97.65 Geo 98.39 Negation + Emoticon + HashTags + 98.46*(p=0.007) Humor + GeoNote: Google Flu Trends got 99.12%!!! (using whole Google query logs)
  14. 14. % Correlation to the CDC data (cont’d)
  15. 15. Semantic-level filtered tweetsTypes Tweet samplesInfluenza confirmation I got flu n coughed a lot. Now my voice is like monster’s voice. RrrInfluenza symptoms My day: flu-like symptoms (headache, body aches, cough, chills, 100.9 fever). Swine flu not ruled out. #H1N1Flu shots I’m still getting flu shots, nothing is worth flu turning into bronchitis into pneumoniaSelf protection Cover your mouth if coughing, use a tissue, wash your hands often & get a flu shot - protect and defend your community from #H1N1Medication Wondering why I didn’t take the flu shot, laying in bed with cough drops, medicine, and the remote
  16. 16. Challenges• Technical issues: – Data sampling: only ~5% sampling rate• Semantic issues: – Metaphoric symptoms: Cabin fever setting in right now. – Interrogative sentences: wonder how long u get off work with swine flu? – Hypothetical sentences: I can ignore this sore throat no longer. And, um, maybe I should have gotten that H1N1 vaccine. – Others: Too much lemonade. My throat is burning.
  17. 17. Summary• We proposed a general and extendable approach for tweet filtering based on an ontology of infectious diseases (BioCaster Ontology) – This methodology can be applied to other languages, e.g., Spanish, Japanese• Our best results showed significantly improvement in comparison to state-of-the-art keyword filtering methods• Using simple semantic filtering in Twitter can improve correlation with CDC data
  18. 18. DIZIE: system for syndromic surveillance on Twitter / Gastrointestinal Respiratory Neurological 40 main world Dermatological Haemorrhagic cities Musculoskeletal Collier and Doan. eHealth 2012;186-95
  19. 19. Acknowledgements• Assoc. Prof. Wendy W. Chapman, PhD, DBMI, UCSD• Mike Conway, PhD, DBMI, UCSD• Grant-in-aid funding from the National Institute of Informatics, Japan