Having timely and well informed information helps governments to take the right actions to reduce the length and severity of an infectious disease outbreak. This information is important not only for pandemic influenza but also for many other diseases such as measles and mumps as well as more exotic diseases like chikungunya. Governments in advanced countries like Japan have access to many sources of information within their own country borders. These range from the very reliable like laboratory reports to statistics about how many drugs are being sold. However the quickest source of information is often rumours. These can be individual messages published on Web sites like Twitter or news reports published in the media.
Twitter is an example of a microblogging service. Users post messages (tweets) up to 140 characters in length. This enables them to post personal information on-the-go from mobile SMS devices where ever they happen to be. Hand in hand with the short messaging style is a highly abbreviated form of vocabulary. We often see special abbreviations and semantic tags called Hashtags that are developed on the fly to describe new concepts such as H1N1 influenza. Volumes also tend to be very high. Although official statistics are hard to find the Twitter developer’s conference mentioned 106 million users in 2010 and the BBC mentioned over 200 million users in 2011. Although this is a fraction of the total world population it still might be possible to use Twitter messages for alerting in major cities where the are a high density of users.
Talk here about the difficult cases – how they are classified and how we might overcome them in the future.
Enhancing Twitter Data Analysis with Simple Semantic Filtering: Example in Tracking Influenza-Like Illnesses
Enhancing Twitter Data Analysis with Simple Semantic Filtering: Example in Tracking Influenza-Like Illnesses Son Doan1, Lucila Ohno-Machado1, Nigel Collier2 1Division of Biomedical Informatics, University of California San Diego 2National Institute of Informatics, Japan IEEE HISB 2012 UCSD, La Jolla, CA Sep 27-28, 2012
Time Sentinel PCP Field Laboratory Rumors networks reports workers reports Certainty Twitter>Twitter> “I’m sick with a“Ahh! Really bad Twitter> chest infection”throat.” “Still getting worse. Staying at home News report> News report> temp is up to 39.5.” “Mystery illness “Influenza starts causes concern.” early this year.”
Social media in event tracking• Event tracking/predicting: – Predict election, gasoline price: O’Connor et al. (2010) – Predict stock market: Bollen et al. (2011) – Earthquake warning: Sasaki et al. (2010), Guy et al. (2010) – Public mood tracking: Golder and Macy (2011), Doan and Collier (2011)• Predicting the Influenza-Like Illness rate: – Google Flu Trends: Ginsberg et al. (2009), Valdivia et al. (2010), now extended to dengue tracking (Chan et al. (2012)) used query logs, but the query data is closed – Culotta (2009), Lampos and Christinini (2010), Signorini et al. (2011), Chew and Eysenbach (2011), Doan et al. (2012) used Twitter
Twitter characteristics• Twitter posts (tweets) are limited to 140 characters – High use of abbreviations and aliases – Dynamic lexicon of semantic tags (hashtags)• Very high volume of data: Generate 430 million tweets per day• High numbers of users: Over 500 active million users• Meta data: Geo-tagging, time stamping, user profile• Event reports sometimes ahead of newswire, e.g. Iranian presidential protests, swine flu outbreak reports from CDC, deaths of famous people (Petrovic et al. 2010)
Twitter corpusTimeline: 36 weeks for the US 2009 influenza season (Aug 30, 2009 to May8, 2010), ‘Gardenhose’ data sampling method (~5% sampling rate from thewhole data)Name Total 25 mil 20 milTweets 587,290,394 15 milUsers 23,571,765 10 milURL 136,034,309 5 milHash 96,399,587Tags Thanks to Brendan O’Connor (CMU) and Twitter Inc.
Existing methods: empirical approach for predicting the ILI rate Case definition from CDC ILI-related Twitter tweets Influenza-like Illness (ILI) = corpus fever (> 100o F)* AND ILI-related cough and/or sore throat (in the absence of a known keywords filtering cause other than influenza) *Temperature can be measured in Culotta4 Signorini3 Chew3 the office or at home flu swine h1n1 cough flu swine flu Every year: headache influenza swineflu 3~5 million severe illness 250 000 – 500 000 deaths sore throat (WHO 2009)Gold standard from laboratory data reported by the US Outpatient Influenza-Like Illness Surveillance Network (ILINet) (CDC)
Knowledge-based approach If the tweeter is referring to someone else‘s symptom then filter out. Only retain if the tweeter is referring to their own symptoms.Name ExampleSyndrome only tweets containing syndrome Barber just coughed keywords on me in the chair.Syndrome + “flu” tweets containing syndrome I got flu n coughed a keywords and “flu” lot.Syndrome + “flu” - tweets containing syndrome 7-year-old boy dies ofURL keywords and “flu”, remove flu,pneumonia < URL> links
Extract syndrome-related keywords from BioCaster ontologyWe extracted keywords only from respiratory syndromeachy chest cold symptom respiratory failureapnea cough runny noseasthma dyspnea short of breathasthmatic dyspnoea shortness of breath 37blocked nose gasping for air sinusitis respiratorybreathing difficulties lung sounds sore throat syndrome keywordsbreathing trouble pneumonia stop breathingbronchitis rales stuffy nose… … …
Semantic level filteringName ExamplesNegation Remove negation in tweets I don’t have fluEmoticon Remove tweets containing Glad to hear that you’re beating the flu. smiley emoticons, e.g., :-),,:D :-) Hope you don’t get the nasty cough that everyone’s getting this yearHashTags Keeps tweets containing Still coughing smh #swineflu #h1n1 keyword “flu”Humor Remove humor features in Hm Im kinda wanting to go to NYC really tweets, e.g., “haha”,”hihi”, soon ***cough … cough*** @Ctmomofsix “***cough … cough***” =)Geo Tweets from graphical locations (e.g., US)
Detecting negation in Twitter Semantic tagsExampleRule A: If VBZ is followed by XX then that sentence is negative
Semantic-level filtered tweetsTypes Tweet samplesInfluenza confirmation I got flu n coughed a lot. Now my voice is like monster’s voice. RrrInfluenza symptoms My day: flu-like symptoms (headache, body aches, cough, chills, 100.9 fever). Swine flu not ruled out. #H1N1Flu shots I’m still getting flu shots, nothing is worth flu turning into bronchitis into pneumoniaSelf protection Cover your mouth if coughing, use a tissue, wash your hands often & get a flu shot - protect and defend your community from #H1N1Medication Wondering why I didn’t take the flu shot, laying in bed with cough drops, medicine, and the remote
Challenges• Technical issues: – Data sampling: only ~5% sampling rate• Semantic issues: – Metaphoric symptoms: Cabin fever setting in right now. – Interrogative sentences: wonder how long u get off work with swine flu? – Hypothetical sentences: I can ignore this sore throat no longer. And, um, maybe I should have gotten that H1N1 vaccine. – Others: Too much lemonade. My throat is burning.
Summary• We proposed a general and extendable approach for tweet filtering based on an ontology of infectious diseases (BioCaster Ontology) – This methodology can be applied to other languages, e.g., Spanish, Japanese• Our best results showed significantly improvement in comparison to state-of-the-art keyword filtering methods• Using simple semantic filtering in Twitter can improve correlation with CDC data
DIZIE: system for syndromic surveillance on Twitter http://born.nii.ac.jp/dizie / Gastrointestinal Respiratory Neurological 40 main world Dermatological Haemorrhagic cities Musculoskeletal Collier and Doan. eHealth 2012;186-95
Acknowledgements• Assoc. Prof. Wendy W. Chapman, PhD, DBMI, UCSD• Mike Conway, PhD, DBMI, UCSD• Grant-in-aid funding from the National Institute of Informatics, Japan