Text mining in action: early detection of disease outbreaks from online media
Text mining in action: early detection ofdisease outbreaks from online mediaNigel CollierAssociate ProfessorNational Institute of Informatics, Tokyoand Japan Science and Technology Agency SAKIGAKE email@example.com://sites.google.com/site/nhcollier/PI of “BioCaster” project (JST, Sakigake grant-in-aid)AAAS Annual Meeting, Vancouver, Saturday 19th February 2012 (13:30-16:30)
Time Sentinel Field Laboratory Rumours GP reports networks workers reports Certainty Blog rumour>Blog rumour> “I’m sick with a“Ahh! Really bad Blog rumour> chest infection”throat.” “Still getting worse. Staying at home News report> News report> temp is up to 39.5.” “Mystery illness “Influenza starts causes concern.” early this year.”
http://born.nii.ac.jp Ontology browsing Trend graphs Email/GeoRSS alerting Watchboard, etc.Event database search Up to date news in Event summaries 12 languagesWHO USIT GHSAG UKJP partners FRCA DE Event alerts Real time Twitter analysis
Technical challenges X0,000 news providers REAL TIME SCALING 30,000-40,000 news items/day 900 on topic/day 200 events/day 4 alerts/day
Technical challenges X0,000 news providers 鳥インフルエンザ Avian Flu REAL TIME SCALING Percentage of News by Language Influenza aviaire Cúm gia cầm MULTILINGUALITY English 조류인플루엔자 Chinese GermanNews event counts for porcine foot- Russianand-mouth outbreak in South Korea Korean2010-2011 French Vietnamese Portuguese Other Increased sensitivity and timeliness from multilingual news
Technical challenges X0,000 news providers Temporal identification REAL TIME “The Spanish flu outbreak…” SCALING MULTILINGUALITY Entity identification “Obama fever builds as Americans AMBIGUITY await a new era” Toponym grounding Variant transliterationsCamden (UK) Camden (AU) Camden (CA) + 19 others Tajoura Tajura Tajoora… Equine influenza in Camden Coreference “Two British holidaymakers fell ill… ” 2 or 4 victims? “Two male pensioners died…”
A snapshot of the BioCaster ontology Kawazoe, A., Chanlekha, H., Shigematsu, M. and Collier, N. (2008), “Structuring an event ontology for disease outbreak detection”,in BMC Bioinformatics, 9 (Suppl 3):S8. Collier, N., Kawazoe, A., Jin, L., Shigematsu, M., Dien, D. Barrero, R., Takeuchi , K.and Kawtrakul, A. (2007), “A multilingual ontology forinfectious disease surveillance: rationale, design and challenges”, Language Resources and Evaluation, Elsevier, DOI: 10.1007/s10579-007-9019-7.
Extant technology gaps – How can we understanding „norms‟ and detect their violations? • Time series analysis and summarization – How do we integrate event features? • Across languages • Across media types • Across ontologies/granularities – How do we rapidly adapt surveillance systems to new vocabulary/event types/domains
5 detection algrorithms 1. Early aberration reporting system (EARS) C2 algorithm – captures the number of standard deviations that the current count exceeds the history mean; – St = max(0, (Ct – (μt + kσt))/ σt) 2. EARS C3 algorithm – similar to C2 except that C3 uses a weighted sum of the previous 3 days for the current period; 3. W2 algorithm – a modified version of C2 which ignores history counts on Saturdays and Sundays to compensate for day of week effects; 4. F statistic – compares the variance in the history window to the variance in the current window; – St = σt 2 +σb 2 5. Exponential Weighted Moving Average (EWMA) – provides less weight to days in the history that are further from the test day. – St = (Yt – μt)/[σt * (λ/(2- λ))1/2], where Y1 = C1 and Yt = λCt + (1- λ)Yt-1 Model parameters were estimated based on an additional 5 epidemic data sets from ProMED-mail (data not shown) Burkom H. S. (2005), “Accessible Alerting Algorithms for Biosurveillance”. National Syndromic Surveillance Conference[4+ Jackson M. L. et all (2007), “A simulation study comparing aberration detection algorithms for syndromic surveillance” Medical Informatics and Decision Making , 7(6): BMC, DOI: 10.1186/1472-6947-7-6.*5+ Madoff L. (2004), “ProMED-mail: An early warning system for emerging diseases”. Clin Infect Dis , 39(2): 227–232.
Test Data# Disease Country ProMED-alerts # Disease Country ProMED-alerts1 Hand,foot,mo PR China 9 10 Influenza Egypt 49 uth 11 Plague USA 82 Ebola Congo 17 12 Dengue Brazil 273 Yellow fever Brazil 28 13 Dengue Indonesia 144 Influenza USA 21 14 Measles UK 135 Cholera Iraq 5 15 Chikungunya Malaysia 156 Chikungunya Singapore 8 16 Yellow fever Senegal 07 Anthrax USA 15 17 Influenza Indonesia 358 Yellow fever Argentina 5 18 Influenza Banglades 39 Ebola Reston Philippines 15 h • 14 countries and 11 infectious disease types • 366 days of news data was collected from BioCaster for each disease and country • The study period is 17th June 2008 to 17th June 2009
Evaluation of time series algorithms C3 C2 W2 F-statistic EWMA Sensitivity 0.74 0.66 0.66 0.78 0.73 (0.69-0.78) (0.61-0.72) (0.60-0.71) (0.74-0.82) (0.68-0.78) Specificity 0.96 0.98 0.98 0.92 0.95 (0.95-0.96) (0.98-0.98) (0.98-0.99) (0.91-0.92) (0.94-0.96) PPV 0.55 0.64 0.65 0.46 0.47 (0.98-0.99) (0.98-0.99) (0.98-0.99) (0.98-0.99) (0.98-0.99) NPV 0.98 0.98 0.98 0.98 0.98 (0.98-0.99) (0.98-0.99) (0.98-0.99) (0.98-0.98) (0.98-0.99) Alarms/100 days 6.48 4.52 4.17 12.34 7.85 F-measure 0.63 0.65 0.66 0.58 0.58 Results in parentheses show 95% confidence intervals Collier, N. (2009), “What’s unusual in online disease outbreak news?”, in BMC Biiomedical Semantics, 1(2).
Time from outbreak news to outbreak detection Outbreak characteristics: Early surge vs multi-modal transmission News event frequency over time Testing data sets for a range of diseases used in Collier, N. (2010), “Towards cross-lingual alerting for bursty epidemic events”, J. Biomed. Semantics, 2 (Suppl 5):S10. Best performance using EARS C3 algorithm on multilingual news event counts: 4 days earlier than ProMED with an F-measure of 0.56 and 12.0Source: BioCaster alarms/100 days.
The landscape of Web sensing for public health GPHIN (Ginsberg et al. 2009) EpiSpider (Tolentino et al. 2007) MiTaP (Damianos et al. 2002) BioCaster (Collier et al. 2008) Argus (Wilson et al .2008) Medisys (Yangarber et al. 2007) HealthMap (Friefeld et al. 2008) ProMed-mail (Madoff 2004) MiTaP (?) (Damianos et al. 2002) Newswire Radio ShareUshahidi(Okolloh et al. 2009)Twitter Earthquake Detector(Guy et al. 2010) SMS/ Query Google Flu TrendsHealthMap microblog Online (Ginsberg et al. 2009)(Friefeld et al. 2008) SignalsBioCaster(Collier et al. 2008) Social Lifestream networks Discuss Livecast
Classification scheme• Disease spread can be strongly influenced by behavioural changes • After surveying Twitter messages we conflated Jones and Salathe‟s groupings into three plus two new categories: – (A) Avoiding behaviour • Avoid people who cough/sneeze, Avoid large gatherings of people, Avoid public transportation, Avoid travel to infected areas – (I) Increased sanitation • Wash hands more often, use disinfectant – (W) Wearing a mask – (P) Pharmaceutical intervention • Seeking clinical advice or using medicines or vaccines to prevent disease – (S) Self reported diagnosis • User reports that they have the flu Jones , J, Salathe, M. (2009), “Early assessment of anxiety and behavioral response to novel swine-origin inuenza A(H1N1)”, PLoSOne, 4(12):e8032. Collier, N. (2009), “UMG U got flu? Analysis of shared health messages for bio-surveillance”, in Proc. 4th Symposium on Semantic Mining inBiomedicine (SMBM’10).
Anxiety indicators have moderately strong correlationwith CDC A(H1N1) lab data 2009-2010 3000 450 400 2500 Category Spearman’s P-value 350 Rho 2000 300 CDC A 0.66 0.020 A S 0.66 0.021 250 S I 0.58 0.048 1500 I P 0.67 0.017 200 P A+I+P 0.68 0.008 1000 150 A+I+P A+I+P+S 0.67 0.017 A+I+P+S 100 500 50 0 0 46 47 48 49 50 51 52 1 2 3 4 5
DIZIE: Text mining from personal health reportson Twitter Syndromic surveillance for gastrointestinal, respiratory, neurological, dermatological, haemorrhagi c, musculoskeletal from Tweets in 40 world cities.
Significance and connections• PH analysis is a highly skilled human task made easier by text mining from open sources• Value in transparent evaluation of core technologies using gold standards – Good understanding now of intrinsic components – More extrinsic evaluations needed to broaden uptake among PH community – Community discussion needed on utility of evaluation strategies.• Power of integrating sources needs to be explored Heat map showing lowest ranked countries by number of reports per „000 population gathered by BioCaster
Special thanks• Funding – Japan Science and Technology Agency‟s SAKIGAKE fund – JSPS Young Researcher type A fund• Postdoctoral students: – Son Doan, PhD., Mike Conway, PhD. (now at UCSD), Reiko Goodwin, PhD. (Fordham U.), Ai Kawazoe, PhD. (now at Tsuda U.)• Ph.D. students – John McCrae, PhD. (now at Bielefeld U.), Hutchatai Chanlekha, PhD. (now at Kasetsart U.)• Intern students – Wita Ratsameetip (Chulalongkorn University, Thailand),Nguyen Trurong Son (Vietnam National University, Ho Chi Minh City, Vietnam), Nguyen Thi Ngoc Mai (Vietnam National University, Ho Chi Minh City, Vietnam), Aurelie Chabord (ENSIMAG-Grenoble INP, France), Therawat Tooumnauy (Kasetsart University, Thailand), Nam Xuan Cao (Vietnam National University, Ho Chi Minh City, Vietnam), Hoang Cong Duy Vu (Vietnam National University, Ho Chi Minh City, Vietnam), Nghiem Quoc Minh (Vietnam National University, Ho Chi Minh City, Vietnam), Van Chi Nam (Vietnam National University, Ho Chi Minh City, Vietnam), Nguyen Thi Hong Nhung (Vietnam National University, Ho Chi Minh City, Vietnam), Pham Thao Thi Xuan (Vietnam National University, Ho Chi Minh City, Vietnam), Ngo Quoc Hung (Vietnam National University, Ho Chi Minh City, Vietnam), Tran Tri Quoc (Vietnam National University, Ho Chi Minh City, Vietnam)