Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

After the Boom No One Tweets: Microblog-based Influenza Detection Incorporating Indirect Information

894 views

Published on

The International Conference on Emerging Databases(EDB)
EDB2016 Runner-up Paper Award

Published in: Science
  • Be the first to comment

  • Be the first to like this

After the Boom No One Tweets: Microblog-based Influenza Detection Incorporating Indirect Information

  1. 1. After the Boom No One Tweets: Microblog-based Influenza Detection Incorporating Indirect Information Shoko Wakamiya1, Yukiko Kawai2, Eiji Aramaki1 1 Nara Institute of Science and Technology, Japan 2 Kyoto Sangyo University, Japan Oct. 18, 2016 Twitter
  2. 2. Exploiting Tweeting User as Social Sensor [Sakaki2010, Lee2011, Aramaki2011] • Various real-world phenomena can be observed EX) Disasters, local events, infectious diseases, etc. • It is expected to outperform other traditional methods of medical reporting means Sakaki et al.: Earthquake Shakes Twitter Users. WWW (2010) Lee, Wakamiya, Sumiya: Discovery of Unusual Regional Social Activities using Geo-tagged Microblogs, World Wide Web Special Issue on Mobile Services on the Web (2011) Aramaki et al.: Twitter Catches The Flu: Detecting Influenza Epidemics using Twitter, EMNLP (2011) Target event Physical Sensor-based Social Sensor-based Previous Proposed Sensors Direct information Indirect informat Direct informati Physical sensor Social sensor
  3. 3. Related work on Twitter-based Influenza Surveillance Target (# of areas) Data size (million tweets) Aramaki [16] Japan (1 area) 300 Achrekar [27] US (10 areas) 1.9 * Culotta [28] US (1 area) 0.5 Kanouch [29] Japan (1 area) 300 De Quincy [30] Europe (1 area) 0.14 Doan [31] US (1 area) 24 * Szomszor [32] Europe (1 area) 3 • Lots of Twitter-based disease detection/ prediction have been developed • Most of the systems performed low-resolution geographic analysis (country-level)
  4. 4. Problem (1): Imbalance of Social Sensor Distribution • Most of the social sensors are in urban cities (Tokyo, Osaka, etc.) • Other cities are affected by a shortage of data Sapporo, Hokkaido Tokyo Geographic distribution of influenza-related tweets in Japan
  5. 5. Problem (2): Gap between Social Sensors and Patients Relation between numbers of influenza-related tweets and patients in each prefecture • Except for a few high-population cities, most areas have fewer tweets • Some such areas have numerous influenza patients 0 500 1000 1500 2000 2500 3000 3500 4000 0 50000 100000 150000 200000 250000 300000 TOKYO AREA13 OSAKA AREA27 KANAGAWA AREA14 CHIBA AREA12 AICHI AREA23 SAITAMA AREA11 HOKKAIDO AREA1 HYOGO AREA28 KYOTO AREA26 FUKUOKA AREA40 SHIZUOKA AREA22 MIYAGI AREA4 IBARAKI AREA8 NIIGATA AREA15 FUKUSHIMA AREA7 GUNMA AREA10 HIROSHIMA AREA34 FUKUI AREA20 GIFU AREA21 KUMAMOTO AREA43 SHIGA AREA25 TOCHIGI AREA9 MIE AREA24 NARA AREA29 IWATE AREA3 OKAYAMA AREA33 KAGOSHIMA AREA46 WAKAYAMA AREA30 OKINAWA AREA47 YAMAGUCHI AREA35 YAMAGATA AREA6 KAGAWA AREA37 MIYAZAKI AREA45 ISHIKAWA AREA19 AOMORI AREA2 EHIME AREA38 NAGANO AREA17 OITA AREA44 TOKUSHIMA AREA36 NAGASAKI AREA42 AKITA AREA5 YAMANASHI AREA16 TOTTORI AREA31 KOCHI AREA39 SAGA AREA41 TOYAMA AREA18 SHIMANE AREA32 # of patients # of tweets #ofpatients #oftweets Prefectures (area)
  6. 6. Problem (2): Gap between Social Sensors and Patients Relation between numbers of influenza-related tweets and patients in each prefecture • Except for a few high-population cities, most areas have fewer tweets • Some such areas have numerous influenza patients 0 500 1000 1500 2000 2500 3000 3500 4000 0 50000 100000 150000 200000 250000 300000 TOKYO AREA13 OSAKA AREA27 KANAGAWA AREA14 CHIBA AREA12 AICHI AREA23 SAITAMA AREA11 HOKKAIDO AREA1 HYOGO AREA28 KYOTO AREA26 FUKUOKA AREA40 SHIZUOKA AREA22 MIYAGI AREA4 IBARAKI AREA8 NIIGATA AREA15 FUKUSHIMA AREA7 GUNMA AREA10 HIROSHIMA AREA34 FUKUI AREA20 GIFU AREA21 KUMAMOTO AREA43 SHIGA AREA25 TOCHIGI AREA9 MIE AREA24 NARA AREA29 IWATE AREA3 OKAYAMA AREA33 KAGOSHIMA AREA46 WAKAYAMA AREA30 OKINAWA AREA47 YAMAGUCHI AREA35 YAMAGATA AREA6 KAGAWA AREA37 MIYAZAKI AREA45 ISHIKAWA AREA19 AOMORI AREA2 EHIME AREA38 NAGANO AREA17 OITA AREA44 TOKUSHIMA AREA36 NAGASAKI AREA42 AKITA AREA5 YAMANASHI AREA16 TOTTORI AREA31 KOCHI AREA39 SAGA AREA41 TOYAMA AREA18 SHIMANE AREA32 # of patients # of tweets #ofpatients #oftweets Prefectures (area)
  7. 7. Exploiting Indirect Info. Pro) Covering wider areas Con) • Unreliability (too noisy or too old) (1) My grandma in Kyoto is in bed with flu (2) NEWS: classes in Osaka have been closed because of the flu • Complex pattern When? Already spread Target event Physical Sensor-based Social Sensor-based Previous Proposed Sensors Direct information Indirect information Direct information Existing Proposed The amount of tweets containing direct info. The amount of tweets containing indirect info. The amount of patients
  8. 8. Our Goal & Approach To estimate the number of patients in each area based on the relation between human motivation to tweet and information propagation • h1) People prefer reporting new info., and that they are insensitive to already-propagated info. • h2) The degree of propagation (popularity) is correlated with the amount of indirect info. The amount of tweets containing direct info. The amount of tweets containing indirect info. The amount of patients (a) Before Epidemics (b) After Epidemics Positive Negative Trapped Sensor Indirect Information Direct Information Direct Information Trapped sensors
  9. 9. Outline • Background • Goal and approach • Construction of Twitter-based Influenza Surveillance System • Experimental evaluation • Discussion • Conclusions
  10. 10. Twitter-based Influenza Surveillance LOCATION DETECTION MODULE AGGREGATION MODULE LINEAR MODEL TRAP MODEL Positive Negative Trash P/N Classifier Tweets GPS Info. Profile Info. Indirect Info. Available No No NLP MODULE # of flu patients Direct Information Indirect Information No 1. NLP-based Classification Patient (positive) or not (negative) 2. Location Detection Direct info. or Indirect info. 3. Data Aggregation Linear model or Trap model
  11. 11. 1. NLP-based Classification To Judge whether a given tweet is written by a patient or not • Building the training set A human annotator assigned one of two labels (positive/negative) to 1,000 influenza-related tweets Ex) • Classifying the test set • SVM-based classifier • Bag-of-words representation • Polynomial kernel (d=2) “My mother got flu today” positive “I got influenza shot today” negative
  12. 12. To cover wider areas by extracting indirect info. as well as direct info. • Direct info. • GPS info. (GPS) • Profile info. (PROF) • Indirect info. (IND) Location names in tweets’ contents extracted using a list of prefecture names and famous landmarks Ex) “My friend in Osaka caught flu” 2. Location Detection GPS (0.5%) PROF (26.2%) IND (4.7%)No location info. Percentage of tweets with direct/indirect info. 7,666,201 tweets
  13. 13. 3. Data Aggregation To estimate the amount of patients using different types of info; direct info. and indirect info. i) LINEAR Model A simple model to sum up direct info. and indirect info. ii) TRAP Model A model based on LINEAR model and human’s nature to tweet 𝐼"#$%&' 𝑎, 𝑡 = 𝑤-./ 0 𝐺𝑃𝑆 𝑎, 𝑡 + 𝑤.'56 0 𝑃𝑅𝑂𝐹 𝑎, 𝑡 + 𝑤#$: ; 𝐼𝑁𝐷(𝑎, 𝑏, 𝑡) B∈& The number of patients 𝐼"#$%&' 𝑎, 𝑡 in area 𝑎 at day 𝑡: 𝐼D'&. 𝑎, 𝑡 = 𝐼"#$%&' 𝑎, 𝑡 𝑤F/%'/ 0 𝑁G − 𝑤D'&. 0 log( 𝑝𝑜𝑝 𝑎, 𝑡 + 1) , 𝑝𝑜𝑝 𝑎, 𝑡 = ; 𝐼𝑁𝐷(𝑎, 𝑐) P QRS The number of patients 𝐼D'&. 𝑎, 𝑡 in area 𝑎 at day 𝑡:
  14. 14. Concept of TRAP Model 1. People prefer a new event, and are insensitive to an already propagated event 2. The degree of propagation (popularity) is correlated with the amount of indirect info. (a) Before Epidemics (b) After Epidemics Indirect Information Direct Information Direct Information (a) Before epidemics (a) Before Epidemics (b) After Epidemics Indirect Information Direct Information Direct Information (b) After epidemics (a) Before Epidemics (b) After Epidemics Positive Negative Trapped Sensor Indirect Information Direct Information Direct Information (a) (b) People actively report the flu Most of the people lose interest to share direct info.
  15. 15. 3. Data Aggregation To estimate the amount of patients using different types of info; direct info. and indirect info. i) LINEAR Model A simple model to sum up direct info. and indirect info. ii) TRAP Model A model based on LINEAR model and human’s nature to tweet 𝐼"#$%&' 𝑎, 𝑡 = 𝑤-./ 0 𝐺𝑃𝑆 𝑎, 𝑡 + 𝑤.'56 0 𝑃𝑅𝑂𝐹 𝑎, 𝑡 + 𝑤#$: ; 𝐼𝑁𝐷(𝑎, 𝑏, 𝑡) B∈& The number of patients 𝐼"#$%&' 𝑎, 𝑡 in area 𝑎 at day 𝑡: 𝐼D'&. 𝑎, 𝑡 = 𝐼"#$%&' 𝑎, 𝑡 𝑤F/%'/ 0 𝑁G − 𝑤D'&. 0 log( 𝑝𝑜𝑝 𝑎, 𝑡 + 1) , 𝑝𝑜𝑝 𝑎, 𝑡 = ; 𝐼𝑁𝐷(𝑎, 𝑐) P QRS The number of patients 𝐼D'&. 𝑎, 𝑡 in area 𝑎 at day 𝑡: The degree of Information propagation in area a during t days The amount of trapped sensors The amount of social sensors in area a
  16. 16. Experimental Datasets • Tweet data • A collection of tweets containing the keyword “I-N-FU-RU” • Gold standard data • The number of patients per week for every prefecture (47 areas) • The data is available from the Infectious Disease Surveillance Center (IDSC) ALL Duration 2012/08/02-2016/01/03 # of tweets (Size) 7,666,201 (2.275 GB) SEASON2012 Duration 2012/11/01-2013/05/31 # of tweets (Size) 1,959,610 (729.4 MB) SEASON2013 Duration 2013/11/01-2014/05/31 # of tweets (Size) 501,542 (143.7 MB)* SEASON2014 Duration 2014/11/01-2015/05/31 # of tweets (Size) 2,736,685 (808.2 MB) A sample of the weekly report from IDSC http://www.nih.go.jp/niid/ja/diseases/a/flu.html
  17. 17. • Methods BASELINE, BASELINE+PROF, LINEAR, TRAP • Evaluation metric Pearson correlation coefficient (high: |r|>0.7, medium: 0.4<|r|≤0.7, low: |r|≤0.4) Experiments Method NLP GPS PROF IND TRAP TRAP+NLP ✓ ✓ ✓ ✓ TRAP ✓ ✓ ✓ LINEAR LINEAR+NLP ✓ ✓ ✓ ✓ LINEAR ✓ ✓ ✓ BASRLINE +PROF BASELINE+PROF+NLP (EMNLP2011) ✓ ✓ ✓ BASELINE+PROF ✓ ✓ BASELINE BASELINE +NLP ✓ ✓ BASELINE ✓ 𝐼T&/% 𝑎, 𝑡 = 𝐺𝑃𝑆 𝑎, 𝑡 𝐼T&/%U.'56 𝑎, 𝑡 = 𝐺𝑃𝑆 𝑎, 𝑡 + 𝑃𝑅𝑂𝐹 𝑎, 𝑡 𝐼"#$%&' 𝑎, 𝑡 = 𝐺𝑃𝑆 𝑎, 𝑡 + 𝑃𝑅𝑂𝐹 𝑎, 𝑡 + ; 𝐼𝑁𝐷(𝑎, 𝑏, 𝑡) B∈& 𝐼D'&. 𝑎, 𝑡 = 𝐼"#$%&' 𝑎, 𝑡 0.05 0 𝑁G − 0.2 0 log( 𝑝𝑜𝑝 𝑎, 𝑡 + 1)
  18. 18. Results (1/3) Contribution of NLP-based Classification • TRAP+NLP (r=0.70) is higher than TRAP (r=0.64) • NLP classification in this domain (flu) is not hard Target Method SEASON 2012 SEASON 2013 SEASON 2014 SEASON TOTAL All areas TRAP+NLP 0.76 0.70 0.69 0.70 LINEAR+NLP 0.70 0.55 0.53 0.50 EMNLP2011 0.74 0.68 0.67 0.69 BASELINE+NLP 0.33 0.37 0.48 0.36 High population areas (Top 10) TRAP+NLP 0.80 0.77 0.72 0.75 LINEAR+NLP 0.78 0.65 0.64 0.64 EMNLP2011 0.80 0.77 0.71 0.75 BASELINE+NLP 0.55 0.60 0.63 0.53 Low population areas (Top 10) TRAP+NLP 0.75 0.66 0.71 0.69 LINEAR+NLP 0.62 0.46 0.48 0.43 EMNLP2011 0.70 0.61 0.65 0.64 BASELINE+NLP 0.21 0.26 0.35 0.25 Target Method SEASON 2012 SEASON 2013 SEASON 2014 SEASON TOTAL All areas TRAP 0.72 0.63 0.64 0.64 LINEAR 0.65 0.48 0.53 0.48 BASELINE+PROF 0.69 0.59 0.66 0.64 BASELINE 0.29 0.34 0.48 0.35 High population areas (Top 10) TRAP 0.75 0.69 0.70 0.70 LINEAR 0.72 0.60 0.63 0.61 BASELINE+PROF 0.75 0.69 0.70 0.70 BASELINE 0.44 0.56 0.63 0.50 Low population areas (Top 10) TRAP 0.71 0.61 0.53 0.57 LINEAR 0.58 0.41 0.46 0.40 BASELINE+PROF 0.65 0.52 0.65 0.59 BASELINE 0.20 0.23 0.35 0.25 (a) With NLP-based classification (b) Without NLP-based classification
  19. 19. Results (2/3) Contribution of Indirect Info. in LINEAR Model • LINEAR+NLP (r=0.50) is lower than BASELINE+PROF+NLP (r=0.69) • It is difficult to detect influenza epidemics by adding indirect info. in a naïve manner Target Method SEASON 2012 SEASON 2013 SEASON 2014 SEASON TOTAL All areas TRAP+NLP 0.76 0.70 0.69 0.70 LINEAR+NLP 0.70 0.55 0.53 0.50 EMNLP2011 0.74 0.68 0.67 0.69 BASELINE+NLP 0.33 0.37 0.48 0.36 High population areas (Top 10) TRAP+NLP 0.80 0.77 0.72 0.75 LINEAR+NLP 0.78 0.65 0.64 0.64 EMNLP2011 0.80 0.77 0.71 0.75 BASELINE+NLP 0.55 0.60 0.63 0.53 Low population areas (Top 10) TRAP+NLP 0.75 0.66 0.71 0.69 LINEAR+NLP 0.62 0.46 0.48 0.43 EMNLP2011 0.70 0.61 0.65 0.64 BASELINE+NLP 0.21 0.26 0.35 0.25 (a) With NLP-based classification
  20. 20. Results (3/3) Contribution of Indirect Info. in TRAP Model • TRAP+NLP achieved the best performance (r=0.70) • TRAP model effectively contributes to exploitation of both direct and indirect info. Target Method SEASON 2012 SEASON 2013 SEASON 2014 SEASON TOTAL All areas TRAP+NLP 0.76 0.70 0.69 0.70 LINEAR+NLP 0.70 0.55 0.53 0.50 EMNLP2011 0.74 0.68 0.67 0.69 BASELINE+NLP 0.33 0.37 0.48 0.36 High population areas (Top 10) TRAP+NLP 0.80 0.77 0.72 0.75 LINEAR+NLP 0.78 0.65 0.64 0.64 EMNLP2011 0.80 0.77 0.71 0.75 BASELINE+NLP 0.55 0.60 0.63 0.53 Low population areas (Top 10) TRAP+NLP 0.75 0.66 0.71 0.69 LINEAR+NLP 0.62 0.46 0.48 0.43 EMNLP2011 0.70 0.61 0.65 0.64 BASELINE+NLP 0.21 0.26 0.35 0.25 (a) With NLP-based classification
  21. 21. Discussion: Relation between Volume of Tweets and Performance (1/2) High population areas • TRAP+NLP was higher than EMNLP2011 • Top 17 high population areas exhibited high correlation (r>0.7) 0 500 1000 1500 2000 2500 3000 3500 4000 0.6 0.62 0.64 0.66 0.68 0.7 0.72 0.74 0.76 0.78 0.8 # of tweets TRAP+NLP EMNLP2011 #oftweets Correlation coefficient Prefectures (AREAs) TOKYO (AREA13) OSAKA (AREA27)
  22. 22. Discussion: Relation between Volume of Tweets and Performance (2/2) Other areas • There is large variance of performance • TRAP+NLP mostly outperforms EMNLP2011 0 500 1000 1500 2000 2500 3000 3500 4000 0.6 0.62 0.64 0.66 0.68 0.7 0.72 0.74 0.76 0.78 0.8 # of tweets TRAP+NLP EMNLP2011 #oftweets Correlation coefficient Prefectures (AREAs) FUKUI (AREA20)AOMORI (AREA2)
  23. 23. Discussion: After the Boom No One Tweets • TRAP model outperformed the LINEAR model If influenza becomes a hot topic, people do not talk about it • Similar phenomena were so far proposed from a psychological viewpoint Most studies showed rapid propagation of rumors (especially bad news) and its short life • This study attempts to handle human nature using a statistical model This model has sufficient room for application to additional studies
  24. 24. Conclusions • Twitter-based influenza surveillance • Utilized indirect info. that mention other places for covering wider area • Developed TRAP model based on information propagation and people’s motivation to tweet •Future work • To examine worldwide influenza surveillance • To establish a novel method by integrating various models for their accurate prediction • To consider various effects related to geographic relations among areas

×