More Related Content

Similar to D-sieve : A Novel Data Processing Engine for Crises Related Social Messages(20)


D-sieve : A Novel Data Processing Engine for Crises Related Social Messages

  1. D-Sieve: A Novel Data Processing Engine for Crises Related Social Messages Soudip Roy Chowdhury (INRIA & UPSUD, France) Hemant Purohit (Wright State Uni., USA) Muhammad Imran (QCRI, Doha) SWDM 2015 May18, 2015
  2. Social Media During Crises Courtesy: ake
  3. And it contains… Timely Information
  4. And it contains… Actionable Information
  5. And it contains… Tactical Information
  6. Social Data Processing • Real-time analysis of crises related social data is necessary – Manual analysis • Standby task force [N. Morrow et al. 2011] – System mediated analysis • Unsupervised learning [T. Sakaki et al. 2010] • Supervised learning [S. Roy Chowdhury et al. 2013] (not scalable) (noisy / ambiguous data) (scarce training data / non-uniform class distribution)
  7. D-Sieve • Post-classification data processing engine in a supervised classification setting • Noise ClassA ClassB • Nois e ClassAClassB Supervised Classifier D-SieveRaw Data Correctly Classified Data
  8. Data Model • A message contains TEXT Date
  9. Data Model • A message contains TEXT Date hashtags content URL
  10. Our Approach • D-Sieve extracts two content features to classify messages correctly – Stable hashtag association – Stable named entity association
  11. Stable Hashtag Association Relative frequencies of stable hashtags for an event remain constant over time….
  12. Stable Hashtag Association • A hashtag is stable, if the moving average score of its frequency distribution (after k posts) becomes steady [X.S. Yang et al. 2013] • Stable hashtag set contains all stable hashtags for an incident • Stable hashtag association metric for an input message • and are set of stable hashtags for true positive and true negative classes (f) VH = | (fTP -fTN )Ç Hi | - | (fTN -fTP )Ç Hi | Hi fTP fTN
  13. Stable Named Entity Association • Named entities are extracted from the message content + URL content • Stable named entity association metric for an input message VE = | (ETP - ETN )Ç IE | - | (ETN - ETP )Ç IE | IE • We calculate the final feature association metrics • Feature weights ( ) are chosen such that ranges [-1,1] • Rule of thumb weights are set as 0.5 SC =WH.VH +WE.VE WH ,WE SC
  14. Functional Architecture
  15. In our experiments.. CrisisLexT6 [A.Oleanu et al. 2014] Hurricane Sandy (10k) , Queensland Flood (10k) Random Forest[A. Liaw et al. 2002] UB = 0.8 LB = 0.7 50% of the data used during model creation and rest used for testing
  16. Few Observations • Stable hashtag set contained – 24 and 21 tags for TP class of Sandy and Queensland dataset respectively – 0 and 7 tags for TN class of Sandy and Queensland dataset respectively – 68% and 70% of the whole data for Sandy and Queensland data satisfy the UB and LB constraints
  17. Experimental Results precision recall OC 96.09 81.42 OC+HT 96.23 84.51 OC+NE 96.31 86.49 OC++ 96.35 87.55 70 75 80 85 90 95 100 #percentage Sandy 2012 precesion recall OC 99.53 36.75 OC+HT 99.8 58.47 OC+NE 99.75 68.9 OC++ 99.8 74.24 0 20 40 60 80 100 120 #percentage Queensland 2013 D-sieve improves recall values substantially (6% and 37% improvements for two datasets).
  18. Future Work • Investigate efficiency of D-sieve both for online and offline classification settings • Address the latency issue for feature extraction (URL content crawling and entitiy spotting) using parallel computing • Extend our experiments with Facebook and G+ data
  19. References • A. Liaw and M. Wiener. Classification and regression by randomforest. R news, 2(3):18–22, 2002. • A. Olteanu, C. Castillo, F. Diaz, and S. Vieweg. “Crisislex: A lexicon for collecting and filtering microblogged communications in crises”. In ICWSM, 2014. • N. Morrow, N. Mock, A. Papendieck, and N. Kocmich. “Independent evaluation of the ushahidi haiti project”. Development Information Systems International, 2011. • S. Roy Chowdhury, M. Imran, M. R. Asghar, S. Amer-Yahia, and C. Castillo. “Tweet4act: Using incident-specific profiles for classifying crisis-related messages”. In ISCRAM 2013. • Takeshi Sakaki, Makoto Okazaki, and Yutaka Matsuo. “Earthquake shakes Twitter users: real-time event detection by social sensors”. In WWW 2010. • X. S. Yang, D. W. Cheung, L. Mo, R. Cheng, and B. Kao. “On incentive-based tagging”. In ICDE 2013.

Editor's Notes

  1. We use InfoGain with Ranker algorithm to select top 800 content based features (uni-gram and bi-gram), and used the Ran- dom Forest [7] as our classification algorithm to learn the best-fit models. The AUC score for Sandy is 78% and 83% for Queensland.