Successfully reported this slideshow.

D-sieve : A Novel Data Processing Engine for Crises Related Social Messages

0

Share

1 of 20
1 of 20

D-sieve : A Novel Data Processing Engine for Crises Related Social Messages

0

Share

Download to read offline

Existing literature demonstrates the usefulness of system-mediated algorithms, such as supervised machine learning for detecting classes of messages in the social-data stream (e.g., topically relevant vs. irrelevant). The classification accuracies of these algorithms largely depend upon the size of labeled samples that are provided during the learning phase. Other factors such as class distribution, term distribution among the training set also play an important role on classifier's accuracy. However, due to several reasons (money / time constraints, limited number of skilled labelers etc.), a large sample of labeled messages is often not available immediately for learning an efficient classification model. Consequently, classifier trained on a poor model often mis-classifies data and hence, the applicability of such learning techniques (especially for the online setting) during ongoing crisis response remains limited.
In this paper, we propose a post-classification processing step leveraging upon two additional content features- stable hashtag association and stable named entity association, to improve the classification accuracy for a classifier in realistic settings. We have tested our algorithms on two crisis datasets from Twitter (Hurricane Sandy 2012 and Queensland Floods 2013), and compared our results against the results produced by a ``best-in-class'' baseline online classifier. By showing the consistent better quality results than the baseline algorithm i.e., by correctly classifying the misclassified data points from the prior step (false negative and false positive to true positive and true negative classes, respectively), we demonstrate the applicability of our approach in practice.

Existing literature demonstrates the usefulness of system-mediated algorithms, such as supervised machine learning for detecting classes of messages in the social-data stream (e.g., topically relevant vs. irrelevant). The classification accuracies of these algorithms largely depend upon the size of labeled samples that are provided during the learning phase. Other factors such as class distribution, term distribution among the training set also play an important role on classifier's accuracy. However, due to several reasons (money / time constraints, limited number of skilled labelers etc.), a large sample of labeled messages is often not available immediately for learning an efficient classification model. Consequently, classifier trained on a poor model often mis-classifies data and hence, the applicability of such learning techniques (especially for the online setting) during ongoing crisis response remains limited.
In this paper, we propose a post-classification processing step leveraging upon two additional content features- stable hashtag association and stable named entity association, to improve the classification accuracy for a classifier in realistic settings. We have tested our algorithms on two crisis datasets from Twitter (Hurricane Sandy 2012 and Queensland Floods 2013), and compared our results against the results produced by a ``best-in-class'' baseline online classifier. By showing the consistent better quality results than the baseline algorithm i.e., by correctly classifying the misclassified data points from the prior step (false negative and false positive to true positive and true negative classes, respectively), we demonstrate the applicability of our approach in practice.

More Related Content

Similar to D-sieve : A Novel Data Processing Engine for Crises Related Social Messages

Related Books

Free with a 14 day trial from Scribd

See all

Related Audiobooks

Free with a 14 day trial from Scribd

See all

D-sieve : A Novel Data Processing Engine for Crises Related Social Messages

  1. 1. D-Sieve: A Novel Data Processing Engine for Crises Related Social Messages Soudip Roy Chowdhury (INRIA & UPSUD, France) Hemant Purohit (Wright State Uni., USA) Muhammad Imran (QCRI, Doha) SWDM 2015 May18, 2015
  2. 2. Social Media During Crises Courtesy: http://worldmap.harvard.edu/maps/nepalqu ake
  3. 3. And it contains… Timely Information
  4. 4. And it contains… Actionable Information
  5. 5. And it contains… Tactical Information
  6. 6. Social Data Processing • Real-time analysis of crises related social data is necessary – Manual analysis • Standby task force [N. Morrow et al. 2011] – System mediated analysis • Unsupervised learning [T. Sakaki et al. 2010] • Supervised learning [S. Roy Chowdhury et al. 2013] (not scalable) (noisy / ambiguous data) (scarce training data / non-uniform class distribution)
  7. 7. D-Sieve • Post-classification data processing engine in a supervised classification setting • Noise ClassA ClassB • Nois e ClassAClassB Supervised Classifier D-SieveRaw Data Correctly Classified Data
  8. 8. Data Model • A message contains TEXT Date
  9. 9. Data Model • A message contains TEXT Date hashtags content URL
  10. 10. Our Approach • D-Sieve extracts two content features to classify messages correctly – Stable hashtag association – Stable named entity association
  11. 11. Stable Hashtag Association Relative frequencies of stable hashtags for an event remain constant over time….
  12. 12. Stable Hashtag Association • A hashtag is stable, if the moving average score of its frequency distribution (after k posts) becomes steady [X.S. Yang et al. 2013] • Stable hashtag set contains all stable hashtags for an incident • Stable hashtag association metric for an input message • and are set of stable hashtags for true positive and true negative classes (f) VH = | (fTP -fTN )Ç Hi | - | (fTN -fTP )Ç Hi | Hi fTP fTN
  13. 13. Stable Named Entity Association • Named entities are extracted from the message content + URL content • Stable named entity association metric for an input message VE = | (ETP - ETN )Ç IE | - | (ETN - ETP )Ç IE | IE • We calculate the final feature association metrics • Feature weights ( ) are chosen such that ranges [-1,1] • Rule of thumb weights are set as 0.5 SC =WH.VH +WE.VE WH ,WE SC
  14. 14. Functional Architecture
  15. 15. In our experiments.. CrisisLexT6 [A.Oleanu et al. 2014] Hurricane Sandy (10k) , Queensland Flood (10k) Random Forest[A. Liaw et al. 2002] UB = 0.8 LB = 0.7 50% of the data used during model creation and rest used for testing
  16. 16. Few Observations • Stable hashtag set contained – 24 and 21 tags for TP class of Sandy and Queensland dataset respectively – 0 and 7 tags for TN class of Sandy and Queensland dataset respectively – 68% and 70% of the whole data for Sandy and Queensland data satisfy the UB and LB constraints
  17. 17. Experimental Results precision recall OC 96.09 81.42 OC+HT 96.23 84.51 OC+NE 96.31 86.49 OC++ 96.35 87.55 70 75 80 85 90 95 100 #percentage Sandy 2012 precesion recall OC 99.53 36.75 OC+HT 99.8 58.47 OC+NE 99.75 68.9 OC++ 99.8 74.24 0 20 40 60 80 100 120 #percentage Queensland 2013 D-sieve improves recall values substantially (6% and 37% improvements for two datasets).
  18. 18. Future Work • Investigate efficiency of D-sieve both for online and offline classification settings • Address the latency issue for feature extraction (URL content crawling and entitiy spotting) using parallel computing • Extend our experiments with Facebook and G+ data
  19. 19. References • A. Liaw and M. Wiener. Classification and regression by randomforest. R news, 2(3):18–22, 2002. • A. Olteanu, C. Castillo, F. Diaz, and S. Vieweg. “Crisislex: A lexicon for collecting and filtering microblogged communications in crises”. In ICWSM, 2014. • N. Morrow, N. Mock, A. Papendieck, and N. Kocmich. “Independent evaluation of the ushahidi haiti project”. Development Information Systems International, 2011. • S. Roy Chowdhury, M. Imran, M. R. Asghar, S. Amer-Yahia, and C. Castillo. “Tweet4act: Using incident-specific profiles for classifying crisis-related messages”. In ISCRAM 2013. • Takeshi Sakaki, Makoto Okazaki, and Yutaka Matsuo. “Earthquake shakes Twitter users: real-time event detection by social sensors”. In WWW 2010. • X. S. Yang, D. W. Cheung, L. Mo, R. Cheng, and B. Kao. “On incentive-based tagging”. In ICDE 2013.

Editor's Notes

  • We use InfoGain with Ranker algorithm to select top 800 content based features (uni-gram and bi-gram), and used the Ran- dom Forest [7] as our classification algorithm to learn the best-fit models. The AUC score for Sandy is 78% and 83% for Queensland.

  • ×