Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
D-Sieve: A Novel Data Processing
Engine for Crises Related Social
Messages
Soudip Roy Chowdhury (INRIA & UPSUD, France)
He...
Social Media During Crises
Courtesy:
http://worldmap.harvard.edu/maps/nepalqu
ake
And it contains…
Timely
Information
And it contains…
Actionable
Information
And it contains…
Tactical
Information
Social Data Processing
• Real-time analysis of crises related social data
is necessary
– Manual analysis
• Standby task fo...
D-Sieve
• Post-classification data processing engine in a
supervised classification setting
• Noise
ClassA
ClassB
• Nois
e...
Data Model
• A message contains
TEXT Date
Data Model
• A message contains
TEXT Date
hashtags
content
URL
Our Approach
• D-Sieve extracts two content features to
classify messages correctly
– Stable hashtag association
– Stable ...
Stable Hashtag Association
Relative frequencies of stable hashtags for an
event remain constant over time….
Stable Hashtag Association
• A hashtag is stable, if the moving average score
of its frequency distribution (after k posts...
Stable Named Entity Association
• Named entities are extracted from the
message content + URL content
• Stable named entit...
Functional Architecture
In our experiments..
CrisisLexT6
[A.Oleanu et al.
2014]
Hurricane Sandy
(10k) , Queensland
Flood (10k)
Random
Forest[A.
Li...
Few Observations
• Stable hashtag set contained
– 24 and 21 tags for TP class of Sandy and
Queensland dataset respectively...
Experimental Results
precision recall
OC 96.09 81.42
OC+HT 96.23 84.51
OC+NE 96.31 86.49
OC++ 96.35 87.55
70
75
80
85
90
9...
Future Work
• Investigate efficiency of D-sieve both for
online and offline classification settings
• Address the latency ...
References
• A. Liaw and M. Wiener. Classification and regression by randomforest. R news,
2(3):18–22, 2002.
• A. Olteanu,...
D-sieve : A Novel Data Processing Engine for Crises Related Social Messages
Upcoming SlideShare
Loading in …5
×

D-sieve : A Novel Data Processing Engine for Crises Related Social Messages

781 views

Published on

Existing literature demonstrates the usefulness of system-mediated algorithms, such as supervised machine learning for detecting classes of messages in the social-data stream (e.g., topically relevant vs. irrelevant). The classification accuracies of these algorithms largely depend upon the size of labeled samples that are provided during the learning phase. Other factors such as class distribution, term distribution among the training set also play an important role on classifier's accuracy. However, due to several reasons (money / time constraints, limited number of skilled labelers etc.), a large sample of labeled messages is often not available immediately for learning an efficient classification model. Consequently, classifier trained on a poor model often mis-classifies data and hence, the applicability of such learning techniques (especially for the online setting) during ongoing crisis response remains limited.
In this paper, we propose a post-classification processing step leveraging upon two additional content features- stable hashtag association and stable named entity association, to improve the classification accuracy for a classifier in realistic settings. We have tested our algorithms on two crisis datasets from Twitter (Hurricane Sandy 2012 and Queensland Floods 2013), and compared our results against the results produced by a ``best-in-class'' baseline online classifier. By showing the consistent better quality results than the baseline algorithm i.e., by correctly classifying the misclassified data points from the prior step (false negative and false positive to true positive and true negative classes, respectively), we demonstrate the applicability of our approach in practice.

Published in: Education
  • Be the first to comment

  • Be the first to like this

D-sieve : A Novel Data Processing Engine for Crises Related Social Messages

  1. 1. D-Sieve: A Novel Data Processing Engine for Crises Related Social Messages Soudip Roy Chowdhury (INRIA & UPSUD, France) Hemant Purohit (Wright State Uni., USA) Muhammad Imran (QCRI, Doha) SWDM 2015 May18, 2015
  2. 2. Social Media During Crises Courtesy: http://worldmap.harvard.edu/maps/nepalqu ake
  3. 3. And it contains… Timely Information
  4. 4. And it contains… Actionable Information
  5. 5. And it contains… Tactical Information
  6. 6. Social Data Processing • Real-time analysis of crises related social data is necessary – Manual analysis • Standby task force [N. Morrow et al. 2011] – System mediated analysis • Unsupervised learning [T. Sakaki et al. 2010] • Supervised learning [S. Roy Chowdhury et al. 2013] (not scalable) (noisy / ambiguous data) (scarce training data / non-uniform class distribution)
  7. 7. D-Sieve • Post-classification data processing engine in a supervised classification setting • Noise ClassA ClassB • Nois e ClassAClassB Supervised Classifier D-SieveRaw Data Correctly Classified Data
  8. 8. Data Model • A message contains TEXT Date
  9. 9. Data Model • A message contains TEXT Date hashtags content URL
  10. 10. Our Approach • D-Sieve extracts two content features to classify messages correctly – Stable hashtag association – Stable named entity association
  11. 11. Stable Hashtag Association Relative frequencies of stable hashtags for an event remain constant over time….
  12. 12. Stable Hashtag Association • A hashtag is stable, if the moving average score of its frequency distribution (after k posts) becomes steady [X.S. Yang et al. 2013] • Stable hashtag set contains all stable hashtags for an incident • Stable hashtag association metric for an input message • and are set of stable hashtags for true positive and true negative classes (f) VH = | (fTP -fTN )Ç Hi | - | (fTN -fTP )Ç Hi | Hi fTP fTN
  13. 13. Stable Named Entity Association • Named entities are extracted from the message content + URL content • Stable named entity association metric for an input message VE = | (ETP - ETN )Ç IE | - | (ETN - ETP )Ç IE | IE • We calculate the final feature association metrics • Feature weights ( ) are chosen such that ranges [-1,1] • Rule of thumb weights are set as 0.5 SC =WH.VH +WE.VE WH ,WE SC
  14. 14. Functional Architecture
  15. 15. In our experiments.. CrisisLexT6 [A.Oleanu et al. 2014] Hurricane Sandy (10k) , Queensland Flood (10k) Random Forest[A. Liaw et al. 2002] UB = 0.8 LB = 0.7 50% of the data used during model creation and rest used for testing
  16. 16. Few Observations • Stable hashtag set contained – 24 and 21 tags for TP class of Sandy and Queensland dataset respectively – 0 and 7 tags for TN class of Sandy and Queensland dataset respectively – 68% and 70% of the whole data for Sandy and Queensland data satisfy the UB and LB constraints
  17. 17. Experimental Results precision recall OC 96.09 81.42 OC+HT 96.23 84.51 OC+NE 96.31 86.49 OC++ 96.35 87.55 70 75 80 85 90 95 100 #percentage Sandy 2012 precesion recall OC 99.53 36.75 OC+HT 99.8 58.47 OC+NE 99.75 68.9 OC++ 99.8 74.24 0 20 40 60 80 100 120 #percentage Queensland 2013 D-sieve improves recall values substantially (6% and 37% improvements for two datasets).
  18. 18. Future Work • Investigate efficiency of D-sieve both for online and offline classification settings • Address the latency issue for feature extraction (URL content crawling and entitiy spotting) using parallel computing • Extend our experiments with Facebook and G+ data
  19. 19. References • A. Liaw and M. Wiener. Classification and regression by randomforest. R news, 2(3):18–22, 2002. • A. Olteanu, C. Castillo, F. Diaz, and S. Vieweg. “Crisislex: A lexicon for collecting and filtering microblogged communications in crises”. In ICWSM, 2014. • N. Morrow, N. Mock, A. Papendieck, and N. Kocmich. “Independent evaluation of the ushahidi haiti project”. Development Information Systems International, 2011. • S. Roy Chowdhury, M. Imran, M. R. Asghar, S. Amer-Yahia, and C. Castillo. “Tweet4act: Using incident-specific profiles for classifying crisis-related messages”. In ISCRAM 2013. • Takeshi Sakaki, Makoto Okazaki, and Yutaka Matsuo. “Earthquake shakes Twitter users: real-time event detection by social sensors”. In WWW 2010. • X. S. Yang, D. W. Cheung, L. Mo, R. Cheng, and B. Kao. “On incentive-based tagging”. In ICDE 2013.

×