Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Towards a Data-driven Approach to Identify
Crisis-Related Topics in Social Media Streams
Muhammad Imran (@mimran15) and Ca...
Information Variability on Social Media
• Different events present different information
categories
• Even for recurring e...
Information Variability on Social Media
• Different events present different information
categories
• Even for recurring e...
Information Variability on Social Media
• Different events present different information
categories
• Even for recurring e...
Information Variability on Social Media
• Different events present different information
categories
• Even for recurring e...
Information Variability on Social Media
• Different events present different information
categories
• Even for recurring e...
Different Classification Approaches
• Various classification approaches exist:
– Manual classification by human experts
– ...
Real-time Stream Classification
(Supervised )
• Fewer categories are better
– Decrease workers dropout
– More training dat...
Problem Statement
• How can we classify items arriving as a data
stream into a small number of categories, if
we cannot an...
Our Approach (top-down + bottom-up)
1. An expert defines information categories (top-down)
2. Messages are categorized int...
Candidate Generation
We propose to apply Latent Dirichlet Allocation
(LDA) on the Miscellaneous category:
• Input: A set o...
Candidate Evaluation
To reduce the workload of experts to decide
which categories to pick or not, we propose the
following...
Experimental Testing
• We used Twitter data of 17 crises (from the
CrisisLexT26 dataset at crisislex.org)
A. Affected indi...
Candidate Generation Setup
• Applied LDA on the messages in the “Z”
category of each crisis
• 5 topics were generated for ...
Candidate Annotation Setup
Recruited two experts from two Int. humanitarian
organizations in the crisis response domain
Results
• Topics with avg. score <= 2.5 considered as bad topics
• Topics with avg. score >= 3.5 considered as good topics...
Conclusion
• Novelty, intra-similarity and cohesiveness are
useful in identifying good topics
• Our approach combines top-...
Data used in this study can be requested:
Contact: Muhammad Imran at
mimran@qf.org.qa OR @mimran15
Thank you!
Authors contact:
Muhammad Imran @mimran15
Carlos Castillo @ChaToX
Upcoming SlideShare
Loading in …5
×

Towards a Data-driven Approach to Identify Crisis-Related Topics in Social Media Streams

1,007 views

Published on

While categorizing any type of user-generated content online
is a challenging problem, categorizing social media messages
during a crisis situation adds an additional layer of complexity, due to the volume and variability of information, and to the fact that these messages must be classified as soon as they arrive. Current approaches involve the use of automatic classification, human classification, or a mixture of
both. In these types of approaches, there are several reasons
to keep the number of information categories small and updated, which we examine in this article. This means at the onset of a crisis an expert must select a handful of information categories into which information will be categorized. The next step, as the crisis unfolds, is to dynamically change the initial set as new information is posted online. In this paper, we propose an effective way to dynamically extract emerging, potentially interesting, new categories from social media data.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Towards a Data-driven Approach to Identify Crisis-Related Topics in Social Media Streams

  1. 1. Towards a Data-driven Approach to Identify Crisis-Related Topics in Social Media Streams Muhammad Imran (@mimran15) and Carlos Castillo (@ChaToX) Qatar Computing Research Institute Doha, Qatar. SWDM’15 : WWW’15 May 18th 2015
  2. 2. Information Variability on Social Media • Different events present different information categories • Even for recurring events, categories proportion change
  3. 3. Information Variability on Social Media • Different events present different information categories • Even for recurring events, categories proportion change
  4. 4. Information Variability on Social Media • Different events present different information categories • Even for recurring events, categories proportion change
  5. 5. Information Variability on Social Media • Different events present different information categories • Even for recurring events, categories proportion change
  6. 6. Information Variability on Social Media • Different events present different information categories • Even for recurring events, categories proportion change
  7. 7. Different Classification Approaches • Various classification approaches exist: – Manual classification by human experts – Automatic classification using unsupervised or supervised approaches(needs training data) – Hybrid: Automatic + Manual • Retrospective vs. real-time classification – Batch processing (offline, training data availability) – Stream processing (real-time, scarce training data)
  8. 8. Real-time Stream Classification (Supervised ) • Fewer categories are better – Decrease workers dropout – More training data for each category, more accuracy – “7 plus/minus 2” rule [G. A. Miller, 56] • Categories need to be defined carefully – Empty categories (waste space and efforts of workers) – Categories that are too large introduce heterogeneity
  9. 9. Problem Statement • How can we classify items arriving as a data stream into a small number of categories, if we cannot anticipate exactly which will be the most frequent categories? Our research improves crowdsourcing-based and supervised learning-based systems (e.g. AIDR) by finding latent categories in fast data streams.
  10. 10. Our Approach (top-down + bottom-up) 1. An expert defines information categories (top-down) 2. Messages are categorized into the initial set plus an extra “Miscellaneous” category 3. Identify relevant and prevalent categories from the messages in the “Miscellaneous” category (bottom- up) 1. Generate candidate categories 2. Learn characteristics of good categories 3. Rank categories on good characteristics How do we identify relevant categories?
  11. 11. Candidate Generation We propose to apply Latent Dirichlet Allocation (LDA) on the Miscellaneous category: • Input: A set of n documents (all messages in the Misc. category) and a number m (# of topics to be generated) • Output: n x m matrix in which cell(i, j) indicates the extent to which document i corresponds to topic j.
  12. 12. Candidate Evaluation To reduce the workload of experts to decide which categories to pick or not, we propose the following criteria: • Volume: a category shouldn’t be too small • Novelty: a category must not overlap or be too similar to the existing categories • Cohesiveness (intra- and inter-similarity): a category should be cohesive (should have small intra-topic and large inter-topic values)
  13. 13. Experimental Testing • We used Twitter data of 17 crises (from the CrisisLexT26 dataset at crisislex.org) A. Affected individuals, deaths, injuries, missing, found. B. Infrastructure and utilities: buildings, roads, services damage. C. Donation and volunteering: needs, requests of food, shelter, supplies. D. Caution and advice: warnings issued or lifted, guidance and tips. E. Sympathy and emotional support: thoughts, prayers, gratitude, etc. Z. Other useful information not covered by any of the above categories.
  14. 14. Candidate Generation Setup • Applied LDA on the messages in the “Z” category of each crisis • 5 topics were generated for each crisis • Considered messages with LDA score > 0.06 in each topic • Presented the LDA generated topics to experts in random order
  15. 15. Candidate Annotation Setup Recruited two experts from two Int. humanitarian organizations in the crisis response domain
  16. 16. Results • Topics with avg. score <= 2.5 considered as bad topics • Topics with avg. score >= 3.5 considered as good topics • Hit: if the metric value of good topics > bad topics A crisis is not considered for evaluation, if all of its topics receive an average score either below or above 3.0.
  17. 17. Conclusion • Novelty, intra-similarity and cohesiveness are useful in identifying good topics • Our approach combines top-down (manual) and bottom-up (automatic) elements. • Learned important characteristics of good topics • Future work includes candidate ranking including recommendation for adding, merging, dropping new unseen categories
  18. 18. Data used in this study can be requested: Contact: Muhammad Imran at mimran@qf.org.qa OR @mimran15
  19. 19. Thank you! Authors contact: Muhammad Imran @mimran15 Carlos Castillo @ChaToX

×