Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

A Robust Framework for Classifying Evolving Document Streams in an Expert-Machine-Crowd Setting

198 views

Published on

An emerging challenge in the online classification of social media data streams is to keep the categories used for classification up-to-date. In this paper, we propose an innovative framework based on an Expert-Machine-Crowd (EMC) triad to help categorize items by continuously identifying novel concepts in heterogeneous data streams often riddled with outliers. We unify constrained clustering and outlier detection by formulating a novel optimization problem: COD-Means. We design an algorithm to solve the COD-Means problem and show that COD-Means will not only help detect novel categories but also seamlessly discover human annotation errors and improve the overall quality of the categorization process. Experiments on diverse real data sets demonstrate that our approach is both effective and efficient.

Published in: Science
  • Be the first to comment

  • Be the first to like this

A Robust Framework for Classifying Evolving Document Streams in an Expert-Machine-Crowd Setting

  1. 1. A Robust Framework for Classifying Evolving Document Streams in an Expert-Machine-Crowd Setting Muhammad Imran*, Sanjay Chawla*, Carlos Castillo** *Qatar Computing Research Institute, Doha, Qatar **Eurecat, Barcelona, Spain
  2. 2. Data Stream Processing Challenges 1. Infinite length 2. Concept-drift (change in data distributions) 3. Concept-evolution (new categories emerge) 4. Limited labeled data Credit Card fraud detection Sensor data classification Social media stream mining Data stream
  3. 3. Social Media Stream Processing in Time-Critical Situations 2013 Pakistan Earthquake September 28 at 07:34 UTC 2010 Haiti Earthquake January 12 at 21:53 UTC Social Media Platforms Availability of Immense Data: Around 16 thousands tweets per minute were posted during the hurricane Sandy in the US. Opportunities: - Early warning and event detection - Situational awareness - Actionable information extraction - Rapid crisis response - Post-disaster analysis Disease outbreaks
  4. 4. Social Media Data Streams Classification We address two issues in the classification (supervised) of social media streams: 1. How to keep the categories used for classification up-to-date? 1. While adding new categories, how to maintain high classification accuracy?
  5. 5. Input and Output Category A Category B Category C Miscellaneous Z Category A’ Category B’ Category C’ Z1 Z2 Z’ INPUTOUTPUT
  6. 6. Problem Definition Given as input a data set of documents: Categorized into a taxonomy: containing Partitioning of documents into taxonomy: Our task is to produce a new taxonomy: With the following characteristics: • There are N new categories: • Pre-existing categories are slightly modified: • New categories are different than the old: • The size of the miscellaneous category is reduced:
  7. 7. Expert-Machine-Crowd Setting Constraints Outlier Detection (COD-Means): 1. Constraints formation using classified items 2. Clustering using COD-Means 3. Labeling errors identification (using outlier detection)
  8. 8. Expert-Machine-Crowd Setting Constraints Outlier Detection (COD-Means): 1. Constraints formation using classified items 2. Clustering using COD-Means 3. Labeling errors identification (using outlier detection) 1 2 3 4
  9. 9. Constraints Formation 1. Items in same category have Must-link constraints 2. Items belonging to different categories have Cannot-link constraints Category A Category B Category C Category Z Must-link Cannot-linkNote: Items in Z do not have any constraints
  10. 10. Objective Function Standard distortion error If an ML constraint if violated then the cost of the violation is equal to the distance between the two centroids that contain the instances. If a CL constraint is violated then the error cost is the distance between the centroid C assigned to the pair and its nearest centroid h(c).
  11. 11. Assignment and Update Rules Rule 1: For items without any constraints (standard distortion error) Rule 2: For items with Must-link constraints; cost of violation is distance b/w their centroids Rule 3: For items with Cannot-link constraints; cost is the distance b/w centroid c and Its nearest centroid is the Kronecker delta function i.e. it is 1 if x=y and 0 if x != y Update rule: The update rule computes a modified average of all points that belong to a cluster.
  12. 12. COD-Means Algorithm Algorithm 1 2 3 Initialization (e.g. random pick of k centroids) Assignment of items based on 3 assignment rules considering ML and CL constraints Points in each cluster are sorted based on their distance to the centroid and top l are removed and inserted into L
  13. 13. Dataset and Experiments 1. Are the new clusters identified by the COD-Means algorithm genuinely different and novel? 2. What is the nature of outliers (labeling errors) discovered by the COD-Means algorithm? Are they genuine outliers? 3. What is the impact of outlier on the quality of clusters generated by COD-Means? 4. Once refined clusters (without labeling errors) used in the training process, does the overall accuracy improves? 8 disaster-related datasets were used from Twitter
  14. 14. Clusters Novelty and Coherence K-Means vs. COD-Means • The proposed approach generates more cohesive and novel clusters by removing outliers. • As the value of L increases, more tight and coherent clusters are observed.
  15. 15. Data Improvements Evaluation 1. Labeling errors in non-miscellaneous categories 2. Items incorrectly labeled as miscellaneous
  16. 16. Impact on Classification Performance
  17. 17. Conclusion • Our setting: supervised stream classification • We presented COD-Means to learn novel categories and labeling errors from live streams • We used real-word Twitter datasets and performed extensive experimentation • We showed that COD-Means is able to identify new categories and labeling errors efficiently
  18. 18. Thank you for your attention!

×