A Robust Framework for Classifying Evolving Document Streams in an Expert-Machine-Crowd Setting

Muhammad Imran
Muhammad ImranScientist at the Qatar Computing Research Institute - Lead of the Crisis Computing group at QCRI
A Robust Framework for Classifying Evolving Document
Streams in an Expert-Machine-Crowd Setting
Muhammad Imran*, Sanjay Chawla*, Carlos Castillo**
*Qatar Computing Research Institute, Doha, Qatar
**Eurecat, Barcelona, Spain
Data Stream Processing
Challenges
1. Infinite length
2. Concept-drift (change in data distributions)
3. Concept-evolution (new categories emerge)
4. Limited labeled data
Credit Card fraud detection Sensor data classification Social media stream mining
Data stream
Social Media Stream Processing in
Time-Critical Situations
2013 Pakistan Earthquake
September 28 at 07:34 UTC
2010 Haiti Earthquake
January 12 at 21:53 UTC
Social Media
Platforms
Availability of Immense Data:
Around 16 thousands tweets
per minute were posted during
the hurricane Sandy in the US.
Opportunities:
- Early warning and event detection
- Situational awareness
- Actionable information extraction
- Rapid crisis response
- Post-disaster analysis
Disease outbreaks
Social Media Data Streams
Classification
We address two issues in the classification (supervised) of
social media streams:
1. How to keep the categories used for classification up-to-date?
1. While adding new categories, how to maintain high
classification accuracy?
Input and Output
Category A Category B Category C Miscellaneous Z
Category A’ Category B’ Category C’
Z1 Z2
Z’
INPUTOUTPUT
Problem Definition
Given as input a data set of documents:
Categorized into a taxonomy: containing
Partitioning of documents into taxonomy:
Our task is to produce a new taxonomy:
With the following characteristics:
• There are N new categories:
• Pre-existing categories are slightly modified:
• New categories are different than the old:
• The size of the miscellaneous category is reduced:
Expert-Machine-Crowd Setting
Constraints Outlier Detection (COD-Means):
1. Constraints formation using classified items
2. Clustering using COD-Means
3. Labeling errors identification (using outlier detection)
Expert-Machine-Crowd Setting
Constraints Outlier Detection (COD-Means):
1. Constraints formation using classified items
2. Clustering using COD-Means
3. Labeling errors identification (using outlier detection)
1
2
3
4
Constraints Formation
1. Items in same category have Must-link constraints
2. Items belonging to different categories have Cannot-link
constraints
Category A Category B Category C Category Z
Must-link
Cannot-linkNote: Items in Z do not have any constraints
Objective Function
Standard distortion error
If an ML constraint if violated
then the cost of the violation is
equal to the distance between
the two centroids that contain
the instances.
If a CL constraint is violated then
the error cost is the distance
between the centroid C assigned
to the pair and its nearest
centroid h(c).
Assignment and Update Rules
Rule 1: For items without any constraints (standard distortion error)
Rule 2: For items with Must-link constraints; cost of violation is distance b/w their centroids
Rule 3: For items with Cannot-link constraints; cost is the distance b/w centroid c and
Its nearest centroid
is the Kronecker delta function
i.e. it is 1 if x=y and 0 if x != y
Update rule: The update rule computes a modified
average of all points that belong to a
cluster.
COD-Means Algorithm
Algorithm
1
2
3
Initialization (e.g. random pick of k centroids)
Assignment of items based on 3 assignment
rules considering ML and CL constraints
Points in each cluster are sorted based
on their distance to the centroid and
top l are removed and inserted into L
Dataset and Experiments
1. Are the new clusters identified by the COD-Means algorithm genuinely different and
novel?
2. What is the nature of outliers (labeling errors) discovered by the COD-Means
algorithm? Are they genuine outliers?
3. What is the impact of outlier on the quality of clusters generated by COD-Means?
4. Once refined clusters (without labeling errors) used in the training process, does the
overall accuracy improves?
8 disaster-related datasets were used from Twitter
Clusters Novelty and Coherence
K-Means vs. COD-Means
• The proposed approach generates more cohesive and novel clusters by removing outliers.
• As the value of L increases, more tight and coherent clusters are observed.
Data Improvements Evaluation
1. Labeling errors in non-miscellaneous categories
2. Items incorrectly labeled as miscellaneous
Impact on Classification Performance
Conclusion
• Our setting: supervised stream classification
• We presented COD-Means to learn novel
categories and labeling errors from live streams
• We used real-word Twitter datasets and
performed extensive experimentation
• We showed that COD-Means is able to identify
new categories and labeling errors efficiently
Thank you for your attention!
1 of 18

Recommended

OO Metrics by
OO MetricsOO Metrics
OO Metricsskmetz
8.9K views18 slides
Group 8 presentation_metrics_for_object_oriented_system by
Group 8 presentation_metrics_for_object_oriented_systemGroup 8 presentation_metrics_for_object_oriented_system
Group 8 presentation_metrics_for_object_oriented_systemHung Ho Ngoc
347 views12 slides
Wilcoxon Rank-Sum Test by
Wilcoxon Rank-Sum TestWilcoxon Rank-Sum Test
Wilcoxon Rank-Sum TestLakshmi Anush
1.4K views14 slides
A Validation of Object-Oriented Design Metrics as Quality Indicators by
A Validation of Object-Oriented Design Metrics as Quality IndicatorsA Validation of Object-Oriented Design Metrics as Quality Indicators
A Validation of Object-Oriented Design Metrics as Quality Indicatorsvie_dels
1.3K views46 slides
Machine Learning Project by
Machine Learning ProjectMachine Learning Project
Machine Learning ProjectPoojaGujral3
55 views15 slides
Measurement Metrics for Object Oriented Design by
Measurement Metrics for Object Oriented DesignMeasurement Metrics for Object Oriented Design
Measurement Metrics for Object Oriented Designzebew
5.1K views45 slides

More Related Content

Viewers also liked

D-sieve : A Novel Data Processing Engine for Crises Related Social Messages by
D-sieve : A Novel Data Processing Engine for Crises Related Social MessagesD-sieve : A Novel Data Processing Engine for Crises Related Social Messages
D-sieve : A Novel Data Processing Engine for Crises Related Social Messageswire unitn
1.2K views20 slides
Maps from the Crowd in Crisis context / OpenStreetMap Response to humanitaria... by
Maps from the Crowd in Crisis context / OpenStreetMap Response to humanitaria...Maps from the Crowd in Crisis context / OpenStreetMap Response to humanitaria...
Maps from the Crowd in Crisis context / OpenStreetMap Response to humanitaria...Pierre Béland
6.1K views53 slides
The Role of Social Media and Artificial Intelligence for Disaster Response by
The Role of Social Media and Artificial Intelligence for Disaster ResponseThe Role of Social Media and Artificial Intelligence for Disaster Response
The Role of Social Media and Artificial Intelligence for Disaster ResponseMuhammad Imran
7.1K views95 slides
Crisis Computing by
Crisis ComputingCrisis Computing
Crisis ComputingCarlos Castillo (ChaTo)
2.1K views75 slides
SIAM SDM2014 tutorial - Social Media and Web of Data to Assist Crisis Respons... by
SIAM SDM2014 tutorial - Social Media and Web of Data to Assist Crisis Respons...SIAM SDM2014 tutorial - Social Media and Web of Data to Assist Crisis Respons...
SIAM SDM2014 tutorial - Social Media and Web of Data to Assist Crisis Respons...Artificial Intelligence Institute at UofSC
20.1K views106 slides
Social Data Mining by
Social Data MiningSocial Data Mining
Social Data MiningMahesh Meniya
60.9K views53 slides

Viewers also liked(6)

D-sieve : A Novel Data Processing Engine for Crises Related Social Messages by wire unitn
D-sieve : A Novel Data Processing Engine for Crises Related Social MessagesD-sieve : A Novel Data Processing Engine for Crises Related Social Messages
D-sieve : A Novel Data Processing Engine for Crises Related Social Messages
wire unitn1.2K views
Maps from the Crowd in Crisis context / OpenStreetMap Response to humanitaria... by Pierre Béland
Maps from the Crowd in Crisis context / OpenStreetMap Response to humanitaria...Maps from the Crowd in Crisis context / OpenStreetMap Response to humanitaria...
Maps from the Crowd in Crisis context / OpenStreetMap Response to humanitaria...
Pierre Béland6.1K views
The Role of Social Media and Artificial Intelligence for Disaster Response by Muhammad Imran
The Role of Social Media and Artificial Intelligence for Disaster ResponseThe Role of Social Media and Artificial Intelligence for Disaster Response
The Role of Social Media and Artificial Intelligence for Disaster Response
Muhammad Imran7.1K views
Social Data Mining by Mahesh Meniya
Social Data MiningSocial Data Mining
Social Data Mining
Mahesh Meniya60.9K views

Similar to A Robust Framework for Classifying Evolving Document Streams in an Expert-Machine-Crowd Setting

In data streams using classification and clustering by
In data streams using classification and clusteringIn data streams using classification and clustering
In data streams using classification and clusteringeSAT Publishing House
262 views3 slides
In data streams using classification and clustering different techniques to f... by
In data streams using classification and clustering different techniques to f...In data streams using classification and clustering different techniques to f...
In data streams using classification and clustering different techniques to f...eSAT Journals
95 views3 slides
Data clustering by
Data clustering Data clustering
Data clustering GARIMA SHAKYA
2.9K views39 slides
Fault Detection in Mobile Communication Networks Using Data Mining Techniques... by
Fault Detection in Mobile Communication Networks Using Data Mining Techniques...Fault Detection in Mobile Communication Networks Using Data Mining Techniques...
Fault Detection in Mobile Communication Networks Using Data Mining Techniques...ijcisjournal
46 views9 slides
pratik meshram-Unit 5 (contemporary mkt r sch) by
pratik meshram-Unit 5 (contemporary mkt r sch)pratik meshram-Unit 5 (contemporary mkt r sch)
pratik meshram-Unit 5 (contemporary mkt r sch)Pratik Meshram
1.6K views55 slides
Clustering by
ClusteringClustering
ClusteringKiran Bhowmick
140 views43 slides

Similar to A Robust Framework for Classifying Evolving Document Streams in an Expert-Machine-Crowd Setting(20)

In data streams using classification and clustering different techniques to f... by eSAT Journals
In data streams using classification and clustering different techniques to f...In data streams using classification and clustering different techniques to f...
In data streams using classification and clustering different techniques to f...
eSAT Journals95 views
Fault Detection in Mobile Communication Networks Using Data Mining Techniques... by ijcisjournal
Fault Detection in Mobile Communication Networks Using Data Mining Techniques...Fault Detection in Mobile Communication Networks Using Data Mining Techniques...
Fault Detection in Mobile Communication Networks Using Data Mining Techniques...
ijcisjournal46 views
pratik meshram-Unit 5 (contemporary mkt r sch) by Pratik Meshram
pratik meshram-Unit 5 (contemporary mkt r sch)pratik meshram-Unit 5 (contemporary mkt r sch)
pratik meshram-Unit 5 (contemporary mkt r sch)
Pratik Meshram1.6K views
Fuzzy Rule Base System for Software Classification by ijcsit
Fuzzy Rule Base System for Software ClassificationFuzzy Rule Base System for Software Classification
Fuzzy Rule Base System for Software Classification
ijcsit1.2K views
Crowd Density Estimation Using Base Line Filtering by paperpublications3
Crowd Density Estimation Using Base Line FilteringCrowd Density Estimation Using Base Line Filtering
Crowd Density Estimation Using Base Line Filtering
paperpublications3193 views
A Critique On Code Critics by Laurie Smith
A Critique On Code CriticsA Critique On Code Critics
A Critique On Code Critics
Laurie Smith2 views
Anomaly detection in plain static graphs by dash-javad
Anomaly detection in plain static graphsAnomaly detection in plain static graphs
Anomaly detection in plain static graphs
dash-javad510 views
Caim discretization algorithm by enok7
Caim discretization algorithmCaim discretization algorithm
Caim discretization algorithm
enok7355 views
V1_I2_2012_Paper6.doc by praveena06
V1_I2_2012_Paper6.docV1_I2_2012_Paper6.doc
V1_I2_2012_Paper6.doc
praveena0612 views
Survey of Data Mining Techniques on Crime Data Analysis by ijdmtaiir
Survey of Data Mining Techniques on Crime Data AnalysisSurvey of Data Mining Techniques on Crime Data Analysis
Survey of Data Mining Techniques on Crime Data Analysis
ijdmtaiir13 views
Survey of Data Mining Techniques on Crime Data Analysis by ijdmtaiir
Survey of Data Mining Techniques on Crime Data AnalysisSurvey of Data Mining Techniques on Crime Data Analysis
Survey of Data Mining Techniques on Crime Data Analysis
ijdmtaiir36 views
Analysis on different Data mining Techniques and algorithms used in IOT by IJERA Editor
Analysis on different Data mining Techniques and algorithms used in IOTAnalysis on different Data mining Techniques and algorithms used in IOT
Analysis on different Data mining Techniques and algorithms used in IOT
IJERA Editor143 views
The Applications Of Cluster Analysis by Heather Freek
The Applications Of Cluster AnalysisThe Applications Of Cluster Analysis
The Applications Of Cluster Analysis
Heather Freek3 views
A Hybrid Theory Of Power Theft Detection by Camella Taylor
A Hybrid Theory Of Power Theft DetectionA Hybrid Theory Of Power Theft Detection
A Hybrid Theory Of Power Theft Detection
Camella Taylor2 views
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ... by SAIL_QU
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...
SAIL_QU126 views

More from Muhammad Imran

Processing Social Media Messages in Mass Emergency: A Survey by
Processing Social Media Messages in Mass Emergency: A SurveyProcessing Social Media Messages in Mass Emergency: A Survey
Processing Social Media Messages in Mass Emergency: A SurveyMuhammad Imran
697 views24 slides
Damage Assessment from Social Media Imagery Data During Disasters by
Damage Assessment from Social Media Imagery Data During DisastersDamage Assessment from Social Media Imagery Data During Disasters
Damage Assessment from Social Media Imagery Data During DisastersMuhammad Imran
434 views21 slides
Image4Act: Online Social Media Image Processing for Disaster Response by
Image4Act: Online Social Media Image Processing for Disaster ResponseImage4Act: Online Social Media Image Processing for Disaster Response
Image4Act: Online Social Media Image Processing for Disaster ResponseMuhammad Imran
552 views15 slides
Real-Time Processing of Social Media Content for Social Good by
Real-Time Processing of Social Media Content for Social GoodReal-Time Processing of Social Media Content for Social Good
Real-Time Processing of Social Media Content for Social GoodMuhammad Imran
645 views107 slides
AIDR Tutorial (Artificial Intelligence for Disaster Response) by
AIDR Tutorial (Artificial Intelligence for Disaster Response)AIDR Tutorial (Artificial Intelligence for Disaster Response)
AIDR Tutorial (Artificial Intelligence for Disaster Response)Muhammad Imran
505 views12 slides
Summarizing Situational Tweets in Crisis Scenario by
Summarizing Situational Tweets in Crisis ScenarioSummarizing Situational Tweets in Crisis Scenario
Summarizing Situational Tweets in Crisis ScenarioMuhammad Imran
597 views25 slides

More from Muhammad Imran(14)

Processing Social Media Messages in Mass Emergency: A Survey by Muhammad Imran
Processing Social Media Messages in Mass Emergency: A SurveyProcessing Social Media Messages in Mass Emergency: A Survey
Processing Social Media Messages in Mass Emergency: A Survey
Muhammad Imran697 views
Damage Assessment from Social Media Imagery Data During Disasters by Muhammad Imran
Damage Assessment from Social Media Imagery Data During DisastersDamage Assessment from Social Media Imagery Data During Disasters
Damage Assessment from Social Media Imagery Data During Disasters
Muhammad Imran434 views
Image4Act: Online Social Media Image Processing for Disaster Response by Muhammad Imran
Image4Act: Online Social Media Image Processing for Disaster ResponseImage4Act: Online Social Media Image Processing for Disaster Response
Image4Act: Online Social Media Image Processing for Disaster Response
Muhammad Imran552 views
Real-Time Processing of Social Media Content for Social Good by Muhammad Imran
Real-Time Processing of Social Media Content for Social GoodReal-Time Processing of Social Media Content for Social Good
Real-Time Processing of Social Media Content for Social Good
Muhammad Imran645 views
AIDR Tutorial (Artificial Intelligence for Disaster Response) by Muhammad Imran
AIDR Tutorial (Artificial Intelligence for Disaster Response)AIDR Tutorial (Artificial Intelligence for Disaster Response)
AIDR Tutorial (Artificial Intelligence for Disaster Response)
Muhammad Imran505 views
Summarizing Situational Tweets in Crisis Scenario by Muhammad Imran
Summarizing Situational Tweets in Crisis ScenarioSummarizing Situational Tweets in Crisis Scenario
Summarizing Situational Tweets in Crisis Scenario
Muhammad Imran597 views
Introduction to Machine Learning: An Application to Disaster Response by Muhammad Imran
Introduction to Machine Learning: An Application to Disaster ResponseIntroduction to Machine Learning: An Application to Disaster Response
Introduction to Machine Learning: An Application to Disaster Response
Muhammad Imran886 views
Artificial Intelligence for Disaster Response by Muhammad Imran
Artificial Intelligence for Disaster ResponseArtificial Intelligence for Disaster Response
Artificial Intelligence for Disaster Response
Muhammad Imran2K views
A Real-time Heuristic-based Unsupervised Method for Name Disambiguation in Di... by Muhammad Imran
A Real-time Heuristic-based Unsupervised Method for Name Disambiguation in Di...A Real-time Heuristic-based Unsupervised Method for Name Disambiguation in Di...
A Real-time Heuristic-based Unsupervised Method for Name Disambiguation in Di...
Muhammad Imran1.2K views
Coordinating Human and Machine Intelligence to Classify Microblog Communica0o... by Muhammad Imran
Coordinating Human and Machine Intelligence to Classify Microblog Communica0o...Coordinating Human and Machine Intelligence to Classify Microblog Communica0o...
Coordinating Human and Machine Intelligence to Classify Microblog Communica0o...
Muhammad Imran1.2K views
Tweet4act: Using Incident-Specific Profiles for Classifying Crisis-Related Me... by Muhammad Imran
Tweet4act: Using Incident-Specific Profiles for Classifying Crisis-Related Me...Tweet4act: Using Incident-Specific Profiles for Classifying Crisis-Related Me...
Tweet4act: Using Incident-Specific Profiles for Classifying Crisis-Related Me...
Muhammad Imran1.2K views
Extracting Information Nuggets from Disaster-Related Messages in Social Media by Muhammad Imran
Extracting Information Nuggets from Disaster-Related Messages in Social MediaExtracting Information Nuggets from Disaster-Related Messages in Social Media
Extracting Information Nuggets from Disaster-Related Messages in Social Media
Muhammad Imran2.5K views
Reseval Mashup Platform Talk at SECO by Muhammad Imran
Reseval Mashup Platform Talk at SECOReseval Mashup Platform Talk at SECO
Reseval Mashup Platform Talk at SECO
Muhammad Imran370 views

Recently uploaded

Study on Drug Drug Interaction Through Prescription Analysis of Type II Diabe... by
Study on Drug Drug Interaction Through Prescription Analysis of Type II Diabe...Study on Drug Drug Interaction Through Prescription Analysis of Type II Diabe...
Study on Drug Drug Interaction Through Prescription Analysis of Type II Diabe...Anmol Vishnu Gupta
28 views12 slides
ALGAL PRODUCTS.pptx by
ALGAL PRODUCTS.pptxALGAL PRODUCTS.pptx
ALGAL PRODUCTS.pptxRASHMI M G
7 views17 slides
BLOTTING TECHNIQUES SPECIAL by
BLOTTING TECHNIQUES SPECIALBLOTTING TECHNIQUES SPECIAL
BLOTTING TECHNIQUES SPECIALMuhammadImranMirza2
7 views56 slides
Krishna VSC 692 Credit Seminar.pptx by
Krishna VSC 692 Credit Seminar.pptxKrishna VSC 692 Credit Seminar.pptx
Krishna VSC 692 Credit Seminar.pptxKrishnaSharma682993
11 views54 slides
ZEBRA FISH: as model organism.pptx by
ZEBRA FISH: as model organism.pptxZEBRA FISH: as model organism.pptx
ZEBRA FISH: as model organism.pptxmahimachoudhary0807
11 views17 slides
Structure of purines and pyrimidines - Jahnvi arora (11228108), mmdu ,mullana... by
Structure of purines and pyrimidines - Jahnvi arora (11228108), mmdu ,mullana...Structure of purines and pyrimidines - Jahnvi arora (11228108), mmdu ,mullana...
Structure of purines and pyrimidines - Jahnvi arora (11228108), mmdu ,mullana...jahnviarora989
7 views12 slides

Recently uploaded(20)

Study on Drug Drug Interaction Through Prescription Analysis of Type II Diabe... by Anmol Vishnu Gupta
Study on Drug Drug Interaction Through Prescription Analysis of Type II Diabe...Study on Drug Drug Interaction Through Prescription Analysis of Type II Diabe...
Study on Drug Drug Interaction Through Prescription Analysis of Type II Diabe...
Structure of purines and pyrimidines - Jahnvi arora (11228108), mmdu ,mullana... by jahnviarora989
Structure of purines and pyrimidines - Jahnvi arora (11228108), mmdu ,mullana...Structure of purines and pyrimidines - Jahnvi arora (11228108), mmdu ,mullana...
Structure of purines and pyrimidines - Jahnvi arora (11228108), mmdu ,mullana...
jahnviarora9897 views
Small ruminant keepers’ knowledge, attitudes and practices towards peste des ... by ILRI
Small ruminant keepers’ knowledge, attitudes and practices towards peste des ...Small ruminant keepers’ knowledge, attitudes and practices towards peste des ...
Small ruminant keepers’ knowledge, attitudes and practices towards peste des ...
ILRI9 views
Effect of Integrated Nutrient Management on Growth and Yield of Solanaceous F... by SwagatBehera9
Effect of Integrated Nutrient Management on Growth and Yield of Solanaceous F...Effect of Integrated Nutrient Management on Growth and Yield of Solanaceous F...
Effect of Integrated Nutrient Management on Growth and Yield of Solanaceous F...
SwagatBehera95 views
2. Natural Sciences and Technology Author Siyavula.pdf by ssuser821efa
2. Natural Sciences and Technology Author Siyavula.pdf2. Natural Sciences and Technology Author Siyavula.pdf
2. Natural Sciences and Technology Author Siyavula.pdf
ssuser821efa11 views
Experimental animal Guinea pigs.pptx by Mansee Arya
Experimental animal Guinea pigs.pptxExperimental animal Guinea pigs.pptx
Experimental animal Guinea pigs.pptx
Mansee Arya40 views
Indian council for child welfare by RenuWaghmare2
Indian council for child welfareIndian council for child welfare
Indian council for child welfare
RenuWaghmare27 views
Note on the Riemann Hypothesis by vegafrank2
Note on the Riemann HypothesisNote on the Riemann Hypothesis
Note on the Riemann Hypothesis
vegafrank28 views
Oral_Presentation_by_Fatma (2).pdf by fatmaalmrzqi
Oral_Presentation_by_Fatma (2).pdfOral_Presentation_by_Fatma (2).pdf
Oral_Presentation_by_Fatma (2).pdf
fatmaalmrzqi8 views
Discovery of therapeutic agents targeting PKLR for NAFLD using drug repositio... by Trustlife
Discovery of therapeutic agents targeting PKLR for NAFLD using drug repositio...Discovery of therapeutic agents targeting PKLR for NAFLD using drug repositio...
Discovery of therapeutic agents targeting PKLR for NAFLD using drug repositio...
Trustlife146 views

A Robust Framework for Classifying Evolving Document Streams in an Expert-Machine-Crowd Setting

  • 1. A Robust Framework for Classifying Evolving Document Streams in an Expert-Machine-Crowd Setting Muhammad Imran*, Sanjay Chawla*, Carlos Castillo** *Qatar Computing Research Institute, Doha, Qatar **Eurecat, Barcelona, Spain
  • 2. Data Stream Processing Challenges 1. Infinite length 2. Concept-drift (change in data distributions) 3. Concept-evolution (new categories emerge) 4. Limited labeled data Credit Card fraud detection Sensor data classification Social media stream mining Data stream
  • 3. Social Media Stream Processing in Time-Critical Situations 2013 Pakistan Earthquake September 28 at 07:34 UTC 2010 Haiti Earthquake January 12 at 21:53 UTC Social Media Platforms Availability of Immense Data: Around 16 thousands tweets per minute were posted during the hurricane Sandy in the US. Opportunities: - Early warning and event detection - Situational awareness - Actionable information extraction - Rapid crisis response - Post-disaster analysis Disease outbreaks
  • 4. Social Media Data Streams Classification We address two issues in the classification (supervised) of social media streams: 1. How to keep the categories used for classification up-to-date? 1. While adding new categories, how to maintain high classification accuracy?
  • 5. Input and Output Category A Category B Category C Miscellaneous Z Category A’ Category B’ Category C’ Z1 Z2 Z’ INPUTOUTPUT
  • 6. Problem Definition Given as input a data set of documents: Categorized into a taxonomy: containing Partitioning of documents into taxonomy: Our task is to produce a new taxonomy: With the following characteristics: • There are N new categories: • Pre-existing categories are slightly modified: • New categories are different than the old: • The size of the miscellaneous category is reduced:
  • 7. Expert-Machine-Crowd Setting Constraints Outlier Detection (COD-Means): 1. Constraints formation using classified items 2. Clustering using COD-Means 3. Labeling errors identification (using outlier detection)
  • 8. Expert-Machine-Crowd Setting Constraints Outlier Detection (COD-Means): 1. Constraints formation using classified items 2. Clustering using COD-Means 3. Labeling errors identification (using outlier detection) 1 2 3 4
  • 9. Constraints Formation 1. Items in same category have Must-link constraints 2. Items belonging to different categories have Cannot-link constraints Category A Category B Category C Category Z Must-link Cannot-linkNote: Items in Z do not have any constraints
  • 10. Objective Function Standard distortion error If an ML constraint if violated then the cost of the violation is equal to the distance between the two centroids that contain the instances. If a CL constraint is violated then the error cost is the distance between the centroid C assigned to the pair and its nearest centroid h(c).
  • 11. Assignment and Update Rules Rule 1: For items without any constraints (standard distortion error) Rule 2: For items with Must-link constraints; cost of violation is distance b/w their centroids Rule 3: For items with Cannot-link constraints; cost is the distance b/w centroid c and Its nearest centroid is the Kronecker delta function i.e. it is 1 if x=y and 0 if x != y Update rule: The update rule computes a modified average of all points that belong to a cluster.
  • 12. COD-Means Algorithm Algorithm 1 2 3 Initialization (e.g. random pick of k centroids) Assignment of items based on 3 assignment rules considering ML and CL constraints Points in each cluster are sorted based on their distance to the centroid and top l are removed and inserted into L
  • 13. Dataset and Experiments 1. Are the new clusters identified by the COD-Means algorithm genuinely different and novel? 2. What is the nature of outliers (labeling errors) discovered by the COD-Means algorithm? Are they genuine outliers? 3. What is the impact of outlier on the quality of clusters generated by COD-Means? 4. Once refined clusters (without labeling errors) used in the training process, does the overall accuracy improves? 8 disaster-related datasets were used from Twitter
  • 14. Clusters Novelty and Coherence K-Means vs. COD-Means • The proposed approach generates more cohesive and novel clusters by removing outliers. • As the value of L increases, more tight and coherent clusters are observed.
  • 15. Data Improvements Evaluation 1. Labeling errors in non-miscellaneous categories 2. Items incorrectly labeled as miscellaneous
  • 17. Conclusion • Our setting: supervised stream classification • We presented COD-Means to learn novel categories and labeling errors from live streams • We used real-word Twitter datasets and performed extensive experimentation • We showed that COD-Means is able to identify new categories and labeling errors efficiently
  • 18. Thank you for your attention!

Editor's Notes

  1. Data streams processing such as credit card fraud detection, sensor data classification, social media stream mining has a number of challenges such as infinite length, concept drift, concept evolution and limited labeled data
  2. In this work, we focus on social media stream classification. This is important during all types of disasters and emergencies. Our interest is to classify millions of messages that people post on social media after disasters.
  3. So, our setting is supervised classification of social media streams. We have humans who provide labels
  4. This show what is our input and what we expect as output.
  5. The COD-Means problem is NP-hard for k > 1 and L ≥ 0.
  6. - The update rule of cj computes a modified average of all points that belong to a cluster . The modification captures the number of elements in the cluster which violated the ML and CL constraints.
  7. Cohesiveness: intra/inter: intra: (avg. distance of elements inside a cluster) inter (avg. distance w.r.t to other clusters) Novelty: maxDist(Ci) − minDist(Ci) Where: maxDist(Ci) = max a∈Ci,b∈∪k i=1Ti d(a, b) and minDist(Ci) = min a∈Ci,b∈∪k i=1Ti d(a, b) .