SlideShare a Scribd company logo
1 of 18
A Robust Framework for Classifying Evolving Document
Streams in an Expert-Machine-Crowd Setting
Muhammad Imran*, Sanjay Chawla*, Carlos Castillo**
*Qatar Computing Research Institute, Doha, Qatar
**Eurecat, Barcelona, Spain
Data Stream Processing
Challenges
1. Infinite length
2. Concept-drift (change in data distributions)
3. Concept-evolution (new categories emerge)
4. Limited labeled data
Credit Card fraud detection Sensor data classification Social media stream mining
Data stream
Social Media Stream Processing in
Time-Critical Situations
2013 Pakistan Earthquake
September 28 at 07:34 UTC
2010 Haiti Earthquake
January 12 at 21:53 UTC
Social Media
Platforms
Availability of Immense Data:
Around 16 thousands tweets
per minute were posted during
the hurricane Sandy in the US.
Opportunities:
- Early warning and event detection
- Situational awareness
- Actionable information extraction
- Rapid crisis response
- Post-disaster analysis
Disease outbreaks
Social Media Data Streams
Classification
We address two issues in the classification (supervised) of
social media streams:
1. How to keep the categories used for classification up-to-date?
1. While adding new categories, how to maintain high
classification accuracy?
Input and Output
Category A Category B Category C Miscellaneous Z
Category A’ Category B’ Category C’
Z1 Z2
Z’
INPUTOUTPUT
Problem Definition
Given as input a data set of documents:
Categorized into a taxonomy: containing
Partitioning of documents into taxonomy:
Our task is to produce a new taxonomy:
With the following characteristics:
• There are N new categories:
• Pre-existing categories are slightly modified:
• New categories are different than the old:
• The size of the miscellaneous category is reduced:
Expert-Machine-Crowd Setting
Constraints Outlier Detection (COD-Means):
1. Constraints formation using classified items
2. Clustering using COD-Means
3. Labeling errors identification (using outlier detection)
Expert-Machine-Crowd Setting
Constraints Outlier Detection (COD-Means):
1. Constraints formation using classified items
2. Clustering using COD-Means
3. Labeling errors identification (using outlier detection)
1
2
3
4
Constraints Formation
1. Items in same category have Must-link constraints
2. Items belonging to different categories have Cannot-link
constraints
Category A Category B Category C Category Z
Must-link
Cannot-linkNote: Items in Z do not have any constraints
Objective Function
Standard distortion error
If an ML constraint if violated
then the cost of the violation is
equal to the distance between
the two centroids that contain
the instances.
If a CL constraint is violated then
the error cost is the distance
between the centroid C assigned
to the pair and its nearest
centroid h(c).
Assignment and Update Rules
Rule 1: For items without any constraints (standard distortion error)
Rule 2: For items with Must-link constraints; cost of violation is distance b/w their centroids
Rule 3: For items with Cannot-link constraints; cost is the distance b/w centroid c and
Its nearest centroid
is the Kronecker delta function
i.e. it is 1 if x=y and 0 if x != y
Update rule: The update rule computes a modified
average of all points that belong to a
cluster.
COD-Means Algorithm
Algorithm
1
2
3
Initialization (e.g. random pick of k centroids)
Assignment of items based on 3 assignment
rules considering ML and CL constraints
Points in each cluster are sorted based
on their distance to the centroid and
top l are removed and inserted into L
Dataset and Experiments
1. Are the new clusters identified by the COD-Means algorithm genuinely different and
novel?
2. What is the nature of outliers (labeling errors) discovered by the COD-Means
algorithm? Are they genuine outliers?
3. What is the impact of outlier on the quality of clusters generated by COD-Means?
4. Once refined clusters (without labeling errors) used in the training process, does the
overall accuracy improves?
8 disaster-related datasets were used from Twitter
Clusters Novelty and Coherence
K-Means vs. COD-Means
• The proposed approach generates more cohesive and novel clusters by removing outliers.
• As the value of L increases, more tight and coherent clusters are observed.
Data Improvements Evaluation
1. Labeling errors in non-miscellaneous categories
2. Items incorrectly labeled as miscellaneous
Impact on Classification Performance
Conclusion
• Our setting: supervised stream classification
• We presented COD-Means to learn novel
categories and labeling errors from live streams
• We used real-word Twitter datasets and
performed extensive experimentation
• We showed that COD-Means is able to identify
new categories and labeling errors efficiently
Thank you for your attention!

More Related Content

Viewers also liked

D-sieve : A Novel Data Processing Engine for Crises Related Social Messages
D-sieve : A Novel Data Processing Engine for Crises Related Social MessagesD-sieve : A Novel Data Processing Engine for Crises Related Social Messages
D-sieve : A Novel Data Processing Engine for Crises Related Social Messageswire unitn
 
Maps from the Crowd in Crisis context / OpenStreetMap Response to humanitaria...
Maps from the Crowd in Crisis context / OpenStreetMap Response to humanitaria...Maps from the Crowd in Crisis context / OpenStreetMap Response to humanitaria...
Maps from the Crowd in Crisis context / OpenStreetMap Response to humanitaria...Pierre Béland
 
The Role of Social Media and Artificial Intelligence for Disaster Response
The Role of Social Media and Artificial Intelligence for Disaster ResponseThe Role of Social Media and Artificial Intelligence for Disaster Response
The Role of Social Media and Artificial Intelligence for Disaster ResponseMuhammad Imran
 
SIAM SDM2014 tutorial - Social Media and Web of Data to Assist Crisis Respons...
SIAM SDM2014 tutorial - Social Media and Web of Data to Assist Crisis Respons...SIAM SDM2014 tutorial - Social Media and Web of Data to Assist Crisis Respons...
SIAM SDM2014 tutorial - Social Media and Web of Data to Assist Crisis Respons...Artificial Intelligence Institute at UofSC
 

Viewers also liked (6)

D-sieve : A Novel Data Processing Engine for Crises Related Social Messages
D-sieve : A Novel Data Processing Engine for Crises Related Social MessagesD-sieve : A Novel Data Processing Engine for Crises Related Social Messages
D-sieve : A Novel Data Processing Engine for Crises Related Social Messages
 
Maps from the Crowd in Crisis context / OpenStreetMap Response to humanitaria...
Maps from the Crowd in Crisis context / OpenStreetMap Response to humanitaria...Maps from the Crowd in Crisis context / OpenStreetMap Response to humanitaria...
Maps from the Crowd in Crisis context / OpenStreetMap Response to humanitaria...
 
The Role of Social Media and Artificial Intelligence for Disaster Response
The Role of Social Media and Artificial Intelligence for Disaster ResponseThe Role of Social Media and Artificial Intelligence for Disaster Response
The Role of Social Media and Artificial Intelligence for Disaster Response
 
Crisis Computing
Crisis ComputingCrisis Computing
Crisis Computing
 
SIAM SDM2014 tutorial - Social Media and Web of Data to Assist Crisis Respons...
SIAM SDM2014 tutorial - Social Media and Web of Data to Assist Crisis Respons...SIAM SDM2014 tutorial - Social Media and Web of Data to Assist Crisis Respons...
SIAM SDM2014 tutorial - Social Media and Web of Data to Assist Crisis Respons...
 
Social Data Mining
Social Data MiningSocial Data Mining
Social Data Mining
 

Similar to A Robust Framework for Classifying Evolving Document Streams in an Expert-Machine-Crowd Setting

In data streams using classification and clustering
In data streams using classification and clusteringIn data streams using classification and clustering
In data streams using classification and clusteringeSAT Publishing House
 
In data streams using classification and clustering different techniques to f...
In data streams using classification and clustering different techniques to f...In data streams using classification and clustering different techniques to f...
In data streams using classification and clustering different techniques to f...eSAT Journals
 
Fault Detection in Mobile Communication Networks Using Data Mining Techniques...
Fault Detection in Mobile Communication Networks Using Data Mining Techniques...Fault Detection in Mobile Communication Networks Using Data Mining Techniques...
Fault Detection in Mobile Communication Networks Using Data Mining Techniques...ijcisjournal
 
pratik meshram-Unit 5 (contemporary mkt r sch)
pratik meshram-Unit 5 (contemporary mkt r sch)pratik meshram-Unit 5 (contemporary mkt r sch)
pratik meshram-Unit 5 (contemporary mkt r sch)Pratik Meshram
 
Fuzzy Rule Base System for Software Classification
Fuzzy Rule Base System for Software ClassificationFuzzy Rule Base System for Software Classification
Fuzzy Rule Base System for Software Classificationijcsit
 
Crowd Density Estimation Using Base Line Filtering
Crowd Density Estimation Using Base Line FilteringCrowd Density Estimation Using Base Line Filtering
Crowd Density Estimation Using Base Line Filteringpaperpublications3
 
A Critique On Code Critics
A Critique On Code CriticsA Critique On Code Critics
A Critique On Code CriticsLaurie Smith
 
Anomaly detection in plain static graphs
Anomaly detection in plain static graphsAnomaly detection in plain static graphs
Anomaly detection in plain static graphsdash-javad
 
Caim discretization algorithm
Caim discretization algorithmCaim discretization algorithm
Caim discretization algorithmenok7
 
V1_I2_2012_Paper6.doc
V1_I2_2012_Paper6.docV1_I2_2012_Paper6.doc
V1_I2_2012_Paper6.docpraveena06
 
Survey of Data Mining Techniques on Crime Data Analysis
Survey of Data Mining Techniques on Crime Data AnalysisSurvey of Data Mining Techniques on Crime Data Analysis
Survey of Data Mining Techniques on Crime Data Analysisijdmtaiir
 
Survey of Data Mining Techniques on Crime Data Analysis
Survey of Data Mining Techniques on Crime Data AnalysisSurvey of Data Mining Techniques on Crime Data Analysis
Survey of Data Mining Techniques on Crime Data Analysisijdmtaiir
 
Analysis on different Data mining Techniques and algorithms used in IOT
Analysis on different Data mining Techniques and algorithms used in IOTAnalysis on different Data mining Techniques and algorithms used in IOT
Analysis on different Data mining Techniques and algorithms used in IOTIJERA Editor
 
A Hybrid Theory Of Power Theft Detection
A Hybrid Theory Of Power Theft DetectionA Hybrid Theory Of Power Theft Detection
A Hybrid Theory Of Power Theft DetectionCamella Taylor
 
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...SAIL_QU
 
A Review on Covid Detection using Cross Dataset Analysis
A Review on Covid Detection using Cross Dataset AnalysisA Review on Covid Detection using Cross Dataset Analysis
A Review on Covid Detection using Cross Dataset AnalysisIRJET Journal
 

Similar to A Robust Framework for Classifying Evolving Document Streams in an Expert-Machine-Crowd Setting (20)

In data streams using classification and clustering
In data streams using classification and clusteringIn data streams using classification and clustering
In data streams using classification and clustering
 
In data streams using classification and clustering different techniques to f...
In data streams using classification and clustering different techniques to f...In data streams using classification and clustering different techniques to f...
In data streams using classification and clustering different techniques to f...
 
Data clustering
Data clustering Data clustering
Data clustering
 
Fault Detection in Mobile Communication Networks Using Data Mining Techniques...
Fault Detection in Mobile Communication Networks Using Data Mining Techniques...Fault Detection in Mobile Communication Networks Using Data Mining Techniques...
Fault Detection in Mobile Communication Networks Using Data Mining Techniques...
 
pratik meshram-Unit 5 (contemporary mkt r sch)
pratik meshram-Unit 5 (contemporary mkt r sch)pratik meshram-Unit 5 (contemporary mkt r sch)
pratik meshram-Unit 5 (contemporary mkt r sch)
 
Clustering
ClusteringClustering
Clustering
 
Fuzzy Rule Base System for Software Classification
Fuzzy Rule Base System for Software ClassificationFuzzy Rule Base System for Software Classification
Fuzzy Rule Base System for Software Classification
 
Crowd Density Estimation Using Base Line Filtering
Crowd Density Estimation Using Base Line FilteringCrowd Density Estimation Using Base Line Filtering
Crowd Density Estimation Using Base Line Filtering
 
A Critique On Code Critics
A Critique On Code CriticsA Critique On Code Critics
A Critique On Code Critics
 
Anomaly detection in plain static graphs
Anomaly detection in plain static graphsAnomaly detection in plain static graphs
Anomaly detection in plain static graphs
 
Caim discretization algorithm
Caim discretization algorithmCaim discretization algorithm
Caim discretization algorithm
 
V1_I2_2012_Paper6.doc
V1_I2_2012_Paper6.docV1_I2_2012_Paper6.doc
V1_I2_2012_Paper6.doc
 
Survey of Data Mining Techniques on Crime Data Analysis
Survey of Data Mining Techniques on Crime Data AnalysisSurvey of Data Mining Techniques on Crime Data Analysis
Survey of Data Mining Techniques on Crime Data Analysis
 
Survey of Data Mining Techniques on Crime Data Analysis
Survey of Data Mining Techniques on Crime Data AnalysisSurvey of Data Mining Techniques on Crime Data Analysis
Survey of Data Mining Techniques on Crime Data Analysis
 
Analysis on different Data mining Techniques and algorithms used in IOT
Analysis on different Data mining Techniques and algorithms used in IOTAnalysis on different Data mining Techniques and algorithms used in IOT
Analysis on different Data mining Techniques and algorithms used in IOT
 
7 ijcse-01229
7 ijcse-012297 ijcse-01229
7 ijcse-01229
 
A Hybrid Theory Of Power Theft Detection
A Hybrid Theory Of Power Theft DetectionA Hybrid Theory Of Power Theft Detection
A Hybrid Theory Of Power Theft Detection
 
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...
 
Chapter 5.pdf
Chapter 5.pdfChapter 5.pdf
Chapter 5.pdf
 
A Review on Covid Detection using Cross Dataset Analysis
A Review on Covid Detection using Cross Dataset AnalysisA Review on Covid Detection using Cross Dataset Analysis
A Review on Covid Detection using Cross Dataset Analysis
 

More from Muhammad Imran

Processing Social Media Messages in Mass Emergency: A Survey
Processing Social Media Messages in Mass Emergency: A SurveyProcessing Social Media Messages in Mass Emergency: A Survey
Processing Social Media Messages in Mass Emergency: A SurveyMuhammad Imran
 
Damage Assessment from Social Media Imagery Data During Disasters
Damage Assessment from Social Media Imagery Data During DisastersDamage Assessment from Social Media Imagery Data During Disasters
Damage Assessment from Social Media Imagery Data During DisastersMuhammad Imran
 
Image4Act: Online Social Media Image Processing for Disaster Response
Image4Act: Online Social Media Image Processing for Disaster ResponseImage4Act: Online Social Media Image Processing for Disaster Response
Image4Act: Online Social Media Image Processing for Disaster ResponseMuhammad Imran
 
Real-Time Processing of Social Media Content for Social Good
Real-Time Processing of Social Media Content for Social GoodReal-Time Processing of Social Media Content for Social Good
Real-Time Processing of Social Media Content for Social GoodMuhammad Imran
 
AIDR Tutorial (Artificial Intelligence for Disaster Response)
AIDR Tutorial (Artificial Intelligence for Disaster Response)AIDR Tutorial (Artificial Intelligence for Disaster Response)
AIDR Tutorial (Artificial Intelligence for Disaster Response)Muhammad Imran
 
Summarizing Situational Tweets in Crisis Scenario
Summarizing Situational Tweets in Crisis ScenarioSummarizing Situational Tweets in Crisis Scenario
Summarizing Situational Tweets in Crisis ScenarioMuhammad Imran
 
Introduction to Machine Learning: An Application to Disaster Response
Introduction to Machine Learning: An Application to Disaster ResponseIntroduction to Machine Learning: An Application to Disaster Response
Introduction to Machine Learning: An Application to Disaster ResponseMuhammad Imran
 
Artificial Intelligence for Disaster Response
Artificial Intelligence for Disaster ResponseArtificial Intelligence for Disaster Response
Artificial Intelligence for Disaster ResponseMuhammad Imran
 
A Real-time Heuristic-based Unsupervised Method for Name Disambiguation in Di...
A Real-time Heuristic-based Unsupervised Method for Name Disambiguation in Di...A Real-time Heuristic-based Unsupervised Method for Name Disambiguation in Di...
A Real-time Heuristic-based Unsupervised Method for Name Disambiguation in Di...Muhammad Imran
 
Coordinating Human and Machine Intelligence to Classify Microblog Communica0o...
Coordinating Human and Machine Intelligence to Classify Microblog Communica0o...Coordinating Human and Machine Intelligence to Classify Microblog Communica0o...
Coordinating Human and Machine Intelligence to Classify Microblog Communica0o...Muhammad Imran
 
Tweet4act: Using Incident-Specific Profiles for Classifying Crisis-Related Me...
Tweet4act: Using Incident-Specific Profiles for Classifying Crisis-Related Me...Tweet4act: Using Incident-Specific Profiles for Classifying Crisis-Related Me...
Tweet4act: Using Incident-Specific Profiles for Classifying Crisis-Related Me...Muhammad Imran
 
Extracting Information Nuggets from Disaster-Related Messages in Social Media
Extracting Information Nuggets from Disaster-Related Messages in Social MediaExtracting Information Nuggets from Disaster-Related Messages in Social Media
Extracting Information Nuggets from Disaster-Related Messages in Social MediaMuhammad Imran
 
Domain Specific Mashups
Domain Specific MashupsDomain Specific Mashups
Domain Specific MashupsMuhammad Imran
 
Reseval Mashup Platform Talk at SECO
Reseval Mashup Platform Talk at SECOReseval Mashup Platform Talk at SECO
Reseval Mashup Platform Talk at SECOMuhammad Imran
 

More from Muhammad Imran (14)

Processing Social Media Messages in Mass Emergency: A Survey
Processing Social Media Messages in Mass Emergency: A SurveyProcessing Social Media Messages in Mass Emergency: A Survey
Processing Social Media Messages in Mass Emergency: A Survey
 
Damage Assessment from Social Media Imagery Data During Disasters
Damage Assessment from Social Media Imagery Data During DisastersDamage Assessment from Social Media Imagery Data During Disasters
Damage Assessment from Social Media Imagery Data During Disasters
 
Image4Act: Online Social Media Image Processing for Disaster Response
Image4Act: Online Social Media Image Processing for Disaster ResponseImage4Act: Online Social Media Image Processing for Disaster Response
Image4Act: Online Social Media Image Processing for Disaster Response
 
Real-Time Processing of Social Media Content for Social Good
Real-Time Processing of Social Media Content for Social GoodReal-Time Processing of Social Media Content for Social Good
Real-Time Processing of Social Media Content for Social Good
 
AIDR Tutorial (Artificial Intelligence for Disaster Response)
AIDR Tutorial (Artificial Intelligence for Disaster Response)AIDR Tutorial (Artificial Intelligence for Disaster Response)
AIDR Tutorial (Artificial Intelligence for Disaster Response)
 
Summarizing Situational Tweets in Crisis Scenario
Summarizing Situational Tweets in Crisis ScenarioSummarizing Situational Tweets in Crisis Scenario
Summarizing Situational Tweets in Crisis Scenario
 
Introduction to Machine Learning: An Application to Disaster Response
Introduction to Machine Learning: An Application to Disaster ResponseIntroduction to Machine Learning: An Application to Disaster Response
Introduction to Machine Learning: An Application to Disaster Response
 
Artificial Intelligence for Disaster Response
Artificial Intelligence for Disaster ResponseArtificial Intelligence for Disaster Response
Artificial Intelligence for Disaster Response
 
A Real-time Heuristic-based Unsupervised Method for Name Disambiguation in Di...
A Real-time Heuristic-based Unsupervised Method for Name Disambiguation in Di...A Real-time Heuristic-based Unsupervised Method for Name Disambiguation in Di...
A Real-time Heuristic-based Unsupervised Method for Name Disambiguation in Di...
 
Coordinating Human and Machine Intelligence to Classify Microblog Communica0o...
Coordinating Human and Machine Intelligence to Classify Microblog Communica0o...Coordinating Human and Machine Intelligence to Classify Microblog Communica0o...
Coordinating Human and Machine Intelligence to Classify Microblog Communica0o...
 
Tweet4act: Using Incident-Specific Profiles for Classifying Crisis-Related Me...
Tweet4act: Using Incident-Specific Profiles for Classifying Crisis-Related Me...Tweet4act: Using Incident-Specific Profiles for Classifying Crisis-Related Me...
Tweet4act: Using Incident-Specific Profiles for Classifying Crisis-Related Me...
 
Extracting Information Nuggets from Disaster-Related Messages in Social Media
Extracting Information Nuggets from Disaster-Related Messages in Social MediaExtracting Information Nuggets from Disaster-Related Messages in Social Media
Extracting Information Nuggets from Disaster-Related Messages in Social Media
 
Domain Specific Mashups
Domain Specific MashupsDomain Specific Mashups
Domain Specific Mashups
 
Reseval Mashup Platform Talk at SECO
Reseval Mashup Platform Talk at SECOReseval Mashup Platform Talk at SECO
Reseval Mashup Platform Talk at SECO
 

Recently uploaded

Jet reorientation in central galaxies of clusters and groups: insights from V...
Jet reorientation in central galaxies of clusters and groups: insights from V...Jet reorientation in central galaxies of clusters and groups: insights from V...
Jet reorientation in central galaxies of clusters and groups: insights from V...Sérgio Sacani
 
NuGOweek 2024 full programme - hosted by Ghent University
NuGOweek 2024 full programme - hosted by Ghent UniversityNuGOweek 2024 full programme - hosted by Ghent University
NuGOweek 2024 full programme - hosted by Ghent Universitypablovgd
 
TEST BANK for Organic Chemistry 6th Edition.pdf
TEST BANK for Organic Chemistry 6th Edition.pdfTEST BANK for Organic Chemistry 6th Edition.pdf
TEST BANK for Organic Chemistry 6th Edition.pdfmarcuskenyatta275
 
GBSN - Microbiology Lab (Microbiology Lab Safety Procedures)
GBSN -  Microbiology Lab (Microbiology Lab Safety Procedures)GBSN -  Microbiology Lab (Microbiology Lab Safety Procedures)
GBSN - Microbiology Lab (Microbiology Lab Safety Procedures)Areesha Ahmad
 
Plasmapheresis - Dr. E. Muralinath - Kalyan . C.pptx
Plasmapheresis - Dr. E. Muralinath - Kalyan . C.pptxPlasmapheresis - Dr. E. Muralinath - Kalyan . C.pptx
Plasmapheresis - Dr. E. Muralinath - Kalyan . C.pptxmuralinath2
 
Tuberculosis (TB)-Notes.pdf microbiology notes
Tuberculosis (TB)-Notes.pdf microbiology notesTuberculosis (TB)-Notes.pdf microbiology notes
Tuberculosis (TB)-Notes.pdf microbiology notesjyothisaisri
 
Molecular and Cellular Mechanism of Action of Hormones such as Growth Hormone...
Molecular and Cellular Mechanism of Action of Hormones such as Growth Hormone...Molecular and Cellular Mechanism of Action of Hormones such as Growth Hormone...
Molecular and Cellular Mechanism of Action of Hormones such as Growth Hormone...Ansari Aashif Raza Mohd Imtiyaz
 
The solar dynamo begins near the surface
The solar dynamo begins near the surfaceThe solar dynamo begins near the surface
The solar dynamo begins near the surfaceSérgio Sacani
 
NuGOweek 2024 programme final FLYER short.pdf
NuGOweek 2024 programme final FLYER short.pdfNuGOweek 2024 programme final FLYER short.pdf
NuGOweek 2024 programme final FLYER short.pdfpablovgd
 
Hemoglobin metabolism: C Kalyan & E. Muralinath
Hemoglobin metabolism: C Kalyan & E. MuralinathHemoglobin metabolism: C Kalyan & E. Muralinath
Hemoglobin metabolism: C Kalyan & E. Muralinathmuralinath2
 
Climate extremes likely to drive land mammal extinction during next supercont...
Climate extremes likely to drive land mammal extinction during next supercont...Climate extremes likely to drive land mammal extinction during next supercont...
Climate extremes likely to drive land mammal extinction during next supercont...Sérgio Sacani
 
A Giant Impact Origin for the First Subduction on Earth
A Giant Impact Origin for the First Subduction on EarthA Giant Impact Origin for the First Subduction on Earth
A Giant Impact Origin for the First Subduction on EarthSérgio Sacani
 
Erythropoiesis- Dr.E. Muralinath-C Kalyan
Erythropoiesis- Dr.E. Muralinath-C KalyanErythropoiesis- Dr.E. Muralinath-C Kalyan
Erythropoiesis- Dr.E. Muralinath-C Kalyanmuralinath2
 
GBSN - Biochemistry (Unit 4) Chemistry of Carbohydrates
GBSN - Biochemistry (Unit 4) Chemistry of CarbohydratesGBSN - Biochemistry (Unit 4) Chemistry of Carbohydrates
GBSN - Biochemistry (Unit 4) Chemistry of CarbohydratesAreesha Ahmad
 
GBSN - Microbiology Lab (Compound Microscope)
GBSN - Microbiology Lab (Compound Microscope)GBSN - Microbiology Lab (Compound Microscope)
GBSN - Microbiology Lab (Compound Microscope)Areesha Ahmad
 
RACEMIzATION AND ISOMERISATION completed.pptx
RACEMIzATION AND ISOMERISATION completed.pptxRACEMIzATION AND ISOMERISATION completed.pptx
RACEMIzATION AND ISOMERISATION completed.pptxArunLakshmiMeenakshi
 
Aerodynamics. flippatterncn5tm5ttnj6nmnynyppt
Aerodynamics. flippatterncn5tm5ttnj6nmnynypptAerodynamics. flippatterncn5tm5ttnj6nmnynyppt
Aerodynamics. flippatterncn5tm5ttnj6nmnynypptsreddyrahul
 
PLANT DISEASE MANAGEMENT PRINCIPLES AND ITS IMPORTANCE
PLANT DISEASE MANAGEMENT PRINCIPLES AND ITS IMPORTANCEPLANT DISEASE MANAGEMENT PRINCIPLES AND ITS IMPORTANCE
PLANT DISEASE MANAGEMENT PRINCIPLES AND ITS IMPORTANCETALAPATI ARUNA CHENNA VYDYANAD
 
Isolation of AMF by wet sieving and decantation method pptx
Isolation of AMF by wet sieving and decantation method pptxIsolation of AMF by wet sieving and decantation method pptx
Isolation of AMF by wet sieving and decantation method pptxGOWTHAMIM22
 
Land use land cover change analysis and detection of its drivers using geospa...
Land use land cover change analysis and detection of its drivers using geospa...Land use land cover change analysis and detection of its drivers using geospa...
Land use land cover change analysis and detection of its drivers using geospa...MohammedAhmed246550
 

Recently uploaded (20)

Jet reorientation in central galaxies of clusters and groups: insights from V...
Jet reorientation in central galaxies of clusters and groups: insights from V...Jet reorientation in central galaxies of clusters and groups: insights from V...
Jet reorientation in central galaxies of clusters and groups: insights from V...
 
NuGOweek 2024 full programme - hosted by Ghent University
NuGOweek 2024 full programme - hosted by Ghent UniversityNuGOweek 2024 full programme - hosted by Ghent University
NuGOweek 2024 full programme - hosted by Ghent University
 
TEST BANK for Organic Chemistry 6th Edition.pdf
TEST BANK for Organic Chemistry 6th Edition.pdfTEST BANK for Organic Chemistry 6th Edition.pdf
TEST BANK for Organic Chemistry 6th Edition.pdf
 
GBSN - Microbiology Lab (Microbiology Lab Safety Procedures)
GBSN -  Microbiology Lab (Microbiology Lab Safety Procedures)GBSN -  Microbiology Lab (Microbiology Lab Safety Procedures)
GBSN - Microbiology Lab (Microbiology Lab Safety Procedures)
 
Plasmapheresis - Dr. E. Muralinath - Kalyan . C.pptx
Plasmapheresis - Dr. E. Muralinath - Kalyan . C.pptxPlasmapheresis - Dr. E. Muralinath - Kalyan . C.pptx
Plasmapheresis - Dr. E. Muralinath - Kalyan . C.pptx
 
Tuberculosis (TB)-Notes.pdf microbiology notes
Tuberculosis (TB)-Notes.pdf microbiology notesTuberculosis (TB)-Notes.pdf microbiology notes
Tuberculosis (TB)-Notes.pdf microbiology notes
 
Molecular and Cellular Mechanism of Action of Hormones such as Growth Hormone...
Molecular and Cellular Mechanism of Action of Hormones such as Growth Hormone...Molecular and Cellular Mechanism of Action of Hormones such as Growth Hormone...
Molecular and Cellular Mechanism of Action of Hormones such as Growth Hormone...
 
The solar dynamo begins near the surface
The solar dynamo begins near the surfaceThe solar dynamo begins near the surface
The solar dynamo begins near the surface
 
NuGOweek 2024 programme final FLYER short.pdf
NuGOweek 2024 programme final FLYER short.pdfNuGOweek 2024 programme final FLYER short.pdf
NuGOweek 2024 programme final FLYER short.pdf
 
Hemoglobin metabolism: C Kalyan & E. Muralinath
Hemoglobin metabolism: C Kalyan & E. MuralinathHemoglobin metabolism: C Kalyan & E. Muralinath
Hemoglobin metabolism: C Kalyan & E. Muralinath
 
Climate extremes likely to drive land mammal extinction during next supercont...
Climate extremes likely to drive land mammal extinction during next supercont...Climate extremes likely to drive land mammal extinction during next supercont...
Climate extremes likely to drive land mammal extinction during next supercont...
 
A Giant Impact Origin for the First Subduction on Earth
A Giant Impact Origin for the First Subduction on EarthA Giant Impact Origin for the First Subduction on Earth
A Giant Impact Origin for the First Subduction on Earth
 
Erythropoiesis- Dr.E. Muralinath-C Kalyan
Erythropoiesis- Dr.E. Muralinath-C KalyanErythropoiesis- Dr.E. Muralinath-C Kalyan
Erythropoiesis- Dr.E. Muralinath-C Kalyan
 
GBSN - Biochemistry (Unit 4) Chemistry of Carbohydrates
GBSN - Biochemistry (Unit 4) Chemistry of CarbohydratesGBSN - Biochemistry (Unit 4) Chemistry of Carbohydrates
GBSN - Biochemistry (Unit 4) Chemistry of Carbohydrates
 
GBSN - Microbiology Lab (Compound Microscope)
GBSN - Microbiology Lab (Compound Microscope)GBSN - Microbiology Lab (Compound Microscope)
GBSN - Microbiology Lab (Compound Microscope)
 
RACEMIzATION AND ISOMERISATION completed.pptx
RACEMIzATION AND ISOMERISATION completed.pptxRACEMIzATION AND ISOMERISATION completed.pptx
RACEMIzATION AND ISOMERISATION completed.pptx
 
Aerodynamics. flippatterncn5tm5ttnj6nmnynyppt
Aerodynamics. flippatterncn5tm5ttnj6nmnynypptAerodynamics. flippatterncn5tm5ttnj6nmnynyppt
Aerodynamics. flippatterncn5tm5ttnj6nmnynyppt
 
PLANT DISEASE MANAGEMENT PRINCIPLES AND ITS IMPORTANCE
PLANT DISEASE MANAGEMENT PRINCIPLES AND ITS IMPORTANCEPLANT DISEASE MANAGEMENT PRINCIPLES AND ITS IMPORTANCE
PLANT DISEASE MANAGEMENT PRINCIPLES AND ITS IMPORTANCE
 
Isolation of AMF by wet sieving and decantation method pptx
Isolation of AMF by wet sieving and decantation method pptxIsolation of AMF by wet sieving and decantation method pptx
Isolation of AMF by wet sieving and decantation method pptx
 
Land use land cover change analysis and detection of its drivers using geospa...
Land use land cover change analysis and detection of its drivers using geospa...Land use land cover change analysis and detection of its drivers using geospa...
Land use land cover change analysis and detection of its drivers using geospa...
 

A Robust Framework for Classifying Evolving Document Streams in an Expert-Machine-Crowd Setting

  • 1. A Robust Framework for Classifying Evolving Document Streams in an Expert-Machine-Crowd Setting Muhammad Imran*, Sanjay Chawla*, Carlos Castillo** *Qatar Computing Research Institute, Doha, Qatar **Eurecat, Barcelona, Spain
  • 2. Data Stream Processing Challenges 1. Infinite length 2. Concept-drift (change in data distributions) 3. Concept-evolution (new categories emerge) 4. Limited labeled data Credit Card fraud detection Sensor data classification Social media stream mining Data stream
  • 3. Social Media Stream Processing in Time-Critical Situations 2013 Pakistan Earthquake September 28 at 07:34 UTC 2010 Haiti Earthquake January 12 at 21:53 UTC Social Media Platforms Availability of Immense Data: Around 16 thousands tweets per minute were posted during the hurricane Sandy in the US. Opportunities: - Early warning and event detection - Situational awareness - Actionable information extraction - Rapid crisis response - Post-disaster analysis Disease outbreaks
  • 4. Social Media Data Streams Classification We address two issues in the classification (supervised) of social media streams: 1. How to keep the categories used for classification up-to-date? 1. While adding new categories, how to maintain high classification accuracy?
  • 5. Input and Output Category A Category B Category C Miscellaneous Z Category A’ Category B’ Category C’ Z1 Z2 Z’ INPUTOUTPUT
  • 6. Problem Definition Given as input a data set of documents: Categorized into a taxonomy: containing Partitioning of documents into taxonomy: Our task is to produce a new taxonomy: With the following characteristics: • There are N new categories: • Pre-existing categories are slightly modified: • New categories are different than the old: • The size of the miscellaneous category is reduced:
  • 7. Expert-Machine-Crowd Setting Constraints Outlier Detection (COD-Means): 1. Constraints formation using classified items 2. Clustering using COD-Means 3. Labeling errors identification (using outlier detection)
  • 8. Expert-Machine-Crowd Setting Constraints Outlier Detection (COD-Means): 1. Constraints formation using classified items 2. Clustering using COD-Means 3. Labeling errors identification (using outlier detection) 1 2 3 4
  • 9. Constraints Formation 1. Items in same category have Must-link constraints 2. Items belonging to different categories have Cannot-link constraints Category A Category B Category C Category Z Must-link Cannot-linkNote: Items in Z do not have any constraints
  • 10. Objective Function Standard distortion error If an ML constraint if violated then the cost of the violation is equal to the distance between the two centroids that contain the instances. If a CL constraint is violated then the error cost is the distance between the centroid C assigned to the pair and its nearest centroid h(c).
  • 11. Assignment and Update Rules Rule 1: For items without any constraints (standard distortion error) Rule 2: For items with Must-link constraints; cost of violation is distance b/w their centroids Rule 3: For items with Cannot-link constraints; cost is the distance b/w centroid c and Its nearest centroid is the Kronecker delta function i.e. it is 1 if x=y and 0 if x != y Update rule: The update rule computes a modified average of all points that belong to a cluster.
  • 12. COD-Means Algorithm Algorithm 1 2 3 Initialization (e.g. random pick of k centroids) Assignment of items based on 3 assignment rules considering ML and CL constraints Points in each cluster are sorted based on their distance to the centroid and top l are removed and inserted into L
  • 13. Dataset and Experiments 1. Are the new clusters identified by the COD-Means algorithm genuinely different and novel? 2. What is the nature of outliers (labeling errors) discovered by the COD-Means algorithm? Are they genuine outliers? 3. What is the impact of outlier on the quality of clusters generated by COD-Means? 4. Once refined clusters (without labeling errors) used in the training process, does the overall accuracy improves? 8 disaster-related datasets were used from Twitter
  • 14. Clusters Novelty and Coherence K-Means vs. COD-Means • The proposed approach generates more cohesive and novel clusters by removing outliers. • As the value of L increases, more tight and coherent clusters are observed.
  • 15. Data Improvements Evaluation 1. Labeling errors in non-miscellaneous categories 2. Items incorrectly labeled as miscellaneous
  • 17. Conclusion • Our setting: supervised stream classification • We presented COD-Means to learn novel categories and labeling errors from live streams • We used real-word Twitter datasets and performed extensive experimentation • We showed that COD-Means is able to identify new categories and labeling errors efficiently
  • 18. Thank you for your attention!

Editor's Notes

  1. Data streams processing such as credit card fraud detection, sensor data classification, social media stream mining has a number of challenges such as infinite length, concept drift, concept evolution and limited labeled data
  2. In this work, we focus on social media stream classification. This is important during all types of disasters and emergencies. Our interest is to classify millions of messages that people post on social media after disasters.
  3. So, our setting is supervised classification of social media streams. We have humans who provide labels
  4. This show what is our input and what we expect as output.
  5. The COD-Means problem is NP-hard for k > 1 and L ≥ 0.
  6. - The update rule of cj computes a modified average of all points that belong to a cluster . The modification captures the number of elements in the cluster which violated the ML and CL constraints.
  7. Cohesiveness: intra/inter: intra: (avg. distance of elements inside a cluster) inter (avg. distance w.r.t to other clusters) Novelty: maxDist(Ci) − minDist(Ci) Where: maxDist(Ci) = max a∈Ci,b∈∪k i=1Ti d(a, b) and minDist(Ci) = min a∈Ci,b∈∪k i=1Ti d(a, b) .