SlideShare a Scribd company logo
1 of 18
A Robust Framework for Classifying Evolving Document
Streams in an Expert-Machine-Crowd Setting
Muhammad Imran*, Sanjay Chawla*, Carlos Castillo**
*Qatar Computing Research Institute, Doha, Qatar
**Eurecat, Barcelona, Spain
Data Stream Processing
Challenges
1. Infinite length
2. Concept-drift (change in data distributions)
3. Concept-evolution (new categories emerge)
4. Limited labeled data
Credit Card fraud detection Sensor data classification Social media stream mining
Data stream
Social Media Stream Processing in
Time-Critical Situations
2013 Pakistan Earthquake
September 28 at 07:34 UTC
2010 Haiti Earthquake
January 12 at 21:53 UTC
Social Media
Platforms
Availability of Immense Data:
Around 16 thousands tweets
per minute were posted during
the hurricane Sandy in the US.
Opportunities:
- Early warning and event detection
- Situational awareness
- Actionable information extraction
- Rapid crisis response
- Post-disaster analysis
Disease outbreaks
Social Media Data Streams
Classification
We address two issues in the classification (supervised) of
social media streams:
1. How to keep the categories used for classification up-to-date?
1. While adding new categories, how to maintain high
classification accuracy?
Input and Output
Category A Category B Category C Miscellaneous Z
Category A’ Category B’ Category C’
Z1 Z2
Z’
INPUTOUTPUT
Problem Definition
Given as input a data set of documents:
Categorized into a taxonomy: containing
Partitioning of documents into taxonomy:
Our task is to produce a new taxonomy:
With the following characteristics:
• There are N new categories:
• Pre-existing categories are slightly modified:
• New categories are different than the old:
• The size of the miscellaneous category is reduced:
Expert-Machine-Crowd Setting
Constraints Outlier Detection (COD-Means):
1. Constraints formation using classified items
2. Clustering using COD-Means
3. Labeling errors identification (using outlier detection)
Expert-Machine-Crowd Setting
Constraints Outlier Detection (COD-Means):
1. Constraints formation using classified items
2. Clustering using COD-Means
3. Labeling errors identification (using outlier detection)
1
2
3
4
Constraints Formation
1. Items in same category have Must-link constraints
2. Items belonging to different categories have Cannot-link
constraints
Category A Category B Category C Category Z
Must-link
Cannot-linkNote: Items in Z do not have any constraints
Objective Function
Standard distortion error
If an ML constraint if violated
then the cost of the violation is
equal to the distance between
the two centroids that contain
the instances.
If a CL constraint is violated then
the error cost is the distance
between the centroid C assigned
to the pair and its nearest
centroid h(c).
Assignment and Update Rules
Rule 1: For items without any constraints (standard distortion error)
Rule 2: For items with Must-link constraints; cost of violation is distance b/w their centroids
Rule 3: For items with Cannot-link constraints; cost is the distance b/w centroid c and
Its nearest centroid
is the Kronecker delta function
i.e. it is 1 if x=y and 0 if x != y
Update rule: The update rule computes a modified
average of all points that belong to a
cluster.
COD-Means Algorithm
Algorithm
1
2
3
Initialization (e.g. random pick of k centroids)
Assignment of items based on 3 assignment
rules considering ML and CL constraints
Points in each cluster are sorted based
on their distance to the centroid and
top l are removed and inserted into L
Dataset and Experiments
1. Are the new clusters identified by the COD-Means algorithm genuinely different and
novel?
2. What is the nature of outliers (labeling errors) discovered by the COD-Means
algorithm? Are they genuine outliers?
3. What is the impact of outlier on the quality of clusters generated by COD-Means?
4. Once refined clusters (without labeling errors) used in the training process, does the
overall accuracy improves?
8 disaster-related datasets were used from Twitter
Clusters Novelty and Coherence
K-Means vs. COD-Means
• The proposed approach generates more cohesive and novel clusters by removing outliers.
• As the value of L increases, more tight and coherent clusters are observed.
Data Improvements Evaluation
1. Labeling errors in non-miscellaneous categories
2. Items incorrectly labeled as miscellaneous
Impact on Classification Performance
Conclusion
• Our setting: supervised stream classification
• We presented COD-Means to learn novel
categories and labeling errors from live streams
• We used real-word Twitter datasets and
performed extensive experimentation
• We showed that COD-Means is able to identify
new categories and labeling errors efficiently
Thank you for your attention!

More Related Content

Viewers also liked

D-sieve : A Novel Data Processing Engine for Crises Related Social Messages
D-sieve : A Novel Data Processing Engine for Crises Related Social MessagesD-sieve : A Novel Data Processing Engine for Crises Related Social Messages
D-sieve : A Novel Data Processing Engine for Crises Related Social Messageswire unitn
 
Maps from the Crowd in Crisis context / OpenStreetMap Response to humanitaria...
Maps from the Crowd in Crisis context / OpenStreetMap Response to humanitaria...Maps from the Crowd in Crisis context / OpenStreetMap Response to humanitaria...
Maps from the Crowd in Crisis context / OpenStreetMap Response to humanitaria...Pierre Béland
 
The Role of Social Media and Artificial Intelligence for Disaster Response
The Role of Social Media and Artificial Intelligence for Disaster ResponseThe Role of Social Media and Artificial Intelligence for Disaster Response
The Role of Social Media and Artificial Intelligence for Disaster ResponseMuhammad Imran
 
SIAM SDM2014 tutorial - Social Media and Web of Data to Assist Crisis Respons...
SIAM SDM2014 tutorial - Social Media and Web of Data to Assist Crisis Respons...SIAM SDM2014 tutorial - Social Media and Web of Data to Assist Crisis Respons...
SIAM SDM2014 tutorial - Social Media and Web of Data to Assist Crisis Respons...Artificial Intelligence Institute at UofSC
 

Viewers also liked (6)

D-sieve : A Novel Data Processing Engine for Crises Related Social Messages
D-sieve : A Novel Data Processing Engine for Crises Related Social MessagesD-sieve : A Novel Data Processing Engine for Crises Related Social Messages
D-sieve : A Novel Data Processing Engine for Crises Related Social Messages
 
Maps from the Crowd in Crisis context / OpenStreetMap Response to humanitaria...
Maps from the Crowd in Crisis context / OpenStreetMap Response to humanitaria...Maps from the Crowd in Crisis context / OpenStreetMap Response to humanitaria...
Maps from the Crowd in Crisis context / OpenStreetMap Response to humanitaria...
 
The Role of Social Media and Artificial Intelligence for Disaster Response
The Role of Social Media and Artificial Intelligence for Disaster ResponseThe Role of Social Media and Artificial Intelligence for Disaster Response
The Role of Social Media and Artificial Intelligence for Disaster Response
 
Crisis Computing
Crisis ComputingCrisis Computing
Crisis Computing
 
SIAM SDM2014 tutorial - Social Media and Web of Data to Assist Crisis Respons...
SIAM SDM2014 tutorial - Social Media and Web of Data to Assist Crisis Respons...SIAM SDM2014 tutorial - Social Media and Web of Data to Assist Crisis Respons...
SIAM SDM2014 tutorial - Social Media and Web of Data to Assist Crisis Respons...
 
Social Data Mining
Social Data MiningSocial Data Mining
Social Data Mining
 

Similar to A Robust Framework for Classifying Evolving Document Streams in an Expert-Machine-Crowd Setting

In data streams using classification and clustering
In data streams using classification and clusteringIn data streams using classification and clustering
In data streams using classification and clusteringeSAT Publishing House
 
In data streams using classification and clustering different techniques to f...
In data streams using classification and clustering different techniques to f...In data streams using classification and clustering different techniques to f...
In data streams using classification and clustering different techniques to f...eSAT Journals
 
Fault Detection in Mobile Communication Networks Using Data Mining Techniques...
Fault Detection in Mobile Communication Networks Using Data Mining Techniques...Fault Detection in Mobile Communication Networks Using Data Mining Techniques...
Fault Detection in Mobile Communication Networks Using Data Mining Techniques...ijcisjournal
 
pratik meshram-Unit 5 (contemporary mkt r sch)
pratik meshram-Unit 5 (contemporary mkt r sch)pratik meshram-Unit 5 (contemporary mkt r sch)
pratik meshram-Unit 5 (contemporary mkt r sch)Pratik Meshram
 
Fuzzy Rule Base System for Software Classification
Fuzzy Rule Base System for Software ClassificationFuzzy Rule Base System for Software Classification
Fuzzy Rule Base System for Software Classificationijcsit
 
Crowd Density Estimation Using Base Line Filtering
Crowd Density Estimation Using Base Line FilteringCrowd Density Estimation Using Base Line Filtering
Crowd Density Estimation Using Base Line Filteringpaperpublications3
 
A Critique On Code Critics
A Critique On Code CriticsA Critique On Code Critics
A Critique On Code CriticsLaurie Smith
 
Anomaly detection in plain static graphs
Anomaly detection in plain static graphsAnomaly detection in plain static graphs
Anomaly detection in plain static graphsdash-javad
 
Caim discretization algorithm
Caim discretization algorithmCaim discretization algorithm
Caim discretization algorithmenok7
 
V1_I2_2012_Paper6.doc
V1_I2_2012_Paper6.docV1_I2_2012_Paper6.doc
V1_I2_2012_Paper6.docpraveena06
 
Survey of Data Mining Techniques on Crime Data Analysis
Survey of Data Mining Techniques on Crime Data AnalysisSurvey of Data Mining Techniques on Crime Data Analysis
Survey of Data Mining Techniques on Crime Data Analysisijdmtaiir
 
Survey of Data Mining Techniques on Crime Data Analysis
Survey of Data Mining Techniques on Crime Data AnalysisSurvey of Data Mining Techniques on Crime Data Analysis
Survey of Data Mining Techniques on Crime Data Analysisijdmtaiir
 
Analysis on different Data mining Techniques and algorithms used in IOT
Analysis on different Data mining Techniques and algorithms used in IOTAnalysis on different Data mining Techniques and algorithms used in IOT
Analysis on different Data mining Techniques and algorithms used in IOTIJERA Editor
 
A Hybrid Theory Of Power Theft Detection
A Hybrid Theory Of Power Theft DetectionA Hybrid Theory Of Power Theft Detection
A Hybrid Theory Of Power Theft DetectionCamella Taylor
 
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...SAIL_QU
 
A Review on Covid Detection using Cross Dataset Analysis
A Review on Covid Detection using Cross Dataset AnalysisA Review on Covid Detection using Cross Dataset Analysis
A Review on Covid Detection using Cross Dataset AnalysisIRJET Journal
 

Similar to A Robust Framework for Classifying Evolving Document Streams in an Expert-Machine-Crowd Setting (20)

In data streams using classification and clustering
In data streams using classification and clusteringIn data streams using classification and clustering
In data streams using classification and clustering
 
In data streams using classification and clustering different techniques to f...
In data streams using classification and clustering different techniques to f...In data streams using classification and clustering different techniques to f...
In data streams using classification and clustering different techniques to f...
 
Data clustering
Data clustering Data clustering
Data clustering
 
Fault Detection in Mobile Communication Networks Using Data Mining Techniques...
Fault Detection in Mobile Communication Networks Using Data Mining Techniques...Fault Detection in Mobile Communication Networks Using Data Mining Techniques...
Fault Detection in Mobile Communication Networks Using Data Mining Techniques...
 
pratik meshram-Unit 5 (contemporary mkt r sch)
pratik meshram-Unit 5 (contemporary mkt r sch)pratik meshram-Unit 5 (contemporary mkt r sch)
pratik meshram-Unit 5 (contemporary mkt r sch)
 
Clustering
ClusteringClustering
Clustering
 
Fuzzy Rule Base System for Software Classification
Fuzzy Rule Base System for Software ClassificationFuzzy Rule Base System for Software Classification
Fuzzy Rule Base System for Software Classification
 
Crowd Density Estimation Using Base Line Filtering
Crowd Density Estimation Using Base Line FilteringCrowd Density Estimation Using Base Line Filtering
Crowd Density Estimation Using Base Line Filtering
 
A Critique On Code Critics
A Critique On Code CriticsA Critique On Code Critics
A Critique On Code Critics
 
Anomaly detection in plain static graphs
Anomaly detection in plain static graphsAnomaly detection in plain static graphs
Anomaly detection in plain static graphs
 
Caim discretization algorithm
Caim discretization algorithmCaim discretization algorithm
Caim discretization algorithm
 
V1_I2_2012_Paper6.doc
V1_I2_2012_Paper6.docV1_I2_2012_Paper6.doc
V1_I2_2012_Paper6.doc
 
Survey of Data Mining Techniques on Crime Data Analysis
Survey of Data Mining Techniques on Crime Data AnalysisSurvey of Data Mining Techniques on Crime Data Analysis
Survey of Data Mining Techniques on Crime Data Analysis
 
Survey of Data Mining Techniques on Crime Data Analysis
Survey of Data Mining Techniques on Crime Data AnalysisSurvey of Data Mining Techniques on Crime Data Analysis
Survey of Data Mining Techniques on Crime Data Analysis
 
Analysis on different Data mining Techniques and algorithms used in IOT
Analysis on different Data mining Techniques and algorithms used in IOTAnalysis on different Data mining Techniques and algorithms used in IOT
Analysis on different Data mining Techniques and algorithms used in IOT
 
7 ijcse-01229
7 ijcse-012297 ijcse-01229
7 ijcse-01229
 
A Hybrid Theory Of Power Theft Detection
A Hybrid Theory Of Power Theft DetectionA Hybrid Theory Of Power Theft Detection
A Hybrid Theory Of Power Theft Detection
 
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...
 
Chapter 5.pdf
Chapter 5.pdfChapter 5.pdf
Chapter 5.pdf
 
A Review on Covid Detection using Cross Dataset Analysis
A Review on Covid Detection using Cross Dataset AnalysisA Review on Covid Detection using Cross Dataset Analysis
A Review on Covid Detection using Cross Dataset Analysis
 

More from Muhammad Imran

Processing Social Media Messages in Mass Emergency: A Survey
Processing Social Media Messages in Mass Emergency: A SurveyProcessing Social Media Messages in Mass Emergency: A Survey
Processing Social Media Messages in Mass Emergency: A SurveyMuhammad Imran
 
Damage Assessment from Social Media Imagery Data During Disasters
Damage Assessment from Social Media Imagery Data During DisastersDamage Assessment from Social Media Imagery Data During Disasters
Damage Assessment from Social Media Imagery Data During DisastersMuhammad Imran
 
Image4Act: Online Social Media Image Processing for Disaster Response
Image4Act: Online Social Media Image Processing for Disaster ResponseImage4Act: Online Social Media Image Processing for Disaster Response
Image4Act: Online Social Media Image Processing for Disaster ResponseMuhammad Imran
 
Real-Time Processing of Social Media Content for Social Good
Real-Time Processing of Social Media Content for Social GoodReal-Time Processing of Social Media Content for Social Good
Real-Time Processing of Social Media Content for Social GoodMuhammad Imran
 
AIDR Tutorial (Artificial Intelligence for Disaster Response)
AIDR Tutorial (Artificial Intelligence for Disaster Response)AIDR Tutorial (Artificial Intelligence for Disaster Response)
AIDR Tutorial (Artificial Intelligence for Disaster Response)Muhammad Imran
 
Summarizing Situational Tweets in Crisis Scenario
Summarizing Situational Tweets in Crisis ScenarioSummarizing Situational Tweets in Crisis Scenario
Summarizing Situational Tweets in Crisis ScenarioMuhammad Imran
 
Introduction to Machine Learning: An Application to Disaster Response
Introduction to Machine Learning: An Application to Disaster ResponseIntroduction to Machine Learning: An Application to Disaster Response
Introduction to Machine Learning: An Application to Disaster ResponseMuhammad Imran
 
Artificial Intelligence for Disaster Response
Artificial Intelligence for Disaster ResponseArtificial Intelligence for Disaster Response
Artificial Intelligence for Disaster ResponseMuhammad Imran
 
A Real-time Heuristic-based Unsupervised Method for Name Disambiguation in Di...
A Real-time Heuristic-based Unsupervised Method for Name Disambiguation in Di...A Real-time Heuristic-based Unsupervised Method for Name Disambiguation in Di...
A Real-time Heuristic-based Unsupervised Method for Name Disambiguation in Di...Muhammad Imran
 
Coordinating Human and Machine Intelligence to Classify Microblog Communica0o...
Coordinating Human and Machine Intelligence to Classify Microblog Communica0o...Coordinating Human and Machine Intelligence to Classify Microblog Communica0o...
Coordinating Human and Machine Intelligence to Classify Microblog Communica0o...Muhammad Imran
 
Tweet4act: Using Incident-Specific Profiles for Classifying Crisis-Related Me...
Tweet4act: Using Incident-Specific Profiles for Classifying Crisis-Related Me...Tweet4act: Using Incident-Specific Profiles for Classifying Crisis-Related Me...
Tweet4act: Using Incident-Specific Profiles for Classifying Crisis-Related Me...Muhammad Imran
 
Extracting Information Nuggets from Disaster-Related Messages in Social Media
Extracting Information Nuggets from Disaster-Related Messages in Social MediaExtracting Information Nuggets from Disaster-Related Messages in Social Media
Extracting Information Nuggets from Disaster-Related Messages in Social MediaMuhammad Imran
 
Domain Specific Mashups
Domain Specific MashupsDomain Specific Mashups
Domain Specific MashupsMuhammad Imran
 
Reseval Mashup Platform Talk at SECO
Reseval Mashup Platform Talk at SECOReseval Mashup Platform Talk at SECO
Reseval Mashup Platform Talk at SECOMuhammad Imran
 

More from Muhammad Imran (14)

Processing Social Media Messages in Mass Emergency: A Survey
Processing Social Media Messages in Mass Emergency: A SurveyProcessing Social Media Messages in Mass Emergency: A Survey
Processing Social Media Messages in Mass Emergency: A Survey
 
Damage Assessment from Social Media Imagery Data During Disasters
Damage Assessment from Social Media Imagery Data During DisastersDamage Assessment from Social Media Imagery Data During Disasters
Damage Assessment from Social Media Imagery Data During Disasters
 
Image4Act: Online Social Media Image Processing for Disaster Response
Image4Act: Online Social Media Image Processing for Disaster ResponseImage4Act: Online Social Media Image Processing for Disaster Response
Image4Act: Online Social Media Image Processing for Disaster Response
 
Real-Time Processing of Social Media Content for Social Good
Real-Time Processing of Social Media Content for Social GoodReal-Time Processing of Social Media Content for Social Good
Real-Time Processing of Social Media Content for Social Good
 
AIDR Tutorial (Artificial Intelligence for Disaster Response)
AIDR Tutorial (Artificial Intelligence for Disaster Response)AIDR Tutorial (Artificial Intelligence for Disaster Response)
AIDR Tutorial (Artificial Intelligence for Disaster Response)
 
Summarizing Situational Tweets in Crisis Scenario
Summarizing Situational Tweets in Crisis ScenarioSummarizing Situational Tweets in Crisis Scenario
Summarizing Situational Tweets in Crisis Scenario
 
Introduction to Machine Learning: An Application to Disaster Response
Introduction to Machine Learning: An Application to Disaster ResponseIntroduction to Machine Learning: An Application to Disaster Response
Introduction to Machine Learning: An Application to Disaster Response
 
Artificial Intelligence for Disaster Response
Artificial Intelligence for Disaster ResponseArtificial Intelligence for Disaster Response
Artificial Intelligence for Disaster Response
 
A Real-time Heuristic-based Unsupervised Method for Name Disambiguation in Di...
A Real-time Heuristic-based Unsupervised Method for Name Disambiguation in Di...A Real-time Heuristic-based Unsupervised Method for Name Disambiguation in Di...
A Real-time Heuristic-based Unsupervised Method for Name Disambiguation in Di...
 
Coordinating Human and Machine Intelligence to Classify Microblog Communica0o...
Coordinating Human and Machine Intelligence to Classify Microblog Communica0o...Coordinating Human and Machine Intelligence to Classify Microblog Communica0o...
Coordinating Human and Machine Intelligence to Classify Microblog Communica0o...
 
Tweet4act: Using Incident-Specific Profiles for Classifying Crisis-Related Me...
Tweet4act: Using Incident-Specific Profiles for Classifying Crisis-Related Me...Tweet4act: Using Incident-Specific Profiles for Classifying Crisis-Related Me...
Tweet4act: Using Incident-Specific Profiles for Classifying Crisis-Related Me...
 
Extracting Information Nuggets from Disaster-Related Messages in Social Media
Extracting Information Nuggets from Disaster-Related Messages in Social MediaExtracting Information Nuggets from Disaster-Related Messages in Social Media
Extracting Information Nuggets from Disaster-Related Messages in Social Media
 
Domain Specific Mashups
Domain Specific MashupsDomain Specific Mashups
Domain Specific Mashups
 
Reseval Mashup Platform Talk at SECO
Reseval Mashup Platform Talk at SECOReseval Mashup Platform Talk at SECO
Reseval Mashup Platform Talk at SECO
 

Recently uploaded

Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfSumit Kumar yadav
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxUmerFayaz5
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...Sérgio Sacani
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptxRajatChauhan518211
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxAArockiyaNisha
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
fundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomologyfundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomologyDrAnita Sharma
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSarthak Sekhar Mondal
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINsankalpkumarsahoo174
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfSumit Kumar yadav
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPirithiRaju
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...anilsa9823
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk
 

Recently uploaded (20)

Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptx
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
fundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomologyfundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomology
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 

A Robust Framework for Classifying Evolving Document Streams in an Expert-Machine-Crowd Setting

  • 1. A Robust Framework for Classifying Evolving Document Streams in an Expert-Machine-Crowd Setting Muhammad Imran*, Sanjay Chawla*, Carlos Castillo** *Qatar Computing Research Institute, Doha, Qatar **Eurecat, Barcelona, Spain
  • 2. Data Stream Processing Challenges 1. Infinite length 2. Concept-drift (change in data distributions) 3. Concept-evolution (new categories emerge) 4. Limited labeled data Credit Card fraud detection Sensor data classification Social media stream mining Data stream
  • 3. Social Media Stream Processing in Time-Critical Situations 2013 Pakistan Earthquake September 28 at 07:34 UTC 2010 Haiti Earthquake January 12 at 21:53 UTC Social Media Platforms Availability of Immense Data: Around 16 thousands tweets per minute were posted during the hurricane Sandy in the US. Opportunities: - Early warning and event detection - Situational awareness - Actionable information extraction - Rapid crisis response - Post-disaster analysis Disease outbreaks
  • 4. Social Media Data Streams Classification We address two issues in the classification (supervised) of social media streams: 1. How to keep the categories used for classification up-to-date? 1. While adding new categories, how to maintain high classification accuracy?
  • 5. Input and Output Category A Category B Category C Miscellaneous Z Category A’ Category B’ Category C’ Z1 Z2 Z’ INPUTOUTPUT
  • 6. Problem Definition Given as input a data set of documents: Categorized into a taxonomy: containing Partitioning of documents into taxonomy: Our task is to produce a new taxonomy: With the following characteristics: • There are N new categories: • Pre-existing categories are slightly modified: • New categories are different than the old: • The size of the miscellaneous category is reduced:
  • 7. Expert-Machine-Crowd Setting Constraints Outlier Detection (COD-Means): 1. Constraints formation using classified items 2. Clustering using COD-Means 3. Labeling errors identification (using outlier detection)
  • 8. Expert-Machine-Crowd Setting Constraints Outlier Detection (COD-Means): 1. Constraints formation using classified items 2. Clustering using COD-Means 3. Labeling errors identification (using outlier detection) 1 2 3 4
  • 9. Constraints Formation 1. Items in same category have Must-link constraints 2. Items belonging to different categories have Cannot-link constraints Category A Category B Category C Category Z Must-link Cannot-linkNote: Items in Z do not have any constraints
  • 10. Objective Function Standard distortion error If an ML constraint if violated then the cost of the violation is equal to the distance between the two centroids that contain the instances. If a CL constraint is violated then the error cost is the distance between the centroid C assigned to the pair and its nearest centroid h(c).
  • 11. Assignment and Update Rules Rule 1: For items without any constraints (standard distortion error) Rule 2: For items with Must-link constraints; cost of violation is distance b/w their centroids Rule 3: For items with Cannot-link constraints; cost is the distance b/w centroid c and Its nearest centroid is the Kronecker delta function i.e. it is 1 if x=y and 0 if x != y Update rule: The update rule computes a modified average of all points that belong to a cluster.
  • 12. COD-Means Algorithm Algorithm 1 2 3 Initialization (e.g. random pick of k centroids) Assignment of items based on 3 assignment rules considering ML and CL constraints Points in each cluster are sorted based on their distance to the centroid and top l are removed and inserted into L
  • 13. Dataset and Experiments 1. Are the new clusters identified by the COD-Means algorithm genuinely different and novel? 2. What is the nature of outliers (labeling errors) discovered by the COD-Means algorithm? Are they genuine outliers? 3. What is the impact of outlier on the quality of clusters generated by COD-Means? 4. Once refined clusters (without labeling errors) used in the training process, does the overall accuracy improves? 8 disaster-related datasets were used from Twitter
  • 14. Clusters Novelty and Coherence K-Means vs. COD-Means • The proposed approach generates more cohesive and novel clusters by removing outliers. • As the value of L increases, more tight and coherent clusters are observed.
  • 15. Data Improvements Evaluation 1. Labeling errors in non-miscellaneous categories 2. Items incorrectly labeled as miscellaneous
  • 17. Conclusion • Our setting: supervised stream classification • We presented COD-Means to learn novel categories and labeling errors from live streams • We used real-word Twitter datasets and performed extensive experimentation • We showed that COD-Means is able to identify new categories and labeling errors efficiently
  • 18. Thank you for your attention!

Editor's Notes

  1. Data streams processing such as credit card fraud detection, sensor data classification, social media stream mining has a number of challenges such as infinite length, concept drift, concept evolution and limited labeled data
  2. In this work, we focus on social media stream classification. This is important during all types of disasters and emergencies. Our interest is to classify millions of messages that people post on social media after disasters.
  3. So, our setting is supervised classification of social media streams. We have humans who provide labels
  4. This show what is our input and what we expect as output.
  5. The COD-Means problem is NP-hard for k > 1 and L ≥ 0.
  6. - The update rule of cj computes a modified average of all points that belong to a cluster . The modification captures the number of elements in the cluster which violated the ML and CL constraints.
  7. Cohesiveness: intra/inter: intra: (avg. distance of elements inside a cluster) inter (avg. distance w.r.t to other clusters) Novelty: maxDist(Ci) − minDist(Ci) Where: maxDist(Ci) = max a∈Ci,b∈∪k i=1Ti d(a, b) and minDist(Ci) = min a∈Ci,b∈∪k i=1Ti d(a, b) .