SlideShare a Scribd company logo
Emre Calisir, Marco Brambilla
KDWEB2018, Cáceres, Spain
The Problem of Data Cleaning
for Knowledge Extraction
from Social Media
Knowledge Extraction
from Social Media
is a Need
Keyword or hash-tag
based filtering is
insufficient
Is it possible to extract a sub-selection of content
items if and only if they are actually relevant to the
topic or context of interest ?
Examples to Related Studies
1. Earthquake Alarm System, Sakaki et al. Proc. of the 19th int.conference on WWW, 2010
2. Detection of influenza-like illnesses
Culotta. Proc. of the 1st workshop on social media analytics, Washington, D.C, 2010
3. Discovering health topics, Paul & Dredze, PLoS ONE 9, e103408, 2014
4. Detection of prescription medication abuse, Sarker et al. Drug Safety, 2016
5. Tracking baseball and fashion topics, Lin et al. KDD, 2011
6. Event detection system, Kunneman & Bosch, BNAIC, 2014
7. Credibility of trend-topic hashtag usage
Castillo et al, Proc. of the 20th int. Conf. on WWW. ACM, Hyderabad, USA, 2011
8. Non-relevant tweet filtering, Hajjem & Latiri, Procedia computer science, 112, 2017
Supervised learning
trained on annotated
data could help us
Overview
Topic
Relevancy
Detection
Machine
Topic Relevant
Dataset
Proposed Data Cleaning Method for
Knowledge Extraction
Use Case
CulturalInstitutions
ofItaly
Best #Hotel Deals in #Pompei #HotelDegliAmiciPompei
starting at EUR99.60 https://t.co/5DxkKn4o69
Pompei Hero Pliny the Elder May Have Been Found 2000
Years Later https://t.co/PyR2rP1Xpe #2017Rewind #archeology
#history #Pompei #rome #RomanEmpire
Non-Relevant Tweet
Relevant Tweet
4 feature extraction strategies
evaluated
N-grams (unigrams, bigrams, trigrams)
Word2Vec
Word2Vec + additional tweet features
Dimensionality Reduction with PCA
Annotated Data
726 tweets.
Contains tweet having specific hashtags and keywords related to
Pompei, Colosseo and Teatro Alla Scala
The data contains 50% relevant and 50% non relevant.
Model 1: Text transformation to ngrams
# of [unigrams, bigrams, trigrams] : [494,287,228]
vocabulary size: 1009 words
Example:
Model 2: Text transformation to word2vec
• Word2vec dimension is selected as 25.
• Word2vec vocabulary is built with 12K unlabeled tweets
• Preprocessing operations before building word2vec model
• Convert to lowercase,
• Discard
• Web links
• Words with character size < 3
• Stopwords are eliminated before model building.
Model 2: Text transformation to word2vec
Model 2: Text transformation to word2vec
Model 3: word2vec + Additional Features
Tweet Author
Full text: #nuovi #corsi #inglese #settembre
#pompei #chiamaci per #info
https://t.co/QRrXlMC0g1
Number of Friends: 4
Number of Followers: 9
Number of Lists: 15
Number of Favourited Tweets: 0
Language: en Number of Tweets: 4220
Source: PostPickr Verified Account: False
Number of Favorited: 0 Geo Enabled: False
Number of Retweets: 0 Default Profile: False
Example:
Model 4: PCA applied on Model 3
Model 1 2 3 4
Accuracy 0.84 0.81 0.82 0.83
Precision 0.84 0.78 0.83 0.84
Recall 0.83 0.86 0.8 0.81
F1 0.83 0.82 0.81 0.82
Model 1: ngrams
Model 2: word2vec
Model 3: word2vec + additional features
Model 4: PCA applied on Model 3
10fold Cross-Validated Results
Conclusions
Supervised Machine Learning
techniques could help to obtain topic
relevant social media data
Collecting more data to build
larger Word2Vec Vocabulary
New Use Cases
Challenges ahead
THANKS!
QUESTIONS?
Emre Calisir, Marco Brambilla
The Problem of Data Cleaning for Knowledge Extraction from Social Media
Marco Brambilla @marcobrambi marco.brambilla@polimi.it
http://datascience.deib.polimi.it http://home.deib.polimi.it/marcobrambi

More Related Content

What's hot

Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...
Xiaohan Zeng
 
Data mining based social network
Data mining based social networkData mining based social network
Data mining based social network
Firas Husseini
 
Social Network Analysis: Applications & Challenges
Social Network Analysis: Applications & ChallengesSocial Network Analysis: Applications & Challenges
Social Network Analysis: Applications & Challenges
IIIT Hyderabad
 
Social network analysis course 2010 - 2011
Social network analysis course 2010 - 2011Social network analysis course 2010 - 2011
Social network analysis course 2010 - 2011
guillaume ereteo
 
Social Network Analysis Introduction including Data Structure Graph overview.
Social Network Analysis Introduction including Data Structure Graph overview. Social Network Analysis Introduction including Data Structure Graph overview.
Social Network Analysis Introduction including Data Structure Graph overview.
Doug Needham
 
Social Network Analysis
Social Network AnalysisSocial Network Analysis
Social Network Analysis
Fred Stutzman
 
LAK13 Tutorial Social Network Analysis 4 Learning Analytics
LAK13 Tutorial Social Network Analysis 4 Learning AnalyticsLAK13 Tutorial Social Network Analysis 4 Learning Analytics
LAK13 Tutorial Social Network Analysis 4 Learning Analytics
goehnert
 
Evolving social data mining and affective analysis
Evolving social data mining and affective analysis  Evolving social data mining and affective analysis
Evolving social data mining and affective analysis
Athena Vakali
 
Social network analysis & Big Data - Telecommunications and more
Social network analysis & Big Data - Telecommunications and moreSocial network analysis & Big Data - Telecommunications and more
Social network analysis & Big Data - Telecommunications and moreWael Elrifai
 
02 Network Data Collection
02 Network Data Collection02 Network Data Collection
02 Network Data Collection
dnac
 
Subscriber Churn Prediction Model using Social Network Analysis In Telecommun...
Subscriber Churn Prediction Model using Social Network Analysis In Telecommun...Subscriber Churn Prediction Model using Social Network Analysis In Telecommun...
Subscriber Churn Prediction Model using Social Network Analysis In Telecommun...
BAINIDA
 
Social Network Analysis (SNA)
Social Network Analysis (SNA)Social Network Analysis (SNA)
Social Network Analysis (SNA)
Development Innovations
 
05 Network Canvas (2017)
05 Network Canvas (2017)05 Network Canvas (2017)
05 Network Canvas (2017)
Duke Network Analysis Center
 
Social network analysis
Social network analysisSocial network analysis
Social network analysis
World Agroforestry (ICRAF)
 
01 Introduction to Networks Methods and Measures
01 Introduction to Networks Methods and Measures01 Introduction to Networks Methods and Measures
01 Introduction to Networks Methods and Measures
dnac
 
Overview Of Network Analysis Platforms
Overview Of Network Analysis PlatformsOverview Of Network Analysis Platforms
Overview Of Network Analysis PlatformsNoah Flower
 
Social Network Analysis (SNA) 2018
Social Network Analysis  (SNA) 2018Social Network Analysis  (SNA) 2018
Social Network Analysis (SNA) 2018
Arsalan Khan
 

What's hot (18)

Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...
 
Data mining based social network
Data mining based social networkData mining based social network
Data mining based social network
 
Social Network Analysis: Applications & Challenges
Social Network Analysis: Applications & ChallengesSocial Network Analysis: Applications & Challenges
Social Network Analysis: Applications & Challenges
 
Social network analysis course 2010 - 2011
Social network analysis course 2010 - 2011Social network analysis course 2010 - 2011
Social network analysis course 2010 - 2011
 
Social Network Analysis
Social Network AnalysisSocial Network Analysis
Social Network Analysis
 
Social Network Analysis Introduction including Data Structure Graph overview.
Social Network Analysis Introduction including Data Structure Graph overview. Social Network Analysis Introduction including Data Structure Graph overview.
Social Network Analysis Introduction including Data Structure Graph overview.
 
Social Network Analysis
Social Network AnalysisSocial Network Analysis
Social Network Analysis
 
LAK13 Tutorial Social Network Analysis 4 Learning Analytics
LAK13 Tutorial Social Network Analysis 4 Learning AnalyticsLAK13 Tutorial Social Network Analysis 4 Learning Analytics
LAK13 Tutorial Social Network Analysis 4 Learning Analytics
 
Evolving social data mining and affective analysis
Evolving social data mining and affective analysis  Evolving social data mining and affective analysis
Evolving social data mining and affective analysis
 
Social network analysis & Big Data - Telecommunications and more
Social network analysis & Big Data - Telecommunications and moreSocial network analysis & Big Data - Telecommunications and more
Social network analysis & Big Data - Telecommunications and more
 
02 Network Data Collection
02 Network Data Collection02 Network Data Collection
02 Network Data Collection
 
Subscriber Churn Prediction Model using Social Network Analysis In Telecommun...
Subscriber Churn Prediction Model using Social Network Analysis In Telecommun...Subscriber Churn Prediction Model using Social Network Analysis In Telecommun...
Subscriber Churn Prediction Model using Social Network Analysis In Telecommun...
 
Social Network Analysis (SNA)
Social Network Analysis (SNA)Social Network Analysis (SNA)
Social Network Analysis (SNA)
 
05 Network Canvas (2017)
05 Network Canvas (2017)05 Network Canvas (2017)
05 Network Canvas (2017)
 
Social network analysis
Social network analysisSocial network analysis
Social network analysis
 
01 Introduction to Networks Methods and Measures
01 Introduction to Networks Methods and Measures01 Introduction to Networks Methods and Measures
01 Introduction to Networks Methods and Measures
 
Overview Of Network Analysis Platforms
Overview Of Network Analysis PlatformsOverview Of Network Analysis Platforms
Overview Of Network Analysis Platforms
 
Social Network Analysis (SNA) 2018
Social Network Analysis  (SNA) 2018Social Network Analysis  (SNA) 2018
Social Network Analysis (SNA) 2018
 

Similar to Data Cleaning for social media knowledge extraction

Top 10 Factors for Successful TDM Projects
Top 10 Factors for Successful TDM ProjectsTop 10 Factors for Successful TDM Projects
Top 10 Factors for Successful TDM Projects
Mary Ellen Bates
 
News construction from microblogging post using open data
News construction from microblogging post using open dataNews construction from microblogging post using open data
News construction from microblogging post using open data
Francisco Berrizbeitia
 
IRJET- An Experimental Evaluation of Mechanical Properties of Bamboo Fiber Re...
IRJET- An Experimental Evaluation of Mechanical Properties of Bamboo Fiber Re...IRJET- An Experimental Evaluation of Mechanical Properties of Bamboo Fiber Re...
IRJET- An Experimental Evaluation of Mechanical Properties of Bamboo Fiber Re...
IRJET Journal
 
IRJET- Tweet Segmentation and its Application to Named Entity Recognition
IRJET- Tweet Segmentation and its Application to Named Entity RecognitionIRJET- Tweet Segmentation and its Application to Named Entity Recognition
IRJET- Tweet Segmentation and its Application to Named Entity Recognition
IRJET Journal
 
Predicting cyber bullying on t witter using machine learning
Predicting cyber bullying on t witter using machine learningPredicting cyber bullying on t witter using machine learning
Predicting cyber bullying on t witter using machine learning
MirXahid1
 
Abcd iqs ssoftware-projects-mercecrosas
Abcd iqs ssoftware-projects-mercecrosasAbcd iqs ssoftware-projects-mercecrosas
Abcd iqs ssoftware-projects-mercecrosas
Merce Crosas
 
Sentence embedding to improve rumour detection performance model
Sentence embedding to improve rumour detection performance modelSentence embedding to improve rumour detection performance model
Sentence embedding to improve rumour detection performance model
IAESIJAI
 
CXO Sakthi Presentation.pptxkjbk/jhlkasjdl;a
CXO Sakthi Presentation.pptxkjbk/jhlkasjdl;aCXO Sakthi Presentation.pptxkjbk/jhlkasjdl;a
CXO Sakthi Presentation.pptxkjbk/jhlkasjdl;a
stark880qndustries
 
Identification and Analysis of Malicious Content on Facebook: A Survey
Identification and Analysis of Malicious Content on Facebook: A SurveyIdentification and Analysis of Malicious Content on Facebook: A Survey
Identification and Analysis of Malicious Content on Facebook: A Survey
Cybersecurity Education and Research Centre
 
Groundhog day: near duplicate detection on twitter
Groundhog day: near duplicate detection on twitterGroundhog day: near duplicate detection on twitter
Groundhog day: near duplicate detection on twitter
Dan Nguyen
 
Integrating technologies and digital literacy in ESOL
Integrating technologies and digital literacy in ESOLIntegrating technologies and digital literacy in ESOL
Integrating technologies and digital literacy in ESOL
Nell Eckersley
 
Open domain Question Answering System - Research project in NLP
Open domain  Question Answering System - Research project in NLPOpen domain  Question Answering System - Research project in NLP
Open domain Question Answering System - Research project in NLP
GVS Chaitanya
 
DP1_160430723010_Divya.pptx
DP1_160430723010_Divya.pptxDP1_160430723010_Divya.pptx
DP1_160430723010_Divya.pptx
DivyaPatel729457
 
Tom Healy Introduction
Tom Healy IntroductionTom Healy Introduction
Tom Healy Introduction
gueste6ee3e
 
Citation Hisseine, M.A.; Chen, D.;Yang, X. The Applicatio.docx
Citation Hisseine, M.A.; Chen, D.;Yang, X. The Applicatio.docxCitation Hisseine, M.A.; Chen, D.;Yang, X. The Applicatio.docx
Citation Hisseine, M.A.; Chen, D.;Yang, X. The Applicatio.docx
richardnorman90310
 
Doctoral seminar (DBIS RWTH Aachen)
Doctoral seminar  (DBIS RWTH Aachen)Doctoral seminar  (DBIS RWTH Aachen)
Doctoral seminar (DBIS RWTH Aachen)Zina Petrushyna
 
fakenews_DBDA_Mar23.pptx
fakenews_DBDA_Mar23.pptxfakenews_DBDA_Mar23.pptx
fakenews_DBDA_Mar23.pptx
deepmitra8
 
Berlin 6 Open Access Conference: Tony Hey
Berlin 6 Open Access Conference: Tony HeyBerlin 6 Open Access Conference: Tony Hey
Berlin 6 Open Access Conference: Tony Hey
Cornelius Puschmann
 
Tags, Networks, Narrative
Tags, Networks, NarrativeTags, Networks, Narrative
Tags, Networks, Narrative
Bruce Mason
 

Similar to Data Cleaning for social media knowledge extraction (20)

Top 10 Factors for Successful TDM Projects
Top 10 Factors for Successful TDM ProjectsTop 10 Factors for Successful TDM Projects
Top 10 Factors for Successful TDM Projects
 
News construction from microblogging post using open data
News construction from microblogging post using open dataNews construction from microblogging post using open data
News construction from microblogging post using open data
 
IRJET- An Experimental Evaluation of Mechanical Properties of Bamboo Fiber Re...
IRJET- An Experimental Evaluation of Mechanical Properties of Bamboo Fiber Re...IRJET- An Experimental Evaluation of Mechanical Properties of Bamboo Fiber Re...
IRJET- An Experimental Evaluation of Mechanical Properties of Bamboo Fiber Re...
 
IRJET- Tweet Segmentation and its Application to Named Entity Recognition
IRJET- Tweet Segmentation and its Application to Named Entity RecognitionIRJET- Tweet Segmentation and its Application to Named Entity Recognition
IRJET- Tweet Segmentation and its Application to Named Entity Recognition
 
Predicting cyber bullying on t witter using machine learning
Predicting cyber bullying on t witter using machine learningPredicting cyber bullying on t witter using machine learning
Predicting cyber bullying on t witter using machine learning
 
Abcd iqs ssoftware-projects-mercecrosas
Abcd iqs ssoftware-projects-mercecrosasAbcd iqs ssoftware-projects-mercecrosas
Abcd iqs ssoftware-projects-mercecrosas
 
Sentence embedding to improve rumour detection performance model
Sentence embedding to improve rumour detection performance modelSentence embedding to improve rumour detection performance model
Sentence embedding to improve rumour detection performance model
 
Seda 10 dot
Seda 10 dotSeda 10 dot
Seda 10 dot
 
CXO Sakthi Presentation.pptxkjbk/jhlkasjdl;a
CXO Sakthi Presentation.pptxkjbk/jhlkasjdl;aCXO Sakthi Presentation.pptxkjbk/jhlkasjdl;a
CXO Sakthi Presentation.pptxkjbk/jhlkasjdl;a
 
Identification and Analysis of Malicious Content on Facebook: A Survey
Identification and Analysis of Malicious Content on Facebook: A SurveyIdentification and Analysis of Malicious Content on Facebook: A Survey
Identification and Analysis of Malicious Content on Facebook: A Survey
 
Groundhog day: near duplicate detection on twitter
Groundhog day: near duplicate detection on twitterGroundhog day: near duplicate detection on twitter
Groundhog day: near duplicate detection on twitter
 
Integrating technologies and digital literacy in ESOL
Integrating technologies and digital literacy in ESOLIntegrating technologies and digital literacy in ESOL
Integrating technologies and digital literacy in ESOL
 
Open domain Question Answering System - Research project in NLP
Open domain  Question Answering System - Research project in NLPOpen domain  Question Answering System - Research project in NLP
Open domain Question Answering System - Research project in NLP
 
DP1_160430723010_Divya.pptx
DP1_160430723010_Divya.pptxDP1_160430723010_Divya.pptx
DP1_160430723010_Divya.pptx
 
Tom Healy Introduction
Tom Healy IntroductionTom Healy Introduction
Tom Healy Introduction
 
Citation Hisseine, M.A.; Chen, D.;Yang, X. The Applicatio.docx
Citation Hisseine, M.A.; Chen, D.;Yang, X. The Applicatio.docxCitation Hisseine, M.A.; Chen, D.;Yang, X. The Applicatio.docx
Citation Hisseine, M.A.; Chen, D.;Yang, X. The Applicatio.docx
 
Doctoral seminar (DBIS RWTH Aachen)
Doctoral seminar  (DBIS RWTH Aachen)Doctoral seminar  (DBIS RWTH Aachen)
Doctoral seminar (DBIS RWTH Aachen)
 
fakenews_DBDA_Mar23.pptx
fakenews_DBDA_Mar23.pptxfakenews_DBDA_Mar23.pptx
fakenews_DBDA_Mar23.pptx
 
Berlin 6 Open Access Conference: Tony Hey
Berlin 6 Open Access Conference: Tony HeyBerlin 6 Open Access Conference: Tony Hey
Berlin 6 Open Access Conference: Tony Hey
 
Tags, Networks, Narrative
Tags, Networks, NarrativeTags, Networks, Narrative
Tags, Networks, Narrative
 

More from Marco Brambilla

M.Sc. Thesis Topics and Proposals @ Polimi Data Science Lab - 2024 - prof. Br...
M.Sc. Thesis Topics and Proposals @ Polimi Data Science Lab - 2024 - prof. Br...M.Sc. Thesis Topics and Proposals @ Polimi Data Science Lab - 2024 - prof. Br...
M.Sc. Thesis Topics and Proposals @ Polimi Data Science Lab - 2024 - prof. Br...
Marco Brambilla
 
Thesis Topics and Proposals @ Polimi Data Science Lab - 2023 - prof. Brambill...
Thesis Topics and Proposals @ Polimi Data Science Lab - 2023 - prof. Brambill...Thesis Topics and Proposals @ Polimi Data Science Lab - 2023 - prof. Brambill...
Thesis Topics and Proposals @ Polimi Data Science Lab - 2023 - prof. Brambill...
Marco Brambilla
 
Hierarchical Transformers for User Semantic Similarity - ICWE 2023
Hierarchical Transformers for User Semantic Similarity - ICWE 2023Hierarchical Transformers for User Semantic Similarity - ICWE 2023
Hierarchical Transformers for User Semantic Similarity - ICWE 2023
Marco Brambilla
 
Exploring the Bi-verse. A trip across the digital and physical ecospheres
Exploring the Bi-verse.A trip across the digital and physical ecospheresExploring the Bi-verse.A trip across the digital and physical ecospheres
Exploring the Bi-verse. A trip across the digital and physical ecospheres
Marco Brambilla
 
Trigger.eu: Cocteau game for policy making - introduction and demo
Trigger.eu: Cocteau game for policy making - introduction and demoTrigger.eu: Cocteau game for policy making - introduction and demo
Trigger.eu: Cocteau game for policy making - introduction and demo
Marco Brambilla
 
Generation of Realistic Navigation Paths for Web Site Testing using RNNs and ...
Generation of Realistic Navigation Paths for Web Site Testing using RNNs and ...Generation of Realistic Navigation Paths for Web Site Testing using RNNs and ...
Generation of Realistic Navigation Paths for Web Site Testing using RNNs and ...
Marco Brambilla
 
Analyzing rich club behavior in open source projects
Analyzing rich club behavior in open source projectsAnalyzing rich club behavior in open source projects
Analyzing rich club behavior in open source projects
Marco Brambilla
 
Analysis of On-line Debate on Long-Running Political Phenomena. The Brexit C...
Analysis of On-line Debate on Long-Running Political Phenomena.The Brexit C...Analysis of On-line Debate on Long-Running Political Phenomena.The Brexit C...
Analysis of On-line Debate on Long-Running Political Phenomena. The Brexit C...
Marco Brambilla
 
Driving Style and Behavior Analysis based on Trip Segmentation over GPS Info...
Driving Style and Behavior Analysis based on Trip Segmentation over GPS  Info...Driving Style and Behavior Analysis based on Trip Segmentation over GPS  Info...
Driving Style and Behavior Analysis based on Trip Segmentation over GPS Info...
Marco Brambilla
 
Myths and challenges in knowledge extraction and analysis from human-generate...
Myths and challenges in knowledge extraction and analysis from human-generate...Myths and challenges in knowledge extraction and analysis from human-generate...
Myths and challenges in knowledge extraction and analysis from human-generate...
Marco Brambilla
 
Harvesting Knowledge from Social Networks: Extracting Typed Relationships amo...
Harvesting Knowledge from Social Networks: Extracting Typed Relationships amo...Harvesting Knowledge from Social Networks: Extracting Typed Relationships amo...
Harvesting Knowledge from Social Networks: Extracting Typed Relationships amo...
Marco Brambilla
 
Model-driven Development of User Interfaces for IoT via Domain-specific Comp...
Model-driven Development of  User Interfaces for IoT via Domain-specific Comp...Model-driven Development of  User Interfaces for IoT via Domain-specific Comp...
Model-driven Development of User Interfaces for IoT via Domain-specific Comp...
Marco Brambilla
 
A Model-Based Method for Seamless Web and Mobile Experience. Splash 2016 conf.
A Model-Based Method for  Seamless Web and Mobile Experience. Splash 2016 conf.A Model-Based Method for  Seamless Web and Mobile Experience. Splash 2016 conf.
A Model-Based Method for Seamless Web and Mobile Experience. Splash 2016 conf.
Marco Brambilla
 
Big Data and Stream Data Analysis at Politecnico di Milano
Big Data and Stream Data Analysis at Politecnico di MilanoBig Data and Stream Data Analysis at Politecnico di Milano
Big Data and Stream Data Analysis at Politecnico di Milano
Marco Brambilla
 
Web Science. An introduction
Web Science. An introductionWeb Science. An introduction
Web Science. An introduction
Marco Brambilla
 
On the Quest for Changing Knowledge. Capturing emerging entities from social ...
On the Quest for Changing Knowledge. Capturing emerging entities from social ...On the Quest for Changing Knowledge. Capturing emerging entities from social ...
On the Quest for Changing Knowledge. Capturing emerging entities from social ...
Marco Brambilla
 
Studying Multicultural Diversity of Cities and Neighborhoods through Social M...
Studying Multicultural Diversity of Cities and Neighborhoods through Social M...Studying Multicultural Diversity of Cities and Neighborhoods through Social M...
Studying Multicultural Diversity of Cities and Neighborhoods through Social M...
Marco Brambilla
 
Model driven software engineering in practice book - Chapter 9 - Model to tex...
Model driven software engineering in practice book - Chapter 9 - Model to tex...Model driven software engineering in practice book - Chapter 9 - Model to tex...
Model driven software engineering in practice book - Chapter 9 - Model to tex...
Marco Brambilla
 
Model driven software engineering in practice book - chapter 7 - Developing y...
Model driven software engineering in practice book - chapter 7 - Developing y...Model driven software engineering in practice book - chapter 7 - Developing y...
Model driven software engineering in practice book - chapter 7 - Developing y...
Marco Brambilla
 
Automatic code generation for cross platform, multi-device mobile apps. An in...
Automatic code generation for cross platform, multi-device mobile apps. An in...Automatic code generation for cross platform, multi-device mobile apps. An in...
Automatic code generation for cross platform, multi-device mobile apps. An in...
Marco Brambilla
 

More from Marco Brambilla (20)

M.Sc. Thesis Topics and Proposals @ Polimi Data Science Lab - 2024 - prof. Br...
M.Sc. Thesis Topics and Proposals @ Polimi Data Science Lab - 2024 - prof. Br...M.Sc. Thesis Topics and Proposals @ Polimi Data Science Lab - 2024 - prof. Br...
M.Sc. Thesis Topics and Proposals @ Polimi Data Science Lab - 2024 - prof. Br...
 
Thesis Topics and Proposals @ Polimi Data Science Lab - 2023 - prof. Brambill...
Thesis Topics and Proposals @ Polimi Data Science Lab - 2023 - prof. Brambill...Thesis Topics and Proposals @ Polimi Data Science Lab - 2023 - prof. Brambill...
Thesis Topics and Proposals @ Polimi Data Science Lab - 2023 - prof. Brambill...
 
Hierarchical Transformers for User Semantic Similarity - ICWE 2023
Hierarchical Transformers for User Semantic Similarity - ICWE 2023Hierarchical Transformers for User Semantic Similarity - ICWE 2023
Hierarchical Transformers for User Semantic Similarity - ICWE 2023
 
Exploring the Bi-verse. A trip across the digital and physical ecospheres
Exploring the Bi-verse.A trip across the digital and physical ecospheresExploring the Bi-verse.A trip across the digital and physical ecospheres
Exploring the Bi-verse. A trip across the digital and physical ecospheres
 
Trigger.eu: Cocteau game for policy making - introduction and demo
Trigger.eu: Cocteau game for policy making - introduction and demoTrigger.eu: Cocteau game for policy making - introduction and demo
Trigger.eu: Cocteau game for policy making - introduction and demo
 
Generation of Realistic Navigation Paths for Web Site Testing using RNNs and ...
Generation of Realistic Navigation Paths for Web Site Testing using RNNs and ...Generation of Realistic Navigation Paths for Web Site Testing using RNNs and ...
Generation of Realistic Navigation Paths for Web Site Testing using RNNs and ...
 
Analyzing rich club behavior in open source projects
Analyzing rich club behavior in open source projectsAnalyzing rich club behavior in open source projects
Analyzing rich club behavior in open source projects
 
Analysis of On-line Debate on Long-Running Political Phenomena. The Brexit C...
Analysis of On-line Debate on Long-Running Political Phenomena.The Brexit C...Analysis of On-line Debate on Long-Running Political Phenomena.The Brexit C...
Analysis of On-line Debate on Long-Running Political Phenomena. The Brexit C...
 
Driving Style and Behavior Analysis based on Trip Segmentation over GPS Info...
Driving Style and Behavior Analysis based on Trip Segmentation over GPS  Info...Driving Style and Behavior Analysis based on Trip Segmentation over GPS  Info...
Driving Style and Behavior Analysis based on Trip Segmentation over GPS Info...
 
Myths and challenges in knowledge extraction and analysis from human-generate...
Myths and challenges in knowledge extraction and analysis from human-generate...Myths and challenges in knowledge extraction and analysis from human-generate...
Myths and challenges in knowledge extraction and analysis from human-generate...
 
Harvesting Knowledge from Social Networks: Extracting Typed Relationships amo...
Harvesting Knowledge from Social Networks: Extracting Typed Relationships amo...Harvesting Knowledge from Social Networks: Extracting Typed Relationships amo...
Harvesting Knowledge from Social Networks: Extracting Typed Relationships amo...
 
Model-driven Development of User Interfaces for IoT via Domain-specific Comp...
Model-driven Development of  User Interfaces for IoT via Domain-specific Comp...Model-driven Development of  User Interfaces for IoT via Domain-specific Comp...
Model-driven Development of User Interfaces for IoT via Domain-specific Comp...
 
A Model-Based Method for Seamless Web and Mobile Experience. Splash 2016 conf.
A Model-Based Method for  Seamless Web and Mobile Experience. Splash 2016 conf.A Model-Based Method for  Seamless Web and Mobile Experience. Splash 2016 conf.
A Model-Based Method for Seamless Web and Mobile Experience. Splash 2016 conf.
 
Big Data and Stream Data Analysis at Politecnico di Milano
Big Data and Stream Data Analysis at Politecnico di MilanoBig Data and Stream Data Analysis at Politecnico di Milano
Big Data and Stream Data Analysis at Politecnico di Milano
 
Web Science. An introduction
Web Science. An introductionWeb Science. An introduction
Web Science. An introduction
 
On the Quest for Changing Knowledge. Capturing emerging entities from social ...
On the Quest for Changing Knowledge. Capturing emerging entities from social ...On the Quest for Changing Knowledge. Capturing emerging entities from social ...
On the Quest for Changing Knowledge. Capturing emerging entities from social ...
 
Studying Multicultural Diversity of Cities and Neighborhoods through Social M...
Studying Multicultural Diversity of Cities and Neighborhoods through Social M...Studying Multicultural Diversity of Cities and Neighborhoods through Social M...
Studying Multicultural Diversity of Cities and Neighborhoods through Social M...
 
Model driven software engineering in practice book - Chapter 9 - Model to tex...
Model driven software engineering in practice book - Chapter 9 - Model to tex...Model driven software engineering in practice book - Chapter 9 - Model to tex...
Model driven software engineering in practice book - Chapter 9 - Model to tex...
 
Model driven software engineering in practice book - chapter 7 - Developing y...
Model driven software engineering in practice book - chapter 7 - Developing y...Model driven software engineering in practice book - chapter 7 - Developing y...
Model driven software engineering in practice book - chapter 7 - Developing y...
 
Automatic code generation for cross platform, multi-device mobile apps. An in...
Automatic code generation for cross platform, multi-device mobile apps. An in...Automatic code generation for cross platform, multi-device mobile apps. An in...
Automatic code generation for cross platform, multi-device mobile apps. An in...
 

Recently uploaded

LORRAINE ANDREI_LEQUIGAN_HOW TO USE TRELLO
LORRAINE ANDREI_LEQUIGAN_HOW TO USE TRELLOLORRAINE ANDREI_LEQUIGAN_HOW TO USE TRELLO
LORRAINE ANDREI_LEQUIGAN_HOW TO USE TRELLO
lorraineandreiamcidl
 
SluggerPunk Final Angel Investor Proposal
SluggerPunk Final Angel Investor ProposalSluggerPunk Final Angel Investor Proposal
SluggerPunk Final Angel Investor Proposal
grogshiregames
 
Social Media Marketing Strategies .
Social Media Marketing Strategies                     .Social Media Marketing Strategies                     .
Social Media Marketing Strategies .
Virtual Real Design
 
Unlock TikTok Success with Sociocosmos..
Unlock TikTok Success with Sociocosmos..Unlock TikTok Success with Sociocosmos..
Unlock TikTok Success with Sociocosmos..
SocioCosmos
 
Surat Digital Marketing School - course curriculum
Surat Digital Marketing School - course curriculumSurat Digital Marketing School - course curriculum
Surat Digital Marketing School - course curriculum
digitalcourseshop4
 
Your Path to YouTube Stardom Starts Here
Your Path to YouTube Stardom Starts HereYour Path to YouTube Stardom Starts Here
Your Path to YouTube Stardom Starts Here
SocioCosmos
 
Buy Pinterest Followers, Reactions & Repins Go Viral on Pinterest with Socio...
Buy Pinterest Followers, Reactions & Repins  Go Viral on Pinterest with Socio...Buy Pinterest Followers, Reactions & Repins  Go Viral on Pinterest with Socio...
Buy Pinterest Followers, Reactions & Repins Go Viral on Pinterest with Socio...
SocioCosmos
 
Multilingual SEO Services | Multilingual Keyword Research | Filose
Multilingual SEO Services |  Multilingual Keyword Research | FiloseMultilingual SEO Services |  Multilingual Keyword Research | Filose
Multilingual SEO Services | Multilingual Keyword Research | Filose
madisonsmith478075
 
SluggerPunk Angel Investor Final Proposal
SluggerPunk Angel Investor Final ProposalSluggerPunk Angel Investor Final Proposal
SluggerPunk Angel Investor Final Proposal
grogshiregames
 
“To be integrated is to feel secure, to feel connected.” The views and experi...
“To be integrated is to feel secure, to feel connected.” The views and experi...“To be integrated is to feel secure, to feel connected.” The views and experi...
“To be integrated is to feel secure, to feel connected.” The views and experi...
AJHSSR Journal
 
7 Tips on Social Media Marketing strategy
7 Tips on Social Media Marketing strategy7 Tips on Social Media Marketing strategy
7 Tips on Social Media Marketing strategy
Digital Marketing Lab
 
Grow Your Reddit Community Fast.........
Grow Your Reddit Community Fast.........Grow Your Reddit Community Fast.........
Grow Your Reddit Community Fast.........
SocioCosmos
 
Improving Workplace Safety Performance in Malaysian SMEs: The Role of Safety ...
Improving Workplace Safety Performance in Malaysian SMEs: The Role of Safety ...Improving Workplace Safety Performance in Malaysian SMEs: The Role of Safety ...
Improving Workplace Safety Performance in Malaysian SMEs: The Role of Safety ...
AJHSSR Journal
 

Recently uploaded (13)

LORRAINE ANDREI_LEQUIGAN_HOW TO USE TRELLO
LORRAINE ANDREI_LEQUIGAN_HOW TO USE TRELLOLORRAINE ANDREI_LEQUIGAN_HOW TO USE TRELLO
LORRAINE ANDREI_LEQUIGAN_HOW TO USE TRELLO
 
SluggerPunk Final Angel Investor Proposal
SluggerPunk Final Angel Investor ProposalSluggerPunk Final Angel Investor Proposal
SluggerPunk Final Angel Investor Proposal
 
Social Media Marketing Strategies .
Social Media Marketing Strategies                     .Social Media Marketing Strategies                     .
Social Media Marketing Strategies .
 
Unlock TikTok Success with Sociocosmos..
Unlock TikTok Success with Sociocosmos..Unlock TikTok Success with Sociocosmos..
Unlock TikTok Success with Sociocosmos..
 
Surat Digital Marketing School - course curriculum
Surat Digital Marketing School - course curriculumSurat Digital Marketing School - course curriculum
Surat Digital Marketing School - course curriculum
 
Your Path to YouTube Stardom Starts Here
Your Path to YouTube Stardom Starts HereYour Path to YouTube Stardom Starts Here
Your Path to YouTube Stardom Starts Here
 
Buy Pinterest Followers, Reactions & Repins Go Viral on Pinterest with Socio...
Buy Pinterest Followers, Reactions & Repins  Go Viral on Pinterest with Socio...Buy Pinterest Followers, Reactions & Repins  Go Viral on Pinterest with Socio...
Buy Pinterest Followers, Reactions & Repins Go Viral on Pinterest with Socio...
 
Multilingual SEO Services | Multilingual Keyword Research | Filose
Multilingual SEO Services |  Multilingual Keyword Research | FiloseMultilingual SEO Services |  Multilingual Keyword Research | Filose
Multilingual SEO Services | Multilingual Keyword Research | Filose
 
SluggerPunk Angel Investor Final Proposal
SluggerPunk Angel Investor Final ProposalSluggerPunk Angel Investor Final Proposal
SluggerPunk Angel Investor Final Proposal
 
“To be integrated is to feel secure, to feel connected.” The views and experi...
“To be integrated is to feel secure, to feel connected.” The views and experi...“To be integrated is to feel secure, to feel connected.” The views and experi...
“To be integrated is to feel secure, to feel connected.” The views and experi...
 
7 Tips on Social Media Marketing strategy
7 Tips on Social Media Marketing strategy7 Tips on Social Media Marketing strategy
7 Tips on Social Media Marketing strategy
 
Grow Your Reddit Community Fast.........
Grow Your Reddit Community Fast.........Grow Your Reddit Community Fast.........
Grow Your Reddit Community Fast.........
 
Improving Workplace Safety Performance in Malaysian SMEs: The Role of Safety ...
Improving Workplace Safety Performance in Malaysian SMEs: The Role of Safety ...Improving Workplace Safety Performance in Malaysian SMEs: The Role of Safety ...
Improving Workplace Safety Performance in Malaysian SMEs: The Role of Safety ...
 

Data Cleaning for social media knowledge extraction

  • 1. Emre Calisir, Marco Brambilla KDWEB2018, Cáceres, Spain The Problem of Data Cleaning for Knowledge Extraction from Social Media
  • 3. Keyword or hash-tag based filtering is insufficient
  • 4. Is it possible to extract a sub-selection of content items if and only if they are actually relevant to the topic or context of interest ?
  • 5. Examples to Related Studies 1. Earthquake Alarm System, Sakaki et al. Proc. of the 19th int.conference on WWW, 2010 2. Detection of influenza-like illnesses Culotta. Proc. of the 1st workshop on social media analytics, Washington, D.C, 2010 3. Discovering health topics, Paul & Dredze, PLoS ONE 9, e103408, 2014 4. Detection of prescription medication abuse, Sarker et al. Drug Safety, 2016 5. Tracking baseball and fashion topics, Lin et al. KDD, 2011 6. Event detection system, Kunneman & Bosch, BNAIC, 2014 7. Credibility of trend-topic hashtag usage Castillo et al, Proc. of the 20th int. Conf. on WWW. ACM, Hyderabad, USA, 2011 8. Non-relevant tweet filtering, Hajjem & Latiri, Procedia computer science, 112, 2017
  • 6. Supervised learning trained on annotated data could help us
  • 8. Proposed Data Cleaning Method for Knowledge Extraction
  • 10. Best #Hotel Deals in #Pompei #HotelDegliAmiciPompei starting at EUR99.60 https://t.co/5DxkKn4o69 Pompei Hero Pliny the Elder May Have Been Found 2000 Years Later https://t.co/PyR2rP1Xpe #2017Rewind #archeology #history #Pompei #rome #RomanEmpire Non-Relevant Tweet Relevant Tweet
  • 11. 4 feature extraction strategies evaluated N-grams (unigrams, bigrams, trigrams) Word2Vec Word2Vec + additional tweet features Dimensionality Reduction with PCA
  • 12. Annotated Data 726 tweets. Contains tweet having specific hashtags and keywords related to Pompei, Colosseo and Teatro Alla Scala The data contains 50% relevant and 50% non relevant.
  • 13. Model 1: Text transformation to ngrams # of [unigrams, bigrams, trigrams] : [494,287,228] vocabulary size: 1009 words Example:
  • 14. Model 2: Text transformation to word2vec • Word2vec dimension is selected as 25. • Word2vec vocabulary is built with 12K unlabeled tweets • Preprocessing operations before building word2vec model • Convert to lowercase, • Discard • Web links • Words with character size < 3 • Stopwords are eliminated before model building.
  • 15. Model 2: Text transformation to word2vec
  • 16. Model 2: Text transformation to word2vec
  • 17. Model 3: word2vec + Additional Features Tweet Author Full text: #nuovi #corsi #inglese #settembre #pompei #chiamaci per #info https://t.co/QRrXlMC0g1 Number of Friends: 4 Number of Followers: 9 Number of Lists: 15 Number of Favourited Tweets: 0 Language: en Number of Tweets: 4220 Source: PostPickr Verified Account: False Number of Favorited: 0 Geo Enabled: False Number of Retweets: 0 Default Profile: False Example:
  • 18. Model 4: PCA applied on Model 3
  • 19. Model 1 2 3 4 Accuracy 0.84 0.81 0.82 0.83 Precision 0.84 0.78 0.83 0.84 Recall 0.83 0.86 0.8 0.81 F1 0.83 0.82 0.81 0.82 Model 1: ngrams Model 2: word2vec Model 3: word2vec + additional features Model 4: PCA applied on Model 3 10fold Cross-Validated Results
  • 20. Conclusions Supervised Machine Learning techniques could help to obtain topic relevant social media data
  • 21. Collecting more data to build larger Word2Vec Vocabulary New Use Cases Challenges ahead
  • 22. THANKS! QUESTIONS? Emre Calisir, Marco Brambilla The Problem of Data Cleaning for Knowledge Extraction from Social Media Marco Brambilla @marcobrambi marco.brambilla@polimi.it http://datascience.deib.polimi.it http://home.deib.polimi.it/marcobrambi

Editor's Notes

  1. 500 million tweets each day. 100 million active daily people. People tend to share every part of their lives, opinions... Every sector wants to extract knowledge from social media: banking, health, tourism...
  2. Data is created by humans. Very open to errors. Words can have multiple meanings. Rule-based filtering systems fail. (we mean by Rule-based systems  systems that collect tweets just using queries) They bring related and unrelated content together.
  3. Our research question.
  4. All of them are based on Twitter data Various use cases, various machine learning techniques Supervised (1 to 7) SVM, Multinomial Bayes, Logistic Regression, Decision Trees, Bayes Networks, Unsupervised (Last one) Latent Dirichlet Allocation We benefit from these studies. Early Alert System. Technique: SVM classifier. Features: keywords and the context of target-event words. Detection of Illnesses. Technique: Log Regr. Bag-of-Words Filter-out Non-Relevant Content. Technique: Log. Regr. Unigrams-Bigrams-Trigrams. Very similar to first model in our study. Discovery of patterns of abuse of specific medications. Topic tracking on tweet streams. Again unigrams Discard noisy data for an event detection system based on Twitter stream data. Assesment of credibility of trend-topics, to check if they are really related with the topic of used hash-tag, or spam. They tried SVM, decision trees and Bayes. Unsupervised technique. Pooling method using Information Retrieval and Latent Dirichlet Allocation.
  5. Annotated dataset, labeled as Relevant or Non-Relevant with the topic. We train SVM based model with labeled data. We predict the new unlabeled data. We selected Support Vector Machines with Linear Kernel. The recommended approach when there is sparse vector features.
  6. Our basic approach to obtain a relevant dataset. We applied with Twitter but is also applicable to any kind of social media data.
  7. And a more detailed illustration to general flow. Initially we build our model. And then we make the prediction. Finally we have a clean dataset.
  8. Pompei, Colosseo and Teatro Alla Scala
  9. Imagine that we have a social media monitoring tool. We want to track tweets related historical value of Pompei. How to do it? Label relevant and non-relevant data. Who will label it? Subject experts
  10. Subject experts are labeled the tweets as relevant and non relevant. A random guess could predict with 0.5 accuracy.
  11. Ngrams is a widely used technique in text classification. It is an example to how we transform a tweet to feature vector. Also we applied Tf-idf to increase accuracy. It is a best-practive in ngrams usage.
  12. Default word2vec dimension is 100. However, our dataset is limited. We performed analysis and achieved better results by representing with 25 features. Larger the word2vec vocabulary, better creating the semantic relations.
  13. Top 3 Similarities showed for specific words. Inside parantheses, there is vectorial distance btw words. For a given tweet, we calculate the average value of its word vectors.
  14. Another illustration of trained word2vec model. This graph is generated with few words to show you just the vectorial representations of words.
  15. Feature Extraction Strategy Vector features: Word2Vec Numerical features: MinMaxScaler Categorical features: OneHotEncoder Dimension of features after transformation: dim(Word2Vec) = 25 -- dim(numerical features) = 7 -- dim(one-hot-encoding) = 68 -- Total number of features = 125
  16. The target is to analyze impact of dimensionality reduction. It could improve model accuracy. Graphic: Desired dimension < 40  the model is underfitting Desired dimension > 40  the model is overfitting We selected dimension size : 40
  17. We preferred cross-validation because size of our labeled dataset is limited. (726 tweets) All the classification models are succesful. Dotted line shows random guess. (50% relevant and 50% non relevant tweets exist inside data) Model 1 has best performance. Model 4 has second best performance. If we have a larger word2vec vocabulary, word2vec could have better accuracy then ngrams. ROC curve and AUC scores prove the performance of our model.
  18. We addressed our research question. We proved that text classsification is a very convenient way to obtain topic relevant data.
  19. We still a way to go for building a more accurate system. We now create a larger dataset. And we will find out new use cases, also open datasets to compare our results with the literature. Also, we will try other algorithms, not only SVM. Also, we can bring external data by importing content from given weblinks.