SlideShare a Scribd company logo
Piet Daas, Marco Puts, Ali Hürriyetoglu
Extracting information from
‘messy’ social media data
Using Big Data for official statistics
– Can and how can we use Big Data for the production of
official statistics?
– Statistics Netherlands produces reliable and consistent
statistical information
‐ The official statistics of the country
– These figures are based on target populations
‐ E.g. the country, its inhabitants and its companies
– We want to use as much data as is (freely) available
‐ Less questionnaires, use more administrative and Big Data
Combining this is challenging !!2
It is important to know that
– Statistics Netherlands is the first organization that has
produced a Big Data based official statistics
‐ Road sensor data based traffic intensity statistics
– Statistics Netherlands is the leading organisation in the
official statistical world regarding the use of Big Data
– Have recently created a ‘Center for Big Data Statistics’
‐ With many partners involved (> 30)
3
Pros and cons of using Big Data
– Positive (2 of the 3 V’s)
‐ A lot of data
‐ Readily available
– Negative
‐ Variety (not that stable)
‐ Potentially biased (selective part of population)
‐ Most are event based (e.g. message oriented, not user)
‐ Little information is available on the users
‐ It’s a challenging data source for producing statistics with
high quality!
4
Big Data studies on Social media
– Statistics oriented
‐ Social media sentiment and Consumer Confidence
‐ Social media based (un)safety monitor
– Population oriented
‐ Users (People, Companies and Others)
‐ Determining background characteristics
‐ We use twiqs.nl, Coosto and Twitter API
5
Social media in the Netherlands
Map by Eric Fischer (via Fast Company)
Social media sentiment
– Studied public Dutch social media collected by Coosto
‐ Not only Twitter, but also Facebook, etc.
‐ Looked at the sentiment (+/-/n) in these messages
‐ Studied the change in overall sentiment over time
‐ Around 3-4 million messages per day
‐ Overall sentiment = (pos. messages – neg. messages)/total
(%)
‐ Day/week/month
7
Daily, weekly, monthly sentiment
8
Sentiment per platform
(~10%) (~80%)
Table 1. Social media messages properties for various platforms and their correlation with consumer confidence
Correlation coefficient of
Social media platform Number of social Number of messages as monthly sentiment index and
media messages1
percentage of total (%) consumer confidence ( r )2
All platforms combined 3,153,002,327 100 0.75 0.78
Facebook 334,854,088 10.6 0.81* 0.85*
Twitter 2,526,481,479 80.1 0.68 0.70
Hyves 45,182,025 1.4 0.50 0.58
News sites 56,027,686 1.8 0.37 0.26
Blogs 48,600,987 1.5 0.25 0.22
Google+ 644,039 0.02 -0.04 -0.09
Linkedin 565,811 0.02 -0.23 -0.25
Youtube 5,661,274 0.2 -0.37 -0.41
Forums 134,98,938 4.3 -0.45 -0.49
1
period covered June 2010 untill November 2013
2
confirmed by visual inspecting scatterplots and additional checks (see text)
*cointegrated
Platform specific results
10 Combination of Facebook andTwitter is best (r > 0.9)
(association continues after that period)
Overall findings
11
– Correlation and cointegration
‐ Consumer confidence survey is conducted during first 2 weeks of a month
‐ Comparing various periods revealed that best correlation and cointegration
is with last 2 weeks of previous month and first 2 weeks of current month
• Highest correlation 0.93* (all Facebook * filteredTwitter)
– Granger causality
‐ Changes in Consumer confidence precede changes in Social media
sentiment
‐ For all combinations shown!
• However: social media is quicker available to us!
– Prediction
‐ Slightly better than random chance
‐ Best for the 4th ‘week’ of month
(Un)safety feeling in social media
– Interviewed people and create a list of words associated
with feelings of (un)safety (347)
– Checked if these words are used in social media (81)
– Only included the most frequently used words (24)
– First version of indicator
‐ Need to: Check context of messages included
‐ Need to: Compare height of peaks with ‘severity’ of
event
Unsafety monitor (first version)
Bomb airport
Brussel
22-03-2016
Truck attack
Nice
14-07-2016
Terrorist attacks
Paris
14-11-2015
Intruder NOS
29-01-2015
Charlie Hebdo
Paris
09-01-2015
MH17 day of
National mourning
23-07-2014
Spain-Neth.
Football (1-5)
13-06-2014
13
(Un)safety feeling in social media (2)
– Interviewed people and create a list of words associated
with feelings of (un)safety (347)
– Check if these words are used in social media (81)
– Only include the frequently used words (24)
– First version of indicator
‐ Need to: Check context of messages included
‐ Want to: Compare height of peaks with other data
Population studies
– Looked at composition of the units active on Twitter
– Type of units
‐ People, companies/organizations, and others
– Tried to determine background characteristics
‐ Not many units provide such information directly
‐ E.g. gender, age, income, level of education etc.
15
Starting point
– Draw a sample of a 1000 user id’s from Twitter
‐ Had a list of 330.000 from a previous study
– It was found that:
‐ 844 still existed
• 691 are persons (82%)
• 119 are companies/organizations (14%)
• 34 are ‘others’ (4%)
• Tried to determine gender
16
17
1)Name
2) Short bio
3) Messages
content
4) Picture
Gender findings: 1) First name
• Used Dutch ‘Voornamenbank’ website (First name database)
• Score between 0 and 1 (female – male); 676 of 844 (80%) names were registered
• Unknown names scored -1 (usually companies/organizations)
Gender findings: 2) Short bio
– If a short bio is provided
– Quite a number of people mention there ‘position’ in the family
‐ Mother, father, papa, mama, ‘son of’, etc.
– Need to check both English and Dutch texts
– 155 of 583 (27%) indicated there gender in short bio
‐ Very precise for women!!
19
Gender findings: 3) Tweets content
– In cooperation with University ofTwente (Dong Nguyen)
– Machine learning approach that checks gender specific writing style
‐ Language specific: Messages need to be Dutch!
‐ 437 of 473 (92%) persons that created tweets could be classified
Gender findings: 4) Profile picture
– Use OpenCV to process pictures
– 1) Face recognition
– 2) Standardisation of faces (resize & rotate)
– 3) Classify faces according to gender
– - 603 of 804 (75%) profile pictures had 1 or more faces on it
1
2
3
Gender findings: overall results (1)
Diagnostic Odds Ratio =
(TP/FN) / (FP/TN)
Random guessing
log(DOR) = 0
‐ Multi-agent findings
• Need ‘clever’ ways to combine these
• Take processing efficiency of the ‘agent’ into consideration
Diagnostic Odds
Ratio (log)
First name 4.33
Short bio 2.70
Tweet content 1.96
Picture (faces) 0.57
22
Gender findings: overall results (2)
Combine results in the best possible way
Unassigned (%) Approach used
844 (100%) 1. Use short bio scores (very precise for females)
689 (82%) 2. Use first name scores
153 (18%) 3. Use Tweet content
29 (3.4%) 4. Use picture
20 (2.4%) 5. Assign male gender
Final log(DOR) is 7.02, an accuracy of 96.5%!
23
Conclusions and future studies
– Social media is one of the most challenging data sources
for official statistics
– Using it requires that we:
‐ Focus on the information available
‐ Think outside the box (i.e. sentiment study)
– Good source to study potential ways to correct for the
selectivity of Big Data sources
– In future studies we will be looking at:
‐ Sentiment, unsafety and more. Population
composition, population dynamics and other
background characteristics
24
The Future
25
The
future
of
statistics
looks
BIG
Thank you for your attention !@pietdaas

More Related Content

Similar to Extracting information from ' messy' social media data

Useful by Piet Daas
Useful by Piet DaasUseful by Piet Daas
Use of social media for official statistics
Use of social media for official statisticsUse of social media for official statistics
Use of social media for official statistics
Piet J.H. Daas
 
Surveillance of social media: Big data analytics
Surveillance of social media: Big data analyticsSurveillance of social media: Big data analytics
Surveillance of social media: Big data analytics
Health Informatics New Zealand
 
Stock prediction using social network
Stock prediction using social networkStock prediction using social network
Stock prediction using social network
Chanon Hongsirikulkit
 
s00146-014-0549-4.pdf
s00146-014-0549-4.pdfs00146-014-0549-4.pdf
s00146-014-0549-4.pdf
EngrAliSarfrazSiddiq
 
Social media sentiment and consumer confidence
Social media sentiment and consumer confidenceSocial media sentiment and consumer confidence
Social media sentiment and consumer confidence
Piet J.H. Daas
 
IRJET- Sentiment Analysis using Machine Learning
IRJET- Sentiment Analysis using Machine LearningIRJET- Sentiment Analysis using Machine Learning
IRJET- Sentiment Analysis using Machine Learning
IRJET Journal
 
Social Media Analytics Research at the QUT Digital Media Research Centre
Social Media Analytics Research at the QUT Digital Media Research CentreSocial Media Analytics Research at the QUT Digital Media Research Centre
Social Media Analytics Research at the QUT Digital Media Research Centre
Axel Bruns
 
IRJET - Twitter Sentiment Analysis using Machine Learning
IRJET -  	  Twitter Sentiment Analysis using Machine LearningIRJET -  	  Twitter Sentiment Analysis using Machine Learning
IRJET - Twitter Sentiment Analysis using Machine Learning
IRJET Journal
 
Press Kit -LiMoSINe Project
Press Kit -LiMoSINe ProjectPress Kit -LiMoSINe Project
Press Kit -LiMoSINe Project
LiMoSINe Project
 
Social media gaucher
Social media gaucherSocial media gaucher
Social media gaucher
Rob Camp
 
EPIDEMIC OUTBREAK PREDICTION USING ARTIFICIAL INTELLIGENCE
EPIDEMIC OUTBREAK PREDICTION USING ARTIFICIAL INTELLIGENCEEPIDEMIC OUTBREAK PREDICTION USING ARTIFICIAL INTELLIGENCE
EPIDEMIC OUTBREAK PREDICTION USING ARTIFICIAL INTELLIGENCE
ijcsit
 
Information Contagion through Social Media: Towards a Realistic Model of the ...
Information Contagion through Social Media: Towards a Realistic Model of the ...Information Contagion through Social Media: Towards a Realistic Model of the ...
Information Contagion through Social Media: Towards a Realistic Model of the ...
Axel Bruns
 
Document(2)
Document(2)Document(2)
Document(2)
Sutha Guru
 
Public Safety Mashups to Support Policy Makers || Choennie
Public Safety Mashups to Support Policy Makers || ChoenniePublic Safety Mashups to Support Policy Makers || Choennie
Public Safety Mashups to Support Policy Makers || ChoennieHuman Centered ICT
 
Opportunities and methodological challenges of Big Data for official statist...
Opportunities and methodological challenges of  Big Data for official statist...Opportunities and methodological challenges of  Big Data for official statist...
Opportunities and methodological challenges of Big Data for official statist...
Piet J.H. Daas
 
A MODEL BASED ON SENTIMENTS ANALYSIS FOR STOCK EXCHANGE PREDICTION - CASE STU...
A MODEL BASED ON SENTIMENTS ANALYSIS FOR STOCK EXCHANGE PREDICTION - CASE STU...A MODEL BASED ON SENTIMENTS ANALYSIS FOR STOCK EXCHANGE PREDICTION - CASE STU...
A MODEL BASED ON SENTIMENTS ANALYSIS FOR STOCK EXCHANGE PREDICTION - CASE STU...
csandit
 
A MODEL BASED ON SENTIMENTS ANALYSIS FOR STOCK EXCHANGE PREDICTION - CASE STU...
A MODEL BASED ON SENTIMENTS ANALYSIS FOR STOCK EXCHANGE PREDICTION - CASE STU...A MODEL BASED ON SENTIMENTS ANALYSIS FOR STOCK EXCHANGE PREDICTION - CASE STU...
A MODEL BASED ON SENTIMENTS ANALYSIS FOR STOCK EXCHANGE PREDICTION - CASE STU...
cscpconf
 
Mediawave, social media monitoring & data analytics
Mediawave, social media monitoring & data analyticsMediawave, social media monitoring & data analytics
Mediawave, social media monitoring & data analytics
Dwi Wahyono
 
EPIDEMIC OUTBREAK PREDICTION USING ARTIFICIAL INTELLIGENCE
EPIDEMIC OUTBREAK PREDICTION USING ARTIFICIAL INTELLIGENCEEPIDEMIC OUTBREAK PREDICTION USING ARTIFICIAL INTELLIGENCE
EPIDEMIC OUTBREAK PREDICTION USING ARTIFICIAL INTELLIGENCE
AIRCC Publishing Corporation
 

Similar to Extracting information from ' messy' social media data (20)

Useful by Piet Daas
Useful by Piet DaasUseful by Piet Daas
Useful by Piet Daas
 
Use of social media for official statistics
Use of social media for official statisticsUse of social media for official statistics
Use of social media for official statistics
 
Surveillance of social media: Big data analytics
Surveillance of social media: Big data analyticsSurveillance of social media: Big data analytics
Surveillance of social media: Big data analytics
 
Stock prediction using social network
Stock prediction using social networkStock prediction using social network
Stock prediction using social network
 
s00146-014-0549-4.pdf
s00146-014-0549-4.pdfs00146-014-0549-4.pdf
s00146-014-0549-4.pdf
 
Social media sentiment and consumer confidence
Social media sentiment and consumer confidenceSocial media sentiment and consumer confidence
Social media sentiment and consumer confidence
 
IRJET- Sentiment Analysis using Machine Learning
IRJET- Sentiment Analysis using Machine LearningIRJET- Sentiment Analysis using Machine Learning
IRJET- Sentiment Analysis using Machine Learning
 
Social Media Analytics Research at the QUT Digital Media Research Centre
Social Media Analytics Research at the QUT Digital Media Research CentreSocial Media Analytics Research at the QUT Digital Media Research Centre
Social Media Analytics Research at the QUT Digital Media Research Centre
 
IRJET - Twitter Sentiment Analysis using Machine Learning
IRJET -  	  Twitter Sentiment Analysis using Machine LearningIRJET -  	  Twitter Sentiment Analysis using Machine Learning
IRJET - Twitter Sentiment Analysis using Machine Learning
 
Press Kit -LiMoSINe Project
Press Kit -LiMoSINe ProjectPress Kit -LiMoSINe Project
Press Kit -LiMoSINe Project
 
Social media gaucher
Social media gaucherSocial media gaucher
Social media gaucher
 
EPIDEMIC OUTBREAK PREDICTION USING ARTIFICIAL INTELLIGENCE
EPIDEMIC OUTBREAK PREDICTION USING ARTIFICIAL INTELLIGENCEEPIDEMIC OUTBREAK PREDICTION USING ARTIFICIAL INTELLIGENCE
EPIDEMIC OUTBREAK PREDICTION USING ARTIFICIAL INTELLIGENCE
 
Information Contagion through Social Media: Towards a Realistic Model of the ...
Information Contagion through Social Media: Towards a Realistic Model of the ...Information Contagion through Social Media: Towards a Realistic Model of the ...
Information Contagion through Social Media: Towards a Realistic Model of the ...
 
Document(2)
Document(2)Document(2)
Document(2)
 
Public Safety Mashups to Support Policy Makers || Choennie
Public Safety Mashups to Support Policy Makers || ChoenniePublic Safety Mashups to Support Policy Makers || Choennie
Public Safety Mashups to Support Policy Makers || Choennie
 
Opportunities and methodological challenges of Big Data for official statist...
Opportunities and methodological challenges of  Big Data for official statist...Opportunities and methodological challenges of  Big Data for official statist...
Opportunities and methodological challenges of Big Data for official statist...
 
A MODEL BASED ON SENTIMENTS ANALYSIS FOR STOCK EXCHANGE PREDICTION - CASE STU...
A MODEL BASED ON SENTIMENTS ANALYSIS FOR STOCK EXCHANGE PREDICTION - CASE STU...A MODEL BASED ON SENTIMENTS ANALYSIS FOR STOCK EXCHANGE PREDICTION - CASE STU...
A MODEL BASED ON SENTIMENTS ANALYSIS FOR STOCK EXCHANGE PREDICTION - CASE STU...
 
A MODEL BASED ON SENTIMENTS ANALYSIS FOR STOCK EXCHANGE PREDICTION - CASE STU...
A MODEL BASED ON SENTIMENTS ANALYSIS FOR STOCK EXCHANGE PREDICTION - CASE STU...A MODEL BASED ON SENTIMENTS ANALYSIS FOR STOCK EXCHANGE PREDICTION - CASE STU...
A MODEL BASED ON SENTIMENTS ANALYSIS FOR STOCK EXCHANGE PREDICTION - CASE STU...
 
Mediawave, social media monitoring & data analytics
Mediawave, social media monitoring & data analyticsMediawave, social media monitoring & data analytics
Mediawave, social media monitoring & data analytics
 
EPIDEMIC OUTBREAK PREDICTION USING ARTIFICIAL INTELLIGENCE
EPIDEMIC OUTBREAK PREDICTION USING ARTIFICIAL INTELLIGENCEEPIDEMIC OUTBREAK PREDICTION USING ARTIFICIAL INTELLIGENCE
EPIDEMIC OUTBREAK PREDICTION USING ARTIFICIAL INTELLIGENCE
 

More from Piet J.H. Daas

Big Data and official statistics with examples of their use
Big Data and official statistics with examples of their useBig Data and official statistics with examples of their use
Big Data and official statistics with examples of their use
Piet J.H. Daas
 
IT infrastructure for Big Data and Data Science at Statistics Netherlands
IT infrastructure for Big Data and Data Science at Statistics NetherlandsIT infrastructure for Big Data and Data Science at Statistics Netherlands
IT infrastructure for Big Data and Data Science at Statistics Netherlands
Piet J.H. Daas
 
ESSnet Big Data WP8 Methodology (+ Quality, +IT)
ESSnet Big Data WP8 Methodology (+ Quality, +IT)ESSnet Big Data WP8 Methodology (+ Quality, +IT)
ESSnet Big Data WP8 Methodology (+ Quality, +IT)
Piet J.H. Daas
 
EMOS 2018 Big Data methods and techniques
EMOS 2018 Big Data methods and techniquesEMOS 2018 Big Data methods and techniques
EMOS 2018 Big Data methods and techniques
Piet J.H. Daas
 
Isi 2017 presentation on Big Data and bias
Isi 2017 presentation on Big Data and biasIsi 2017 presentation on Big Data and bias
Isi 2017 presentation on Big Data and bias
Piet J.H. Daas
 
Responsible Data Science at Statistics Netherlands
Responsible Data Science at Statistics NetherlandsResponsible Data Science at Statistics Netherlands
Responsible Data Science at Statistics Netherlands
Piet J.H. Daas
 
CBS lecture at the opening of Data Science Campus of ONS
CBS lecture at the opening of Data Science Campus of ONSCBS lecture at the opening of Data Science Campus of ONS
CBS lecture at the opening of Data Science Campus of ONS
Piet J.H. Daas
 
Using Road Sensor Data for Official Statistics: towards a Big Data Methodology
Using Road Sensor Data for Official Statistics: towards a Big Data MethodologyUsing Road Sensor Data for Official Statistics: towards a Big Data Methodology
Using Road Sensor Data for Official Statistics: towards a Big Data Methodology
Piet J.H. Daas
 
Quality challenges in modernising business statistics
Quality challenges in modernising business statisticsQuality challenges in modernising business statistics
Quality challenges in modernising business statistics
Piet J.H. Daas
 
Quality Approaches to Big Data
Quality Approaches to Big DataQuality Approaches to Big Data
Quality Approaches to Big Data
Piet J.H. Daas
 
Big data @ CBS
Big data @ CBSBig data @ CBS
Big data @ CBS
Piet J.H. Daas
 
Strata Big data presentation
Strata Big data presentationStrata Big data presentation
Strata Big data presentation
Piet J.H. Daas
 
Big Data, the Future of Statistics: Experiences at Statistics Netherlands
Big Data, the Future of Statistics: Experiences at Statistics NetherlandsBig Data, the Future of Statistics: Experiences at Statistics Netherlands
Big Data, the Future of Statistics: Experiences at Statistics Netherlands
Piet J.H. Daas
 
Big data Big impact?
Big data Big impact?Big data Big impact?
Big data Big impact?
Piet J.H. Daas
 
Bi dutch meeting data science
Bi dutch meeting data scienceBi dutch meeting data science
Bi dutch meeting data science
Piet J.H. Daas
 
Piet daas big_data_official_statistics_target_groningen
Piet daas big_data_official_statistics_target_groningenPiet daas big_data_official_statistics_target_groningen
Piet daas big_data_official_statistics_target_groningen
Piet J.H. Daas
 
Big data en officiële statistiek
Big data en officiële statistiekBig data en officiële statistiek
Big data en officiële statistiek
Piet J.H. Daas
 
Data science and the future of statistics
Data science and the future of statisticsData science and the future of statistics
Data science and the future of statistics
Piet J.H. Daas
 
New Data Sources for Statistics, Social media: Twitter.
New Data Sources for Statistics, Social media: Twitter.New Data Sources for Statistics, Social media: Twitter.
New Data Sources for Statistics, Social media: Twitter.
Piet J.H. Daas
 

More from Piet J.H. Daas (19)

Big Data and official statistics with examples of their use
Big Data and official statistics with examples of their useBig Data and official statistics with examples of their use
Big Data and official statistics with examples of their use
 
IT infrastructure for Big Data and Data Science at Statistics Netherlands
IT infrastructure for Big Data and Data Science at Statistics NetherlandsIT infrastructure for Big Data and Data Science at Statistics Netherlands
IT infrastructure for Big Data and Data Science at Statistics Netherlands
 
ESSnet Big Data WP8 Methodology (+ Quality, +IT)
ESSnet Big Data WP8 Methodology (+ Quality, +IT)ESSnet Big Data WP8 Methodology (+ Quality, +IT)
ESSnet Big Data WP8 Methodology (+ Quality, +IT)
 
EMOS 2018 Big Data methods and techniques
EMOS 2018 Big Data methods and techniquesEMOS 2018 Big Data methods and techniques
EMOS 2018 Big Data methods and techniques
 
Isi 2017 presentation on Big Data and bias
Isi 2017 presentation on Big Data and biasIsi 2017 presentation on Big Data and bias
Isi 2017 presentation on Big Data and bias
 
Responsible Data Science at Statistics Netherlands
Responsible Data Science at Statistics NetherlandsResponsible Data Science at Statistics Netherlands
Responsible Data Science at Statistics Netherlands
 
CBS lecture at the opening of Data Science Campus of ONS
CBS lecture at the opening of Data Science Campus of ONSCBS lecture at the opening of Data Science Campus of ONS
CBS lecture at the opening of Data Science Campus of ONS
 
Using Road Sensor Data for Official Statistics: towards a Big Data Methodology
Using Road Sensor Data for Official Statistics: towards a Big Data MethodologyUsing Road Sensor Data for Official Statistics: towards a Big Data Methodology
Using Road Sensor Data for Official Statistics: towards a Big Data Methodology
 
Quality challenges in modernising business statistics
Quality challenges in modernising business statisticsQuality challenges in modernising business statistics
Quality challenges in modernising business statistics
 
Quality Approaches to Big Data
Quality Approaches to Big DataQuality Approaches to Big Data
Quality Approaches to Big Data
 
Big data @ CBS
Big data @ CBSBig data @ CBS
Big data @ CBS
 
Strata Big data presentation
Strata Big data presentationStrata Big data presentation
Strata Big data presentation
 
Big Data, the Future of Statistics: Experiences at Statistics Netherlands
Big Data, the Future of Statistics: Experiences at Statistics NetherlandsBig Data, the Future of Statistics: Experiences at Statistics Netherlands
Big Data, the Future of Statistics: Experiences at Statistics Netherlands
 
Big data Big impact?
Big data Big impact?Big data Big impact?
Big data Big impact?
 
Bi dutch meeting data science
Bi dutch meeting data scienceBi dutch meeting data science
Bi dutch meeting data science
 
Piet daas big_data_official_statistics_target_groningen
Piet daas big_data_official_statistics_target_groningenPiet daas big_data_official_statistics_target_groningen
Piet daas big_data_official_statistics_target_groningen
 
Big data en officiële statistiek
Big data en officiële statistiekBig data en officiële statistiek
Big data en officiële statistiek
 
Data science and the future of statistics
Data science and the future of statisticsData science and the future of statistics
Data science and the future of statistics
 
New Data Sources for Statistics, Social media: Twitter.
New Data Sources for Statistics, Social media: Twitter.New Data Sources for Statistics, Social media: Twitter.
New Data Sources for Statistics, Social media: Twitter.
 

Recently uploaded

Mudde & Rovira Kaltwasser. - Populism - a very short introduction [2017].pdf
Mudde & Rovira Kaltwasser. - Populism - a very short introduction [2017].pdfMudde & Rovira Kaltwasser. - Populism - a very short introduction [2017].pdf
Mudde & Rovira Kaltwasser. - Populism - a very short introduction [2017].pdf
frank0071
 
ESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptxESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptx
PRIYANKA PATEL
 
Orion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWSOrion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWS
Columbia Weather Systems
 
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptxThe use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
MAGOTI ERNEST
 
Deep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless ReproducibilityDeep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless Reproducibility
University of Rennes, INSA Rennes, Inria/IRISA, CNRS
 
nodule formation by alisha dewangan.pptx
nodule formation by alisha dewangan.pptxnodule formation by alisha dewangan.pptx
nodule formation by alisha dewangan.pptx
alishadewangan1
 
Seminar of U.V. Spectroscopy by SAMIR PANDA
 Seminar of U.V. Spectroscopy by SAMIR PANDA Seminar of U.V. Spectroscopy by SAMIR PANDA
Seminar of U.V. Spectroscopy by SAMIR PANDA
SAMIR PANDA
 
Introduction to Mean Field Theory(MFT).pptx
Introduction to Mean Field Theory(MFT).pptxIntroduction to Mean Field Theory(MFT).pptx
Introduction to Mean Field Theory(MFT).pptx
zeex60
 
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
Abdul Wali Khan University Mardan,kP,Pakistan
 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
RenuJangid3
 
Toxic effects of heavy metals : Lead and Arsenic
Toxic effects of heavy metals : Lead and ArsenicToxic effects of heavy metals : Lead and Arsenic
Toxic effects of heavy metals : Lead and Arsenic
sanjana502982
 
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills MN
 
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdfUnveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Erdal Coalmaker
 
Phenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvementPhenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvement
IshaGoswami9
 
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Ana Luísa Pinho
 
Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.
Nistarini College, Purulia (W.B) India
 
Richard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlandsRichard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlands
Richard Gill
 
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
Wasswaderrick3
 
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
University of Maribor
 
DMARDs Pharmacolgy Pharm D 5th Semester.pdf
DMARDs Pharmacolgy Pharm D 5th Semester.pdfDMARDs Pharmacolgy Pharm D 5th Semester.pdf
DMARDs Pharmacolgy Pharm D 5th Semester.pdf
fafyfskhan251kmf
 

Recently uploaded (20)

Mudde & Rovira Kaltwasser. - Populism - a very short introduction [2017].pdf
Mudde & Rovira Kaltwasser. - Populism - a very short introduction [2017].pdfMudde & Rovira Kaltwasser. - Populism - a very short introduction [2017].pdf
Mudde & Rovira Kaltwasser. - Populism - a very short introduction [2017].pdf
 
ESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptxESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptx
 
Orion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWSOrion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWS
 
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptxThe use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
 
Deep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless ReproducibilityDeep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless Reproducibility
 
nodule formation by alisha dewangan.pptx
nodule formation by alisha dewangan.pptxnodule formation by alisha dewangan.pptx
nodule formation by alisha dewangan.pptx
 
Seminar of U.V. Spectroscopy by SAMIR PANDA
 Seminar of U.V. Spectroscopy by SAMIR PANDA Seminar of U.V. Spectroscopy by SAMIR PANDA
Seminar of U.V. Spectroscopy by SAMIR PANDA
 
Introduction to Mean Field Theory(MFT).pptx
Introduction to Mean Field Theory(MFT).pptxIntroduction to Mean Field Theory(MFT).pptx
Introduction to Mean Field Theory(MFT).pptx
 
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
 
Toxic effects of heavy metals : Lead and Arsenic
Toxic effects of heavy metals : Lead and ArsenicToxic effects of heavy metals : Lead and Arsenic
Toxic effects of heavy metals : Lead and Arsenic
 
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
 
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdfUnveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdf
 
Phenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvementPhenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvement
 
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
 
Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.
 
Richard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlandsRichard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlands
 
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
 
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
 
DMARDs Pharmacolgy Pharm D 5th Semester.pdf
DMARDs Pharmacolgy Pharm D 5th Semester.pdfDMARDs Pharmacolgy Pharm D 5th Semester.pdf
DMARDs Pharmacolgy Pharm D 5th Semester.pdf
 

Extracting information from ' messy' social media data

  • 1. Piet Daas, Marco Puts, Ali Hürriyetoglu Extracting information from ‘messy’ social media data
  • 2. Using Big Data for official statistics – Can and how can we use Big Data for the production of official statistics? – Statistics Netherlands produces reliable and consistent statistical information ‐ The official statistics of the country – These figures are based on target populations ‐ E.g. the country, its inhabitants and its companies – We want to use as much data as is (freely) available ‐ Less questionnaires, use more administrative and Big Data Combining this is challenging !!2
  • 3. It is important to know that – Statistics Netherlands is the first organization that has produced a Big Data based official statistics ‐ Road sensor data based traffic intensity statistics – Statistics Netherlands is the leading organisation in the official statistical world regarding the use of Big Data – Have recently created a ‘Center for Big Data Statistics’ ‐ With many partners involved (> 30) 3
  • 4. Pros and cons of using Big Data – Positive (2 of the 3 V’s) ‐ A lot of data ‐ Readily available – Negative ‐ Variety (not that stable) ‐ Potentially biased (selective part of population) ‐ Most are event based (e.g. message oriented, not user) ‐ Little information is available on the users ‐ It’s a challenging data source for producing statistics with high quality! 4
  • 5. Big Data studies on Social media – Statistics oriented ‐ Social media sentiment and Consumer Confidence ‐ Social media based (un)safety monitor – Population oriented ‐ Users (People, Companies and Others) ‐ Determining background characteristics ‐ We use twiqs.nl, Coosto and Twitter API 5
  • 6. Social media in the Netherlands Map by Eric Fischer (via Fast Company)
  • 7. Social media sentiment – Studied public Dutch social media collected by Coosto ‐ Not only Twitter, but also Facebook, etc. ‐ Looked at the sentiment (+/-/n) in these messages ‐ Studied the change in overall sentiment over time ‐ Around 3-4 million messages per day ‐ Overall sentiment = (pos. messages – neg. messages)/total (%) ‐ Day/week/month 7
  • 10. Table 1. Social media messages properties for various platforms and their correlation with consumer confidence Correlation coefficient of Social media platform Number of social Number of messages as monthly sentiment index and media messages1 percentage of total (%) consumer confidence ( r )2 All platforms combined 3,153,002,327 100 0.75 0.78 Facebook 334,854,088 10.6 0.81* 0.85* Twitter 2,526,481,479 80.1 0.68 0.70 Hyves 45,182,025 1.4 0.50 0.58 News sites 56,027,686 1.8 0.37 0.26 Blogs 48,600,987 1.5 0.25 0.22 Google+ 644,039 0.02 -0.04 -0.09 Linkedin 565,811 0.02 -0.23 -0.25 Youtube 5,661,274 0.2 -0.37 -0.41 Forums 134,98,938 4.3 -0.45 -0.49 1 period covered June 2010 untill November 2013 2 confirmed by visual inspecting scatterplots and additional checks (see text) *cointegrated Platform specific results 10 Combination of Facebook andTwitter is best (r > 0.9) (association continues after that period)
  • 11. Overall findings 11 – Correlation and cointegration ‐ Consumer confidence survey is conducted during first 2 weeks of a month ‐ Comparing various periods revealed that best correlation and cointegration is with last 2 weeks of previous month and first 2 weeks of current month • Highest correlation 0.93* (all Facebook * filteredTwitter) – Granger causality ‐ Changes in Consumer confidence precede changes in Social media sentiment ‐ For all combinations shown! • However: social media is quicker available to us! – Prediction ‐ Slightly better than random chance ‐ Best for the 4th ‘week’ of month
  • 12. (Un)safety feeling in social media – Interviewed people and create a list of words associated with feelings of (un)safety (347) – Checked if these words are used in social media (81) – Only included the most frequently used words (24) – First version of indicator ‐ Need to: Check context of messages included ‐ Need to: Compare height of peaks with ‘severity’ of event
  • 13. Unsafety monitor (first version) Bomb airport Brussel 22-03-2016 Truck attack Nice 14-07-2016 Terrorist attacks Paris 14-11-2015 Intruder NOS 29-01-2015 Charlie Hebdo Paris 09-01-2015 MH17 day of National mourning 23-07-2014 Spain-Neth. Football (1-5) 13-06-2014 13
  • 14. (Un)safety feeling in social media (2) – Interviewed people and create a list of words associated with feelings of (un)safety (347) – Check if these words are used in social media (81) – Only include the frequently used words (24) – First version of indicator ‐ Need to: Check context of messages included ‐ Want to: Compare height of peaks with other data
  • 15. Population studies – Looked at composition of the units active on Twitter – Type of units ‐ People, companies/organizations, and others – Tried to determine background characteristics ‐ Not many units provide such information directly ‐ E.g. gender, age, income, level of education etc. 15
  • 16. Starting point – Draw a sample of a 1000 user id’s from Twitter ‐ Had a list of 330.000 from a previous study – It was found that: ‐ 844 still existed • 691 are persons (82%) • 119 are companies/organizations (14%) • 34 are ‘others’ (4%) • Tried to determine gender 16
  • 17. 17 1)Name 2) Short bio 3) Messages content 4) Picture
  • 18. Gender findings: 1) First name • Used Dutch ‘Voornamenbank’ website (First name database) • Score between 0 and 1 (female – male); 676 of 844 (80%) names were registered • Unknown names scored -1 (usually companies/organizations)
  • 19. Gender findings: 2) Short bio – If a short bio is provided – Quite a number of people mention there ‘position’ in the family ‐ Mother, father, papa, mama, ‘son of’, etc. – Need to check both English and Dutch texts – 155 of 583 (27%) indicated there gender in short bio ‐ Very precise for women!! 19
  • 20. Gender findings: 3) Tweets content – In cooperation with University ofTwente (Dong Nguyen) – Machine learning approach that checks gender specific writing style ‐ Language specific: Messages need to be Dutch! ‐ 437 of 473 (92%) persons that created tweets could be classified
  • 21. Gender findings: 4) Profile picture – Use OpenCV to process pictures – 1) Face recognition – 2) Standardisation of faces (resize & rotate) – 3) Classify faces according to gender – - 603 of 804 (75%) profile pictures had 1 or more faces on it 1 2 3
  • 22. Gender findings: overall results (1) Diagnostic Odds Ratio = (TP/FN) / (FP/TN) Random guessing log(DOR) = 0 ‐ Multi-agent findings • Need ‘clever’ ways to combine these • Take processing efficiency of the ‘agent’ into consideration Diagnostic Odds Ratio (log) First name 4.33 Short bio 2.70 Tweet content 1.96 Picture (faces) 0.57 22
  • 23. Gender findings: overall results (2) Combine results in the best possible way Unassigned (%) Approach used 844 (100%) 1. Use short bio scores (very precise for females) 689 (82%) 2. Use first name scores 153 (18%) 3. Use Tweet content 29 (3.4%) 4. Use picture 20 (2.4%) 5. Assign male gender Final log(DOR) is 7.02, an accuracy of 96.5%! 23
  • 24. Conclusions and future studies – Social media is one of the most challenging data sources for official statistics – Using it requires that we: ‐ Focus on the information available ‐ Think outside the box (i.e. sentiment study) – Good source to study potential ways to correct for the selectivity of Big Data sources – In future studies we will be looking at: ‐ Sentiment, unsafety and more. Population composition, population dynamics and other background characteristics 24
  • 26. Thank you for your attention !@pietdaas