SlideShare a Scribd company logo
1 of 12
Piet Daas and Joep Burger
With special thanks to Marco Puts & Dong Nguyen1
Profiling Big Data sources to
assess their selectivity
Big Data
– More and more organizations want to use Big Data as a
new/additional source of information
– However, there are some major challenges :
– Selectivity of Big Data
– Source does not have to completely cover the target
population
– What part of the population is included?
2
Profiling: extracting ‘features’
3
– Extract background characteristics (‘features’) from the ‘units’
in Big Data in an attempt to determine its selectivity
‐ The need for this depends on the ‘type’ of Big data source and its
foreseen use
– Important background characteristics for statistics are:
‐ Persons: gender, age, income, education, origin,
urbanicity, household composition, ..
‐ Companies: number of employees, turnover, type of
economic activity, legal form, ..
Social Media: Twitter as an example
4
– On Social media persons, companies and ‘others’ can
create an account and create messages
‐ In the Netherlands 70% of the population is active on social media
– What kind of information is available on Twitter of a user
‐ Focus on gender!
– Let’s look at a profile: @pietdaas
5
1)Name
2) Short bio
3) Messages
content
4) Picture
Studied a Twitter sample
– From a list of Dutch Twitter users (~330.000)
– A random sample of 1000 unique ids was drawn
– Of the sample:
‐ 844 profiles still existed
• 844 had a name
• 583 provided a short bio
• 473 created ‘tweets’
• 804 had a ‘non-default’ picture
• 409 Men (49%)
• 282 Women (33%)
• 153 ‘Others’ (18%)
• companies, organizations, dogs, cats, ‘bots’..
6
DefaultTwitter picture
Gender findings: 1) First name
7
– Used Dutch ‘Voornamenbank’ website (First name database)
– Score between 0 and 1 (female – male); 676 of 844 (80%) names were registered
– Unknown names scored -1 (usually companies/organizations)
8
Gender findings: 2) Short bio
– If a short bio is provided
‐ Quite a number of people mention there ‘position’ in the
family
• Mother, father, papa, mama, ‘son of’, etc.
‐ Sometimes also occupations are mentioned that reflect the
gender (‘studente’)
‐ 155 of 583 (27%) indicated there gender in short bio
‐ Need to check both English and Dutch texts
Gender findings: 3) Tweets content
– In cooperation with University of Twente (Dong Nguyen)
– Machine learning approach that determines gender specific writing style
‐ Language specific: Messages need to be Dutch!
‐ 437 of 473 (92%) persons that created tweets could be classified
Gender findings: 4) Profile picture
10
– Use OpenCV to process pictures
1) Face recognition
2) Standardisation of faces (resize & rotate)
3) Classify faces according to gender
- 603 of 804 (75%) profile pictures had 1 or more faces on it
1
2
3
Gender findings: overall results
11
Diagnostic Odds Ratio =
(TP/FN) / (FP/TN)
random guessing
log(DOR) = 0
‐ Multi-agent findings
• Need clever ways to combine these
• Take processing efficiency of the ‘agent’ into consideration
Diagnostic Odds
Ratio (log)
First name 6.41
Short bio 3.50
Tweet content 2.36
Picture (faces) 0.72
Thank you for your attention !
12

More Related Content

What's hot

Twitter Based Outcome Predictions of 2019 Indian General Elections Using Deci...
Twitter Based Outcome Predictions of 2019 Indian General Elections Using Deci...Twitter Based Outcome Predictions of 2019 Indian General Elections Using Deci...
Twitter Based Outcome Predictions of 2019 Indian General Elections Using Deci...Ferdin Joe John Joseph PhD
 
Sentiment Analysis and Social Media: How and Why
Sentiment Analysis and Social Media: How and WhySentiment Analysis and Social Media: How and Why
Sentiment Analysis and Social Media: How and WhyDavide Feltoni Gurini
 
ECSM2014: Using Social Media To Inform Policy Making: To whom are we listenin...
ECSM2014: Using Social Media To Inform Policy Making: To whom are we listenin...ECSM2014: Using Social Media To Inform Policy Making: To whom are we listenin...
ECSM2014: Using Social Media To Inform Policy Making: To whom are we listenin...Miriam Fernandez
 
Using Tweets for Understanding Public Opinion During U.S. Primaries and Predi...
Using Tweets for Understanding Public Opinion During U.S. Primaries and Predi...Using Tweets for Understanding Public Opinion During U.S. Primaries and Predi...
Using Tweets for Understanding Public Opinion During U.S. Primaries and Predi...Monica Powell
 
Data augmented ethnography: 
using big data and ethnography to explore candi...
Data augmented ethnography: 
using big data and ethnography  to explore candi...Data augmented ethnography: 
using big data and ethnography  to explore candi...
Data augmented ethnography: 
using big data and ethnography to explore candi...Salla-Maaria Laaksonen
 
Social Media in Australia: A ‘Big Data’ Perspective on Twitter
Social Media in Australia: A ‘Big Data’ Perspective on TwitterSocial Media in Australia: A ‘Big Data’ Perspective on Twitter
Social Media in Australia: A ‘Big Data’ Perspective on TwitterAxel Bruns
 
Fake News Detector
Fake News DetectorFake News Detector
Fake News DetectorIrisYoon5
 
Twitter analytics: some thoughts on sampling, tools, data, ethics and user re...
Twitter analytics: some thoughts on sampling, tools, data, ethics and user re...Twitter analytics: some thoughts on sampling, tools, data, ethics and user re...
Twitter analytics: some thoughts on sampling, tools, data, ethics and user re...Farida Vis
 
Who’s in the Gang? Revealing Coordinating Communities in Social Media
Who’s in the Gang? Revealing Coordinating Communities in Social MediaWho’s in the Gang? Revealing Coordinating Communities in Social Media
Who’s in the Gang? Revealing Coordinating Communities in Social MediaDerek Weber
 
Chung-Jui LAI - Polarization of Political Opinion by News Media
Chung-Jui LAI - Polarization of Political Opinion by News MediaChung-Jui LAI - Polarization of Political Opinion by News Media
Chung-Jui LAI - Polarization of Political Opinion by News MediaREVULN
 
Twitter Data Analytics
Twitter Data AnalyticsTwitter Data Analytics
Twitter Data Analyticsrupika08
 
On user generated content, teleology and predictability in social systems
On user generated content, teleology and predictability in social systemsOn user generated content, teleology and predictability in social systems
On user generated content, teleology and predictability in social systemsUniversità of Urbino Carlo Bo
 
2018 02-13 pathways-data enquiry_martina_emke
2018 02-13 pathways-data enquiry_martina_emke2018 02-13 pathways-data enquiry_martina_emke
2018 02-13 pathways-data enquiry_martina_emkeDr Martina Emke
 
Twitter Based Election Prediction and Analysis
Twitter Based Election Prediction and AnalysisTwitter Based Election Prediction and Analysis
Twitter Based Election Prediction and AnalysisIRJET Journal
 
Analysis Tweets Korea Politicians(25 Sep2009)Sj
Analysis Tweets Korea Politicians(25 Sep2009)SjAnalysis Tweets Korea Politicians(25 Sep2009)Sj
Analysis Tweets Korea Politicians(25 Sep2009)SjWCU Webometrics Institute
 
1 Crore Projects | ieee 2016 Projects | 2016 ieee Projects in chennai
1 Crore Projects | ieee 2016 Projects | 2016 ieee Projects in chennai1 Crore Projects | ieee 2016 Projects | 2016 ieee Projects in chennai
1 Crore Projects | ieee 2016 Projects | 2016 ieee Projects in chennai1crore projects
 
PREDICTING ELECTION OUTCOME FROM SOCIAL MEDIA DATA
PREDICTING ELECTION OUTCOME FROM SOCIAL MEDIA DATAPREDICTING ELECTION OUTCOME FROM SOCIAL MEDIA DATA
PREDICTING ELECTION OUTCOME FROM SOCIAL MEDIA DATAkevig
 

What's hot (20)

Twitter Based Outcome Predictions of 2019 Indian General Elections Using Deci...
Twitter Based Outcome Predictions of 2019 Indian General Elections Using Deci...Twitter Based Outcome Predictions of 2019 Indian General Elections Using Deci...
Twitter Based Outcome Predictions of 2019 Indian General Elections Using Deci...
 
Sentiment Analysis and Social Media: How and Why
Sentiment Analysis and Social Media: How and WhySentiment Analysis and Social Media: How and Why
Sentiment Analysis and Social Media: How and Why
 
ECSM2014: Using Social Media To Inform Policy Making: To whom are we listenin...
ECSM2014: Using Social Media To Inform Policy Making: To whom are we listenin...ECSM2014: Using Social Media To Inform Policy Making: To whom are we listenin...
ECSM2014: Using Social Media To Inform Policy Making: To whom are we listenin...
 
Using Tweets for Understanding Public Opinion During U.S. Primaries and Predi...
Using Tweets for Understanding Public Opinion During U.S. Primaries and Predi...Using Tweets for Understanding Public Opinion During U.S. Primaries and Predi...
Using Tweets for Understanding Public Opinion During U.S. Primaries and Predi...
 
Data augmented ethnography: 
using big data and ethnography to explore candi...
Data augmented ethnography: 
using big data and ethnography  to explore candi...Data augmented ethnography: 
using big data and ethnography  to explore candi...
Data augmented ethnography: 
using big data and ethnography to explore candi...
 
Social Media in Australia: A ‘Big Data’ Perspective on Twitter
Social Media in Australia: A ‘Big Data’ Perspective on TwitterSocial Media in Australia: A ‘Big Data’ Perspective on Twitter
Social Media in Australia: A ‘Big Data’ Perspective on Twitter
 
Fake News Detector
Fake News DetectorFake News Detector
Fake News Detector
 
Twitter analytics: some thoughts on sampling, tools, data, ethics and user re...
Twitter analytics: some thoughts on sampling, tools, data, ethics and user re...Twitter analytics: some thoughts on sampling, tools, data, ethics and user re...
Twitter analytics: some thoughts on sampling, tools, data, ethics and user re...
 
Who’s in the Gang? Revealing Coordinating Communities in Social Media
Who’s in the Gang? Revealing Coordinating Communities in Social MediaWho’s in the Gang? Revealing Coordinating Communities in Social Media
Who’s in the Gang? Revealing Coordinating Communities in Social Media
 
Chung-Jui LAI - Polarization of Political Opinion by News Media
Chung-Jui LAI - Polarization of Political Opinion by News MediaChung-Jui LAI - Polarization of Political Opinion by News Media
Chung-Jui LAI - Polarization of Political Opinion by News Media
 
Political Poster Edit
Political Poster EditPolitical Poster Edit
Political Poster Edit
 
Chung-Hong Chan and King-Wa Fu: Differential opinions among Hong Kong online ...
Chung-Hong Chan and King-Wa Fu: Differential opinions among Hong Kong online ...Chung-Hong Chan and King-Wa Fu: Differential opinions among Hong Kong online ...
Chung-Hong Chan and King-Wa Fu: Differential opinions among Hong Kong online ...
 
Twitter Data Analytics
Twitter Data AnalyticsTwitter Data Analytics
Twitter Data Analytics
 
On user generated content, teleology and predictability in social systems
On user generated content, teleology and predictability in social systemsOn user generated content, teleology and predictability in social systems
On user generated content, teleology and predictability in social systems
 
2018 02-13 pathways-data enquiry_martina_emke
2018 02-13 pathways-data enquiry_martina_emke2018 02-13 pathways-data enquiry_martina_emke
2018 02-13 pathways-data enquiry_martina_emke
 
Twitter Based Election Prediction and Analysis
Twitter Based Election Prediction and AnalysisTwitter Based Election Prediction and Analysis
Twitter Based Election Prediction and Analysis
 
Secondary source qual
Secondary source qualSecondary source qual
Secondary source qual
 
Analysis Tweets Korea Politicians(25 Sep2009)Sj
Analysis Tweets Korea Politicians(25 Sep2009)SjAnalysis Tweets Korea Politicians(25 Sep2009)Sj
Analysis Tweets Korea Politicians(25 Sep2009)Sj
 
1 Crore Projects | ieee 2016 Projects | 2016 ieee Projects in chennai
1 Crore Projects | ieee 2016 Projects | 2016 ieee Projects in chennai1 Crore Projects | ieee 2016 Projects | 2016 ieee Projects in chennai
1 Crore Projects | ieee 2016 Projects | 2016 ieee Projects in chennai
 
PREDICTING ELECTION OUTCOME FROM SOCIAL MEDIA DATA
PREDICTING ELECTION OUTCOME FROM SOCIAL MEDIA DATAPREDICTING ELECTION OUTCOME FROM SOCIAL MEDIA DATA
PREDICTING ELECTION OUTCOME FROM SOCIAL MEDIA DATA
 

Similar to Profiling Big Data sources to assess their selectivity

Extracting information from ' messy' social media data
Extracting information from ' messy' social media dataExtracting information from ' messy' social media data
Extracting information from ' messy' social media dataPiet J.H. Daas
 
Inclusive networks (2014 Forum on Workplace Inclusion)
Inclusive networks (2014 Forum on Workplace Inclusion)Inclusive networks (2014 Forum on Workplace Inclusion)
Inclusive networks (2014 Forum on Workplace Inclusion)Joe Gerstandt
 
HumanityRoad training - Basic Crisis Information Management
HumanityRoad training - Basic Crisis Information ManagementHumanityRoad training - Basic Crisis Information Management
HumanityRoad training - Basic Crisis Information ManagementGisli Olafsson
 
Ai, social media and political polarization
Ai, social media and political polarizationAi, social media and political polarization
Ai, social media and political polarizationJim Isaak
 
Commercialization Forum
Commercialization ForumCommercialization Forum
Commercialization ForumCafe Commerce
 
Shortcut To Career Preparation 2009 2010
Shortcut To Career Preparation 2009 2010Shortcut To Career Preparation 2009 2010
Shortcut To Career Preparation 2009 2010Charessa Dela Roma
 
Honeypot Projects are Everywhere
Honeypot Projects are EverywhereHoneypot Projects are Everywhere
Honeypot Projects are EverywhereChristos Beretas
 
Getting started in data science (4:3)
Getting started in data science (4:3)Getting started in data science (4:3)
Getting started in data science (4:3)Thinkful
 
Getting started in data science (4:3)
Getting started in data science (4:3)Getting started in data science (4:3)
Getting started in data science (4:3)Thinkful
 
CHI 2014 Panel: Opportunities and Risks of Discovering Personality Traits fro...
CHI 2014 Panel: Opportunities and Risks of Discovering Personality Traits fro...CHI 2014 Panel: Opportunities and Risks of Discovering Personality Traits fro...
CHI 2014 Panel: Opportunities and Risks of Discovering Personality Traits fro...Jeffrey Nichols
 
Big Data Ethics Cjbe july 2021
Big Data Ethics Cjbe july 2021Big Data Ethics Cjbe july 2021
Big Data Ethics Cjbe july 2021andygustafson
 
Manichean Progress: Positive and Negative States of the Art in Web-Scale Data...
Manichean Progress: Positive and Negative States of the Art in Web-Scale Data...Manichean Progress: Positive and Negative States of the Art in Web-Scale Data...
Manichean Progress: Positive and Negative States of the Art in Web-Scale Data...Lewis Shepherd
 
Data science and ethics in fundraising
Data science and ethics in fundraisingData science and ethics in fundraising
Data science and ethics in fundraisingJames Orton
 
Lecture 10 Inferential Data Analysis, Personality Quizes and Fake News...
Lecture 10 Inferential Data Analysis, Personality Quizes and Fake News...Lecture 10 Inferential Data Analysis, Personality Quizes and Fake News...
Lecture 10 Inferential Data Analysis, Personality Quizes and Fake News...Marcus Leaning
 
DWS16 - Plenary - Earning digital trust - Vesselin Popov, University of Cambr...
DWS16 - Plenary - Earning digital trust - Vesselin Popov, University of Cambr...DWS16 - Plenary - Earning digital trust - Vesselin Popov, University of Cambr...
DWS16 - Plenary - Earning digital trust - Vesselin Popov, University of Cambr...IDATE DigiWorld
 
Evidence Live Social Media Workshop
Evidence Live Social Media Workshop Evidence Live Social Media Workshop
Evidence Live Social Media Workshop Douglas Badenoch
 

Similar to Profiling Big Data sources to assess their selectivity (20)

Extracting information from ' messy' social media data
Extracting information from ' messy' social media dataExtracting information from ' messy' social media data
Extracting information from ' messy' social media data
 
Ethics and Data
Ethics and DataEthics and Data
Ethics and Data
 
Inclusive networks (2014 Forum on Workplace Inclusion)
Inclusive networks (2014 Forum on Workplace Inclusion)Inclusive networks (2014 Forum on Workplace Inclusion)
Inclusive networks (2014 Forum on Workplace Inclusion)
 
Bigdatahuman
BigdatahumanBigdatahuman
Bigdatahuman
 
Maura Tuohy
Maura TuohyMaura Tuohy
Maura Tuohy
 
HumanityRoad training - Basic Crisis Information Management
HumanityRoad training - Basic Crisis Information ManagementHumanityRoad training - Basic Crisis Information Management
HumanityRoad training - Basic Crisis Information Management
 
Ai, social media and political polarization
Ai, social media and political polarizationAi, social media and political polarization
Ai, social media and political polarization
 
Commercialization Forum
Commercialization ForumCommercialization Forum
Commercialization Forum
 
Shortcut To Career Preparation 2009 2010
Shortcut To Career Preparation 2009 2010Shortcut To Career Preparation 2009 2010
Shortcut To Career Preparation 2009 2010
 
Social Media in the Workplace
Social Media in the WorkplaceSocial Media in the Workplace
Social Media in the Workplace
 
Honeypot Projects are Everywhere
Honeypot Projects are EverywhereHoneypot Projects are Everywhere
Honeypot Projects are Everywhere
 
Getting started in data science (4:3)
Getting started in data science (4:3)Getting started in data science (4:3)
Getting started in data science (4:3)
 
Getting started in data science (4:3)
Getting started in data science (4:3)Getting started in data science (4:3)
Getting started in data science (4:3)
 
CHI 2014 Panel: Opportunities and Risks of Discovering Personality Traits fro...
CHI 2014 Panel: Opportunities and Risks of Discovering Personality Traits fro...CHI 2014 Panel: Opportunities and Risks of Discovering Personality Traits fro...
CHI 2014 Panel: Opportunities and Risks of Discovering Personality Traits fro...
 
Big Data Ethics Cjbe july 2021
Big Data Ethics Cjbe july 2021Big Data Ethics Cjbe july 2021
Big Data Ethics Cjbe july 2021
 
Manichean Progress: Positive and Negative States of the Art in Web-Scale Data...
Manichean Progress: Positive and Negative States of the Art in Web-Scale Data...Manichean Progress: Positive and Negative States of the Art in Web-Scale Data...
Manichean Progress: Positive and Negative States of the Art in Web-Scale Data...
 
Data science and ethics in fundraising
Data science and ethics in fundraisingData science and ethics in fundraising
Data science and ethics in fundraising
 
Lecture 10 Inferential Data Analysis, Personality Quizes and Fake News...
Lecture 10 Inferential Data Analysis, Personality Quizes and Fake News...Lecture 10 Inferential Data Analysis, Personality Quizes and Fake News...
Lecture 10 Inferential Data Analysis, Personality Quizes and Fake News...
 
DWS16 - Plenary - Earning digital trust - Vesselin Popov, University of Cambr...
DWS16 - Plenary - Earning digital trust - Vesselin Popov, University of Cambr...DWS16 - Plenary - Earning digital trust - Vesselin Popov, University of Cambr...
DWS16 - Plenary - Earning digital trust - Vesselin Popov, University of Cambr...
 
Evidence Live Social Media Workshop
Evidence Live Social Media Workshop Evidence Live Social Media Workshop
Evidence Live Social Media Workshop
 

More from Piet J.H. Daas

Big Data and official statistics with examples of their use
Big Data and official statistics with examples of their useBig Data and official statistics with examples of their use
Big Data and official statistics with examples of their usePiet J.H. Daas
 
IT infrastructure for Big Data and Data Science at Statistics Netherlands
IT infrastructure for Big Data and Data Science at Statistics NetherlandsIT infrastructure for Big Data and Data Science at Statistics Netherlands
IT infrastructure for Big Data and Data Science at Statistics NetherlandsPiet J.H. Daas
 
ESSnet Big Data WP8 Methodology (+ Quality, +IT)
ESSnet Big Data WP8 Methodology (+ Quality, +IT)ESSnet Big Data WP8 Methodology (+ Quality, +IT)
ESSnet Big Data WP8 Methodology (+ Quality, +IT)Piet J.H. Daas
 
EMOS 2018 Big Data methods and techniques
EMOS 2018 Big Data methods and techniquesEMOS 2018 Big Data methods and techniques
EMOS 2018 Big Data methods and techniquesPiet J.H. Daas
 
Use of social media for official statistics
Use of social media for official statisticsUse of social media for official statistics
Use of social media for official statisticsPiet J.H. Daas
 
Isi 2017 presentation on Big Data and bias
Isi 2017 presentation on Big Data and biasIsi 2017 presentation on Big Data and bias
Isi 2017 presentation on Big Data and biasPiet J.H. Daas
 
Responsible Data Science at Statistics Netherlands
Responsible Data Science at Statistics NetherlandsResponsible Data Science at Statistics Netherlands
Responsible Data Science at Statistics NetherlandsPiet J.H. Daas
 
CBS lecture at the opening of Data Science Campus of ONS
CBS lecture at the opening of Data Science Campus of ONSCBS lecture at the opening of Data Science Campus of ONS
CBS lecture at the opening of Data Science Campus of ONSPiet J.H. Daas
 
Ntts2017 presentation 45
Ntts2017 presentation 45Ntts2017 presentation 45
Ntts2017 presentation 45Piet J.H. Daas
 
Big Data presentation Mannheim
Big Data presentation MannheimBig Data presentation Mannheim
Big Data presentation MannheimPiet J.H. Daas
 
Big data cbs_piet_daas
Big data cbs_piet_daasBig data cbs_piet_daas
Big data cbs_piet_daasPiet J.H. Daas
 
Gebruik van sociale media voor de officiële statistiek
Gebruik van sociale media voor de officiële statistiekGebruik van sociale media voor de officiële statistiek
Gebruik van sociale media voor de officiële statistiekPiet J.H. Daas
 
Using Road Sensor Data for Official Statistics: towards a Big Data Methodology
Using Road Sensor Data for Official Statistics: towards a Big Data MethodologyUsing Road Sensor Data for Official Statistics: towards a Big Data Methodology
Using Road Sensor Data for Official Statistics: towards a Big Data MethodologyPiet J.H. Daas
 
Big Data @ CBS for Fontys students in Eindhoven
Big Data @ CBS for Fontys students in EindhovenBig Data @ CBS for Fontys students in Eindhoven
Big Data @ CBS for Fontys students in EindhovenPiet J.H. Daas
 
Big Data presentation for Statistics Canada
Big Data presentation for Statistics CanadaBig Data presentation for Statistics Canada
Big Data presentation for Statistics CanadaPiet J.H. Daas
 
Quality challenges in modernising business statistics
Quality challenges in modernising business statisticsQuality challenges in modernising business statistics
Quality challenges in modernising business statisticsPiet J.H. Daas
 
Quality Approaches to Big Data
Quality Approaches to Big DataQuality Approaches to Big Data
Quality Approaches to Big DataPiet J.H. Daas
 
Opportunities and methodological challenges of Big Data for official statist...
Opportunities and methodological challenges of  Big Data for official statist...Opportunities and methodological challenges of  Big Data for official statist...
Opportunities and methodological challenges of Big Data for official statist...Piet J.H. Daas
 
Strata Big data presentation
Strata Big data presentationStrata Big data presentation
Strata Big data presentationPiet J.H. Daas
 

More from Piet J.H. Daas (20)

Big Data and official statistics with examples of their use
Big Data and official statistics with examples of their useBig Data and official statistics with examples of their use
Big Data and official statistics with examples of their use
 
IT infrastructure for Big Data and Data Science at Statistics Netherlands
IT infrastructure for Big Data and Data Science at Statistics NetherlandsIT infrastructure for Big Data and Data Science at Statistics Netherlands
IT infrastructure for Big Data and Data Science at Statistics Netherlands
 
ESSnet Big Data WP8 Methodology (+ Quality, +IT)
ESSnet Big Data WP8 Methodology (+ Quality, +IT)ESSnet Big Data WP8 Methodology (+ Quality, +IT)
ESSnet Big Data WP8 Methodology (+ Quality, +IT)
 
EMOS 2018 Big Data methods and techniques
EMOS 2018 Big Data methods and techniquesEMOS 2018 Big Data methods and techniques
EMOS 2018 Big Data methods and techniques
 
Use of social media for official statistics
Use of social media for official statisticsUse of social media for official statistics
Use of social media for official statistics
 
Isi 2017 presentation on Big Data and bias
Isi 2017 presentation on Big Data and biasIsi 2017 presentation on Big Data and bias
Isi 2017 presentation on Big Data and bias
 
Responsible Data Science at Statistics Netherlands
Responsible Data Science at Statistics NetherlandsResponsible Data Science at Statistics Netherlands
Responsible Data Science at Statistics Netherlands
 
CBS lecture at the opening of Data Science Campus of ONS
CBS lecture at the opening of Data Science Campus of ONSCBS lecture at the opening of Data Science Campus of ONS
CBS lecture at the opening of Data Science Campus of ONS
 
Ntts2017 presentation 45
Ntts2017 presentation 45Ntts2017 presentation 45
Ntts2017 presentation 45
 
Big Data presentation Mannheim
Big Data presentation MannheimBig Data presentation Mannheim
Big Data presentation Mannheim
 
Big data cbs_piet_daas
Big data cbs_piet_daasBig data cbs_piet_daas
Big data cbs_piet_daas
 
Gebruik van sociale media voor de officiële statistiek
Gebruik van sociale media voor de officiële statistiekGebruik van sociale media voor de officiële statistiek
Gebruik van sociale media voor de officiële statistiek
 
Using Road Sensor Data for Official Statistics: towards a Big Data Methodology
Using Road Sensor Data for Official Statistics: towards a Big Data MethodologyUsing Road Sensor Data for Official Statistics: towards a Big Data Methodology
Using Road Sensor Data for Official Statistics: towards a Big Data Methodology
 
Big Data @ CBS for Fontys students in Eindhoven
Big Data @ CBS for Fontys students in EindhovenBig Data @ CBS for Fontys students in Eindhoven
Big Data @ CBS for Fontys students in Eindhoven
 
Big Data presentation for Statistics Canada
Big Data presentation for Statistics CanadaBig Data presentation for Statistics Canada
Big Data presentation for Statistics Canada
 
Quality challenges in modernising business statistics
Quality challenges in modernising business statisticsQuality challenges in modernising business statistics
Quality challenges in modernising business statistics
 
Quality Approaches to Big Data
Quality Approaches to Big DataQuality Approaches to Big Data
Quality Approaches to Big Data
 
Opportunities and methodological challenges of Big Data for official statist...
Opportunities and methodological challenges of  Big Data for official statist...Opportunities and methodological challenges of  Big Data for official statist...
Opportunities and methodological challenges of Big Data for official statist...
 
Big data @ CBS
Big data @ CBSBig data @ CBS
Big data @ CBS
 
Strata Big data presentation
Strata Big data presentationStrata Big data presentation
Strata Big data presentation
 

Recently uploaded

WORLD CREATIVITY AND INNOVATION DAY 2024.
WORLD CREATIVITY AND INNOVATION DAY 2024.WORLD CREATIVITY AND INNOVATION DAY 2024.
WORLD CREATIVITY AND INNOVATION DAY 2024.Christina Parmionova
 
How the Congressional Budget Office Assists Lawmakers
How the Congressional Budget Office Assists LawmakersHow the Congressional Budget Office Assists Lawmakers
How the Congressional Budget Office Assists LawmakersCongressional Budget Office
 
Call Girls Connaught Place Delhi reach out to us at ☎ 9711199012
Call Girls Connaught Place Delhi reach out to us at ☎ 9711199012Call Girls Connaught Place Delhi reach out to us at ☎ 9711199012
Call Girls Connaught Place Delhi reach out to us at ☎ 9711199012rehmti665
 
(多少钱)Dal毕业证国外本科学位证
(多少钱)Dal毕业证国外本科学位证(多少钱)Dal毕业证国外本科学位证
(多少钱)Dal毕业证国外本科学位证mbetknu
 
Earth Day 2024 - AMC "COMMON GROUND'' movie night.
Earth Day 2024 - AMC "COMMON GROUND'' movie night.Earth Day 2024 - AMC "COMMON GROUND'' movie night.
Earth Day 2024 - AMC "COMMON GROUND'' movie night.Christina Parmionova
 
Enhancing Indigenous Peoples' right to self-determination in the context of t...
Enhancing Indigenous Peoples' right to self-determination in the context of t...Enhancing Indigenous Peoples' right to self-determination in the context of t...
Enhancing Indigenous Peoples' right to self-determination in the context of t...Christina Parmionova
 
No.1 Call Girls in Basavanagudi ! 7001305949 ₹2999 Only and Free Hotel Delive...
No.1 Call Girls in Basavanagudi ! 7001305949 ₹2999 Only and Free Hotel Delive...No.1 Call Girls in Basavanagudi ! 7001305949 ₹2999 Only and Free Hotel Delive...
No.1 Call Girls in Basavanagudi ! 7001305949 ₹2999 Only and Free Hotel Delive...narwatsonia7
 
Precarious profits? Why firms use insecure contracts, and what would change t...
Precarious profits? Why firms use insecure contracts, and what would change t...Precarious profits? Why firms use insecure contracts, and what would change t...
Precarious profits? Why firms use insecure contracts, and what would change t...ResolutionFoundation
 
Take action for a healthier planet and brighter future.
Take action for a healthier planet and brighter future.Take action for a healthier planet and brighter future.
Take action for a healthier planet and brighter future.Christina Parmionova
 
(官方原版办理)BU毕业证国外大学毕业证样本
(官方原版办理)BU毕业证国外大学毕业证样本(官方原版办理)BU毕业证国外大学毕业证样本
(官方原版办理)BU毕业证国外大学毕业证样本mbetknu
 
Call Girls Service Race Course Road Just Call 7001305949 Enjoy College Girls ...
Call Girls Service Race Course Road Just Call 7001305949 Enjoy College Girls ...Call Girls Service Race Course Road Just Call 7001305949 Enjoy College Girls ...
Call Girls Service Race Course Road Just Call 7001305949 Enjoy College Girls ...narwatsonia7
 
Goa Escorts WhatsApp Number South Goa Call Girl … 8588052666…
Goa Escorts WhatsApp Number South Goa Call Girl … 8588052666…Goa Escorts WhatsApp Number South Goa Call Girl … 8588052666…
Goa Escorts WhatsApp Number South Goa Call Girl … 8588052666…nishakur201
 
Call Girls Rohini Delhi reach out to us at ☎ 9711199012
Call Girls Rohini Delhi reach out to us at ☎ 9711199012Call Girls Rohini Delhi reach out to us at ☎ 9711199012
Call Girls Rohini Delhi reach out to us at ☎ 9711199012rehmti665
 
productionpost-productiondiary-240320114322-5004daf6.pptx
productionpost-productiondiary-240320114322-5004daf6.pptxproductionpost-productiondiary-240320114322-5004daf6.pptx
productionpost-productiondiary-240320114322-5004daf6.pptxHenryBriggs2
 
Russian Call Girl Hebbagodi ! 7001305949 ₹2999 Only and Free Hotel Delivery 2...
Russian Call Girl Hebbagodi ! 7001305949 ₹2999 Only and Free Hotel Delivery 2...Russian Call Girl Hebbagodi ! 7001305949 ₹2999 Only and Free Hotel Delivery 2...
Russian Call Girl Hebbagodi ! 7001305949 ₹2999 Only and Free Hotel Delivery 2...narwatsonia7
 
VIP Greater Noida Call Girls 9711199012 Escorts Service Noida Extension,Ms
VIP Greater Noida Call Girls 9711199012 Escorts Service Noida Extension,MsVIP Greater Noida Call Girls 9711199012 Escorts Service Noida Extension,Ms
VIP Greater Noida Call Girls 9711199012 Escorts Service Noida Extension,Msankitnayak356677
 
VIP Call Girls Service Bikaner Aishwarya 8250192130 Independent Escort Servic...
VIP Call Girls Service Bikaner Aishwarya 8250192130 Independent Escort Servic...VIP Call Girls Service Bikaner Aishwarya 8250192130 Independent Escort Servic...
VIP Call Girls Service Bikaner Aishwarya 8250192130 Independent Escort Servic...Suhani Kapoor
 

Recently uploaded (20)

WORLD CREATIVITY AND INNOVATION DAY 2024.
WORLD CREATIVITY AND INNOVATION DAY 2024.WORLD CREATIVITY AND INNOVATION DAY 2024.
WORLD CREATIVITY AND INNOVATION DAY 2024.
 
How the Congressional Budget Office Assists Lawmakers
How the Congressional Budget Office Assists LawmakersHow the Congressional Budget Office Assists Lawmakers
How the Congressional Budget Office Assists Lawmakers
 
Call Girls Connaught Place Delhi reach out to us at ☎ 9711199012
Call Girls Connaught Place Delhi reach out to us at ☎ 9711199012Call Girls Connaught Place Delhi reach out to us at ☎ 9711199012
Call Girls Connaught Place Delhi reach out to us at ☎ 9711199012
 
(多少钱)Dal毕业证国外本科学位证
(多少钱)Dal毕业证国外本科学位证(多少钱)Dal毕业证国外本科学位证
(多少钱)Dal毕业证国外本科学位证
 
Earth Day 2024 - AMC "COMMON GROUND'' movie night.
Earth Day 2024 - AMC "COMMON GROUND'' movie night.Earth Day 2024 - AMC "COMMON GROUND'' movie night.
Earth Day 2024 - AMC "COMMON GROUND'' movie night.
 
Enhancing Indigenous Peoples' right to self-determination in the context of t...
Enhancing Indigenous Peoples' right to self-determination in the context of t...Enhancing Indigenous Peoples' right to self-determination in the context of t...
Enhancing Indigenous Peoples' right to self-determination in the context of t...
 
No.1 Call Girls in Basavanagudi ! 7001305949 ₹2999 Only and Free Hotel Delive...
No.1 Call Girls in Basavanagudi ! 7001305949 ₹2999 Only and Free Hotel Delive...No.1 Call Girls in Basavanagudi ! 7001305949 ₹2999 Only and Free Hotel Delive...
No.1 Call Girls in Basavanagudi ! 7001305949 ₹2999 Only and Free Hotel Delive...
 
Model Town (Delhi) 9953330565 Escorts, Call Girls Services
Model Town (Delhi)  9953330565 Escorts, Call Girls ServicesModel Town (Delhi)  9953330565 Escorts, Call Girls Services
Model Town (Delhi) 9953330565 Escorts, Call Girls Services
 
Precarious profits? Why firms use insecure contracts, and what would change t...
Precarious profits? Why firms use insecure contracts, and what would change t...Precarious profits? Why firms use insecure contracts, and what would change t...
Precarious profits? Why firms use insecure contracts, and what would change t...
 
Take action for a healthier planet and brighter future.
Take action for a healthier planet and brighter future.Take action for a healthier planet and brighter future.
Take action for a healthier planet and brighter future.
 
(官方原版办理)BU毕业证国外大学毕业证样本
(官方原版办理)BU毕业证国外大学毕业证样本(官方原版办理)BU毕业证国外大学毕业证样本
(官方原版办理)BU毕业证国外大学毕业证样本
 
Call Girls Service Race Course Road Just Call 7001305949 Enjoy College Girls ...
Call Girls Service Race Course Road Just Call 7001305949 Enjoy College Girls ...Call Girls Service Race Course Road Just Call 7001305949 Enjoy College Girls ...
Call Girls Service Race Course Road Just Call 7001305949 Enjoy College Girls ...
 
9953330565 Low Rate Call Girls In Adarsh Nagar Delhi NCR
9953330565 Low Rate Call Girls In Adarsh Nagar Delhi NCR9953330565 Low Rate Call Girls In Adarsh Nagar Delhi NCR
9953330565 Low Rate Call Girls In Adarsh Nagar Delhi NCR
 
Goa Escorts WhatsApp Number South Goa Call Girl … 8588052666…
Goa Escorts WhatsApp Number South Goa Call Girl … 8588052666…Goa Escorts WhatsApp Number South Goa Call Girl … 8588052666…
Goa Escorts WhatsApp Number South Goa Call Girl … 8588052666…
 
The Federal Budget and Health Care Policy
The Federal Budget and Health Care PolicyThe Federal Budget and Health Care Policy
The Federal Budget and Health Care Policy
 
Call Girls Rohini Delhi reach out to us at ☎ 9711199012
Call Girls Rohini Delhi reach out to us at ☎ 9711199012Call Girls Rohini Delhi reach out to us at ☎ 9711199012
Call Girls Rohini Delhi reach out to us at ☎ 9711199012
 
productionpost-productiondiary-240320114322-5004daf6.pptx
productionpost-productiondiary-240320114322-5004daf6.pptxproductionpost-productiondiary-240320114322-5004daf6.pptx
productionpost-productiondiary-240320114322-5004daf6.pptx
 
Russian Call Girl Hebbagodi ! 7001305949 ₹2999 Only and Free Hotel Delivery 2...
Russian Call Girl Hebbagodi ! 7001305949 ₹2999 Only and Free Hotel Delivery 2...Russian Call Girl Hebbagodi ! 7001305949 ₹2999 Only and Free Hotel Delivery 2...
Russian Call Girl Hebbagodi ! 7001305949 ₹2999 Only and Free Hotel Delivery 2...
 
VIP Greater Noida Call Girls 9711199012 Escorts Service Noida Extension,Ms
VIP Greater Noida Call Girls 9711199012 Escorts Service Noida Extension,MsVIP Greater Noida Call Girls 9711199012 Escorts Service Noida Extension,Ms
VIP Greater Noida Call Girls 9711199012 Escorts Service Noida Extension,Ms
 
VIP Call Girls Service Bikaner Aishwarya 8250192130 Independent Escort Servic...
VIP Call Girls Service Bikaner Aishwarya 8250192130 Independent Escort Servic...VIP Call Girls Service Bikaner Aishwarya 8250192130 Independent Escort Servic...
VIP Call Girls Service Bikaner Aishwarya 8250192130 Independent Escort Servic...
 

Profiling Big Data sources to assess their selectivity

  • 1. Piet Daas and Joep Burger With special thanks to Marco Puts & Dong Nguyen1 Profiling Big Data sources to assess their selectivity
  • 2. Big Data – More and more organizations want to use Big Data as a new/additional source of information – However, there are some major challenges : – Selectivity of Big Data – Source does not have to completely cover the target population – What part of the population is included? 2
  • 3. Profiling: extracting ‘features’ 3 – Extract background characteristics (‘features’) from the ‘units’ in Big Data in an attempt to determine its selectivity ‐ The need for this depends on the ‘type’ of Big data source and its foreseen use – Important background characteristics for statistics are: ‐ Persons: gender, age, income, education, origin, urbanicity, household composition, .. ‐ Companies: number of employees, turnover, type of economic activity, legal form, ..
  • 4. Social Media: Twitter as an example 4 – On Social media persons, companies and ‘others’ can create an account and create messages ‐ In the Netherlands 70% of the population is active on social media – What kind of information is available on Twitter of a user ‐ Focus on gender! – Let’s look at a profile: @pietdaas
  • 5. 5 1)Name 2) Short bio 3) Messages content 4) Picture
  • 6. Studied a Twitter sample – From a list of Dutch Twitter users (~330.000) – A random sample of 1000 unique ids was drawn – Of the sample: ‐ 844 profiles still existed • 844 had a name • 583 provided a short bio • 473 created ‘tweets’ • 804 had a ‘non-default’ picture • 409 Men (49%) • 282 Women (33%) • 153 ‘Others’ (18%) • companies, organizations, dogs, cats, ‘bots’.. 6 DefaultTwitter picture
  • 7. Gender findings: 1) First name 7 – Used Dutch ‘Voornamenbank’ website (First name database) – Score between 0 and 1 (female – male); 676 of 844 (80%) names were registered – Unknown names scored -1 (usually companies/organizations)
  • 8. 8 Gender findings: 2) Short bio – If a short bio is provided ‐ Quite a number of people mention there ‘position’ in the family • Mother, father, papa, mama, ‘son of’, etc. ‐ Sometimes also occupations are mentioned that reflect the gender (‘studente’) ‐ 155 of 583 (27%) indicated there gender in short bio ‐ Need to check both English and Dutch texts
  • 9. Gender findings: 3) Tweets content – In cooperation with University of Twente (Dong Nguyen) – Machine learning approach that determines gender specific writing style ‐ Language specific: Messages need to be Dutch! ‐ 437 of 473 (92%) persons that created tweets could be classified
  • 10. Gender findings: 4) Profile picture 10 – Use OpenCV to process pictures 1) Face recognition 2) Standardisation of faces (resize & rotate) 3) Classify faces according to gender - 603 of 804 (75%) profile pictures had 1 or more faces on it 1 2 3
  • 11. Gender findings: overall results 11 Diagnostic Odds Ratio = (TP/FN) / (FP/TN) random guessing log(DOR) = 0 ‐ Multi-agent findings • Need clever ways to combine these • Take processing efficiency of the ‘agent’ into consideration Diagnostic Odds Ratio (log) First name 6.41 Short bio 3.50 Tweet content 2.36 Picture (faces) 0.72
  • 12. Thank you for your attention ! 12