VIP Call Girls Service Bikaner Aishwarya 8250192130 Independent Escort Servic...
Profiling Big Data sources to assess their selectivity
1. Piet Daas and Joep Burger
With special thanks to Marco Puts & Dong Nguyen1
Profiling Big Data sources to
assess their selectivity
2. Big Data
– More and more organizations want to use Big Data as a
new/additional source of information
– However, there are some major challenges :
– Selectivity of Big Data
– Source does not have to completely cover the target
population
– What part of the population is included?
2
3. Profiling: extracting ‘features’
3
– Extract background characteristics (‘features’) from the ‘units’
in Big Data in an attempt to determine its selectivity
‐ The need for this depends on the ‘type’ of Big data source and its
foreseen use
– Important background characteristics for statistics are:
‐ Persons: gender, age, income, education, origin,
urbanicity, household composition, ..
‐ Companies: number of employees, turnover, type of
economic activity, legal form, ..
4. Social Media: Twitter as an example
4
– On Social media persons, companies and ‘others’ can
create an account and create messages
‐ In the Netherlands 70% of the population is active on social media
– What kind of information is available on Twitter of a user
‐ Focus on gender!
– Let’s look at a profile: @pietdaas
6. Studied a Twitter sample
– From a list of Dutch Twitter users (~330.000)
– A random sample of 1000 unique ids was drawn
– Of the sample:
‐ 844 profiles still existed
• 844 had a name
• 583 provided a short bio
• 473 created ‘tweets’
• 804 had a ‘non-default’ picture
• 409 Men (49%)
• 282 Women (33%)
• 153 ‘Others’ (18%)
• companies, organizations, dogs, cats, ‘bots’..
6
DefaultTwitter picture
7. Gender findings: 1) First name
7
– Used Dutch ‘Voornamenbank’ website (First name database)
– Score between 0 and 1 (female – male); 676 of 844 (80%) names were registered
– Unknown names scored -1 (usually companies/organizations)
8. 8
Gender findings: 2) Short bio
– If a short bio is provided
‐ Quite a number of people mention there ‘position’ in the
family
• Mother, father, papa, mama, ‘son of’, etc.
‐ Sometimes also occupations are mentioned that reflect the
gender (‘studente’)
‐ 155 of 583 (27%) indicated there gender in short bio
‐ Need to check both English and Dutch texts
9. Gender findings: 3) Tweets content
– In cooperation with University of Twente (Dong Nguyen)
– Machine learning approach that determines gender specific writing style
‐ Language specific: Messages need to be Dutch!
‐ 437 of 473 (92%) persons that created tweets could be classified
10. Gender findings: 4) Profile picture
10
– Use OpenCV to process pictures
1) Face recognition
2) Standardisation of faces (resize & rotate)
3) Classify faces according to gender
- 603 of 804 (75%) profile pictures had 1 or more faces on it
1
2
3
11. Gender findings: overall results
11
Diagnostic Odds Ratio =
(TP/FN) / (FP/TN)
random guessing
log(DOR) = 0
‐ Multi-agent findings
• Need clever ways to combine these
• Take processing efficiency of the ‘agent’ into consideration
Diagnostic Odds
Ratio (log)
First name 6.41
Short bio 3.50
Tweet content 2.36
Picture (faces) 0.72