SlideShare a Scribd company logo
1 of 34
Download to read offline
An analysis of the user occupational class
through Twitter content
Daniel Preot¸iuc-Pietro1
Vasileios Lampos2
Nikolaos Aletras2
1Computer and Information Science 2Department of Computer Science
University of Pennsylvania University College London
29 July 2015
Motivation
User attribute prediction from text is successful:
Age (Rao et al. 2010 ACL)
Gender (Burger et al. 2011 EMNLP)
Location (Eisenstein et al. 2011 EMNLP)
Personality (Schwartz et al. 2013 PLoS One)
Impact (Lampos et al. 2014 EACL)
Political orientation (Volkova et al. 2014 ACL)
Mental illness (Coppersmith et al. 2014 ACL)
Downstream applications are benefiting from this:
Sentiment analysis (Volkova et al. 2013 EMNLP)
Text classification (Hovy 2015 ACL)
However...
Socio-economic factors (occupation, social class, education,
income) play a vital role in language use
(Bernstein 1960, Labov 1972/2006)
No large scale user level dataset to date
Applications:
sociological analysis of language use
embedding to downstream tasks (e.g. controlling for
socio-economic status)
At a Glance
Our contributions:
Predicting new user attribute: occupation
New dataset: user ←→ occupation
Gaussian Process classification for NLP tasks
Feature ranking and analysis using non-linear methods
Standard Occupational Classification
Standardised job classification taxonomy
Developed and used by the UK Office for National Statistics
(ONS)
Hierarchical:
1-digit (major) groups: 9
2-digit (sub-major) groups: 25
3-digit (minor) groups: 90
4-digit (unit) groups: 369
Jobs grouped by skill requirements
Standard Occupational Classification
C1 Managers, Directors and Senior Officials
11 Corporate Managers and Directors
111 Chief Executives and Senior Officials
1115 Chief Executives and Senior Officials
Job: chief executive, bank manager
1116 Elected Officers and Representatives
112 Production Managers and Directors
113 Functional Managers and Directors
115 Financial Institution Managers and Directors
116 Managers and Directors in Transport and Logistics
117 Senior Officers in Protective Services
118 Health and Social Services Managers and Directors
119 Managers and Directors in Retail and Wholesale
12 Other Managers and Proprietors
Standard Occupational Classification
C2 Professional Occupations
Job: mechanical engineer, pediatrist, postdoctoral researcher
C3 Associate Professional and Technical Occupations
Job: system administrator, dispensing optician
C4 Administrative and Secretarial Occupations
Job: legal clerk, company secretary
C5 Skilled Trades Occupations
Job: electrical fitter, tailor
C6 Caring, Leisure, Other Service Occupations
Job: school assistant, hairdresser
C7 Sales and Customer Service Occupations
Job: sales assistant, telephonist
C8 Process, Plant and Machine Operatives
Job: factory worker, van driver
C9 Elementary Occupations
Job: shelf stacker, bartender
Data
5,191 users ←→ 3-digit job group
Users collected by self-disclosure of job title in profile
Manually filtered by the authors
10M tweets, average 94.4 users per 3-digit group
Data
Here we classify only at the 1-digit top level group (9 classes)
Feature representation and labels available online
Raw data available for research purposes on request (per
Twitter TOS)
Features
User Level features (18), such as:
number of:
followers
friends
listings
tweets
proportion of:
retweets
hashtags
@-replies
links
average:
tweets/day
retweets/tweet
Features
Focus on interpretable features for analysis
Compute over reference corpus of 400M tweets:
SVD embeddings and clusters
Word2Vec (W2V) embeddings and clusters
SVD Features
Compute word × word similarity matrix
Similarity metric is Normalized PMI (Bouma 2009) using the
entire tweet as context
SVD with different number of dimensions (30, 50, 100, 200)
User is represented by summing its word representations
The low-dimensional features offer no interpretability
SVD Features
Spectral clustering to get hard clusters of words (30, 50, 100, 200
clusters)
Each cluster consists of distributionally similar words ←→ topic
User is represented by the number of times he uses a word
from each cluster.
Word2Vec Features
Trained Word2Vec (layer size 50) on our Twitter reference
corpus
Spectral clustering on the word × word similiarity matrix (30,
50, 100, 200 clusters)
Similarity is cosine similarity of words in the embedding space
Gaussian Processes
Brings together several key ideas in one framework:
Bayesian
kernelised
non-parametric
non-linear
modelling uncertainty
Elegant and powerful framework, with growing popularity in
machine learning and application domains
Gaussian Process Graphical Model View
f ∼ GP(m, k)
y ∼ N(f(x), σ2)
f : RD− > R is a latent
function
y is a noisy realisation
of f(x)
k is the covariance
function or kernel
m and σ2 are learnt
from data
k
f
yx
σ
N
Gaussian Process Classification
Pass latent function through logistic function to squash the
input from (−∞, ∞) to obtain probability, π(x) = p(yi = 1|fi)
(similar to logistic regression)
The likelihood is non-Gaussian and solution is not analytical
Inference using Expectation propagation (EP)
FITC approximation for large data
Gaussian Process Classification
ARD kernel learns feature importance → features most
discriminative between classes
We learn 9 one-vs-all binary classifiers
This way, we find the most predictive features consistent for all
classes
Gaussian Process Resources
Free book:
http://www.gaussianprocess.org/gpml/chapters/
Gaussian Process Resources
GPs for Natural Language Processing tutorial (ACL 2014)
http://www.preotiuc.ro
GP Schools in Sheffield and roadshows in Kampala,
Pereira, Nyeri, Melbourne
http://ml.dcs.shef.ac.uk/gpss/
Annotated bibliography and other materials
http://www.gaussianprocess.org
GPy Toolkit (Python)
https://github.com/SheffieldML/GPy
Prediction
34
31.5
34.2
25
30
35
40
45
50
55
User Level
LR SVM-RBF GP Baseline
Stratified 10 fold cross-validation
Prediction
34
40
31.5
43.1
34.2
43.8
25
30
35
40
45
50
55
User Level SVD-E (200)
LR SVM-RBF GP Baseline
Stratified 10 fold cross-validation
Prediction
34
40
44.2
31.5
43.1
47.9
34.2
43.8
48.2
25
30
35
40
45
50
55
User Level SVD-E (200) SVD-C (200)
LR SVM-RBF GP Baseline
Stratified 10 fold cross-validation
Prediction
34
40
44.2
42.5
31.5
43.1
47.9
49
34.2
43.8
48.2 48.4
25
30
35
40
45
50
55
User Level SVD-E (200) SVD-C (200) W2V-E (50)
LR SVM-RBF GP Baseline
Stratified 10 fold cross-validation
Prediction
34
40
44.2
42.5
46.9
31.5
43.1
47.9
49
51.7
34.2
43.8
48.2 48.4
52.7
25
30
35
40
45
50
55
User Level SVD-E (200) SVD-C (200) W2V-E (50) W2V-C (200)
LR SVM-RBF GP Baseline
Stratified 10 fold cross-validation
Prediction Analysis
User level features have no predictive value
Clusters outperform embeddings
Word2Vec features are better than SVD/NPMI for prediction
Non-linear methods (SVM-RBF and GP) significantly
outperform linear methods
52.7% accuracy for 9-class classification is decent
Class Comparison
Jensen-Shannon Divergence between topic distributions across
occupational classes
Some clusters of occupations are observable
1 2 3 4 5 6 7 8 9
1
2
3
4
5
6
7
8
9
0.00
0.01
0.02
0.03
Feature Analysis
Rank Manual Label Topic (most frequent words)
1 Arts art, design, print, collection,
poster, painting, custom, logo,
printing, drawing
2 Health risk, cancer, mental, stress, pa-
tients, treatment, surgery, dis-
ease, drugs, doctor
3 Beauty Care beauty, natural, dry, skin, mas-
sage, plastic, spray, facial, treat-
ments, soap
4 Higher Education students, research, board, stu-
dent, college, education, library,
schools, teaching, teachers
5 Software Engineering service, data, system, services,
access, security, development,
software, testing, standard
Most predictive Word2Vec 200 clusters as given by Gaussian
Process ARD ranking
Feature Analysis
Rank Manual Label Topic (most frequent words)
7 Football van, foster, cole, winger, terry,
reckons, youngster, rooney,
fielding, kenny
8 Corporate patent, industry, reports, global,
survey, leading, firm, 2015, in-
novation, financial
9 Cooking recipe, meat, salad, egg, soup,
sauce, beef, served, pork, rice
12 Elongated Words wait, till, til, yay, ahhh, hoo,
woo, woot, whoop, woohoo
16 Politics human, culture, justice, religion,
democracy, religious, humanity,
tradition, ancient, racism
Most predictive Word2Vec 200 clusters as given by Gaussian
Process ARD ranking
Feature Analysis - Cumulative density functions
0.001 0.01 0.05
0
0.2
0.4
0.6
0.8
1
Topic proportion
Userprobability
Higher Education (#21)
C1
C2
C3
C4
C5
C6
C7
C8
C9
Topic more prevalent → CDF line closer to bottom-right corner
Feature Analysis - Cumulative density functions
0.001 0.01 0.05
0
0.2
0.4
0.6
0.8
1
Topic proportion
Userprobability
Arts (#116)
C1
C2
C3
C4
C5
C6
C7
C8
C9
Topic more prevalent → CDF line closer to bottom-right corner
Feature Analysis - Cumulative density functions
0.001 0.01 0.05
0
0.2
0.4
0.6
0.8
1
Topic proportion
Userprobability
Elongated Words (#164)
C1
C2
C3
C4
C5
C6
C7
C8
C9
Topic more prevalent → CDF line closer to bottom-right corner
Feature Analysis
Comparison of mean topic usage between supersets of
occupational classes (1-2 vs. 6-9)
Take Aways
User occupation influences language use in social media
Non-linear methods (Gaussian Processes) obtain significant
gains over linear methods
Topic (clusters) features are both predictive and interpretable
New dataset available for research

More Related Content

Similar to Daniel Preotiuc-Pietro - 2015 - An analysis of the user occupational class through Twitter content

Medinfo 2010 openEHR Clinical Modelling Worshop
Medinfo 2010 openEHR Clinical Modelling WorshopMedinfo 2010 openEHR Clinical Modelling Worshop
Medinfo 2010 openEHR Clinical Modelling Worshop
Koray Atalag
 
kantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.pptkantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.ppt
butest
 
Deep Learning for Recommender Systems
Deep Learning for Recommender SystemsDeep Learning for Recommender Systems
Deep Learning for Recommender Systems
Marcel Kurovski
 
Deep Learning for Recommender Systems
Deep Learning for Recommender SystemsDeep Learning for Recommender Systems
Deep Learning for Recommender Systems
inovex GmbH
 
Assessment Worksheet Aligning Risks, Threats, and Vuln.docx
Assessment Worksheet Aligning Risks, Threats, and Vuln.docxAssessment Worksheet Aligning Risks, Threats, and Vuln.docx
Assessment Worksheet Aligning Risks, Threats, and Vuln.docx
festockton
 
THIC MedIX Summer 2015 Poster
THIC MedIX Summer 2015 PosterTHIC MedIX Summer 2015 Poster
THIC MedIX Summer 2015 Poster
Diana Zajac
 
Innovation at the Edge_Final
Innovation at the Edge_FinalInnovation at the Edge_Final
Innovation at the Edge_Final
Chris Waller
 
Download presentation source
Download presentation sourceDownload presentation source
Download presentation source
butest
 
Toward a System Building Agenda for Data Integration(and Dat.docx
Toward a System Building Agenda for Data Integration(and Dat.docxToward a System Building Agenda for Data Integration(and Dat.docx
Toward a System Building Agenda for Data Integration(and Dat.docx
juliennehar
 

Similar to Daniel Preotiuc-Pietro - 2015 - An analysis of the user occupational class through Twitter content (20)

Medinfo 2010 openEHR Clinical Modelling Worshop
Medinfo 2010 openEHR Clinical Modelling WorshopMedinfo 2010 openEHR Clinical Modelling Worshop
Medinfo 2010 openEHR Clinical Modelling Worshop
 
kantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.pptkantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.ppt
 
Deep Learning for Recommender Systems
Deep Learning for Recommender SystemsDeep Learning for Recommender Systems
Deep Learning for Recommender Systems
 
Deep Learning for Recommender Systems
Deep Learning for Recommender SystemsDeep Learning for Recommender Systems
Deep Learning for Recommender Systems
 
Assessment Worksheet Aligning Risks, Threats, and Vuln.docx
Assessment Worksheet Aligning Risks, Threats, and Vuln.docxAssessment Worksheet Aligning Risks, Threats, and Vuln.docx
Assessment Worksheet Aligning Risks, Threats, and Vuln.docx
 
Thoughts on Knowledge Graphs & Deeper Provenance
Thoughts on Knowledge Graphs  & Deeper ProvenanceThoughts on Knowledge Graphs  & Deeper Provenance
Thoughts on Knowledge Graphs & Deeper Provenance
 
Understanding Knowledge Services
Understanding Knowledge ServicesUnderstanding Knowledge Services
Understanding Knowledge Services
 
THIC MedIX Summer 2015 Poster
THIC MedIX Summer 2015 PosterTHIC MedIX Summer 2015 Poster
THIC MedIX Summer 2015 Poster
 
Innovation at the Edge_Final
Innovation at the Edge_FinalInnovation at the Edge_Final
Innovation at the Edge_Final
 
Pistoia Alliance US Conference 2015 - 1.1.2 Innovation in Pharma - Chris Waller
Pistoia Alliance US Conference 2015 - 1.1.2 Innovation in Pharma - Chris WallerPistoia Alliance US Conference 2015 - 1.1.2 Innovation in Pharma - Chris Waller
Pistoia Alliance US Conference 2015 - 1.1.2 Innovation in Pharma - Chris Waller
 
Curation-Friendly Tools for the Scientific Researcher
Curation-Friendly Tools for the Scientific ResearcherCuration-Friendly Tools for the Scientific Researcher
Curation-Friendly Tools for the Scientific Researcher
 
Workshop nwav 47 - LVS - Tool for Quantitative Data Analysis
Workshop nwav 47 - LVS - Tool for Quantitative Data AnalysisWorkshop nwav 47 - LVS - Tool for Quantitative Data Analysis
Workshop nwav 47 - LVS - Tool for Quantitative Data Analysis
 
Service science intro 20110606 v1
Service science intro 20110606 v1Service science intro 20110606 v1
Service science intro 20110606 v1
 
Fuschi current Research and Developments
Fuschi current Research and DevelopmentsFuschi current Research and Developments
Fuschi current Research and Developments
 
Code.org TCEA 2015
Code.org TCEA 2015Code.org TCEA 2015
Code.org TCEA 2015
 
Download presentation source
Download presentation sourceDownload presentation source
Download presentation source
 
The Analytics and Data Science Landscape
The Analytics and Data Science LandscapeThe Analytics and Data Science Landscape
The Analytics and Data Science Landscape
 
Big Data in Education
Big Data in EducationBig Data in Education
Big Data in Education
 
Toward a System Building Agenda for Data Integration(and Dat.docx
Toward a System Building Agenda for Data Integration(and Dat.docxToward a System Building Agenda for Data Integration(and Dat.docx
Toward a System Building Agenda for Data Integration(and Dat.docx
 
Data Science and AI in Biomedicine: The World has Changed
Data Science and AI in Biomedicine: The World has ChangedData Science and AI in Biomedicine: The World has Changed
Data Science and AI in Biomedicine: The World has Changed
 

More from Association for Computational Linguistics

More from Association for Computational Linguistics (20)

Muis - 2016 - Weak Semi-Markov CRFs for NP Chunking in Informal Text
Muis - 2016 - Weak Semi-Markov CRFs for NP Chunking in Informal TextMuis - 2016 - Weak Semi-Markov CRFs for NP Chunking in Informal Text
Muis - 2016 - Weak Semi-Markov CRFs for NP Chunking in Informal Text
 
Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...
Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...
Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...
 
Castro - 2018 - A Crowd-Annotated Spanish Corpus for Humour Analysis
Castro - 2018 - A Crowd-Annotated Spanish Corpus for Humour AnalysisCastro - 2018 - A Crowd-Annotated Spanish Corpus for Humour Analysis
Castro - 2018 - A Crowd-Annotated Spanish Corpus for Humour Analysis
 
Muthu Kumar Chandrasekaran - 2018 - Countering Position Bias in Instructor In...
Muthu Kumar Chandrasekaran - 2018 - Countering Position Bias in Instructor In...Muthu Kumar Chandrasekaran - 2018 - Countering Position Bias in Instructor In...
Muthu Kumar Chandrasekaran - 2018 - Countering Position Bias in Instructor In...
 
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future DirectionsDaniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
 
Elior Sulem - 2018 - Semantic Structural Evaluation for Text Simplification
Elior Sulem - 2018 - Semantic Structural Evaluation for Text SimplificationElior Sulem - 2018 - Semantic Structural Evaluation for Text Simplification
Elior Sulem - 2018 - Semantic Structural Evaluation for Text Simplification
 
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future DirectionsDaniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
 
Wenqiang Lei - 2018 - Sequicity: Simplifying Task-oriented Dialogue Systems w...
Wenqiang Lei - 2018 - Sequicity: Simplifying Task-oriented Dialogue Systems w...Wenqiang Lei - 2018 - Sequicity: Simplifying Task-oriented Dialogue Systems w...
Wenqiang Lei - 2018 - Sequicity: Simplifying Task-oriented Dialogue Systems w...
 
Matthew Marge - 2017 - Exploring Variation of Natural Human Commands to a Rob...
Matthew Marge - 2017 - Exploring Variation of Natural Human Commands to a Rob...Matthew Marge - 2017 - Exploring Variation of Natural Human Commands to a Rob...
Matthew Marge - 2017 - Exploring Variation of Natural Human Commands to a Rob...
 
Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...
Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...
Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...
 
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 WorkshopSatoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
 
Chenchen Ding - 2015 - NICT at WAT 2015
Chenchen Ding - 2015 - NICT at WAT 2015Chenchen Ding - 2015 - NICT at WAT 2015
Chenchen Ding - 2015 - NICT at WAT 2015
 
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
 
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
 
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
 
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
 
Hyoung-Gyu Lee - 2015 - NAVER Machine Translation System for WAT 2015
Hyoung-Gyu Lee - 2015 - NAVER Machine Translation System for WAT 2015Hyoung-Gyu Lee - 2015 - NAVER Machine Translation System for WAT 2015
Hyoung-Gyu Lee - 2015 - NAVER Machine Translation System for WAT 2015
 
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 WorkshopSatoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
 
Chenchen Ding - 2015 - NICT at WAT 2015
Chenchen Ding - 2015 - NICT at WAT 2015Chenchen Ding - 2015 - NICT at WAT 2015
Chenchen Ding - 2015 - NICT at WAT 2015
 
Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...
Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...
Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...
 

Recently uploaded

1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
QucHHunhnh
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
negromaestrong
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
kauryashika82
 

Recently uploaded (20)

How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 

Daniel Preotiuc-Pietro - 2015 - An analysis of the user occupational class through Twitter content

  • 1. An analysis of the user occupational class through Twitter content Daniel Preot¸iuc-Pietro1 Vasileios Lampos2 Nikolaos Aletras2 1Computer and Information Science 2Department of Computer Science University of Pennsylvania University College London 29 July 2015
  • 2. Motivation User attribute prediction from text is successful: Age (Rao et al. 2010 ACL) Gender (Burger et al. 2011 EMNLP) Location (Eisenstein et al. 2011 EMNLP) Personality (Schwartz et al. 2013 PLoS One) Impact (Lampos et al. 2014 EACL) Political orientation (Volkova et al. 2014 ACL) Mental illness (Coppersmith et al. 2014 ACL) Downstream applications are benefiting from this: Sentiment analysis (Volkova et al. 2013 EMNLP) Text classification (Hovy 2015 ACL)
  • 3. However... Socio-economic factors (occupation, social class, education, income) play a vital role in language use (Bernstein 1960, Labov 1972/2006) No large scale user level dataset to date Applications: sociological analysis of language use embedding to downstream tasks (e.g. controlling for socio-economic status)
  • 4. At a Glance Our contributions: Predicting new user attribute: occupation New dataset: user ←→ occupation Gaussian Process classification for NLP tasks Feature ranking and analysis using non-linear methods
  • 5. Standard Occupational Classification Standardised job classification taxonomy Developed and used by the UK Office for National Statistics (ONS) Hierarchical: 1-digit (major) groups: 9 2-digit (sub-major) groups: 25 3-digit (minor) groups: 90 4-digit (unit) groups: 369 Jobs grouped by skill requirements
  • 6. Standard Occupational Classification C1 Managers, Directors and Senior Officials 11 Corporate Managers and Directors 111 Chief Executives and Senior Officials 1115 Chief Executives and Senior Officials Job: chief executive, bank manager 1116 Elected Officers and Representatives 112 Production Managers and Directors 113 Functional Managers and Directors 115 Financial Institution Managers and Directors 116 Managers and Directors in Transport and Logistics 117 Senior Officers in Protective Services 118 Health and Social Services Managers and Directors 119 Managers and Directors in Retail and Wholesale 12 Other Managers and Proprietors
  • 7. Standard Occupational Classification C2 Professional Occupations Job: mechanical engineer, pediatrist, postdoctoral researcher C3 Associate Professional and Technical Occupations Job: system administrator, dispensing optician C4 Administrative and Secretarial Occupations Job: legal clerk, company secretary C5 Skilled Trades Occupations Job: electrical fitter, tailor C6 Caring, Leisure, Other Service Occupations Job: school assistant, hairdresser C7 Sales and Customer Service Occupations Job: sales assistant, telephonist C8 Process, Plant and Machine Operatives Job: factory worker, van driver C9 Elementary Occupations Job: shelf stacker, bartender
  • 8. Data 5,191 users ←→ 3-digit job group Users collected by self-disclosure of job title in profile Manually filtered by the authors 10M tweets, average 94.4 users per 3-digit group
  • 9. Data Here we classify only at the 1-digit top level group (9 classes) Feature representation and labels available online Raw data available for research purposes on request (per Twitter TOS)
  • 10. Features User Level features (18), such as: number of: followers friends listings tweets proportion of: retweets hashtags @-replies links average: tweets/day retweets/tweet
  • 11. Features Focus on interpretable features for analysis Compute over reference corpus of 400M tweets: SVD embeddings and clusters Word2Vec (W2V) embeddings and clusters
  • 12. SVD Features Compute word × word similarity matrix Similarity metric is Normalized PMI (Bouma 2009) using the entire tweet as context SVD with different number of dimensions (30, 50, 100, 200) User is represented by summing its word representations The low-dimensional features offer no interpretability
  • 13. SVD Features Spectral clustering to get hard clusters of words (30, 50, 100, 200 clusters) Each cluster consists of distributionally similar words ←→ topic User is represented by the number of times he uses a word from each cluster.
  • 14. Word2Vec Features Trained Word2Vec (layer size 50) on our Twitter reference corpus Spectral clustering on the word × word similiarity matrix (30, 50, 100, 200 clusters) Similarity is cosine similarity of words in the embedding space
  • 15. Gaussian Processes Brings together several key ideas in one framework: Bayesian kernelised non-parametric non-linear modelling uncertainty Elegant and powerful framework, with growing popularity in machine learning and application domains
  • 16. Gaussian Process Graphical Model View f ∼ GP(m, k) y ∼ N(f(x), σ2) f : RD− > R is a latent function y is a noisy realisation of f(x) k is the covariance function or kernel m and σ2 are learnt from data k f yx σ N
  • 17. Gaussian Process Classification Pass latent function through logistic function to squash the input from (−∞, ∞) to obtain probability, π(x) = p(yi = 1|fi) (similar to logistic regression) The likelihood is non-Gaussian and solution is not analytical Inference using Expectation propagation (EP) FITC approximation for large data
  • 18. Gaussian Process Classification ARD kernel learns feature importance → features most discriminative between classes We learn 9 one-vs-all binary classifiers This way, we find the most predictive features consistent for all classes
  • 19. Gaussian Process Resources Free book: http://www.gaussianprocess.org/gpml/chapters/
  • 20. Gaussian Process Resources GPs for Natural Language Processing tutorial (ACL 2014) http://www.preotiuc.ro GP Schools in Sheffield and roadshows in Kampala, Pereira, Nyeri, Melbourne http://ml.dcs.shef.ac.uk/gpss/ Annotated bibliography and other materials http://www.gaussianprocess.org GPy Toolkit (Python) https://github.com/SheffieldML/GPy
  • 21. Prediction 34 31.5 34.2 25 30 35 40 45 50 55 User Level LR SVM-RBF GP Baseline Stratified 10 fold cross-validation
  • 22. Prediction 34 40 31.5 43.1 34.2 43.8 25 30 35 40 45 50 55 User Level SVD-E (200) LR SVM-RBF GP Baseline Stratified 10 fold cross-validation
  • 23. Prediction 34 40 44.2 31.5 43.1 47.9 34.2 43.8 48.2 25 30 35 40 45 50 55 User Level SVD-E (200) SVD-C (200) LR SVM-RBF GP Baseline Stratified 10 fold cross-validation
  • 24. Prediction 34 40 44.2 42.5 31.5 43.1 47.9 49 34.2 43.8 48.2 48.4 25 30 35 40 45 50 55 User Level SVD-E (200) SVD-C (200) W2V-E (50) LR SVM-RBF GP Baseline Stratified 10 fold cross-validation
  • 25. Prediction 34 40 44.2 42.5 46.9 31.5 43.1 47.9 49 51.7 34.2 43.8 48.2 48.4 52.7 25 30 35 40 45 50 55 User Level SVD-E (200) SVD-C (200) W2V-E (50) W2V-C (200) LR SVM-RBF GP Baseline Stratified 10 fold cross-validation
  • 26. Prediction Analysis User level features have no predictive value Clusters outperform embeddings Word2Vec features are better than SVD/NPMI for prediction Non-linear methods (SVM-RBF and GP) significantly outperform linear methods 52.7% accuracy for 9-class classification is decent
  • 27. Class Comparison Jensen-Shannon Divergence between topic distributions across occupational classes Some clusters of occupations are observable 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 0.00 0.01 0.02 0.03
  • 28. Feature Analysis Rank Manual Label Topic (most frequent words) 1 Arts art, design, print, collection, poster, painting, custom, logo, printing, drawing 2 Health risk, cancer, mental, stress, pa- tients, treatment, surgery, dis- ease, drugs, doctor 3 Beauty Care beauty, natural, dry, skin, mas- sage, plastic, spray, facial, treat- ments, soap 4 Higher Education students, research, board, stu- dent, college, education, library, schools, teaching, teachers 5 Software Engineering service, data, system, services, access, security, development, software, testing, standard Most predictive Word2Vec 200 clusters as given by Gaussian Process ARD ranking
  • 29. Feature Analysis Rank Manual Label Topic (most frequent words) 7 Football van, foster, cole, winger, terry, reckons, youngster, rooney, fielding, kenny 8 Corporate patent, industry, reports, global, survey, leading, firm, 2015, in- novation, financial 9 Cooking recipe, meat, salad, egg, soup, sauce, beef, served, pork, rice 12 Elongated Words wait, till, til, yay, ahhh, hoo, woo, woot, whoop, woohoo 16 Politics human, culture, justice, religion, democracy, religious, humanity, tradition, ancient, racism Most predictive Word2Vec 200 clusters as given by Gaussian Process ARD ranking
  • 30. Feature Analysis - Cumulative density functions 0.001 0.01 0.05 0 0.2 0.4 0.6 0.8 1 Topic proportion Userprobability Higher Education (#21) C1 C2 C3 C4 C5 C6 C7 C8 C9 Topic more prevalent → CDF line closer to bottom-right corner
  • 31. Feature Analysis - Cumulative density functions 0.001 0.01 0.05 0 0.2 0.4 0.6 0.8 1 Topic proportion Userprobability Arts (#116) C1 C2 C3 C4 C5 C6 C7 C8 C9 Topic more prevalent → CDF line closer to bottom-right corner
  • 32. Feature Analysis - Cumulative density functions 0.001 0.01 0.05 0 0.2 0.4 0.6 0.8 1 Topic proportion Userprobability Elongated Words (#164) C1 C2 C3 C4 C5 C6 C7 C8 C9 Topic more prevalent → CDF line closer to bottom-right corner
  • 33. Feature Analysis Comparison of mean topic usage between supersets of occupational classes (1-2 vs. 6-9)
  • 34. Take Aways User occupation influences language use in social media Non-linear methods (Gaussian Processes) obtain significant gains over linear methods Topic (clusters) features are both predictive and interpretable New dataset available for research