This talk introduces computational social science as a new research discipline, gives a brief introduction to natural language processing and explains how word vector representations are computed and how to use them in Python.
Word vector representations like word2vec encode semantic relationships like gender and "is the capital city of". This makes it easy to find similar words and compare them visually. In this talk, I'm comparing Wikipedia article revisions using word2vec.
2. – Hacker Ethics
“Access to computers
—
and anything which might !
teach you something about !
the way the world works!
—
should be unlimited and total.
Always yield to !
the Hands-On Imperative!”
Levy, Steven (2001). Hackers: Heroes of the Computer Revolution (updated ed.).
New York: Penguin Books. ISBN 0141000511. OCLC 47216793.
3. Agenda
• Computational Social Science
• Natural Language Processing
• Word Vector Representations
• Comparing different
Wikipedia revisions
• Random Indexing
• word2vec patent
6. Computational Social Science
Digital Humanities
• combines computer science & social sciences
• makes new research possible, e.g. the analysis of
massive social networks and content of millions
of books
immersion.media.mit.edu
7. D. Crandall and N. Snavely, ‘Modeling People and Places with Internet Photo
Collections’, Commun. ACM, vol. 55, no. 6, pp. 52–60, Jun. 2012. DOI:
10.1145/2184319.2184336
8. Massive-scale automated !
analysis of news-content
• 2.5 million articles from 498 different
English-language news outlets
(Reuters & New York Times Corpus)
• automatically annotated into 15 topic areas
• the topics were compared in regards to
readability, linguistic subjectivity and
gender imbalances
I. Flaounas, O. Ali, T. Lansdall-Welfare, T. De Bie, N. Mosdell, J. Lewis, and N.
Cristianini, ‘Research Methods in the Age of Digital Journalism: Massive-scale
automated analysis of news-content: topics, style and gender’, Digital
Journalism, vol. 1, no. 1, 2013. DOI:10.1080/21670811.2012.714928
10. “Low level of political interest and engagement
could be connected to the !
lack of subjectivity (adjectival excess)”
Linguistic Subjectivity!
Adjectives (Part-of-Speech Tagging) & SentiWordNet
12. Male-to-Female Ratio!
Named Entity Recognition
“Gender bias in sports coverage (...)
females only account for between
only 7 and 25 per cent of coverage”
14. Part-of-Speech Tagging!
Identifying nouns, verbs, adjectives…
>>> import nltk
>>> text = "In the middle ages Sweden had the
same king as Denmark and Norway."
>>> words = nltk.word_tokenize( text )
!
>>> nltk.pos_tag( words )
[('In', 'IN'), ('the', 'DT'), ('middle', 'NN'),
('ages', 'NNS'), ('Sweden', 'NNP'), ('had', 'VBD'),
('the', 'DT'), ('same', 'JJ'), ('king', 'NN'), ('as',
'IN'), ('Denmark', 'NNP'), ('and', 'CC'), ('Norway',
'NNP'), ('.', '.')]
NN* Noun
VB* Verb
JJ* Adjective
RB* Adverb
DT Determiner
IN Preposition
15. Named Entity Recognition!
Identifying people, organizations, locations…
>>> import nltk
>>> text = "New York City is the largest city
in the United States."
>>> words = nltk.word_tokenize( text )
!
>>> nltk.ne_chunk( nltk.pos_tag( words ) )
Tree('S', [Tree('GPE', [('New', 'NNP'), ('York', 'NNP'),
('City', 'NNP')]), ('is', 'VBZ'), ('the', 'DT'),
('largest', 'JJS'), ('city', 'NN'), ('in', 'IN'), ('the',
'DT'), Tree('GPE', [('United', 'NNP'), ('States',
'NNPS')]), ('.', '.')])
ORGANIZATION Georgia-Pacific Corp., WHO
PERSON Eddy Bonte, President Obama
LOCATION Murray River, Mount Everest
DATE June, 2008-06-29
TIME two fifty a m, 1:30 p.m.
MONEY GBP 10.40
PERCENT twenty pct, 18.75 %
FACILITY Washington Monument, Stonehenge
GPE South East Asia, Midlothian
(geo-political entity)
19. –J. R. Firth 1957
“You shall know a word
by the company it keeps”
Firth, John R. 1957. A synopsis of linguistic theory 1930–1955. In Studies in
linguistic analysis, 1–32. Oxford: Blackwell.
20. –J. R. Firth 1957
“You shall know a word
by the company it keeps”
Quoted after Socher
23. Vectors are directions in space
Quoted after Socher
word2vec
Representing a word with a vector
24. T. Mikolov, K. Chen, G. Corrado, and J. Dean, ‘Efficient Estimation of Word
Representations in Vector Space’, CoRR, vol. abs/1301.3781, 2013 [Online].
Available: http://arxiv.org/abs/1301.3781
MAN
WOMAN
AUNT
UNCLE
QUEEN
KING
word2vec
Vectors can encode relationships
25. T. Mikolov, K. Chen, G. Corrado, and J. Dean, ‘Efficient Estimation of Word
Representations in Vector Space’, CoRR, vol. abs/1301.3781, 2013 [Online].
Available: http://arxiv.org/abs/1301.3781
word2vec
Vectors can encode relationships
KINGS
KING
QUEEN
QUEENS
26. T. Mikolov, K. Chen, G. Corrado, and J. Dean, ‘Efficient Estimation of Word
Representations in Vector Space’, CoRR, vol. abs/1301.3781, 2013 [Online].
Available: http://arxiv.org/abs/1301.3781
word2vec
Vectors can encode relationships
27. England is to Cameron as
Germany is to ?
England is to London as
Germany is to ?
Link: http://radimrehurek.com/2014/02/word2vec-tutorial/
598.7ms [["Berlin",0.563393235206604],["Dusseldorf",0.5625754594802856],
["Munich",0.5460122227668762],["Budapest",0.5285829901695251],
["Düsseldorf",0.5266501903533936]]
556.8ms [["Merkel",0.5016422867774963],["Schroeder",
0.49941977858543396],["Klaus",0.4981233477592468],["Schröder",
0.4947296977043152],["Peer_Nils",0.492642343044281]]
word2vec
Analogy puzzles
28. wake is to woken as
be is to ?
fast is to fastest as
slow is to ?
Link: http://radimrehurek.com/2014/02/word2vec-tutorial/
806.2ms [["slowest",0.7025301456451416],["slower",0.6236234307289124],
["slowed",0.5842559337615967],["slowing",0.5462259650230408],["quickest",
0.5290436744689941]]
929.9ms [["been",0.41698968410491943],["tobe",0.40402814745903015],
["are",0.3866569399833679],["being",0.3746173679828644],["notbe",
0.36837878823280334]]
word2vec
Analogy puzzles
29. Scotland is to haggis as
Germany is to ?
Link: http://radimrehurek.com/2014/02/word2vec-tutorial/
793.5ms [["Currywurst",0.5284685492515564],["schnitzel",
0.5208959579467773],["wursts",0.5166285037994385],["sauerkraut",
0.512742817401886],["stollen",0.5095855593681335]]
word2vec
Analogy puzzles
30. communism is to Karl_Marx as
capitalism is to ?
Link: http://radimrehurek.com/2014/02/word2vec-tutorial/
544.7ms [["Capitalism",0.5884973406791687],["capitalist",
0.5700926184654236],["Friedrich_Hayek",0.5352163314819336],
["Milton_Friedman",0.5348755121231079],["John_Maynard_Keynes",
0.5335651636123657]]
word2vec
Analogy puzzles
48. Machine Translation
T. Mikolov, Q. V. Le, and I. Sutskever, ‘Exploiting Similarities among Languages
for Machine Translation’, CoRR, vol. abs/1309.4168, 2013 [Online]. Available:
http://arxiv.org/abs/1309.4168
69. • “Pretty much every time
Google has engaged in
patent infringement
litigation, it has been against
someone who has brought
an infringement suit against
them first. (…) it keeps
inventions they are using out
of the hands of patent trolls”
• “Idiotic. I'm surprised they
didn't just patent matrix
algebra”
• “fuck software patents”
https://www.reddit.com/r/MachineLearning/comments/37b1bl/
word2vec_has_been_patented_what_does_it_change/
Google’s word2vec patent
Reactions from the community
70. • Omer Levy:
• The novelty claim in this
patent is somewhat bogus
• word2vec is doing more or
less what the NLP research
community has been doing
for the past 25 years
• much of the improvement in
performance stems from
preprocessing "hacks"
and hyperparameter
settings
• word2vec is a brilliantly
efficient implementation of
decade-old ideas
https://www.reddit.com/r/MachineLearning/comments/37b1bl/
word2vec_has_been_patented_what_does_it_change/
Google’s word2vec patent
Reactions from the community
71. Google’s word2vec patent
What does it change for NLP practitioners?
• “Likely nothing. It's probably one of the thousands of overly
broad "defensive" patents held by companies”
• “Didn't it have an Apache open source license before-hand?”
72. Google’s word2vec patent
What does it change for NLP practitioners?
• “Likely nothing. It's probably one of the thousands of overly
broad "defensive" patents held by companies.”
• “Didn't it have an Apache open source license before-hand?”
75. • all information in vectors
• each word has a hash key!
• n-dimensional vector
• most dimensions are 0
• for a small number k, randomly
distributed -1 or +1 values
• the dimension of the vectors is
much smaller than the number
of contexts
Random Indexing!
Incremental word space model
hash !
key
76. • Every time you see a word wi,
add the hash key of the words in the
context window vi-3, …, vi+3 to the
word’s context vector vi
• After a number of occurrences, the
context vector holds information
about a word’s distribution
• dimensionality reduction,
computationally less costly than
methods like PCA
Random Indexing!
Incremental word space model
hash !
key
context!
vector
85. Image Captioning
Fei-Fei Li & Andrej Karpathy, Stanford University, CS231n,
http://cs231n.stanford.edu/syllabus.html
86. Image Captioning
Fei-Fei Li & Andrej Karpathy, Stanford University, CS231n,
http://cs231n.stanford.edu/syllabus.html
87. T. Mikolov, K. Chen, G. Corrado, and J. Dean, ‘Efficient Estimation of Word
Representations in Vector Space’, CoRR, vol. abs/1301.3781, 2013 [Online].
Available: http://arxiv.org/abs/1301.3781
word2vec
Vectors can encode relationships
KINGS
KING
QUEEN
QUEENS
88. Image Captioning
Fei-Fei Li & Andrej Karpathy, Stanford University, CS231n,
http://cs231n.stanford.edu/syllabus.html
89. H. Gao, J. Mao, J. Zhou, Z. Huang, L. Wang, and W. Xu, ‘Are You Talking to a
Machine? Dataset and Methods for Multilingual Image Question Answering’,
CoRR, vol. abs/1505.05612, 2015 [Online]. Available: http://arxiv.org/abs/
1505.05612
Image Question Answering
91. predict the current word!
input!
wi-2, wi-1, wi+1, wi+2 !
output !
wi!
word2vec
How it is trained
92. predict the current word!
input!
wi-2, wi-1, wi+1, wi+2 !
output !
wi!
predict the surrounding words!
input
wi !
output !
wi-2, wi-1, wi +1, wi +2.
word2vec
How it is trained