Hacking Human Language (PyData London)

Hacking!
Human!
Language!
Hendrik
Heuer
London

– Hacker Ethics
“Access to computers
— 
and anything which might !
teach you something about !
the way the world works!
—
should be unlimited and total.
Always yield to !
the Hands-On Imperative!”
Levy, Steven (2001). Hackers: Heroes of the Computer Revolution (updated ed.).
New York: Penguin Books. ISBN 0141000511. OCLC 47216793.

Agenda
• Computational Social Science
• Natural Language Processing
• Word Vector Representations
• Comparing different  
Wikipedia revisions
• Random Indexing
• word2vec patent

Computational  
Social Science

Computational Social Science 
Digital Humanities
• combines computer science & social sciences
• makes new research possible, e.g. the analysis of
massive social networks and content of millions
of books
immersion.media.mit.edu

D. Crandall and N. Snavely, ‘Modeling People and Places with Internet Photo
Collections’, Commun. ACM, vol. 55, no. 6, pp. 52–60, Jun. 2012. DOI:
10.1145/2184319.2184336

Massive-scale automated !
analysis of news-content
• 2.5 million articles from 498 different  
English-language news outlets  
(Reuters & New York Times Corpus)
• automatically annotated into 15 topic areas
• the topics were compared in regards to
readability, linguistic subjectivity and  
gender imbalances
I. Flaounas, O. Ali, T. Lansdall-Welfare, T. De Bie, N. Mosdell, J. Lewis, and N.
Cristianini, ‘Research Methods in the Age of Digital Journalism: Massive-scale
automated analysis of news-content: topics, style and gender’, Digital
Journalism, vol. 1, no. 1, 2013. DOI:10.1080/21670811.2012.714928

Linguistic Subjectivity!
Adjectives (Part-of-Speech Tagging) & SentiWordNet

“Low level of political interest and engagement
could be connected to the !
lack of subjectivity (adjectival excess)”
Linguistic Subjectivity!
Adjectives (Part-of-Speech Tagging) & SentiWordNet

Male-to-Female Ratio!
Named Entity Recognition

Male-to-Female Ratio!
Named Entity Recognition
“Gender bias in sports coverage (...)
females only account for between
only 7 and 25 per cent of coverage”

scikit-learn
gensim
Natural Language Toolkit
spaCyword2vec
Machine Learning
Text Processing
Topic Modeling
Visualization
d3.js
Google Chart API
Highcharts

Part-of-Speech Tagging!
Identifying nouns, verbs, adjectives…
>>> import nltk
>>> text = "In the middle ages Sweden had the  
same king as Denmark and Norway."
>>> words = nltk.word_tokenize( text )
!
>>> nltk.pos_tag( words )
[('In', 'IN'), ('the', 'DT'), ('middle', 'NN'),
('ages', 'NNS'), ('Sweden', 'NNP'), ('had', 'VBD'),
('the', 'DT'), ('same', 'JJ'), ('king', 'NN'), ('as',
'IN'), ('Denmark', 'NNP'), ('and', 'CC'), ('Norway',
'NNP'), ('.', '.')]
NN* Noun
VB* Verb
JJ* Adjective
RB* Adverb
DT Determiner
IN Preposition

Named Entity Recognition!
Identifying people, organizations, locations…
>>> import nltk
>>> text = "New York City is the largest city  
in the United States."
>>> words = nltk.word_tokenize( text )
!
>>> nltk.ne_chunk( nltk.pos_tag( words ) )
Tree('S', [Tree('GPE', [('New', 'NNP'), ('York', 'NNP'),
('City', 'NNP')]), ('is', 'VBZ'), ('the', 'DT'),
('largest', 'JJS'), ('city', 'NN'), ('in', 'IN'), ('the',
'DT'), Tree('GPE', [('United', 'NNP'), ('States',
'NNPS')]), ('.', '.')])
ORGANIZATION Georgia-Pacific Corp., WHO
PERSON Eddy Bonte, President Obama
LOCATION Murray River, Mount Everest
DATE June, 2008-06-29
TIME two fifty a m, 1:30 p.m.
MONEY GBP 10.40
PERCENT twenty pct, 18.75 %
FACILITY Washington Monument, Stonehenge
GPE South East Asia, Midlothian
(geo-political entity)

Sentiment Analysis
Tell if a sentence is positive or negative

Word Vector  
Representations

–J. R. Firth 1957
“You shall know a word  
by the company it keeps”
Firth, John R. 1957. A synopsis of linguistic theory 1930–1955. In Studies in
linguistic analysis, 1–32. Oxford: Blackwell.

–J. R. Firth 1957
“You shall know a word  
by the company it keeps”
Quoted after Socher

Vectors are directions in space

Vectors are directions in space
Quoted after Socher
word2vec
Representing a word with a vector

T. Mikolov, K. Chen, G. Corrado, and J. Dean, ‘Efﬁcient Estimation of Word
Representations in Vector Space’, CoRR, vol. abs/1301.3781, 2013 [Online].
Available: http://arxiv.org/abs/1301.3781
MAN
WOMAN
AUNT
UNCLE
QUEEN
KING
word2vec
Vectors can encode relationships

word2vec
KINGS
KING
QUEEN
QUEENS

word2vec

England is to Cameron as  
Germany is to ?
England is to London as  
Germany is to ?
Link: http://radimrehurek.com/2014/02/word2vec-tutorial/
598.7ms [["Berlin",0.563393235206604],["Dusseldorf",0.5625754594802856],
["Munich",0.5460122227668762],["Budapest",0.5285829901695251],
["Düsseldorf",0.5266501903533936]]
556.8ms [["Merkel",0.5016422867774963],["Schroeder",
0.49941977858543396],["Klaus",0.4981233477592468],["Schröder",
0.4947296977043152],["Peer_Nils",0.492642343044281]]
word2vec
Analogy puzzles

wake is to woken as  
be is to ?
fast is to fastest as  
slow is to ?
806.2ms [["slowest",0.7025301456451416],["slower",0.6236234307289124],
["slowed",0.5842559337615967],["slowing",0.5462259650230408],["quickest",
0.5290436744689941]]
929.9ms [["been",0.41698968410491943],["tobe",0.40402814745903015],
["are",0.3866569399833679],["being",0.3746173679828644],["notbe",
0.36837878823280334]]
word2vec
Analogy puzzles

Scotland is to haggis as  
Germany is to ?
793.5ms [["Currywurst",0.5284685492515564],["schnitzel",
0.5208959579467773],["wursts",0.5166285037994385],["sauerkraut",
0.512742817401886],["stollen",0.5095855593681335]]
word2vec
Analogy puzzles

communism is to Karl_Marx as  
capitalism is to ?
544.7ms [["Capitalism",0.5884973406791687],["capitalist",
0.5700926184654236],["Friedrich_Hayek",0.5352163314819336],
["Milton_Friedman",0.5348755121231079],["John_Maynard_Keynes",
0.5335651636123657]]
word2vec
Analogy puzzles

Word vector
representations 
in Python

Link: https://radimrehurek.com/gensim/models/word2vec.html

Link: https://honnibal.github.io/spaCy/

spaCy!
Dependency-Based
Word representations
by Levy and Goldberg
Gensim!
word2vec 
 
by Mikolov et al

spaCy!
Dependency-Based
Gensim!
word2vec 
 
by Mikolov et al
2 words context window

spaCy!
Dependency-Based
Gensim!
word2vec 
 
by Mikolov et al

word2vec
Training word vectors with generator

Link: https://code.google.com/p/word2vec/#Pre-trained_entity_vectors_with_Freebase_naming

Machine Translation
T. Mikolov, Q. V. Le, and I. Sutskever, ‘Exploiting Similarities among Languages
for Machine Translation’, CoRR, vol. abs/1309.4168, 2013 [Online]. Available:
http://arxiv.org/abs/1309.4168

1.
Find Word
Representations
word2vec
2.
Dimensionality
Reduction
t-SNE
3.
Visualisation
JSON

1.
Find Word
Representations
word2vec
2.
Dimensionality
Reduction
t-SNE
3.
Visualisation
JSON
Link gensim: https://radimrehurek.com/gensim/!
Link word2vec: https://code.google.com/p/word2vec/

1.
Find Word
Representations
word2vec
2.
Dimensionality
Reduction
t-SNE
3.
Visualisation
JSON
linguistics

1.
Find Word
Representations
word2vec
2.
Dimensionality
Reduction
t-SNE
3.
Visualisation
JSON
Link: http://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html

1.
Find Word
Representations
word2vec
2.
Dimensionality
Reduction
t-SNE
3.
Visualisation
JSON
Link: https://github.com/mbostock/d3/wiki/Gallery

Comparing Wikipedia revisions!
Game of Thrones

Bird’s eye view!
Game of Thrones

Bird’s eye view, intersection set!
Game of Thrones

Characters in 2013 and 2015!
Game of Thrones

Bird’s eye view!
United States

Bird’s eye view, intersection set!
United States

Bird’s eye view, 2015!
United States

Bird’s eye view, 2013!
United States

• “Pretty much every time
Google has engaged in
patent infringement
litigation, it has been against
someone who has brought
an infringement suit against
them ﬁrst. (…) it keeps
inventions they are using out
of the hands of patent trolls”
• “Idiotic. I'm surprised they
didn't just patent matrix
algebra”
• “fuck software patents”
https://www.reddit.com/r/MachineLearning/comments/37b1bl/
word2vec_has_been_patented_what_does_it_change/
Google’s word2vec patent 
Reactions from the community

• Omer Levy:
• The novelty claim in this
patent is somewhat bogus
• word2vec is doing more or
less what the NLP research
community has been doing
for the past 25 years
• much of the improvement in
performance stems from
preprocessing "hacks"
and hyperparameter
settings
• word2vec is a brilliantly
efﬁcient implementation of
decade-old ideas
https://www.reddit.com/r/MachineLearning/comments/37b1bl/
word2vec_has_been_patented_what_does_it_change/
Reactions from the community

What does it change for NLP practitioners?
• “Likely nothing. It's probably one of the thousands of overly
broad "defensive" patents held by companies”
• “Didn't it have an Apache open source license before-hand?”

What does it change for NLP practitioners?
• “Likely nothing. It's probably one of the thousands of overly
broad "defensive" patents held by companies.”
• “Didn't it have an Apache open source license before-hand?”

• all information in vectors
• each word has a hash key!
• n-dimensional vector
• most dimensions are 0
• for a small number k, randomly
distributed -1 or +1 values
• the dimension of the vectors is
much smaller than the number
of contexts
Random Indexing!
Incremental word space model
hash !
key

• Every time you see a word wi,  
add the hash key of the words in the
context window vi-3, …, vi+3 to the
word’s context vector vi
• After a number of occurrences, the
context vector holds information
about a word’s distribution
• dimensionality reduction,
computationally less costly than
methods like PCA
Random Indexing!
Incremental word space model
hash !
key
context!
vector

https://en.wikipedia.org/wiki/Athens
Gavagai Living Lexicon

Image Captioning
Fei-Fei Li & Andrej Karpathy, Stanford University, CS231n,  
http://cs231n.stanford.edu/syllabus.html

H. Gao, J. Mao, J. Zhou, Z. Huang, L. Wang, and W. Xu, ‘Are You Talking to a
Machine? Dataset and Methods for Multilingual Image Question Answering’,
CoRR, vol. abs/1505.05612, 2015 [Online]. Available: http://arxiv.org/abs/
1505.05612
Image Question Answering

Hacking!
Human!
Language!
Hendrik
Heuer
London
hendrikheuer@gmail.com!
http://hen-drik.de!
@hen_drik
Thanks to Andrii, Jussi & Roelof
Slides: https://tinyurl.com/pydata-language

predict the current word!
input!
wi-2, wi-1, wi+1, wi+2 !
output !
wi!
word2vec
How it is trained

predict the current word!
input!
wi-2, wi-1, wi+1, wi+2 !
output !
wi!
predict the surrounding words!
input  
wi !
output !
wi-2, wi-1, wi +1, wi +2.
word2vec
How it is trained

Hacking Human Language (PyData London)

Recommended

Recommended

More Related Content

Similar to Hacking Human Language (PyData London)

Similar to Hacking Human Language (PyData London) (20)

More from hen_drik

More from hen_drik (8)

Recently uploaded

Recently uploaded (20)

Hacking Human Language (PyData London)