Hacking Human Language (PyCon Sweden 2015)

Hacking!
Human!
Language!
Hendrik
Heuer
PyCon !
Stockholm!
Sweden

– Hacker Ethics
“Access to computers
— 
and anything which might !
teach you something about !
the way the world works!
—
should be unlimited and total.
Always yield to !
the Hands-On Imperative!”
Levy, Steven (2001). Hackers: Heroes of the Computer Revolution (updated ed.).
New York: Penguin Books. ISBN 0141000511. OCLC 47216793.

Agenda
• Computational Social Science
• Natural Language Processing
• Word Vector Representations
• Visualising and comparing  
my Google searches

D. Crandall and N. Snavely, ‘Modeling People and Places with Internet Photo
Collections’, Commun. ACM, vol. 55, no. 6, pp. 52–60, Jun. 2012. DOI:
10.1145/2184319.2184336

Computational Social Science 
Digital Humanities
• combines computer science & social sciences
• makes new research possible, e.g. the analysis of
massive social networks and content of millions
of books
immersion.media.mit.edu

Massive-scale automated !
analysis of news-content
• 2.5 million articles from 498 different  
English-language news outlets  
(Reuters & New York Times Corpus)
• automatically annotated into 15 topic areas
• the topics were compared in regards to
readability, linguistic subjectivity and  
gender imbalances
I. Flaounas, O. Ali, T. Lansdall-Welfare, T. De Bie, N. Mosdell, J. Lewis, and N.
Cristianini, ‘Research Methods in the Age of Digital Journalism: Massive-scale
automated analysis of news-content: topics, style and gender’, Digital
Journalism, vol. 1, no. 1, 2013. DOI:10.1080/21670811.2012.714928

Linguistic Subjectivity!
Adjectives (Part-of-Speech Tagging) & SentiWordNet

“Low level of political interest and engagement
could be connected to the !
lack of subjectivity (adjectival excess)”
Linguistic Subjectivity!
Adjectives (Part-of-Speech Tagging) & SentiWordNet

Male-to-Female Ratio!
Named Entity Recognition

Male-to-Female Ratio!
Named Entity Recognition
“Gender bias in sports coverage (...)
females only account for between
only 7 and 25 per cent of coverage”

scikit-learn
gensim
Natural Language Toolkit
spaCyword2vec
Machine Learning
Text Processing
Topic Modeling
Visualization
d3.js
Google Chart API
Highcharts

Introduction to  
Natural Language Processing

Word Tokenization!
Splitting a sentence into single words
>>> from nltk.tokenize import word_tokenize
!
>>> word_tokenize("All your base are belong to us") 
['All', 'your', 'base', 'are', 'belong', 'to', 'us']

Sentence Tokenization!
Splitting a text into sentences
>>> from nltk.tokenize import sent_tokenize
!
>>> sent_tokenize("Hello, Mr. Anderson. We missed you!")
['Hello, Mr. Anderson.', 'We missed you!']

Sentence Tokenization!
Splitting a text into sentences
>>> import nltk
>>> import functools
!
>>> sent_tokenize =  
nltk.data.load(“tokenizers/punkt/swedish.pickle”)

Stemming!
Finding the word stem or root form
>>> import nltk
>>> porter = nltk.PorterStemmer()
>>> lancaster = nltk.LancasterStemmer()
>>> wnl = nltk.WordNetLemmatizer()
!
>>> [wnl.lemmatize(w) for w in ['investigation','women']]
['investigation', ‘woman']
!
>>> [porter.stem(w) for w in ['investigation','women']]
['investig', 'women']
!
>>> [lancaster.stem(w) for w in ['investigation','women']]
['investig', 'wom']

Part-of-Speech Tagging!
Identifying nouns, verbs, adjectives…
>>> import nltk
>>> text = "In the middle ages Sweden had the  
same king as Denmark and Norway."
>>> words = nltk.word_tokenize( text )
!
>>> nltk.pos_tag( words )
[('In', 'IN'), ('the', 'DT'), ('middle', 'NN'),
('ages', 'NNS'), ('Sweden', 'NNP'), ('had', 'VBD'),
('the', 'DT'), ('same', 'JJ'), ('king', 'NN'), ('as',
'IN'), ('Denmark', 'NNP'), ('and', 'CC'), ('Norway',
'NNP'), ('.', '.')]
NN* Noun
VB* Verb
JJ* Adjective
RB* Adverb
DT Determiner
IN Preposition

Named Entity Recognition!
Identifying people, organizations, locations…
>>> import nltk
>>> text = "New York City is the largest city in the
United States."
>>> words = nltk.word_tokenize( text )
!
>>> nltk.ne_chunk( nltk.pos_tag( words ) )
Tree('S', [Tree('GPE', [('New', 'NNP'), ('York', 'NNP'),
('City', 'NNP')]), ('is', 'VBZ'), ('the', 'DT'),
('largest', 'JJS'), ('city', 'NN'), ('in', 'IN'), ('the',
'DT'), Tree('GPE', [('United', 'NNP'), ('States',
'NNPS')]), ('.', '.')])
ORGANIZATION Georgia-Pacific Corp., WHO
PERSON Eddy Bonte, President Obama
LOCATION Murray River, Mount Everest
DATE June, 2008-06-29
TIME two fifty a m, 1:30 p.m.
MONEY GBP 10.40
PERCENT twenty pct, 18.75 %
FACILITY Washington Monument, Stonehenge
GPE South East Asia, Midlothian
(geo-political entity)

Sentiment Analysis
Tell if a sentence is positive or negative

–J. R. Firth 1957
“You shall know a word  
by the company it keeps”
Firth, John R. 1957. A synopsis of linguistic theory 1930–1955. In Studies in
linguistic analysis, 1–32. Oxford: Blackwell.

–J. R. Firth 1957
“You shall know a word  
by the company it keeps”
Quoted after Socher

Vectors are directions in space

Vectors are directions in space
Quoted after Socher
word2vec
Representing a word with a vector

T. Mikolov, K. Chen, G. Corrado, and J. Dean, ‘Efﬁcient Estimation of Word
Representations in Vector Space’, CoRR, vol. abs/1301.3781, 2013 [Online].
Available: http://arxiv.org/abs/1301.3781
Vectors can encode relationships
MAN
WOMAN
AUNT
UNCLE
QUEEN
KING
word2vec

man is to woman as king is to ?
KINGS
KING
QUEEN
QUEENS
word2vec

word2vec

Link: https://radimrehurek.com/gensim/models/word2vec.html

Link: https://honnibal.github.io/spaCy/

spaCy!
Dependency-Based
Word representations
by Levy and Goldberg
Gensim!
word2vec 
 
by Mikolov et al

spaCy!
Dependency-Based
Gensim!
word2vec 
 
by Mikolov et al
2 words context window

spaCy!
Dependency-Based
Gensim!
word2vec 
 
by Mikolov et al

Link: https://code.google.com/p/word2vec/#Pre-trained_entity_vectors_with_Freebase_naming

Machine Translation
T. Mikolov, Q. V. Le, and I. Sutskever, ‘Exploiting Similarities among Languages
for Machine Translation’, CoRR, vol. abs/1309.4168, 2013 [Online]. Available:
http://arxiv.org/abs/1309.4168

Link: https://support.google.com/websearch/answer/6068625?hl=en

{ "event":[
{"query":
{"id":[
{"timestamp_usec":"1317002730153183"}
],
"query_text":"google hangout"
}
},
{"query":
{"id":[
{"timestamp_usec":"1316577601549660"}
],
"query_text":"eurokrise"
}
},
{"query":
{"id":[
{"timestamp_usec":"1315592145720230"}
],
"query_text":"hoverboard"
}
}
parsed_json[‘event’][42]['query']['query_text']

1.
Find Word
Representations
word2vec
2.
Dimensionality
Reduction
t-SNE
3.
Output
JSON

1.
Find Word
Representations
word2vec
2.
Dimensionality
Reduction
t-SNE
3.
Output
JSON
Link gensim: https://radimrehurek.com/gensim/!
Link word2vec: https://code.google.com/p/word2vec/

1.
Find Word
Representations
word2vec
2.
Dimensionality
Reduction
t-SNE
3.
Output
JSON
linguistics

1.
Find Word
Representations
word2vec
2.
Dimensionality
Reduction
t-SNE
3.
Output
JSON
Link: http://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html

1.
Find Word
Representations
word2vec
2.
Dimensionality
Reduction
t-SNE
3.
Output
JSON
Link: https://github.com/mbostock/d3/wiki/Gallery

My Google Searches
Oct – Dec 2014
Jul – Sep 2011
Both, 2011 & 2014

Hacking!
Human!
Language!
Hendrik
Heuer
PyCon !
Stockholm!
Sweden
hendrikheuer@gmail.com!
http://hen-drik.de!
@hen_drik
Thanks to Andrii, Jussi & Roelof
Slides: https://tinyurl.com/pycon-word2vec

predict the current word!
input!
wi-2, wi-1, wi+1, wi+2 !
output !
wi!

predict the current word!
input!
wi-2, wi-1, wi+1, wi+2 !
output !
wi!
predict the surrounding words!
input  
wi !
output !
wi-2, wi-1, wi +1, wi +2.

Hacking Human Language (PyCon Sweden 2015)

Hacking Human Language (PyCon Sweden 2015)

More Related Content

Similar to Hacking Human Language (PyCon Sweden 2015)

More from hen_drik

Recently uploaded

Hacking Human Language (PyCon Sweden 2015)