My talk @ Smart Data Meetup in Munich: https://www.meetup.com/SmartData/events/237731342/
Learn how to build a modern NLP + deep learning pipeline with spaCy and Keras. Code samples here: https://github.com/trustyou/meetups/tree/master/smart-data
4. ✓ Excellent hotel!
✓ Nice building
“Clean, hip & modern, excellent facilities”
✓ Great view
« Vue superbe »
5. ✓ Excellent hotel!*
✓ Nice building
“Clean, hip & modern, excellent facilities”
✓ Great view
« Vue superbe »
✓ Great for partying
“Nice weekend getaway or for partying”
✗ Solo travelers complain about TVs
ℹ You should check out Reichstag,
KaDeWe & Gendarmenmarkt.
*) nhow Berlin (Full summary)
6.
7.
8.
9. steffen@trustyou.com
● Studied CS here in Munich
● Joined TrustYou in 2008 as working student …
● First product manager, then CTO since 2012
● Manages very diverse tech stack and team of
30 engineers:
○ Data engineers
○ Data scientists
○ Web developers
10. TrustYou Architecture
TrustYou ♥ Spark + Python
NLP
Text
Generation
Machine
Learning
Aggregation
Crawling API
3M new reviews
per week!
12. Typical NLP Pipeline
Raw text
Tokenization
Part of
speech
tagging
Parsing
Sentence
splitting
Structured
data!
13. ● NLP library
● Implements NLP pipelines for English, German + others
● Focus on performance and production use
○ Largely implemented in Cython … heard of it? :)
● Plays well with machine learning libraries
● Unlike NLTK, which is more for educational use, and
sees few updates these days …
14. import spacy
nlp = spacy.load("en")
doc = nlp("This hotel is truly huge and
beautiful. I'll be back for sure")
for word in doc:
print(word)
15. doc = nlp("I'll code code")
for word in doc:
print(word.text, word.lemma_, word.pos_)
# I -PRON- PRON
# 'll will VERB
# code code VERB
# code code NOUN
17. ● “Nice room”
● “Room wasn‘t so great”
● “อาหารรสชาติดี”
● “ﺟﯾدة ﺧدﻣﺔ ”
● Custom NLP framework,
extension of NLTK
● Supports 20 languages
natively!
● Custom,
domain-specific tagging
and parsing
Semantic Analysis at TrustYou
18. Let’s do some ML!
Hm, how to model text as input for ML?
● Enter Word vectors!
● Goal: Find a mapping word → high-dimensional vector
where similar word have vectors close together
● “Woman” is close to “lady” is close to “womna”
● Word2vec is an algorithm to produce such embeddings
19. woman, lady, dude = nlp("woman lady dude")
woman.similarity(lady) # 0.78
woman.similarity(dude) # 0.40
● Word2vec considers words to be similar if they occur in
similar contexts, i.e. typically have the same words
before/after them
21. (Somewhat Pointless) Application
Goal: Predict review overall score just from title!
Input
(here, word
vectors)
Output
(here, review
score, so just one
node)
Training = rejiggering the weights of these arrows,
trying to closely match training data
22. ML 10 years ago
● Work goes into feature
engineering
● Bigram models, POS
tags, parse trees …
whatever helps
Deep learning now
● Big NNs capture lots of
complexity … can work
directly on raw data
● Bad news for domain
experts :’(
23. Keras
● High-level machine learning library
● API for defining neural network architecture
● Training & prediction is done in a backend:
○ Tensorflow
○ Theano
○ …
26. Let’s try our model:
“Perfect” → 97
“Beautiful hotel” → 95
“Good hotel” → 84
“Could have been better” → 65
“Hotel was not beautiful …” → 51
“Right in the middle of Munich” → 89
“Right in the middle of Bagdad” → 89
Trained on 1M review titles.
Mean squared error: 12/100
33. Spark
● Distributed computing framework
● User writes driver program which transparently
schedules execution in a cluster
● Faster and more expressive than MapReduce
34. Let’s try Spark!
$ # how old is the C code in CPython?
$ git clone https://github.com/python/cpython && cd cpython
$ find . -name "*.c" -exec git blame {} ; > blame
$ head blame
dc5dbf61 (Guido van Rossum 1991-02-19 12:39:46 +0000 1)
daadddf7 (Guido van Rossum 1990-10-14 12:07:46 +0000 2) /* List a no
daadddf7 (Guido van Rossum 1990-10-14 12:07:46 +0000 3)
badc12f6 (Guido van Rossum 1990-12-20 15:06:42 +0000 4) #include "pg
daadddf7 (Guido van Rossum 1990-10-14 12:07:46 +0000 5) #include "to
daadddf7 (Guido van Rossum 1990-10-14 12:07:46 +0000 6) #include "no
daadddf7 (Guido van Rossum 1990-10-14 12:07:46 +0000 7)
badc12f6 (Guido van Rossum 1990-12-20 15:06:42 +0000 8) /* Forward *
38. ● Build complex pipelines of
batch jobs
○ Dependency resolution
○ Parallelism
○ Resume failed jobs
Luigi
39. class MyTask(luigi.Task):
def output(self):
return luigi.Target("/to/make/this/file")
def requires(self):
return [
INeedThisTask(),
AndAlsoThisTask("with_some arg")
]
def run(self):
# ... then ...
# I do this to make it!