Smart Data Meetup - NLP At Scale

NLP at Scale
TrustYou Review Summaries
Steffen Wenz, CTO
@tyengineering
Smart Data Meetup Sep 2017

For every hotel on the
planet, provide a summary
of traveler reviews.
What does TrustYou do?

✓ Excellent hotel!
✓ Nice building
“Clean, hip & modern, excellent facilities”
✓ Great view
« Vue superbe »

✓ Excellent hotel!*
✓ Nice building
“Clean, hip & modern, excellent facilities”
✓ Great view
« Vue superbe »
✓ Great for partying
“Nice weekend getaway or for partying”
✗ Solo travelers complain about TVs
ℹ You should check out Reichstag,
KaDeWe & Gendarmenmarkt.
*) nhow Berlin (Full summary)

steffen@trustyou.com
● Studied CS here in Munich
● Joined TrustYou in 2008 as working student …
● First product manager, then CTO since 2012
● Manages very diverse tech stack and team of
30 engineers:
○ Data engineers
○ Data scientists
○ Web developers

TrustYou Architecture
TrustYou ♥ Spark + Python
NLP
Text
Generation
Machine
Learning
Aggregation
Crawling API
3M new reviews
per week!

Typical NLP Pipeline
Raw text
Tokenization
Part of
speech
tagging
Parsing
Sentence
splitting
Structured
data!

● NLP library
● Implements NLP pipelines for English, German + others
● Focus on performance and production use
○ Largely implemented in Cython … heard of it? :)
● Plays well with machine learning libraries
● Unlike NLTK, which is more for educational use, and
sees few updates these days …

import spacy
nlp = spacy.load("en")
doc = nlp("This hotel is truly huge and
beautiful. I'll be back for sure")
for word in doc:
print(word)

doc = nlp("I'll code code")
for word in doc:
print(word.text, word.lemma_, word.pos_)
# I -PRON- PRON
# 'll will VERB
# code code VERB
# code code NOUN

Dependency parsing
Try “displaCy” yourself

● “Nice room”
● “Room wasn‘t so great”
● “อาหารรสชาติดี”
● “‫ﺟﯾدة‬ ‫ﺧدﻣﺔ‬ ”
● Custom NLP framework,
extension of NLTK
● Supports 20 languages
natively!
● Custom,
domain-specific tagging
and parsing
Semantic Analysis at TrustYou

Let’s do some ML!
Hm, how to model text as input for ML?
● Enter Word vectors!
● Goal: Find a mapping word → high-dimensional vector
where similar word have vectors close together
● “Woman” is close to “lady” is close to “womna”
● Word2vec is an algorithm to produce such embeddings

woman, lady, dude = nlp("woman lady dude")
woman.similarity(lady) # 0.78
woman.similarity(dude) # 0.40
● Word2vec considers words to be similar if they occur in
similar contexts, i.e. typically have the same words
before/after them

(Somewhat Pointless) Application
Goal: Predict review overall score just from title!

(Somewhat Pointless) Application
Goal: Predict review overall score just from title!
Input
(here, word
vectors)
Output
(here, review
score, so just one
node)
Training = rejiggering the weights of these arrows,
trying to closely match training data

ML 10 years ago
● Work goes into feature
engineering
● Bigram models, POS
tags, parse trees …
whatever helps
Deep learning now
● Big NNs capture lots of
complexity … can work
directly on raw data
● Bad news for domain
experts :’(

Keras
● High-level machine learning library
● API for defining neural network architecture
● Training & prediction is done in a backend:
○ Tensorflow
○ Theano
○ …

Neural network topology, in Keras
Disclaimer:

model = keras.models.Sequential()
model.add(
keras.layers.Embedding(
embeddings.shape[0],
embeddings.shape[1],
input_length=max_length,
trainable=False,
weights=[embeddings],
)
)
model.add(keras.layers.Bidirectional(keras.layers.LSTM(lstm_units)))
model.add(keras.layers.Dropout(dropout_rate))
model.add(keras.layers.Dense(1, activation="sigmoid"))
model.compile(optimizer="adam", loss="mean_squared_error", metrics=["accuracy"])

Let’s try our model:
“Perfect” → 97
“Beautiful hotel” → 95
“Good hotel” → 84
“Could have been better” → 65
“Hotel was not beautiful …” → 51
“Right in the middle of Munich” → 89
“Right in the middle of Bagdad” → 89
Trained on 1M review titles.
Mean squared error: 12/100

Try for yourself:
Code on GitHub

ML @ TrustYou
● gensim doc2vec model
to create hotel
embedding
● Used – together with
other features – for
various hotel-level
classifiers

Workflow Management
& Scaling Up

Python on Hadoop:
… possible, but not natural

Spark
● Distributed computing framework
● User writes driver program which transparently
schedules execution in a cluster
● Faster and more expressive than MapReduce

Let’s try Spark!
$ # how old is the C code in CPython?
$ git clone https://github.com/python/cpython && cd cpython
$ find . -name "*.c" -exec git blame {} ; > blame
$ head blame
dc5dbf61 (Guido van Rossum 1991-02-19 12:39:46 +0000 1)
daadddf7 (Guido van Rossum 1990-10-14 12:07:46 +0000 2) /* List a no
daadddf7 (Guido van Rossum 1990-10-14 12:07:46 +0000 3)
badc12f6 (Guido van Rossum 1990-12-20 15:06:42 +0000 4) #include "pg
daadddf7 (Guido van Rossum 1990-10-14 12:07:46 +0000 5) #include "to
daadddf7 (Guido van Rossum 1990-10-14 12:07:46 +0000 6) #include "no
daadddf7 (Guido van Rossum 1990-10-14 12:07:46 +0000 7)
badc12f6 (Guido van Rossum 1990-12-20 15:06:42 +0000 8) /* Forward *

Let’s try Spark!
import operator as op, re
# sc: SparkContext, connection to cluster
year_re = r"(d{4})-d{2}-d{2}"
years_hist = sc.textFile("blame")
.flatMap(lambda line: re.findall(year_re, line))
.map(lambda year: (year, 1))
.reduceByKey(op.add)
output = years_hist.collect()

● Build complex pipelines of
batch jobs
○ Dependency resolution
○ Parallelism
○ Resume failed jobs
Luigi

class MyTask(luigi.Task):
def output(self):
return luigi.Target("/to/make/this/file")
def requires(self):
return [
INeedThisTask(),
AndAlsoThisTask("with_some arg")
]
def run(self):
# ... then ...
# I do this to make it!

https://github.com/trustyou/tyluigiutils
Utilities for getting Luigi, Spark and virtualenv to work
together

We’re hiring data scientists and software engineers!
http://www.trustyou.com/careers/
steffen@trustyou.com

Smart Data Meetup - NLP At Scale

Recommended

Recommended

More Related Content

Similar to Smart Data Meetup - NLP At Scale

Similar to Smart Data Meetup - NLP At Scale (20)

More from Steffen Wenz

More from Steffen Wenz (7)

Recently uploaded

Recently uploaded (20)

Smart Data Meetup - NLP At Scale