SlideShare a Scribd company logo
1 of 42
Download to read offline
NLP at Scale
TrustYou Review Summaries
Steffen Wenz, CTO
@tyengineering
Smart Data Meetup Sep 2017
For every hotel on the
planet, provide a summary
of traveler reviews.
What does TrustYou do?
✓ Excellent hotel!
✓ Excellent hotel!
✓ Nice building
“Clean, hip & modern, excellent facilities”
✓ Great view
« Vue superbe »
✓ Excellent hotel!*
✓ Nice building
“Clean, hip & modern, excellent facilities”
✓ Great view
« Vue superbe »
✓ Great for partying
“Nice weekend getaway or for partying”
✗ Solo travelers complain about TVs
ℹ You should check out Reichstag,
KaDeWe & Gendarmenmarkt.
*) nhow Berlin (Full summary)
steffen@trustyou.com
● Studied CS here in Munich
● Joined TrustYou in 2008 as working student …
● First product manager, then CTO since 2012
● Manages very diverse tech stack and team of
30 engineers:
○ Data engineers
○ Data scientists
○ Web developers
TrustYou Architecture
TrustYou ♥ Spark + Python
NLP
Text
Generation
Machine
Learning
Aggregation
Crawling API
3M new reviews
per week!
Extracting
Meaning from Text
Typical NLP Pipeline
Raw text
Tokenization
Part of
speech
tagging
Parsing
Sentence
splitting
Structured
data!
● NLP library
● Implements NLP pipelines for English, German + others
● Focus on performance and production use
○ Largely implemented in Cython … heard of it? :)
● Plays well with machine learning libraries
● Unlike NLTK, which is more for educational use, and
sees few updates these days …
import spacy
nlp = spacy.load("en")
doc = nlp("This hotel is truly huge and
beautiful. I'll be back for sure")
for word in doc:
print(word)
doc = nlp("I'll code code")
for word in doc:
print(word.text, word.lemma_, word.pos_)
# I -PRON- PRON
# 'll will VERB
# code code VERB
# code code NOUN
Dependency parsing
Try “displaCy” yourself
● “Nice room”
● “Room wasn‘t so great”
● “อาหารรสชาติดี”
● “‫ﺟﯾدة‬ ‫ﺧدﻣﺔ‬ ”
● Custom NLP framework,
extension of NLTK
● Supports 20 languages
natively!
● Custom,
domain-specific tagging
and parsing
Semantic Analysis at TrustYou
Let’s do some ML!
Hm, how to model text as input for ML?
● Enter Word vectors!
● Goal: Find a mapping word → high-dimensional vector
where similar word have vectors close together
● “Woman” is close to “lady” is close to “womna”
● Word2vec is an algorithm to produce such embeddings
woman, lady, dude = nlp("woman lady dude")
woman.similarity(lady) # 0.78
woman.similarity(dude) # 0.40
● Word2vec considers words to be similar if they occur in
similar contexts, i.e. typically have the same words
before/after them
(Somewhat Pointless) Application
Goal: Predict review overall score just from title!
(Somewhat Pointless) Application
Goal: Predict review overall score just from title!
Input
(here, word
vectors)
Output
(here, review
score, so just one
node)
Training = rejiggering the weights of these arrows,
trying to closely match training data
ML 10 years ago
● Work goes into feature
engineering
● Bigram models, POS
tags, parse trees …
whatever helps
Deep learning now
● Big NNs capture lots of
complexity … can work
directly on raw data
● Bad news for domain
experts :’(
Keras
● High-level machine learning library
● API for defining neural network architecture
● Training & prediction is done in a backend:
○ Tensorflow
○ Theano
○ …
Neural network topology, in Keras
Disclaimer:
model = keras.models.Sequential()
model.add(
keras.layers.Embedding(
embeddings.shape[0],
embeddings.shape[1],
input_length=max_length,
trainable=False,
weights=[embeddings],
)
)
model.add(keras.layers.Bidirectional(keras.layers.LSTM(lstm_units)))
model.add(keras.layers.Dropout(dropout_rate))
model.add(keras.layers.Dense(1, activation="sigmoid"))
model.compile(optimizer="adam", loss="mean_squared_error", metrics=["accuracy"])
Let’s try our model:
“Perfect” → 97
“Beautiful hotel” → 95
“Good hotel” → 84
“Could have been better” → 65
“Hotel was not beautiful …” → 51
“Right in the middle of Munich” → 89
“Right in the middle of Bagdad” → 89
Trained on 1M review titles.
Mean squared error: 12/100
Try for yourself:
Code on GitHub
ML @ TrustYou
● gensim doc2vec model
to create hotel
embedding
● Used – together with
other features – for
various hotel-level
classifiers
Workflow Management
& Scaling Up
Hadoop:
… slow & massive
Python on Hadoop:
… possible, but not natural
Spark
● Distributed computing framework
● User writes driver program which transparently
schedules execution in a cluster
● Faster and more expressive than MapReduce
Let’s try Spark!
$ # how old is the C code in CPython?
$ git clone https://github.com/python/cpython && cd cpython
$ find . -name "*.c" -exec git blame {} ; > blame
$ head blame
dc5dbf61 (Guido van Rossum 1991-02-19 12:39:46 +0000 1)
daadddf7 (Guido van Rossum 1990-10-14 12:07:46 +0000 2) /* List a no
daadddf7 (Guido van Rossum 1990-10-14 12:07:46 +0000 3)
badc12f6 (Guido van Rossum 1990-12-20 15:06:42 +0000 4) #include "pg
daadddf7 (Guido van Rossum 1990-10-14 12:07:46 +0000 5) #include "to
daadddf7 (Guido van Rossum 1990-10-14 12:07:46 +0000 6) #include "no
daadddf7 (Guido van Rossum 1990-10-14 12:07:46 +0000 7)
badc12f6 (Guido van Rossum 1990-12-20 15:06:42 +0000 8) /* Forward *
Let’s try Spark!
import operator as op, re
# sc: SparkContext, connection to cluster
year_re = r"(d{4})-d{2}-d{2}"
years_hist = sc.textFile("blame") 
.flatMap(lambda line: re.findall(year_re, line)) 
.map(lambda year: (year, 1)) 
.reduceByKey(op.add)
output = years_hist.collect()
What happened here?
● Build complex pipelines of
batch jobs
○ Dependency resolution
○ Parallelism
○ Resume failed jobs
Luigi
class MyTask(luigi.Task):
def output(self):
return luigi.Target("/to/make/this/file")
def requires(self):
return [
INeedThisTask(),
AndAlsoThisTask("with_some arg")
]
def run(self):
# ... then ...
# I do this to make it!
https://github.com/trustyou/tyluigiutils
Utilities for getting Luigi, Spark and virtualenv to work
together
We’re hiring data scientists and software engineers!
http://www.trustyou.com/careers/
steffen@trustyou.com

More Related Content

Similar to Smart Data Meetup - NLP At Scale

Mongo NYC PHP Development
Mongo NYC PHP Development Mongo NYC PHP Development
Mongo NYC PHP Development
Fitz Agard
 

Similar to Smart Data Meetup - NLP At Scale (20)

DevTalks Cluj - Open-Source Technologies for Analyzing Text
DevTalks Cluj - Open-Source Technologies for Analyzing TextDevTalks Cluj - Open-Source Technologies for Analyzing Text
DevTalks Cluj - Open-Source Technologies for Analyzing Text
 
Pipeline as code for your infrastructure as Code
Pipeline as code for your infrastructure as CodePipeline as code for your infrastructure as Code
Pipeline as code for your infrastructure as Code
 
The "Holy Grail" of Dev/Ops
The "Holy Grail" of Dev/OpsThe "Holy Grail" of Dev/Ops
The "Holy Grail" of Dev/Ops
 
Dev secops opsec, devsec, devops ?
Dev secops opsec, devsec, devops ?Dev secops opsec, devsec, devops ?
Dev secops opsec, devsec, devops ?
 
BDD Testing Using Godog - Bangalore Golang Meetup # 32
BDD Testing Using Godog - Bangalore Golang Meetup # 32BDD Testing Using Godog - Bangalore Golang Meetup # 32
BDD Testing Using Godog - Bangalore Golang Meetup # 32
 
Meet a 100% R-based CRO. The summary of a 5-year journey
Meet a 100% R-based CRO. The summary of a 5-year journeyMeet a 100% R-based CRO. The summary of a 5-year journey
Meet a 100% R-based CRO. The summary of a 5-year journey
 
Meet a 100% R-based CRO - The summary of a 5-year journey
Meet a 100% R-based CRO - The summary of a 5-year journeyMeet a 100% R-based CRO - The summary of a 5-year journey
Meet a 100% R-based CRO - The summary of a 5-year journey
 
Mongo NYC PHP Development
Mongo NYC PHP Development Mongo NYC PHP Development
Mongo NYC PHP Development
 
Intro to Python
Intro to PythonIntro to Python
Intro to Python
 
IC-SDV 2019: Down-to-earth machine learning: What you always wanted your data...
IC-SDV 2019: Down-to-earth machine learning: What you always wanted your data...IC-SDV 2019: Down-to-earth machine learning: What you always wanted your data...
IC-SDV 2019: Down-to-earth machine learning: What you always wanted your data...
 
Building an E-commerce website in MEAN stack
Building an E-commerce website in MEAN stackBuilding an E-commerce website in MEAN stack
Building an E-commerce website in MEAN stack
 
Brownfield Domain Driven Design
Brownfield Domain Driven DesignBrownfield Domain Driven Design
Brownfield Domain Driven Design
 
2022 - Delivering Powerful Technical Presentations.pdf
2022 - Delivering Powerful Technical Presentations.pdf2022 - Delivering Powerful Technical Presentations.pdf
2022 - Delivering Powerful Technical Presentations.pdf
 
Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...
Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...
Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...
 
Breaking Through The Challenges of Scalable Deep Learning for Video Analytics
Breaking Through The Challenges of Scalable Deep Learning for Video AnalyticsBreaking Through The Challenges of Scalable Deep Learning for Video Analytics
Breaking Through The Challenges of Scalable Deep Learning for Video Analytics
 
Aws r
Aws rAws r
Aws r
 
Andrea Di Persio
Andrea Di PersioAndrea Di Persio
Andrea Di Persio
 
Production process presentation - drupalcamp Toronto 2010
Production process presentation - drupalcamp Toronto 2010Production process presentation - drupalcamp Toronto 2010
Production process presentation - drupalcamp Toronto 2010
 
Oh the compilers you'll build
Oh the compilers you'll buildOh the compilers you'll build
Oh the compilers you'll build
 
Dust.js
Dust.jsDust.js
Dust.js
 

More from Steffen Wenz

More from Steffen Wenz (7)

Is Python turning into Java? PyData 2017 Berlin Lightning Talk
Is Python turning into Java? PyData 2017 Berlin Lightning TalkIs Python turning into Java? PyData 2017 Berlin Lightning Talk
Is Python turning into Java? PyData 2017 Berlin Lightning Talk
 
Is this good Python? PyCon WEB 2017 Lightning Talk
Is this good Python? PyCon WEB 2017 Lightning TalkIs this good Python? PyCon WEB 2017 Lightning Talk
Is this good Python? PyCon WEB 2017 Lightning Talk
 
Powered by Python - PyCon Germany 2016
Powered by Python - PyCon Germany 2016Powered by Python - PyCon Germany 2016
Powered by Python - PyCon Germany 2016
 
DevTalks Cluj - Predictions for Machine Learning in 2020
DevTalks Cluj - Predictions for Machine Learning in 2020DevTalks Cluj - Predictions for Machine Learning in 2020
DevTalks Cluj - Predictions for Machine Learning in 2020
 
PyData Berlin Meetup
PyData Berlin MeetupPyData Berlin Meetup
PyData Berlin Meetup
 
Cluj Big Data Meetup - Big Data in Practice
Cluj Big Data Meetup - Big Data in PracticeCluj Big Data Meetup - Big Data in Practice
Cluj Big Data Meetup - Big Data in Practice
 
Cluj.py Meetup: Extending Python in C
Cluj.py Meetup: Extending Python in CCluj.py Meetup: Extending Python in C
Cluj.py Meetup: Extending Python in C
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 

Smart Data Meetup - NLP At Scale

  • 1. NLP at Scale TrustYou Review Summaries Steffen Wenz, CTO @tyengineering Smart Data Meetup Sep 2017
  • 2. For every hotel on the planet, provide a summary of traveler reviews. What does TrustYou do?
  • 4. ✓ Excellent hotel! ✓ Nice building “Clean, hip & modern, excellent facilities” ✓ Great view « Vue superbe »
  • 5. ✓ Excellent hotel!* ✓ Nice building “Clean, hip & modern, excellent facilities” ✓ Great view « Vue superbe » ✓ Great for partying “Nice weekend getaway or for partying” ✗ Solo travelers complain about TVs ℹ You should check out Reichstag, KaDeWe & Gendarmenmarkt. *) nhow Berlin (Full summary)
  • 6.
  • 7.
  • 8.
  • 9. steffen@trustyou.com ● Studied CS here in Munich ● Joined TrustYou in 2008 as working student … ● First product manager, then CTO since 2012 ● Manages very diverse tech stack and team of 30 engineers: ○ Data engineers ○ Data scientists ○ Web developers
  • 10. TrustYou Architecture TrustYou ♥ Spark + Python NLP Text Generation Machine Learning Aggregation Crawling API 3M new reviews per week!
  • 12. Typical NLP Pipeline Raw text Tokenization Part of speech tagging Parsing Sentence splitting Structured data!
  • 13. ● NLP library ● Implements NLP pipelines for English, German + others ● Focus on performance and production use ○ Largely implemented in Cython … heard of it? :) ● Plays well with machine learning libraries ● Unlike NLTK, which is more for educational use, and sees few updates these days …
  • 14. import spacy nlp = spacy.load("en") doc = nlp("This hotel is truly huge and beautiful. I'll be back for sure") for word in doc: print(word)
  • 15. doc = nlp("I'll code code") for word in doc: print(word.text, word.lemma_, word.pos_) # I -PRON- PRON # 'll will VERB # code code VERB # code code NOUN
  • 17. ● “Nice room” ● “Room wasn‘t so great” ● “อาหารรสชาติดี” ● “‫ﺟﯾدة‬ ‫ﺧدﻣﺔ‬ ” ● Custom NLP framework, extension of NLTK ● Supports 20 languages natively! ● Custom, domain-specific tagging and parsing Semantic Analysis at TrustYou
  • 18. Let’s do some ML! Hm, how to model text as input for ML? ● Enter Word vectors! ● Goal: Find a mapping word → high-dimensional vector where similar word have vectors close together ● “Woman” is close to “lady” is close to “womna” ● Word2vec is an algorithm to produce such embeddings
  • 19. woman, lady, dude = nlp("woman lady dude") woman.similarity(lady) # 0.78 woman.similarity(dude) # 0.40 ● Word2vec considers words to be similar if they occur in similar contexts, i.e. typically have the same words before/after them
  • 20. (Somewhat Pointless) Application Goal: Predict review overall score just from title!
  • 21. (Somewhat Pointless) Application Goal: Predict review overall score just from title! Input (here, word vectors) Output (here, review score, so just one node) Training = rejiggering the weights of these arrows, trying to closely match training data
  • 22. ML 10 years ago ● Work goes into feature engineering ● Bigram models, POS tags, parse trees … whatever helps Deep learning now ● Big NNs capture lots of complexity … can work directly on raw data ● Bad news for domain experts :’(
  • 23. Keras ● High-level machine learning library ● API for defining neural network architecture ● Training & prediction is done in a backend: ○ Tensorflow ○ Theano ○ …
  • 24. Neural network topology, in Keras Disclaimer:
  • 26. Let’s try our model: “Perfect” → 97 “Beautiful hotel” → 95 “Good hotel” → 84 “Could have been better” → 65 “Hotel was not beautiful …” → 51 “Right in the middle of Munich” → 89 “Right in the middle of Bagdad” → 89 Trained on 1M review titles. Mean squared error: 12/100
  • 28. ML @ TrustYou ● gensim doc2vec model to create hotel embedding ● Used – together with other features – for various hotel-level classifiers
  • 31. Python on Hadoop: … possible, but not natural
  • 32.
  • 33. Spark ● Distributed computing framework ● User writes driver program which transparently schedules execution in a cluster ● Faster and more expressive than MapReduce
  • 34. Let’s try Spark! $ # how old is the C code in CPython? $ git clone https://github.com/python/cpython && cd cpython $ find . -name "*.c" -exec git blame {} ; > blame $ head blame dc5dbf61 (Guido van Rossum 1991-02-19 12:39:46 +0000 1) daadddf7 (Guido van Rossum 1990-10-14 12:07:46 +0000 2) /* List a no daadddf7 (Guido van Rossum 1990-10-14 12:07:46 +0000 3) badc12f6 (Guido van Rossum 1990-12-20 15:06:42 +0000 4) #include "pg daadddf7 (Guido van Rossum 1990-10-14 12:07:46 +0000 5) #include "to daadddf7 (Guido van Rossum 1990-10-14 12:07:46 +0000 6) #include "no daadddf7 (Guido van Rossum 1990-10-14 12:07:46 +0000 7) badc12f6 (Guido van Rossum 1990-12-20 15:06:42 +0000 8) /* Forward *
  • 35. Let’s try Spark! import operator as op, re # sc: SparkContext, connection to cluster year_re = r"(d{4})-d{2}-d{2}" years_hist = sc.textFile("blame") .flatMap(lambda line: re.findall(year_re, line)) .map(lambda year: (year, 1)) .reduceByKey(op.add) output = years_hist.collect()
  • 37.
  • 38. ● Build complex pipelines of batch jobs ○ Dependency resolution ○ Parallelism ○ Resume failed jobs Luigi
  • 39. class MyTask(luigi.Task): def output(self): return luigi.Target("/to/make/this/file") def requires(self): return [ INeedThisTask(), AndAlsoThisTask("with_some arg") ] def run(self): # ... then ... # I do this to make it!
  • 40.
  • 41. https://github.com/trustyou/tyluigiutils Utilities for getting Luigi, Spark and virtualenv to work together
  • 42. We’re hiring data scientists and software engineers! http://www.trustyou.com/careers/ steffen@trustyou.com