SlideShare a Scribd company logo
NLP at Scale
TrustYou Review Summaries
Steffen Wenz, CTO
@tyengineering
Smart Data Meetup Sep 2017
For every hotel on the
planet, provide a summary
of traveler reviews.
What does TrustYou do?
✓ Excellent hotel!
✓ Excellent hotel!
✓ Nice building
“Clean, hip & modern, excellent facilities”
✓ Great view
« Vue superbe »
✓ Excellent hotel!*
✓ Nice building
“Clean, hip & modern, excellent facilities”
✓ Great view
« Vue superbe »
✓ Great for partying
“Nice weekend getaway or for partying”
✗ Solo travelers complain about TVs
ℹ You should check out Reichstag,
KaDeWe & Gendarmenmarkt.
*) nhow Berlin (Full summary)
steffen@trustyou.com
● Studied CS here in Munich
● Joined TrustYou in 2008 as working student …
● First product manager, then CTO since 2012
● Manages very diverse tech stack and team of
30 engineers:
○ Data engineers
○ Data scientists
○ Web developers
TrustYou Architecture
TrustYou ♥ Spark + Python
NLP
Text
Generation
Machine
Learning
Aggregation
Crawling API
3M new reviews
per week!
Extracting
Meaning from Text
Typical NLP Pipeline
Raw text
Tokenization
Part of
speech
tagging
Parsing
Sentence
splitting
Structured
data!
● NLP library
● Implements NLP pipelines for English, German + others
● Focus on performance and production use
○ Largely implemented in Cython … heard of it? :)
● Plays well with machine learning libraries
● Unlike NLTK, which is more for educational use, and
sees few updates these days …
import spacy
nlp = spacy.load("en")
doc = nlp("This hotel is truly huge and
beautiful. I'll be back for sure")
for word in doc:
print(word)
doc = nlp("I'll code code")
for word in doc:
print(word.text, word.lemma_, word.pos_)
# I -PRON- PRON
# 'll will VERB
# code code VERB
# code code NOUN
Dependency parsing
Try “displaCy” yourself
● “Nice room”
● “Room wasn‘t so great”
● “อาหารรสชาติดี”
● “‫ﺟﯾدة‬ ‫ﺧدﻣﺔ‬ ”
● Custom NLP framework,
extension of NLTK
● Supports 20 languages
natively!
● Custom,
domain-specific tagging
and parsing
Semantic Analysis at TrustYou
Let’s do some ML!
Hm, how to model text as input for ML?
● Enter Word vectors!
● Goal: Find a mapping word → high-dimensional vector
where similar word have vectors close together
● “Woman” is close to “lady” is close to “womna”
● Word2vec is an algorithm to produce such embeddings
woman, lady, dude = nlp("woman lady dude")
woman.similarity(lady) # 0.78
woman.similarity(dude) # 0.40
● Word2vec considers words to be similar if they occur in
similar contexts, i.e. typically have the same words
before/after them
(Somewhat Pointless) Application
Goal: Predict review overall score just from title!
(Somewhat Pointless) Application
Goal: Predict review overall score just from title!
Input
(here, word
vectors)
Output
(here, review
score, so just one
node)
Training = rejiggering the weights of these arrows,
trying to closely match training data
ML 10 years ago
● Work goes into feature
engineering
● Bigram models, POS
tags, parse trees …
whatever helps
Deep learning now
● Big NNs capture lots of
complexity … can work
directly on raw data
● Bad news for domain
experts :’(
Keras
● High-level machine learning library
● API for defining neural network architecture
● Training & prediction is done in a backend:
○ Tensorflow
○ Theano
○ …
Neural network topology, in Keras
Disclaimer:
model = keras.models.Sequential()
model.add(
keras.layers.Embedding(
embeddings.shape[0],
embeddings.shape[1],
input_length=max_length,
trainable=False,
weights=[embeddings],
)
)
model.add(keras.layers.Bidirectional(keras.layers.LSTM(lstm_units)))
model.add(keras.layers.Dropout(dropout_rate))
model.add(keras.layers.Dense(1, activation="sigmoid"))
model.compile(optimizer="adam", loss="mean_squared_error", metrics=["accuracy"])
Let’s try our model:
“Perfect” → 97
“Beautiful hotel” → 95
“Good hotel” → 84
“Could have been better” → 65
“Hotel was not beautiful …” → 51
“Right in the middle of Munich” → 89
“Right in the middle of Bagdad” → 89
Trained on 1M review titles.
Mean squared error: 12/100
Try for yourself:
Code on GitHub
ML @ TrustYou
● gensim doc2vec model
to create hotel
embedding
● Used – together with
other features – for
various hotel-level
classifiers
Workflow Management
& Scaling Up
Hadoop:
… slow & massive
Python on Hadoop:
… possible, but not natural
Spark
● Distributed computing framework
● User writes driver program which transparently
schedules execution in a cluster
● Faster and more expressive than MapReduce
Let’s try Spark!
$ # how old is the C code in CPython?
$ git clone https://github.com/python/cpython && cd cpython
$ find . -name "*.c" -exec git blame {} ; > blame
$ head blame
dc5dbf61 (Guido van Rossum 1991-02-19 12:39:46 +0000 1)
daadddf7 (Guido van Rossum 1990-10-14 12:07:46 +0000 2) /* List a no
daadddf7 (Guido van Rossum 1990-10-14 12:07:46 +0000 3)
badc12f6 (Guido van Rossum 1990-12-20 15:06:42 +0000 4) #include "pg
daadddf7 (Guido van Rossum 1990-10-14 12:07:46 +0000 5) #include "to
daadddf7 (Guido van Rossum 1990-10-14 12:07:46 +0000 6) #include "no
daadddf7 (Guido van Rossum 1990-10-14 12:07:46 +0000 7)
badc12f6 (Guido van Rossum 1990-12-20 15:06:42 +0000 8) /* Forward *
Let’s try Spark!
import operator as op, re
# sc: SparkContext, connection to cluster
year_re = r"(d{4})-d{2}-d{2}"
years_hist = sc.textFile("blame") 
.flatMap(lambda line: re.findall(year_re, line)) 
.map(lambda year: (year, 1)) 
.reduceByKey(op.add)
output = years_hist.collect()
What happened here?
● Build complex pipelines of
batch jobs
○ Dependency resolution
○ Parallelism
○ Resume failed jobs
Luigi
class MyTask(luigi.Task):
def output(self):
return luigi.Target("/to/make/this/file")
def requires(self):
return [
INeedThisTask(),
AndAlsoThisTask("with_some arg")
]
def run(self):
# ... then ...
# I do this to make it!
https://github.com/trustyou/tyluigiutils
Utilities for getting Luigi, Spark and virtualenv to work
together
We’re hiring data scientists and software engineers!
http://www.trustyou.com/careers/
steffen@trustyou.com

More Related Content

Similar to Smart Data Meetup - NLP At Scale

DevTalks Cluj - Open-Source Technologies for Analyzing Text
DevTalks Cluj - Open-Source Technologies for Analyzing TextDevTalks Cluj - Open-Source Technologies for Analyzing Text
DevTalks Cluj - Open-Source Technologies for Analyzing Text
Steffen Wenz
 
Pipeline as code for your infrastructure as Code
Pipeline as code for your infrastructure as CodePipeline as code for your infrastructure as Code
Pipeline as code for your infrastructure as Code
Kris Buytaert
 
The "Holy Grail" of Dev/Ops
The "Holy Grail" of Dev/OpsThe "Holy Grail" of Dev/Ops
The "Holy Grail" of Dev/Ops
Erik Osterman
 
Dev secops opsec, devsec, devops ?
Dev secops opsec, devsec, devops ?Dev secops opsec, devsec, devops ?
Dev secops opsec, devsec, devops ?
Kris Buytaert
 
BDD Testing Using Godog - Bangalore Golang Meetup # 32
BDD Testing Using Godog - Bangalore Golang Meetup # 32BDD Testing Using Godog - Bangalore Golang Meetup # 32
BDD Testing Using Godog - Bangalore Golang Meetup # 32
OpenEBS
 
Meet a 100% R-based CRO - The summary of a 5-year journey
Meet a 100% R-based CRO - The summary of a 5-year journeyMeet a 100% R-based CRO - The summary of a 5-year journey
Meet a 100% R-based CRO - The summary of a 5-year journey
Adrian Olszewski
 
Meet a 100% R-based CRO. The summary of a 5-year journey
Meet a 100% R-based CRO. The summary of a 5-year journeyMeet a 100% R-based CRO. The summary of a 5-year journey
Meet a 100% R-based CRO. The summary of a 5-year journey
Adrian Olszewski
 
Mongo NYC PHP Development
Mongo NYC PHP Development Mongo NYC PHP Development
Mongo NYC PHP Development
Fitz Agard
 
Intro to Python
Intro to PythonIntro to Python
IC-SDV 2019: Down-to-earth machine learning: What you always wanted your data...
IC-SDV 2019: Down-to-earth machine learning: What you always wanted your data...IC-SDV 2019: Down-to-earth machine learning: What you always wanted your data...
IC-SDV 2019: Down-to-earth machine learning: What you always wanted your data...
Dr. Haxel Consult
 
Building an E-commerce website in MEAN stack
Building an E-commerce website in MEAN stackBuilding an E-commerce website in MEAN stack
Building an E-commerce website in MEAN stack
divyapisces
 
Brownfield Domain Driven Design
Brownfield Domain Driven DesignBrownfield Domain Driven Design
Brownfield Domain Driven Design
Nicolò Pignatelli
 
2022 - Delivering Powerful Technical Presentations.pdf
2022 - Delivering Powerful Technical Presentations.pdf2022 - Delivering Powerful Technical Presentations.pdf
2022 - Delivering Powerful Technical Presentations.pdf
Wesley Reisz
 
Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...
Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...
Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...
Neo4j
 
Breaking Through The Challenges of Scalable Deep Learning for Video Analytics
Breaking Through The Challenges of Scalable Deep Learning for Video AnalyticsBreaking Through The Challenges of Scalable Deep Learning for Video Analytics
Breaking Through The Challenges of Scalable Deep Learning for Video Analytics
Jason Anderson
 
Aws r
Aws rAws r
Andrea Di Persio
Andrea Di PersioAndrea Di Persio
Andrea Di Persio
CodeFest
 
Production process presentation - drupalcamp Toronto 2010
Production process presentation - drupalcamp Toronto 2010Production process presentation - drupalcamp Toronto 2010
Production process presentation - drupalcamp Toronto 2010
Aidan Foster
 
Oh the compilers you'll build
Oh the compilers you'll buildOh the compilers you'll build
Oh the compilers you'll build
Mark Stoodley
 
Dust.js
Dust.jsDust.js

Similar to Smart Data Meetup - NLP At Scale (20)

DevTalks Cluj - Open-Source Technologies for Analyzing Text
DevTalks Cluj - Open-Source Technologies for Analyzing TextDevTalks Cluj - Open-Source Technologies for Analyzing Text
DevTalks Cluj - Open-Source Technologies for Analyzing Text
 
Pipeline as code for your infrastructure as Code
Pipeline as code for your infrastructure as CodePipeline as code for your infrastructure as Code
Pipeline as code for your infrastructure as Code
 
The "Holy Grail" of Dev/Ops
The "Holy Grail" of Dev/OpsThe "Holy Grail" of Dev/Ops
The "Holy Grail" of Dev/Ops
 
Dev secops opsec, devsec, devops ?
Dev secops opsec, devsec, devops ?Dev secops opsec, devsec, devops ?
Dev secops opsec, devsec, devops ?
 
BDD Testing Using Godog - Bangalore Golang Meetup # 32
BDD Testing Using Godog - Bangalore Golang Meetup # 32BDD Testing Using Godog - Bangalore Golang Meetup # 32
BDD Testing Using Godog - Bangalore Golang Meetup # 32
 
Meet a 100% R-based CRO - The summary of a 5-year journey
Meet a 100% R-based CRO - The summary of a 5-year journeyMeet a 100% R-based CRO - The summary of a 5-year journey
Meet a 100% R-based CRO - The summary of a 5-year journey
 
Meet a 100% R-based CRO. The summary of a 5-year journey
Meet a 100% R-based CRO. The summary of a 5-year journeyMeet a 100% R-based CRO. The summary of a 5-year journey
Meet a 100% R-based CRO. The summary of a 5-year journey
 
Mongo NYC PHP Development
Mongo NYC PHP Development Mongo NYC PHP Development
Mongo NYC PHP Development
 
Intro to Python
Intro to PythonIntro to Python
Intro to Python
 
IC-SDV 2019: Down-to-earth machine learning: What you always wanted your data...
IC-SDV 2019: Down-to-earth machine learning: What you always wanted your data...IC-SDV 2019: Down-to-earth machine learning: What you always wanted your data...
IC-SDV 2019: Down-to-earth machine learning: What you always wanted your data...
 
Building an E-commerce website in MEAN stack
Building an E-commerce website in MEAN stackBuilding an E-commerce website in MEAN stack
Building an E-commerce website in MEAN stack
 
Brownfield Domain Driven Design
Brownfield Domain Driven DesignBrownfield Domain Driven Design
Brownfield Domain Driven Design
 
2022 - Delivering Powerful Technical Presentations.pdf
2022 - Delivering Powerful Technical Presentations.pdf2022 - Delivering Powerful Technical Presentations.pdf
2022 - Delivering Powerful Technical Presentations.pdf
 
Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...
Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...
Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...
 
Breaking Through The Challenges of Scalable Deep Learning for Video Analytics
Breaking Through The Challenges of Scalable Deep Learning for Video AnalyticsBreaking Through The Challenges of Scalable Deep Learning for Video Analytics
Breaking Through The Challenges of Scalable Deep Learning for Video Analytics
 
Aws r
Aws rAws r
Aws r
 
Andrea Di Persio
Andrea Di PersioAndrea Di Persio
Andrea Di Persio
 
Production process presentation - drupalcamp Toronto 2010
Production process presentation - drupalcamp Toronto 2010Production process presentation - drupalcamp Toronto 2010
Production process presentation - drupalcamp Toronto 2010
 
Oh the compilers you'll build
Oh the compilers you'll buildOh the compilers you'll build
Oh the compilers you'll build
 
Dust.js
Dust.jsDust.js
Dust.js
 

More from Steffen Wenz

Is Python turning into Java? PyData 2017 Berlin Lightning Talk
Is Python turning into Java? PyData 2017 Berlin Lightning TalkIs Python turning into Java? PyData 2017 Berlin Lightning Talk
Is Python turning into Java? PyData 2017 Berlin Lightning Talk
Steffen Wenz
 
Is this good Python? PyCon WEB 2017 Lightning Talk
Is this good Python? PyCon WEB 2017 Lightning TalkIs this good Python? PyCon WEB 2017 Lightning Talk
Is this good Python? PyCon WEB 2017 Lightning Talk
Steffen Wenz
 
Powered by Python - PyCon Germany 2016
Powered by Python - PyCon Germany 2016Powered by Python - PyCon Germany 2016
Powered by Python - PyCon Germany 2016
Steffen Wenz
 
DevTalks Cluj - Predictions for Machine Learning in 2020
DevTalks Cluj - Predictions for Machine Learning in 2020DevTalks Cluj - Predictions for Machine Learning in 2020
DevTalks Cluj - Predictions for Machine Learning in 2020
Steffen Wenz
 
PyData Berlin Meetup
PyData Berlin MeetupPyData Berlin Meetup
PyData Berlin Meetup
Steffen Wenz
 
Cluj Big Data Meetup - Big Data in Practice
Cluj Big Data Meetup - Big Data in PracticeCluj Big Data Meetup - Big Data in Practice
Cluj Big Data Meetup - Big Data in Practice
Steffen Wenz
 
Cluj.py Meetup: Extending Python in C
Cluj.py Meetup: Extending Python in CCluj.py Meetup: Extending Python in C
Cluj.py Meetup: Extending Python in C
Steffen Wenz
 

More from Steffen Wenz (7)

Is Python turning into Java? PyData 2017 Berlin Lightning Talk
Is Python turning into Java? PyData 2017 Berlin Lightning TalkIs Python turning into Java? PyData 2017 Berlin Lightning Talk
Is Python turning into Java? PyData 2017 Berlin Lightning Talk
 
Is this good Python? PyCon WEB 2017 Lightning Talk
Is this good Python? PyCon WEB 2017 Lightning TalkIs this good Python? PyCon WEB 2017 Lightning Talk
Is this good Python? PyCon WEB 2017 Lightning Talk
 
Powered by Python - PyCon Germany 2016
Powered by Python - PyCon Germany 2016Powered by Python - PyCon Germany 2016
Powered by Python - PyCon Germany 2016
 
DevTalks Cluj - Predictions for Machine Learning in 2020
DevTalks Cluj - Predictions for Machine Learning in 2020DevTalks Cluj - Predictions for Machine Learning in 2020
DevTalks Cluj - Predictions for Machine Learning in 2020
 
PyData Berlin Meetup
PyData Berlin MeetupPyData Berlin Meetup
PyData Berlin Meetup
 
Cluj Big Data Meetup - Big Data in Practice
Cluj Big Data Meetup - Big Data in PracticeCluj Big Data Meetup - Big Data in Practice
Cluj Big Data Meetup - Big Data in Practice
 
Cluj.py Meetup: Extending Python in C
Cluj.py Meetup: Extending Python in CCluj.py Meetup: Extending Python in C
Cluj.py Meetup: Extending Python in C
 

Recently uploaded

HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
Zilliz
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
IndexBug
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
Claudio Di Ciccio
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Speck&Tech
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
Edge AI and Vision Alliance
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 

Recently uploaded (20)

HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 

Smart Data Meetup - NLP At Scale

  • 1. NLP at Scale TrustYou Review Summaries Steffen Wenz, CTO @tyengineering Smart Data Meetup Sep 2017
  • 2. For every hotel on the planet, provide a summary of traveler reviews. What does TrustYou do?
  • 4. ✓ Excellent hotel! ✓ Nice building “Clean, hip & modern, excellent facilities” ✓ Great view « Vue superbe »
  • 5. ✓ Excellent hotel!* ✓ Nice building “Clean, hip & modern, excellent facilities” ✓ Great view « Vue superbe » ✓ Great for partying “Nice weekend getaway or for partying” ✗ Solo travelers complain about TVs ℹ You should check out Reichstag, KaDeWe & Gendarmenmarkt. *) nhow Berlin (Full summary)
  • 6.
  • 7.
  • 8.
  • 9. steffen@trustyou.com ● Studied CS here in Munich ● Joined TrustYou in 2008 as working student … ● First product manager, then CTO since 2012 ● Manages very diverse tech stack and team of 30 engineers: ○ Data engineers ○ Data scientists ○ Web developers
  • 10. TrustYou Architecture TrustYou ♥ Spark + Python NLP Text Generation Machine Learning Aggregation Crawling API 3M new reviews per week!
  • 12. Typical NLP Pipeline Raw text Tokenization Part of speech tagging Parsing Sentence splitting Structured data!
  • 13. ● NLP library ● Implements NLP pipelines for English, German + others ● Focus on performance and production use ○ Largely implemented in Cython … heard of it? :) ● Plays well with machine learning libraries ● Unlike NLTK, which is more for educational use, and sees few updates these days …
  • 14. import spacy nlp = spacy.load("en") doc = nlp("This hotel is truly huge and beautiful. I'll be back for sure") for word in doc: print(word)
  • 15. doc = nlp("I'll code code") for word in doc: print(word.text, word.lemma_, word.pos_) # I -PRON- PRON # 'll will VERB # code code VERB # code code NOUN
  • 17. ● “Nice room” ● “Room wasn‘t so great” ● “อาหารรสชาติดี” ● “‫ﺟﯾدة‬ ‫ﺧدﻣﺔ‬ ” ● Custom NLP framework, extension of NLTK ● Supports 20 languages natively! ● Custom, domain-specific tagging and parsing Semantic Analysis at TrustYou
  • 18. Let’s do some ML! Hm, how to model text as input for ML? ● Enter Word vectors! ● Goal: Find a mapping word → high-dimensional vector where similar word have vectors close together ● “Woman” is close to “lady” is close to “womna” ● Word2vec is an algorithm to produce such embeddings
  • 19. woman, lady, dude = nlp("woman lady dude") woman.similarity(lady) # 0.78 woman.similarity(dude) # 0.40 ● Word2vec considers words to be similar if they occur in similar contexts, i.e. typically have the same words before/after them
  • 20. (Somewhat Pointless) Application Goal: Predict review overall score just from title!
  • 21. (Somewhat Pointless) Application Goal: Predict review overall score just from title! Input (here, word vectors) Output (here, review score, so just one node) Training = rejiggering the weights of these arrows, trying to closely match training data
  • 22. ML 10 years ago ● Work goes into feature engineering ● Bigram models, POS tags, parse trees … whatever helps Deep learning now ● Big NNs capture lots of complexity … can work directly on raw data ● Bad news for domain experts :’(
  • 23. Keras ● High-level machine learning library ● API for defining neural network architecture ● Training & prediction is done in a backend: ○ Tensorflow ○ Theano ○ …
  • 24. Neural network topology, in Keras Disclaimer:
  • 26. Let’s try our model: “Perfect” → 97 “Beautiful hotel” → 95 “Good hotel” → 84 “Could have been better” → 65 “Hotel was not beautiful …” → 51 “Right in the middle of Munich” → 89 “Right in the middle of Bagdad” → 89 Trained on 1M review titles. Mean squared error: 12/100
  • 28. ML @ TrustYou ● gensim doc2vec model to create hotel embedding ● Used – together with other features – for various hotel-level classifiers
  • 31. Python on Hadoop: … possible, but not natural
  • 32.
  • 33. Spark ● Distributed computing framework ● User writes driver program which transparently schedules execution in a cluster ● Faster and more expressive than MapReduce
  • 34. Let’s try Spark! $ # how old is the C code in CPython? $ git clone https://github.com/python/cpython && cd cpython $ find . -name "*.c" -exec git blame {} ; > blame $ head blame dc5dbf61 (Guido van Rossum 1991-02-19 12:39:46 +0000 1) daadddf7 (Guido van Rossum 1990-10-14 12:07:46 +0000 2) /* List a no daadddf7 (Guido van Rossum 1990-10-14 12:07:46 +0000 3) badc12f6 (Guido van Rossum 1990-12-20 15:06:42 +0000 4) #include "pg daadddf7 (Guido van Rossum 1990-10-14 12:07:46 +0000 5) #include "to daadddf7 (Guido van Rossum 1990-10-14 12:07:46 +0000 6) #include "no daadddf7 (Guido van Rossum 1990-10-14 12:07:46 +0000 7) badc12f6 (Guido van Rossum 1990-12-20 15:06:42 +0000 8) /* Forward *
  • 35. Let’s try Spark! import operator as op, re # sc: SparkContext, connection to cluster year_re = r"(d{4})-d{2}-d{2}" years_hist = sc.textFile("blame") .flatMap(lambda line: re.findall(year_re, line)) .map(lambda year: (year, 1)) .reduceByKey(op.add) output = years_hist.collect()
  • 37.
  • 38. ● Build complex pipelines of batch jobs ○ Dependency resolution ○ Parallelism ○ Resume failed jobs Luigi
  • 39. class MyTask(luigi.Task): def output(self): return luigi.Target("/to/make/this/file") def requires(self): return [ INeedThisTask(), AndAlsoThisTask("with_some arg") ] def run(self): # ... then ... # I do this to make it!
  • 40.
  • 41. https://github.com/trustyou/tyluigiutils Utilities for getting Luigi, Spark and virtualenv to work together
  • 42. We’re hiring data scientists and software engineers! http://www.trustyou.com/careers/ steffen@trustyou.com