Helping Travellers Make
Better Hotel Choices
500 Million Times a Month
Miguel Cabrera
@mfcabrera
https://www.flickr.com/photos/18694857@N00/5614701858/
ABOUT ME
•  Neuberliner
•  Ing. Sistemas e Inf. Universidad Nacional - Med
•  M.Sc. In Informatics TUM, Hons. Technology
Management.
•  Work for TrustYou as Data (Scientist|Engineer|
Juggler)™
•  Founder and former organizer of Munich DataGeeks
ABOUT ME
TODAY
•  What we do
•  Architecture
•  Technology
•  Crawling
•  Textual Processing
•  Workflow Management and Scale
•  Sample Application
AGENDA
WHAT WE DO
For every hotel on the planet, provide
a summary of traveler reviews.
•  Crawling
•  Natural Language Processing / Semantic
Analysis
•  Record Linkage / Deduplication
•  Ranking
•  Recommendation
•  Classification
•  Clustering
Tasks
ARCHITECTURE
Data Flow
Crawling	
  
Seman-c	
  
Analysis	
  
	
  Database	
   API	
  
Clients
• Google
• Kayak+
• TY
Analytics
Batch
Layer
• Hadoop
• Python
• Pig*
• Java*
Service
Layer
• PostgreSQL
• MongoDB
• Redis
• Cassandra
DATA DATA
Hadoop Cluster
Application
Machines
Stack
SOME NUMBERS
25 supported languages
500,000+ Properties
30,000,000+ daily crawled
reviews
Deduplicated against 250,000,000+
reviews
300,000+ daily new reviews
https://www.flickr.com/photos/22646823@N08/2694765397/
Lots of text
TECHNOLOGY
•  Numpy
•  NLTK
•  Scikit-Learn
•  Pandas
•  IPython / Jupyter
•  Scrapy
Python
•  Hadoop Streaming
•  MRJob
•  Oozie
•  Luigi
•  …
Python + Hadoop
Crawling
Crawling
•  Build your own web crawlers
•  Extract data via CSS selectors, XPath,
regexes, etc.
•  Handles queuing, request parallelism,
cookies, throttling …
•  Comprehensive and well-designed
•  Commercial support by
http://scrapinghub.com/
•  2 - 3 million new reviews/week
•  Customers want alerts 8 - 24h after review
publication!
•  Smart crawl frequency & depth, but still high
overhead
•  Pools of constantly refreshed EC2 proxy IPs
•  Direct API connections with many sites
Crawling at TrustYou
•  Custom framework very similar to scrapy
•  Runs on Hadoop cluster (100 nodes)
•  Not 100% suitable for MapReduce
•  Nodes mostly waiting
•  Coordination/messaging between nodes
required:
–  Distributed queue
–  Rate Limiting
Crawling at TrustYou
Text Processing
Text Processing
Raw	
  text	
  
Setence	
  
spli:ng	
  
Tokenizing	
   Stopwords	
  
Stemming
Topic Models
Word Vectors
Classification
Text Processing
•  “great rooms”
•  “great hotel”
•  “rooms are terrible”
•  “hotel is terrible”
Text Processing
JJ NN
JJ NN
NN VB JJ
NN VB JJ

>> nltk.pos_tag(nltk.word_tokenize("hotel is
terrible"))

[('hotel', 'NN'), ('is', 'VBZ'), ('terrible', 'JJ')]
•  25+ languages
•  Linguistic system (morphology, taggers,
grammars, parsers …)
•  Hadoop: Scale out CPU
•  ~1B opinions in the database
•  Python for ML & NLP libraries
Semantic Analysis
Word2Vec/Doc2Vec
Group of algorithms
An instance of shallow learning
Feature learning model
Generates real-valued vectors
represenation of words
“king” – “man” + “woman” = “queen”
Word2Vec
Source:	
  h*p://technology.s4tchfix.com/blog/2015/03/11/word-­‐is-­‐worth-­‐a-­‐thousand-­‐vectors/	
  
Word2Vec
Source:	
  h*p://technology.s4tchfix.com/blog/2015/03/11/word-­‐is-­‐worth-­‐a-­‐thousand-­‐vectors/	
  
Word2Vec
Source:	
  h*p://technology.s4tchfix.com/blog/2015/03/11/word-­‐is-­‐worth-­‐a-­‐thousand-­‐vectors/	
  
Word2Vec
Source:	
  h*p://technology.s4tchfix.com/blog/2015/03/11/word-­‐is-­‐worth-­‐a-­‐thousand-­‐vectors/	
  
Word2Vec
Source:	
  h*p://technology.s4tchfix.com/blog/2015/03/11/word-­‐is-­‐worth-­‐a-­‐thousand-­‐vectors/	
  
Word2Vec
Source:	
  h*p://technology.s4tchfix.com/blog/2015/03/11/word-­‐is-­‐worth-­‐a-­‐thousand-­‐vectors/	
  
Similar words/documents are nearby
vectors
Wor2vec offer a similarity metric of
words
Can be extended to paragraphs and
documents
A fast Python based implementation
available via Gensim
Workflow Management and Scale
Crawl	
  
Extract	
  
Clean	
  
Stats	
  
ML	
  
ML	
  
NLP	
  
Luigi
“ A python framework for data
flow definition and execution ”
Luigi
•  Build complex pipelines of
batch jobs
•  Dependency resolution
•  Parallelism
•  Resume failed jobs
•  Some support for Hadoop
Luigi
Luigi
•  Dependency definition
•  Hadoop / HDFS Integration
•  Object oriented abstraction
•  Parallelism
•  Resume failed jobs
•  Visualization of pipelines
•  Command line integration
Minimal Bolerplate Code
class WordCount(luigi.Task):
date = luigi.DateParameter()
def requires(self):
return InputText(date)
def output(self):
return luigi.LocalTarget(’/tmp/%s' % self.date_interval)
def run(self):
count = {}
for f in self.input():
for line in f.open('r'):
for word in line.strip().split():
count[word] = count.get(word, 0) + 1
f = self.output().open('w')
for word, count in six.iteritems(count):
f.write("%st%dn" % (word, count))
f.close()
class WordCount(luigi.Task):
date = luigi.DateParameter()
def requires(self):
return InputText(date)
def output(self):
return luigi.LocalTarget(’/tmp/%s' % self.date_interval)
def run(self):
count = {}
for f in self.input():
for line in f.open('r'):
for word in line.strip().split():
count[word] = count.get(word, 0) + 1
f = self.output().open('w')
for word, count in six.iteritems(count):
f.write("%st%dn" % (word, count))
f.close()
Task Parameters
class WordCount(luigi.Task):
date = luigi.DateParameter()
def requires(self):
return InputText(date)
def output(self):
return luigi.LocalTarget(’/tmp/%s' % self.date_interval)
def run(self):
count = {}
for f in self.input():
for line in f.open('r'):
for word in line.strip().split():
count[word] = count.get(word, 0) + 1
f = self.output().open('w')
for word, count in six.iteritems(count):
f.write("%st%dn" % (word, count))
f.close()
Programmatically Defined Dependencies
class WordCount(luigi.Task):
date = luigi.DateParameter()
def requires(self):
return InputText(date)
def output(self):
return luigi.LocalTarget(’/tmp/%s' % self.date_interval)
def run(self):
count = {}
for f in self.input():
for line in f.open('r'):
for word in line.strip().split():
count[word] = count.get(word, 0) + 1
f = self.output().open('w')
for word, count in six.iteritems(count):
f.write("%st%dn" % (word, count))
f.close()
Each Task produces an ouput
class WordCount(luigi.Task):
date = luigi.DateParameter()
def requires(self):
return InputText(date)
def output(self):
return luigi.LocalTarget(’/tmp/%s' % self.date_interval)
def run(self):
count = {}
for f in self.input():
for line in f.open('r'):
for word in line.strip().split():
count[word] = count.get(word, 0) + 1
f = self.output().open('w')
for word, count in six.iteritems(count):
f.write("%st%dn" % (word, count))
f.close()
Write Logic in Python
Hadoop
https://www.flickr.com/photos/12914838@N00/15015146343/
Hadoop = Java?
Hadoop
Streaming
cat input.txt | ./map.py | sort | ./reduce.py > output.txt
Hadoop
Streaming
hadoop jar contrib/streaming/hadoop-*streaming*.jar 
-file /home/hduser/mapper.py -mapper /home/hduser/mapper.py 
-file /home/hduser/reducer.py -reducer /home/hduser/reducer.py 
-input /user/hduser/text.txt -output /user/hduser/gutenberg-output
class WordCount(luigi.hadoop.JobTask):
date = luigi.DateParameter()
def requires(self):
return InputText(date)
def output(self):
return luigi.hdfs.HdfsTarget(’%s' % self.date_interval)
def mapper(self, line):
for word in line.strip().split():
yield word, 1
def reducer(self, key, values):
yield key, sum(values)
Luigi + Hadoop/HDFS
Go and learn:
Data Flow Visualization
Data Flow Visualization
Before
•  Bash scripts + Cron
•  Manual cleanup
•  Manual failure recovery
•  Hard(er) to debug
Now
•  Complex nested Luigi jobs graphs
•  Automatic retries
•  Still Hard to debug
We use it for…
•  Standalone executables
•  Dump data from databases
•  General Hadoop Streaming
•  Bash Scripts / MRJob
•  Pig* Scripts
You can wrap anything
Sample Application
Reviews are boring…
Source:	
  hGp://www.telegraph.co.uk/travel/hotels/11240430/TripAdvisor-­‐the-­‐funniest-­‐
reviews-­‐biggest-­‐controversies-­‐and-­‐best-­‐spoofs.html	
  
Reviews highlight the individuality
and personality of users
Snippets from Reviews
“Hips don’t lie”
“Maid was banging”
“Beautiful bowl flowers”
“Irish dance, I love that”
“No ghost sighting”
“One ghost touching”
“Too much cardio, not enough squats in the gym”
“it is like hugging a bony super model”
Hotel Reviews + Gensim + Python +
Luigi = ?
ExtractSentences
LearnBigrams
LearnModel
ExtractClusterIds
UploadEmbeddings
Pig
from gensim.models.doc2vec import Doc2Vec
class LearnModelTask(luigi.Task):
# Parameters.... blah blah blah
def output(self):
return luigi.LocalTarget(os.path.join(self.output_directory,
self.model_out))
def requires(self):
return LearnBigramsTask()
def run(self):
sentences = LabeledClusterIDSentence(self.input().path)
model = Doc2Vec(sentences=sentences,
size=int(self.size),
dm=int(self.distmem),
negative=int(self.negative),
workers=int(self.workers),
window=int(self.window),
min_count=int(self.min_count),
train_words=True)
model.save(self.output().path)
Wor2vec/Doc2vec offer a similarity
metric of words
Similarities are useful for non-
personalized recommender systems
Non-personalized recommenders
recommend items based on what
other consumers have said about the
items.
http://demo.trustyou.com
Takeaways
Takeaways
•  It is possible to use Python as the primary
language for doing large data processing on
Hadoop.
•  It is not a perfect setup but works well most of
the time.
•  Keep your ecosystem open to other
technologies.
We are hiring
miguel.cabrera@trustyou.net
We are hiring
miguel.cabrera@trustyou.net
Questions?

Ayudando a los Viajeros usando 500 millones de Reseñas Hoteleras al Mes