PyData Berlin Meetup

286 views

Published on

Helping travelers make better hotel choices - 500 million times a month

TrustYou analyzes online hotel reviews to create a summary for every hotel in the world. What do travelers think of the service? Is this hotel suitable for business travelers? TrustYou data is integrated on countless websites (Trivago, Wego, Kayak), helping travelers make better choices. Try it out yourself on http://www.trust-score.com/

TrustYou runs almost exclusively on Python. Every week, we find 3 million new hotel reviews on the web, process them, analyze the text using Natural Language Processing, and update our database of 600,000 hotels. In this talk, Steffen will give insights into how Python is used at TrustYou to collect, analyze and visualize these large amounts of data.

Published in: Internet
  • Be the first to comment

PyData Berlin Meetup

  1. 1. Helping travelers make better hotel choices 500 million times a month* Steffen Wenz, CTO TrustYou
  2. 2. For every hotel on the planet, provide a summary of traveler reviews. What does TrustYou do?
  3. 3. ✓ Excellent hotel!
  4. 4. ✓ Excellent hotel! ✓ Nice building “Clean, hip & modern, excellent facilities” ✓ Great view « Vue superbe »
  5. 5. ✓ Excellent hotel!* ✓ Nice building “Clean, hip & modern, excellent facilities” ✓ Great view « Vue superbe » ✓ Great for partying “Nice weekend getaway or for partying” ✗ Solo travelers complain about TVs ℹ You should check out Reichstag, KaDeWe & Gendarmenmarkt. *) nhow Berlin (Full summary)
  6. 6. DBCrawling Semantic Analysis TrustYou Analytics API Kayak... TrustYou Architecture 200 million reqs/month
  7. 7. Crawling
  8. 8. /find? q=Berlin /find? q=Munich /meetup/ BerlinPyData /meetup/ BerlinCyclists /find? q=Munich&pa ge=2 /meetup/ BerlinPolitics /meetup/ BerlinCyclists /find? q=Munich&pa ge=3 Seed URLs Frontier Basic crawling setup
  9. 9. /find? q=Berlin /find? q=Munich /meetup/ BerlinPyData /meetup/ BerlinCyclists /find? q=Munich&pa ge=2 /meetup/ BerlinPolitics /meetup/ BerlinCyclists /find? q=Munich&pa ge=3 /find? q=Munich&pa ge=99999999 ... … if only it were so easy facebok. com/meetup Seed URLs Frontier
  10. 10. Scrapy ● Build your own web crawlers ○ Extract data via CSS selectors, XPath, regexes … ○ Handles queuing, request parallelism, cookies, throttling … ● Comprehensive and well-designed ● Commercial support by http://scrapinghub.com/
  11. 11. Frontier Seed URLs Intro to Scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule class MySpider(CrawlSpider): name = "my_spider" # start with this URL start_urls = ["http://www.meetup.com/find/?allMeetups=true&radius=50&userFreeform=Berlin"] # follow these URLs, and call self.parse_meetup to extract data from them rules = [ Rule(LinkExtractor(allow=[ "^http://www.meetup.com/[^/]+/$", ]), callback="parse_meetup"), ] def parse_meetup(self, response): # Extract data about meetup from HTML m = MeetupItem() yield m
  12. 12. Try it out! $ scrapy crawl city -a city=Berlin -t jsonlines -o - 2>/dev/null {"url": "http://www.meetup.com/Making-Customers-Happy-Berlin/", "name": "eCommerce - Making Customers Happy - Berlin", "members": "774"} {"url": "http://www.meetup.com/Berlin-Scrum-Meetup/", "name": "Berlin Scrum Meetup", "members": "368"} {"url": "http://www.meetup.com/Clojure-Berlin/", "name": "The Clojure Conspiracy (Berlin)", "members": "545"} {"url": "http://www.meetup.com/appliedJavascript/", "name": "Applied Javascript", "members": "494"} {"url": "http://www.meetup.com/englishconversationclubberlin/", "name": "English Conversation Club Berlin", "members": "1"} {"url": "http://www.meetup.com/Berlin-Nights-Out-and-Daylight-Catch-Up/", "name": "Berlin Nights Out and Daylight Catch Up", "members": "1"} ... Full code on GitHub, dump of all Berlin meetups (Note: Meetup also has an API …)
  13. 13. Number of registered meetups
  14. 14. Crawling at TrustYou scale ● 2 - 3 million new reviews/week ● Customers want alerts 8 - 24h after review publication! ● Smart crawl frequency & depth, but still high overhead ● Pools of constantly refreshed EC2 proxy IPs ● Direct API connections with many sites
  15. 15. Crawling at TrustYou scale ● Custom framework very similar to scrapy ● Runs on Hadoop cluster (100 nodes) ● … Though problem not 100% suitable for MapReduce ○ Nodes mostly waiting ○ Coordination/messaging between nodes required: ■ Distributed queue ■ Rate limiting
  16. 16. Textual Data
  17. 17. Treating textual data raw text sentence splitting stopword filtering stemming tokenization
  18. 18. Tokenization >>> import nltk >>> raw = "We are always looking for interesting talks, locations to host meetups and enthusiastic volunteers. Please get in touch using info@pydata.berlin." >>> nltk.sent_tokenize(raw) ['We are always looking for interesting talks, locations to host meetups and enthusiastic volunteers.', 'Please get in touch using info@pydata. berlin.'] >>> nltk.word_tokenize(raw) ['We', 'are', 'always', 'looking', 'for', 'interesting', 'talks', ',', 'locations', 'to', 'host', 'meetups', 'and', 'enthusiastic', 'volunteers.', 'Please', 'get', 'in', 'touch', 'using', 'info', '@', 'pydata.berlin', '.']
  19. 19. “great rooms” “great hotel” “rooms are terrible” “hotel is terrible” JJ NN JJ NN NN VB JJ NN VB JJ Grammars and Parsing >>> nltk.pos_tag(nltk.word_tokenize("hotel is terrible")) [('hotel', 'NN'), ('is', 'VBZ'), ('terrible', 'JJ')]
  20. 20. >>> grammar = nltk.CFG.fromstring(""" ... OPINION -> NN COP JJ ... OPINION -> JJ NN ... NN -> 'hotel' | 'rooms' ... COP -> 'is' | 'are' ... JJ -> 'great' | 'terrible' ... """) >>> parser = nltk.ChartParser(grammar) >>> sent = nltk.word_tokenize("great rooms") >>> for tree in parser.parse(sent): >>> print tree (OPINION (ADJ great) (NOUN rooms)) Grammars and Parsing
  21. 21. WordNet >>> from nltk.corpus import wordnet as wn >>> wn.morphy('coded', wn.VERB) 'code' >>> wn.synsets("python") [Synset('python.n.01'), Synset('python.n.02'), Synset('python.n. 03')] >>> wn.synset('python.n.01').hypernyms() [Synset('boa.n.02')] >>> # meh :/
  22. 22. ● “Nice room” ● “Room wasn‘t so great” ● “The air-conditioning was so powerful that we were cold in the room even when it was off.” ● “อาหารรสชาติดี” ● “‫ﺟﯾدة‬ ‫ﺧدﻣﺔ‬ ” ● 20 languages ● Linguistic system (morphology, taggers, grammars, parsers …) ● Hadoop: Scale out CPU ○ ~1B opinions in DB ● Python for ML & NLP libraries Semantic Analysis at TrustYou
  23. 23. Word2Vec ● Map words to vectors ● “Step up” from bag-of- words model ● ‘Cats’ and ‘dogs’ should be similar - because they occur in similar contexts >>> m["python"] array([-0.1351, -0.1040, -0.0823, -0.0287, 0.3709, -0.0200, -0.0325, 0.0166, 0.3312, -0.0928, -0.0967, -0.0199, -0.2498, -0.4445, -0.0445, # ... -1.0090, -0.2553, 0.2686, -0.4121, 0.3116, -0.0639, -0.3688, -0.0273, -0.1266, -0.2606, -0.1549, 0.0023, 0.0084, 0.2169, 0.0060], dtype=float32)
  24. 24. Fun with Word2Vec >>> # trained from 100k meetup descriptions! >>> m = gensim.models.Word2Vec.load("data/word2vec") >>> m.most_similar(positive=["python"])[:3] [(u'javascript', 0.8382717370986938), (u'php', 0.8266388773918152), (u'django', 0.8189617991447449)] >>> m.doesnt_match(["python", "c++", "javascript"]) 'c++' >>> m.most_similar(positive=["berlin"])[:3] [(u'paris', 0.8339072465896606), (u'lisbon', 0.7986686825752258), (u'holland', 0.7970746755599976)] >>> m.most_similar(positive=["ladies"])[:3] [(u'girls', 0.8175351619720459), (u'mamas', 0.745951771736145), (u'gals', 0.7336771488189697)]
  25. 25. ML @ TrustYou ● gensim doc2vec model to create hotel embedding ● Used - together with other features - for various classifiers
  26. 26. Workflow Management & Scaling Up
  27. 27. ● Build complex pipelines of batch jobs ○ Dependency resolution ○ Parallelism ○ Resume failed jobs ● Some support for Hadoop ● Pythonic replacement for Oozie ● Can be combined with Pig, Hive Luigi
  28. 28. class MyTask(luigi.Task): def requires(self): return DependentTask() def output(self): return luigi.LocalTarget("data/my_task_output")) def run(self): with self.output().open("w") as out: out.write("foo") Luigi tasks vs. Makefiles data/my_task_output: DependentTask run run run ...
  29. 29. class CrawlTask(luigi.Task): city = luigi.Parameter() def output(self): output_path = os.path.join("data", "{}.jsonl".format(self.city)) return luigi.LocalTarget(output_path) def run(self): tmp_output_path = self.output().path + "_tmp" subprocess.check_output(["scrapy", "crawl", "city", "-a", "city={}". format(self.city), "-o", tmp_output_path, "-t", "jsonlines"]) os.rename(tmp_output_path, self.output().path) Example: Wrap crawl in Luigi task
  30. 30. Luigi dependency graphs
  31. 31. Hadoop! ● MapReduce: Programming model for distributed computation problems ● Express your algorithm as sequences of operations: a. Map: Do a linear pass over your data, emit (k, v) b. (Distributed sort) c. Reduce: Linear pass over all (k, v) for the same k ● Python on Hadoop: Hadoop streaming, MRJob, Luigi (Just go learn PySpark instead)
  32. 32. Luigi Hadoop integration class HadoopTask(luigi.hadoop.JobTask): def output(self): return luigi.HdfsTarget("output_in_hdfs") def requires(self): return { "some_task": SomeTask(), "some_other_task": SomeOtherTask() } def mapper(self, line): key, value = line.rstrip().split("t") yield key, value def reducer(self, key, values): yield key, ", ".join(values)
  33. 33. Luigi Hadoop integration class HadoopTask(luigi.hadoop.JobTask): def output(self): return luigi.HdfsTarget("output_in_hdfs") def requires(self): return { "some_task": SomeTask(), "some_other_task": SomeOtherTask() } def mapper(self, line): key, value = line.rstrip().split("t") yield key, value def reducer(self, key, values): yield key, ", ".join(values) 1. Your input data is sitting in distributed file system (HDFS) 2. Luigi creates a .tar.gz, Hadoop moves your code on machines 3. mapper() gets run (distributed) 4. Data gets re-sorted by key 5. reducer() gets run (distributed) 6. Output gets saved in HDFS
  34. 34. ● Batch, never real time ● Slow even for batch (lots of disk IO) ● Limited expressiveness (remedies/crutches: MRJob, Pig, Hive) ● Spark: More complete Python support Beyond MapReduce
  35. 35. Workflows at TrustYou
  36. 36. Workflows at TrustYou
  37. 37. We’re hiring! steffen@trustyou.com

×