Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Powered by Python - PyCon Germany 2016

265 views

Published on

Gave this talk about usage of Python in TrustYou at PyCon Germany 2016.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Powered by Python - PyCon Germany 2016

  1. 1. Powered by Python Summarizing hotel reviews for 100 million travelers Steffen Wenz, CTO steffen@trustyou.com
  2. 2. 10,000 hotels use TrustYou Analytics to analyze their guest reviews. 100 million travelers see our data on Google, Hotels.com, Kayak … actually it’s probably more.
  3. 3. Architecture ;-) Hadoop Cluster (Hortonworks Distribution) Big Data Python Machine Learning NLP Scraping API MagicLove
  4. 4. Hadoop: … slow & massive
  5. 5. Python on Hadoop: … possible, but not natural
  6. 6. Let’s try Spark! $ # how old is the C code in CPython? $ git clone https://github.com/python/cpython && cd cpython $ find . -name "*.c" -exec git blame {} ; > blame $ head blame dc5dbf61 (Guido van Rossum 1991-02-19 12:39:46 +0000 1) daadddf7 (Guido van Rossum 1990-10-14 12:07:46 +0000 2) /* List a no daadddf7 (Guido van Rossum 1990-10-14 12:07:46 +0000 3) badc12f6 (Guido van Rossum 1990-12-20 15:06:42 +0000 4) #include "pg daadddf7 (Guido van Rossum 1990-10-14 12:07:46 +0000 5) #include "to daadddf7 (Guido van Rossum 1990-10-14 12:07:46 +0000 6) #include "no daadddf7 (Guido van Rossum 1990-10-14 12:07:46 +0000 7) badc12f6 (Guido van Rossum 1990-12-20 15:06:42 +0000 8) /* Forward *
  7. 7. Let’s try Spark! import operator as op, re # sc: SparkContext, connection to cluster year_re = r"(d{4})-d{2}-d{2}" years_hist = sc.textFile("blame") .flatMap(lambda line: re.findall(year_re, line)) .map(lambda year: (year, 1)) .reduceByKey(op.add) output = years_hist.collect()
  8. 8. What happened here?
  9. 9. Grammars & Parsing Or: Why you should have paid attention in compilers class
  10. 10. Grammars and Parsing $ less Grammar/Grammar ... compound_stmt: if_stmt | while_stmt | for_stmt | try_stmt | with_stmt | funcde async_stmt: ASYNC (funcdef | with_stmt | for_stmt) if_stmt: 'if' test ':' suite ('elif' test ':' suite)* ['else' ':' suite] while_stmt: 'while' test ':' suite ['else' ':' suite] for_stmt: 'for' exprlist 'in' testlist ':' suite ['else' ':' suite] ... Parsing: Given an input string, determine/guess grammar production rules to generate it
  11. 11. >>> grammar = nltk.CFG.fromstring(""" ... OPINION -> NOUN COP ADJ ... OPINION -> ADJ NOUN ... NOUN -> 'hotel' | 'rooms' ... COP -> 'is' | 'are' ... ADJ -> 'great' | 'terrible' ... """) >>> parser = nltk.ChartParser(grammar) >>> sent = nltk.word_tokenize("great rooms") >>> for tree in parser.parse(sent): >>> print(tree) (OPINION (ADJ great) (NOUN rooms)) Grammars and Parsing
  12. 12. Word2Vec ● Map words to vectors ● “Step up” from bag-of-words model ● ‘Cats’ and ‘dogs’ should be similar - because they occur in similar contexts >>> m["python"] array([-0.1351, -0.1040, -0.0823, -0.0287, 0.3709, -0.0200, -0.0325, 0.0166, 0.3312, -0.0928, -0.0967, -0.0199, -0.2498, -0.4445, -0.0445, # ...
  13. 13. Fun with Word2Vec >>> # trained from 100k meetup descriptions! >>> m = gensim.models.Word2Vec.load("data/word2vec") >>> m.most_similar(positive=["python"])[:3] [(u'javascript', 0.83), (u'php', 0.82), (u'django', 0.81)] >>> m.doesnt_match(["python", "c++", "javascript"]) 'c++' >>> m.most_similar(positive=["ladies"])[:3] [(u'girls', 0.81), (u'mamas', 0.74), (u'gals', 0.73)]
  14. 14. ML @ TrustYou ● gensim doc2vec model to create hotel embedding ● Used - together with other features - for various classifiers
  15. 15. ● Build complex pipelines of batch jobs ○ Dependency resolution ○ Parallelism ○ Resume failed jobs Luigi
  16. 16. class MyTask(luigi.Task): def output(self): return luigi.Target("/to/make/this/file") def requires(self): return [ INeedThisTask(), AndAlsoThisTask("with_some arg") ] def run(self): # ... then ... # I do this to make it!
  17. 17. steffen@trustyou.com or www.trustyou.com/careers We’re hiring web developers & data engineers!

×