Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Smart Data Meetup - NLP At Scale

73 views

Published on

My talk @ Smart Data Meetup in Munich: https://www.meetup.com/SmartData/events/237731342/

Learn how to build a modern NLP + deep learning pipeline with spaCy and Keras. Code samples here: https://github.com/trustyou/meetups/tree/master/smart-data

Published in: Technology
  • Be the first to comment

Smart Data Meetup - NLP At Scale

  1. 1. NLP at Scale TrustYou Review Summaries Steffen Wenz, CTO @tyengineering Smart Data Meetup Sep 2017
  2. 2. For every hotel on the planet, provide a summary of traveler reviews. What does TrustYou do?
  3. 3. ✓ Excellent hotel!
  4. 4. ✓ Excellent hotel! ✓ Nice building “Clean, hip & modern, excellent facilities” ✓ Great view « Vue superbe »
  5. 5. ✓ Excellent hotel!* ✓ Nice building “Clean, hip & modern, excellent facilities” ✓ Great view « Vue superbe » ✓ Great for partying “Nice weekend getaway or for partying” ✗ Solo travelers complain about TVs ℹ You should check out Reichstag, KaDeWe & Gendarmenmarkt. *) nhow Berlin (Full summary)
  6. 6. steffen@trustyou.com ● Studied CS here in Munich ● Joined TrustYou in 2008 as working student … ● First product manager, then CTO since 2012 ● Manages very diverse tech stack and team of 30 engineers: ○ Data engineers ○ Data scientists ○ Web developers
  7. 7. TrustYou Architecture TrustYou ♥ Spark + Python NLP Text Generation Machine Learning Aggregation Crawling API 3M new reviews per week!
  8. 8. Extracting Meaning from Text
  9. 9. Typical NLP Pipeline Raw text Tokenization Part of speech tagging Parsing Sentence splitting Structured data!
  10. 10. ● NLP library ● Implements NLP pipelines for English, German + others ● Focus on performance and production use ○ Largely implemented in Cython … heard of it? :) ● Plays well with machine learning libraries ● Unlike NLTK, which is more for educational use, and sees few updates these days …
  11. 11. import spacy nlp = spacy.load("en") doc = nlp("This hotel is truly huge and beautiful. I'll be back for sure") for word in doc: print(word)
  12. 12. doc = nlp("I'll code code") for word in doc: print(word.text, word.lemma_, word.pos_) # I -PRON- PRON # 'll will VERB # code code VERB # code code NOUN
  13. 13. Dependency parsing Try “displaCy” yourself
  14. 14. ● “Nice room” ● “Room wasn‘t so great” ● “อาหารรสชาติดี” ● “‫ﺟﯾدة‬ ‫ﺧدﻣﺔ‬ ” ● Custom NLP framework, extension of NLTK ● Supports 20 languages natively! ● Custom, domain-specific tagging and parsing Semantic Analysis at TrustYou
  15. 15. Let’s do some ML! Hm, how to model text as input for ML? ● Enter Word vectors! ● Goal: Find a mapping word → high-dimensional vector where similar word have vectors close together ● “Woman” is close to “lady” is close to “womna” ● Word2vec is an algorithm to produce such embeddings
  16. 16. woman, lady, dude = nlp("woman lady dude") woman.similarity(lady) # 0.78 woman.similarity(dude) # 0.40 ● Word2vec considers words to be similar if they occur in similar contexts, i.e. typically have the same words before/after them
  17. 17. (Somewhat Pointless) Application Goal: Predict review overall score just from title!
  18. 18. (Somewhat Pointless) Application Goal: Predict review overall score just from title! Input (here, word vectors) Output (here, review score, so just one node) Training = rejiggering the weights of these arrows, trying to closely match training data
  19. 19. ML 10 years ago ● Work goes into feature engineering ● Bigram models, POS tags, parse trees … whatever helps Deep learning now ● Big NNs capture lots of complexity … can work directly on raw data ● Bad news for domain experts :’(
  20. 20. Keras ● High-level machine learning library ● API for defining neural network architecture ● Training & prediction is done in a backend: ○ Tensorflow ○ Theano ○ …
  21. 21. Neural network topology, in Keras Disclaimer:
  22. 22. model = keras.models.Sequential() model.add( keras.layers.Embedding( embeddings.shape[0], embeddings.shape[1], input_length=max_length, trainable=False, weights=[embeddings], ) ) model.add(keras.layers.Bidirectional(keras.layers.LSTM(lstm_units))) model.add(keras.layers.Dropout(dropout_rate)) model.add(keras.layers.Dense(1, activation="sigmoid")) model.compile(optimizer="adam", loss="mean_squared_error", metrics=["accuracy"])
  23. 23. Let’s try our model: “Perfect” → 97 “Beautiful hotel” → 95 “Good hotel” → 84 “Could have been better” → 65 “Hotel was not beautiful …” → 51 “Right in the middle of Munich” → 89 “Right in the middle of Bagdad” → 89 Trained on 1M review titles. Mean squared error: 12/100
  24. 24. Try for yourself: Code on GitHub
  25. 25. ML @ TrustYou ● gensim doc2vec model to create hotel embedding ● Used – together with other features – for various hotel-level classifiers
  26. 26. Workflow Management & Scaling Up
  27. 27. Hadoop: … slow & massive
  28. 28. Python on Hadoop: … possible, but not natural
  29. 29. Spark ● Distributed computing framework ● User writes driver program which transparently schedules execution in a cluster ● Faster and more expressive than MapReduce
  30. 30. Let’s try Spark! $ # how old is the C code in CPython? $ git clone https://github.com/python/cpython && cd cpython $ find . -name "*.c" -exec git blame {} ; > blame $ head blame dc5dbf61 (Guido van Rossum 1991-02-19 12:39:46 +0000 1) daadddf7 (Guido van Rossum 1990-10-14 12:07:46 +0000 2) /* List a no daadddf7 (Guido van Rossum 1990-10-14 12:07:46 +0000 3) badc12f6 (Guido van Rossum 1990-12-20 15:06:42 +0000 4) #include "pg daadddf7 (Guido van Rossum 1990-10-14 12:07:46 +0000 5) #include "to daadddf7 (Guido van Rossum 1990-10-14 12:07:46 +0000 6) #include "no daadddf7 (Guido van Rossum 1990-10-14 12:07:46 +0000 7) badc12f6 (Guido van Rossum 1990-12-20 15:06:42 +0000 8) /* Forward *
  31. 31. Let’s try Spark! import operator as op, re # sc: SparkContext, connection to cluster year_re = r"(d{4})-d{2}-d{2}" years_hist = sc.textFile("blame") .flatMap(lambda line: re.findall(year_re, line)) .map(lambda year: (year, 1)) .reduceByKey(op.add) output = years_hist.collect()
  32. 32. What happened here?
  33. 33. ● Build complex pipelines of batch jobs ○ Dependency resolution ○ Parallelism ○ Resume failed jobs Luigi
  34. 34. class MyTask(luigi.Task): def output(self): return luigi.Target("/to/make/this/file") def requires(self): return [ INeedThisTask(), AndAlsoThisTask("with_some arg") ] def run(self): # ... then ... # I do this to make it!
  35. 35. https://github.com/trustyou/tyluigiutils Utilities for getting Luigi, Spark and virtualenv to work together
  36. 36. We’re hiring data scientists and software engineers! http://www.trustyou.com/careers/ steffen@trustyou.com

×