Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

NLP on a Billion Documents: Scalable Machine Learning with Apache Spark

5,218 views

Published on

Talk from PyData London 2015

Published in: Technology
  • Be the first to comment

NLP on a Billion Documents: Scalable Machine Learning with Apache Spark

  1. 1. User of Spark since 2012 Organiser of the London Spark Meetup Run Data Science team at Skimlinks Who am I
  2. 2. Apache Spark
  3. 3. 4 The RDD
  4. 4. 5 RDD.map >>> thisrdd = sc.parallelize(range(12), 4) >>> thisrdd.collect() [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11] >>> otherrdd = thisrdd.map(lambda x:x%3) >>> otherrdd.collect() [0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2]
  5. 5. 6 RDD.map
  6. 6. 7 RDD.map >>> otherrdd.zip(thisrdd).collect() [(0, 0), (1, 1), (2, 2), (0, 3), (1, 4), (2, 5), (0, 6), (1, 7), (2, 8), (0, 9), (1, 10), (2, 11)] >>> otherrdd.zip(thisrdd).reduceByKey(lambda x,y: x+y). collect() [(0, 18), (1, 22), (2, 26)]
  7. 7. 8 RDD.reduceByKey
  8. 8. Set the number of reducers sensibly Configure your pyspark cluster properly Don’t shuffle (unless you have to) Don’t groupBy Repartition your data if necessary 9 How to not crash your spark job
  9. 9. Lots of people will say 'use scala' 10
  10. 10. Lots of people will say 'use scala' Don't listen to those people. 11
  11. 11. 12 Naive bayes - recap
  12. 12. # get (class label, word) tuples label_token = gettokens(docs) # [(False, u'https'), (True, u'fashionblog'), (True, u'dress'), (False, u'com')),...] tokencounter = label_token.map(lambda (label, token): (token, (label, not label))) # [(u'https', [0, 1]), (u'fashionblog', [1, 0]), (u'dress', [1, 0]), (u'com', [0, 1])), ...] # get the word count for each class termcounts = tokencounter.reduceByKey(lambda x, y: map(add, x, y)) # [(u'https', [100, 112]), (u'fashionblog', [0, 100]), (u'dress', [5, 15]), (u'com', [95, 100])), ...] 13 Naive Bayes in Spark
  13. 13. termcounts_plus_pseudo = termcounts.map(lambda (term, counts): (term, map(add, counts, (1, 1)))) # [(u'https', [100, 112]), (u'fashionblog', [0, 100]), (u'dress', [5, 15]),...] # => [(u'https', [101, 113]), (u'fashionblog', [1, 101]), (u'dress', [6, 16]),...] # get the total number of words in each class values = termcounts_plus_pseudo.map(lambda (term, (truecounts, falsecounts)): (truecounts, falsecounts)) totals = values.reduce(lambda x,y: map(add, x,y)) # [1321, 2345] P_t = termcounts_plus_pseudo.map(lambda (label, counts): (label, map(truediv, counts, totals))) 14 Naive Bayes in Spark
  14. 14. reduceByKey(combineByKey) {k1: 2, …} (k1, 2) (k1, 3) (k1,5) {k1: 10, …} {…} combineLocally _mergeCombiners {k1: 3, …} {k1: 5, …} (k1, 1) (k1, 1) (k1, 2) (k1, 1) (k1, 5)
  15. 15. reduceByKey(combineByKey) {k1: 2, …} (k1, 2) (k1, 3) (k1,5) {k1: 10, …} {…} combineLocally _mergeCombiners {k1: 3, …} {k1: 5, …} reduceByKey(numPartitions) (k1, 1) (k1, 1) (k1, 2) (k1, 1) (k1, 5)
  16. 16. RDD.aggregate(zeroValue, seqOp, combOp) Aggregate the elements of each partition, and then the results for all the partitions, using a given combine functions and a neutral “zero value.” 17 Naive Bayes in Spark
  17. 17. class WordFrequencyAgreggator(object): def __init__(self): self.S = {} def add(self, (token, count)): if token not in self.S: self.S[token] = (0,0) self.S[token] = map(add, self.S[token], count) return self def merge(self, other): for term, count in other.S.iteritems(): if term not in self.S: self.S[term] = (0,0) self.S[term] = map(add, self.S[term], count) return self 18 Naive Bayes in Spark: Aggregation
  18. 18. With aggregate termcounts = tokencounter.reduceByKey(lambda x, y: map(add, x, y)) # [(u'https', [0, 1]), (u'fashionblog', [0, 1]), (u'dress', [0, 1]),...] # => [(u'https', [100, 112]), (u'fashionblog', [0, 100]), (u'dress', [5, 15]),...] With aggregate aggregates = tokencounter.aggregate(WordFrequencyAgreggator(), lambda x,y:x.add(y), lambda x,y: x.merge(y)) RDD.aggregate(zeroValue, seqOp, combOp) 19 Naive Bayes in Spark
  19. 19. 20 Naive Bayes in Spark: Aggregation
  20. 20. 21 Naive Bayes in Spark: treeAggregation
  21. 21. RDD.treeAggregate(zeroValue, seqOp, combOp, depth=2) Aggregates the elements of this RDD in a multi-level tree pattern. With reduce termcounts = tokencounter.reduceByKey(lambda x, y: map(add, x, y)) # [(u'https', [0, 1]), (u'fashionblog', [0, 1]), (u'dress', [0, 1]), (u'com', [0, 1])),...] # ===> # [(u'https', [100, 112]), (u'fashionblog', [0, 100]), (u'dress', [5, 15]), (u'com', [95, 100])),...] With treeAggregate aggregates = tokencounter.treeAggregate(WordFrequencyAgreggator(), lambda x,y:x.add (y), lambda x,y: x.merge(y), depth=4) 22 Naive Bayes in Spark: treeAggregate
  22. 22. On 1B short documents: RDD.reduceByKey: 18 min RDD.treeAggregate: 10 min https://gist.github.com/martingoodson/aad5d06e81f23930127b 23 treeAggregate performance
  23. 23. 24 Word2Vec
  24. 24. 25 Training Word2Vec in Spark from pyspark.mllib.feature import Word2Vec inp = sc.textFile("text8_lines").map(lambda row: row.split(" ")) word2vec = Word2Vec() model = word2vec.fit(inp)
  25. 25. Averaging Clustering Convolutional Neural Network 26 How to use word2vec vectors for classification problems
  26. 26. 27 K-Means in Spark from pyspark.mllib.clustering import KMeans, KMeansModel word=sc.textFile('GoogleNews-vectors-negative300.txt') vectors = word.map(lambda line: array( [float(x) for x in line.split('t')[1:]]) ) clusters = KMeans.train(vectors, 50000, maxIterations=10, runs=10, initializationMode="random") clusters_b = sc.broadcast(clusters) labels = parsedData.map(lambda x:clusters_b.value.predict(x))
  27. 27. 28 Semi Supervised Naive Bayes ● Build an initial naive Bayes classifier, ŵ, from the labeled documents, X, only ● Loop while classifier parameters improve: ○ (E-step) Use the current classifier, ŵ, to estimate component membership of each unlabeled document, i.e., the probability that each class generated each document, ○ (M-step) Re-estimate the classifier, ŵ, given the estimated class membership of each document. Kamal Nigam, Andrew McCallum and Tom Mitchell. Semi-supervised Text Classification Using EM. In Chapelle, O., Zien, A., and Scholkopf, B. (Eds.) Semi-Supervised Learning. MIT Press: Boston. 2006.
  28. 28. instead of labels: tokencounter = label_token.map(lambda (label, token): (token, (label, not label))) # [(u'https', [0, 1]), (u'fashionblog', [0, 1]), (u'dress', [0, 1]), (u'com', [0, 1])),...] use probabilities: # [(u'https', [0.1, 0.3]), (u'fashionblog', [0.01, .11]), (u'dress', [0.02, 0.02]), (u'com', [0.13, .05])),...] 29 Naive Bayes in Spark: EM
  29. 29. 500K labelled examples Precision: 0.27 Recall: 0.15 F1: 0.099 Add 10M unlabelled examples. 10 EM iterations. Precision of 0.26 Recall of 0.31 F1 of 0.14 30 Naive Bayes in Spark: EM
  30. 30. 240M training examples Precision: 0.31 Recall: 0.19 F1: 0.12 Add 250M unlabelled examples. 10 EM iterations. Precision of 0.26 and Recall of 0.22 F1: 0.12 31 Naive Bayes in Spark: EM
  31. 31. PySpark Memory: worked example
  32. 32. 33 PySpark Configuration: Worked Example 10 x r3.4xlarge (122G, 16 cores) Use half for each executor: 60GB Number of cores = 120 OS: ~12GB Each python process: ~4GB = 48GB Cache = 60% x 60GB x 10 = 360GB Each java thread: 40% x 60GB / 12 = ~2GB more here: http://files.meetup.com/13722842/Spark%20Meetup.pdf
  33. 33. We are hiring! martin@skimlinks.com @martingoodson

×