Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Scaling out logistic regression with Spark

3,540 views

Published on

Large scale multinomial logistic regression with Spark. Contains animated gifs. Analysis of LBFGS. Real world spark configurations. SimilarWeb categorization algorithm

Published in: Software
  • Überprüfen Sie die Quelle ⇒ www.WritersHilfe.com ⇐ . Diese Seite hat mir geholfen, eine Diplomarbeit zu schreiben.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Follow the link, new dating source: ♥♥♥ http://bit.ly/369VOVb ♥♥♥
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Dating for everyone is here: ❤❤❤ http://bit.ly/369VOVb ❤❤❤
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Scaling out logistic regression with Spark

  1. 1. Scaling Out Logistic Regression with Apache Spark
  2. 2. Nir Cohen - CTO
  3. 3. General Background about the company › The company was founded 8 years ago › 300 ~ employees world wide › 240 employees in Israel › Stay updated about our open positions in our website. You can contact jobs@similarweb.com › Nir Cohen – nir@similarweb.com
  4. 4. The product
  5. 5. Data Size › 650 servers total › Several Hadoop Clusters – 120 Servers in the biggest. › 5 Hbase clusters › Couchbase clusters › Kafka clusters › MYSQL Galera clusters › 5TB of new data every day › Full data backup to s3
  6. 6. Plan for the next hour or so › The need › Some history › Spark related algorithmic intuitions › Dive into spark › Our Additions › Runtime issues › Current Categorization Algorithm
  7. 7. The Need
  8. 8. Need: The Customer
  9. 9. Need: The Product
  10. 10. Need: The Product – Direct Competitors
  11. 11. Need: How would you classify the Web? › Crawl the web › Collect data about each website › Manually classify a few › Use machine learning to derive model › Classify all the websites we’ve seen
  12. 12. Some History
  13. 13. LEARNING SET: CLASSES › Shopping – Clothing – Consumer Electronics – Jewelry – … › Sports – Baseball – Basketball – Boxing – … › … Manually defined 246 categories 2 level tree 25 Parent categories
  14. 14. LEARNING SET: FEATURES › Tag Count Source – cnn.com | news | 1 – bbc.com | culture | 50 – … › Html Analyzer Source – cnn.com | money | 14 – nba.com | nba draft | 2 – … 11 basic sources Feature is: site | tag | score Some reintroduced after additional processing Eventually – 16 sources 18 GB of data 4M Unique features
  15. 15. Our challenge › Large Scale Logistic Regression – ~500K site samples – 4M Unique features – ~800K features/source – 246 classes – Eventually apply model to 400M sites
  16. 16. FIRST LOGISTIC REGRESSION ATTEMPT › Only scales up › Pre-combination of features reduces coverage › Runtime: a few days › Code is complex, and hard to tweak algorithm › Bus test Single machine Java logistic regression implementation  highly optimized  Manually tuned loss function  multi threaded  Uses plain arrays and divides "stripes" between threads  Works on “summed features”
  17. 17. SECOND LOGISTIC REGRESSION ATTEMPT  Out of the box solution  Customizable  Open source  Distributable
  18. 18. Why we choose spark › Has out of the box distributed solution for large scale multinomial logistic regression › Simplicity › Lower production maintenance costs compared to R › Intent to move to Spark for large complex algorithmics
  19. 19. Spark related Algorithmics Intuitive reminder
  20. 20. Basic Regression Method › We want to estimate value of y based on samples (x, y) 𝑦 = 𝑓 𝑥, 𝛽 ; 𝛽 – unknown function constants › Define loss function 𝑙 𝛽 that corresponds with accuracy – for example : 𝑙 𝛽 ≡ 𝑖 𝑠𝑎𝑚𝑝𝑙𝑒𝑠 𝑓 𝑥 𝑖,𝛽 −𝑦 𝑖 2 #𝑠𝑎𝑚𝑝𝑙𝑒𝑠 › Find 𝛽 that minimize 𝑙 𝛽
  21. 21. Logistic Regression › In case of classification we want to use logistic function 𝑦 = 𝑓 𝑥, 𝛽 = 𝑃(𝑦|𝑥; 𝛽) = 𝑒 𝛽𝑥 1 + 𝑒 𝛽𝑥 › Define differentiable loss function (log-likelihood) 𝑙 𝑥, 𝛽 = 𝑖 𝑠𝑎𝑚𝑝𝑙𝑒𝑠 𝑙𝑜𝑔𝑃(𝑦𝑖|𝑥𝑖; 𝛽) › We cannot find 𝛽 analytically › However, 𝑙 𝑥, 𝛽 is smooth, continuous and convex! – Has one global minimum
  22. 22. GRADIENT DESCENT Generally • Value of −𝛻𝑙(𝛽) is a vector that points in direction of steepest descent • In every step 𝛽 𝑘+1 = 𝛽 𝑘 − 𝛼𝛻𝑙(𝛽 𝑘) • 𝛼 – learning rate • Converges when 𝛻𝑙 𝛽 → 0 Spark • 𝑟𝑎𝑡𝑒 = 𝛼 𝑖𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛 𝑛𝑢𝑚𝑏𝑒𝑟 • SGD – stochastic mini- batch GD
  23. 23. LINE SEARCH – DETERMINING STEP SIZE Approximate method At each iteration • Find step size that sufficiently decreases l • By reducing the range of possible steps sizes Spark: • StrongWolfeLineSearch • Sufficiency check is a function of 𝑙 𝛽 , 𝛻𝑙 𝛽
  24. 24. Is there a faster way?
  25. 25. Function Analysis for y = 𝑙(𝛽0, 𝛽1, … , 𝛽 𝑛) › So, we want 𝛽 that satisfy 𝑙′ 𝛽 = 0 𝑔𝑟𝑎𝑑𝑖𝑒𝑛𝑡 𝑣𝑒𝑐𝑡𝑜𝑟 ≝ 𝛻𝑙 𝛽 = 𝜕𝑙 𝜕𝛽1 𝜕𝑙 𝜕𝛽2 ⋮ 𝜕𝑙 𝜕𝛽 𝑛 Hessian Matrix − 𝐻 𝛽 ≝ 𝛻2 𝑙 𝛽 = 𝜕𝑙 𝜕𝛽1 𝛽1 ⋯ 𝜕𝑙 𝜕𝛽1 𝛽 𝑛 ⋮ ⋱ ⋮ 𝜕𝑙 𝜕𝛽 𝑛 𝛽1 ⋯ 𝜕𝑙 𝜕𝛽 𝑛 𝛽 𝑛 At minimum, derivative is 0 In Our Case 800Kx800K way too much…
  26. 26. NEWTON’S METHOD (NEWTON-RAPHSON) 𝑁𝑒𝑤𝑡𝑜𝑛′ 𝑠 𝐼𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛: 𝑥 𝑘+1=𝑥 𝑘− 𝑓(𝑥 𝑘) 𝑓′(𝑥 𝑘) 𝑈𝑝𝑑𝑎𝑡𝑒 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛 𝑤𝑖𝑡ℎ 𝑜𝑛𝑒 𝑓𝑒𝑎𝑡𝑢𝑟𝑒 𝛽 𝑘+1=𝛽 𝑘− 𝑙′(𝛽 𝑘) 𝑙′′(𝛽 𝑘) 𝑂𝑢𝑟 𝐶𝑎𝑠𝑒: 𝑀𝑢𝑙𝑡𝑖𝑝𝑙𝑒 𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠 𝛽 𝑘+1=𝛽 𝑘−𝛻𝑙 𝛽 𝑘 × 𝐻−1 (𝛽 𝑘) 𝐻−1 (𝛽 𝑘) − 𝑖𝑛𝑣𝑒𝑟𝑠𝑒 𝐻𝑒𝑠𝑠𝑖𝑎𝑛 𝑚𝑎𝑡𝑟𝑖𝑥 "NewtonIteration Ani" by Ralf Pfeifer - NewtonIteration_Ani.gif https://en.wikipedia.org/wiki/Newton's_method
  27. 27. Illustration for simple parabola (1 feature) GRADIENT DESCENT NEWTON’S GRADIENT DESCENT Images from here
  28. 28. Is there a fast and simpler way?
  29. 29. SECANT METHOD (QUAZI-NEWTON) Approximation of derivative 𝑙′′ 𝛽1 ≈ 𝑙′ 𝛽1 − 𝑙′(𝛽0) 𝛽1 − 𝛽0 𝑁𝑒𝑤𝑡𝑜𝑛′ 𝑠 𝐼𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛 𝑏𝑒𝑐𝑜𝑚𝑒𝑠 𝛽 𝑘+1 = 𝛽 𝑘− 𝑙′(𝛽 𝑘) 𝑙′′(𝛽 𝑘−1) = = 𝛽 𝑘−𝑙′(𝛽 𝑘) 𝛽 𝑘 − 𝛽 𝑘−1 𝑙′ 𝛽 𝑘 − 𝑙′(𝛽 𝑘−1) ! Hessian is not needed ! In our case, we need only 𝛻𝑙 Animation from here
  30. 30. Requirements and Convergence rate Newton-Raphson Quazi-Newton Analytical formula for gradient Analytical formula for gradient Compute gradient at each step 𝑂(𝑀 × 𝑁) Compute gradient at each step 𝑂(𝑀 × 𝑁) Analytical formula for Hessian Compute Inverse Hessian at each step - 𝑂(𝑀2 𝑁) Save last calculations of gradient Order Of Convergence q=2 Order Of Convergence q=1.6 Which is faster? Which is cheaper (memory, cpu) in 1000 iterations for M=100,000 features? Which of Gradient Descent, Newton or Quazi-Newton should we use?
  31. 31. BFGS - Quazi-Newton with Line Search › Initially guess 𝛽0 and set 𝐻0 −1 = 𝐼 › In each step k – Calculate gradient value (direction) 𝑝 𝑘 = −𝛻𝑓(𝛽 𝑘) × 𝐻 𝑘 −1 – Find step size (𝛼 𝑘) using line search (with Wolfe conditions) – Update 𝛽 𝑘+1 = 𝛽 𝑘 + 𝛼 𝑘 𝑝 𝑘 – Update 𝐻 𝑘+1 −1 = 𝐻 𝑘 −1 + 𝑢𝑝𝑑𝑎𝑡𝑒𝐹𝑢𝑛𝑐(𝐻 𝑘 −1 , 𝛻𝑓 𝛽 𝑘 , 𝛻𝑓 𝛽 𝑘+1 , 𝛼 𝑘, 𝑝 𝑘 ) › Stop when improvement is small enough › More info BFGS
  32. 32. Back To Engineering
  33. 33. Challenges Implementing Logistic Regression › In order to get the values of gradient we need instantiate the formula with the learning set – For every iteration we need to go over the learning set › If we want to speed this up by parallelization we need ship model or learning set to each thread/process › Single machine -> process is CPU bound › Multiple machines -> network bound › With large number of features, memory becomes a problem as well
  34. 34. Why we choose to use L-BFGS › Only out of the box multinomial logistic regression › Gives good value for money – Good tradeoff between cost per iteration and number of iterations › Uses spark’s GeneralizedLinearModel API:
  35. 35. L-BFGS › L stands for Limited Memory – Replace 𝐻𝑒𝑠𝑠𝑖𝑎𝑛 which is 𝑀 × 𝑀 matrix with a few (~10) most recent updates of 𝛻𝑓 𝛽 𝑘 and 𝑓(𝛽 𝑘) which are 𝑀 sized vectors › spark.LBFGS – Distributed wrapper over breeze.LBFGS – Mostly, distribution of gradient calculation › Rest is not › Shipping around the model and collecting gradient values – Uses L2 regularization – Scaling Features
  36. 36. Spark internals distributed sub loop (max 10) distributed but cached on executors Partial agg on executors, final on Driver
  37. 37. AGGREGATE & TREE AGGREGATE Aggregate • Each executor holds a portion of learning set • Broadcast model to executors • Collect results to driver TreeAggregate • Simple heuristic to add level • Perform partial aggregation by shipping results to other executors (by repartitioning) Weights - 𝛽 Partial gradient
  38. 38. Job UI – big job
  39. 39. Implementation
  40. 40. Overfitting › We have more features then samples › Some features are poorly represented › For example: – only one sample for “carbon” tag – sample is labeled “automotive” › Model would give high weight to this feature for “automotive” class and 0 for others – Do you think it is correct? › How would you solve this?
  41. 41. Regularization › Solution internal to regression mechanism › We introduce regularization into the cost function 𝑙 𝑡𝑜𝑡𝑎𝑙 𝛽, 𝑥 =𝑙 𝑚𝑜𝑑𝑒𝑙 𝛽, 𝑥 + 𝜆 ∙ 𝑙 𝑟𝑒𝑔 𝛽 L2 regularization : 𝑙 𝑟𝑒𝑔 𝛽 = 1 2 𝛽 2 › 𝜆 – regularization constant › What happens if 𝜆 is too large? › What happens if 𝜆 is too small? › Spark’s LBFGS has L2 built-in
  42. 42. Finding Best Lambda › We choose best 𝜆 using cross-validation – Set aside 30% of learning set, and use it for test › Build model for every 𝜆 and compare precision › Lets Parallelize? Is there more efficient way to do this? – We use the fact that for large 𝜆, model is underfitted, converges fast – Start from large 𝜆 and use its model as a starting point of next iteration
  43. 43. CHOOSING REGULARIZATION PARAMETER Lambda Precision Iterations 25 35.06% 3 12.5 35.45% 12 6.25 36.68% 5 3.125 38.41% 5 1.563 Failure! 0.781 45.87% 13 0.391 50.64% 10 0.195 55.04% 13 0.098 58.33% 17 0.049 60.93% 19 0.024 62.33% 21 0.012 64.30% 25 0.006 65.95% 42 0.003 65.46% 38  After choosing the best lambda, we can use complete learning set to calculate final model  Failures can be caused externally or internally  Avg iteration time 2 sec
  44. 44. LBFGS EXTENSION & BUGFIXES › Spark layer of LBFGS swallows all failures – and returns bad weights › Feature scaling was always on – Redundant in our case – Rendered passed weights unusable – Lowered model precision › Expose effective number of iterations to external monitoring • Enable passing starting weights into LBFGS • More transparency
  45. 45. SPARK ADDITIONS & BUG FIXES › PoliteLBFGS addition to spark.LBFGS – 3-5% more precise (for our data) – 30% faster calculation › Planning to contribute back to spark class PoliteLbfgs extends spark.Lbfgs Was it worth the trouble? po·lite : pəˈlīt/ having or showing behavior that is respectful and considerate of others. synonyms: well mannered, civil, courteous, mannerly, respectful, deferential, well behaved
  46. 46. Job UI – small job
  47. 47. RUNNING
  48. 48. Hardware › 110 machines › 5.20 TB Memory › 6600 VCores › Yarn › Block size 128 MB › Cluster is shared with other MapReduce jobs and HBase › 60 Vcores per machine › 64GB Memory – ~1 GB per VCore › 12 Cores – 5 Vcores per physical core (tuned for MapReduce) › CentOS 6.6 › cdh-5.4.8
  49. 49. Execution – Good Neighboring › Each source has different number of samples and features › Execution profiles for single learning run Small Large #Samples ~50K 500K Input Size under 1gb 1g - 3g #Executors 2 22 Executor Memory 2g 4g Driver Memory 2g 18g Yarn Driver Overhead 2g Yarn Executor Overhead 1g #Jobs per profile 200 180
  50. 50. Execution Example Hardware : Driver 2 cores, 20g memory Hardware : Executors 22 machines x (2 cores, 5g memory) Number of Features 100,000 Number of Samples 500,000 Total Number of Iterations (try out 14 different 𝜆) 152 Avg Iteration Time 18.8 sec Total Learning Time 2863 sec (48 minutes) Max Iterations for single 𝜆 30
  51. 51. Could you guess the reason for difference? run Phase name real time [sec] iteration time [sec] iterations 1 parent-glm-AVTags 29101 153.2 190 2 parent-glm-AVTags 15226 82.3 185 3 parent-glm-AVTags 2863 18.8 152 • OK, I admit, cluster was very loaded in first run • What about the second ? • org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle • Increase spark.shuffle.memoryFraction=0.5
  52. 52. AKKA IN REAL WORLD › spark.akka.frameSize = 100 › spark.akka.askTimeout = 200 › spark.akka.lookupTimeout = 200Response times are slower when cluster is loaded askTimeout - seems to be particularly responsible for executors failures when removing broadcasts and unpersisting RDD
  53. 53. Kryo Stability › Kryo uses quite a lot of memory, – if buffer is not sufficient, process will crush – spark.kryoserializer.buffer.max.mb = 512
  54. 54. LEARNING SET: CLASSES › Shopping – Clothing – Consumer Electronics – Jewelry – … › Sports – Baseball – Basketball – Boxing – … › … Manually defined 246 categories 2 level tree 25 Parent categories
  55. 55. LEARNING SET: FEATURES › Tag Count Source – cnn.com | news | 1 – bbc.com | culture | 50 – … › Html Analyzer Source – cnn.com | money | 14 – nba.com | nba draft | 2 – … 11 basic sources Feature is: site | tag | score Some reintroduced after additional processing Eventually – 16 sources ~500K site samples 18 GB of data 4M Unique features ~800K features/source
  56. 56. Need: How would you improve over time? › We collect different kinds of data: – Tags – Links – User behavior – … › How to identify where to focus collection efforts? › How to improve classification algorithm?
  57. 57. Current Approach - Training › foreach source – choose 100K most influential features – train model for L1 – foreach L1 class (avg 9.2 L2 classes per L1) › train model for L2 › foreach source – foreach sample in training set › Calculate probabilities (𝜃) of belonging to any of L1 classes › train Random Forest using L1 probabilities set 16 sources 25 L1 classes
  58. 58. Current Approach - Application › foreach site to classify – foreach source › Calculate probabilities (𝜃) belonging to L1 class – aggregate results and estimate L1 (using RF model) – given estimated L1, foreach source › calculate estimated L2 – choose (by voting) final L2
  59. 59. OTHER EXTENSIONS › Extend mllib.LogisticRegressionModel to return probabilities instead final decision from “predict” method › For Example – Site : nhl.com – Instead “is L1=sports” – We produce › P(news) = 30% › P(sports) = 65% › P(art) = 5% model.advise(p:point)
  60. 60. Summary : This Approach vs Straight Logistic Regression › Increases precision by using more features › Increases coverage by using very granular features › Have feedback (from RF) regarding quality of each source – Using out-of-bag error › Natural parallelization by source › No need for feature scaling

×