2014-08-14 Alpine Innovation to Spark

1,438 views

Published on

Spark is rapidly catching fire with the machine learning and data science community for a number of reasons. Predominantly, it is making it possible to extend and enhance machine learning algorithms to a level we’ve never seen before. In this talk, we’ll give examples of two areas Alpine Data Labs has contributed to the Spark project:

Bio:
DB Tsai is a Machine Learning Engineer working at Alpine Data Labs. His current focus is on Big Data, Data Mining, and Machine Learning. He uses Hadoop, Spark, Mahout, and several Machine Learning algorithms to build powerful, scalable, and robust cloud-driven applications. His favorite programming languages are Java, Scala, and Python. DB is a Ph.D. candidate in Applied Physics at Stanford University (currently taking leave of absence). He holds a Master’s degree in Electrical Engineering from Stanford University, as well as a Master's degree in Physics from National Taiwan University.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,438
On SlideShare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
23
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

2014-08-14 Alpine Innovation to Spark

  1. 1. Learn more about Advanced Analytics at http://www.alpinenow.com Innovation on DB Tsai dbtsai@alpinenow.com Sung Chung schung@alpinenow.com Machine Learning Engineering @AlpineDataLabs August 14, 2014
  2. 2. Learn more about Advanced Analytics at http://www.alpinenow.com TRADITIONAL DESKTOP IN-DATABASE METHODS WEB-BASED AND COLLABORATIVE SIMPLIFIED CODE-FREE HADOOP & MPP DATABASE ONGOING INNOVATION The Path to Innovation
  3. 3. Learn more about Advanced Analytics at http://www.alpinenow.com The Path to Innovation Iterative algorithms scan through the data each time With Spark, data is cached in memory after first iteration Quasi-Newton methods enhance in-memory benefits 921s 150m m rows 97s
  4. 4. Learn more about Advanced Analytics at http://www.alpinenow.com Machine Learning in the Big Data Era •  Hadoop Map Reduce solutions •  MapReduce scales well for batch processing •  Lots of machine learning algorithms are iterative by nature •  There are lots of tricks people do, like training with subsamples of data, and then average the models. Why have big data if you’re only approximating. + =
  5. 5. Learn more about Advanced Analytics at http://www.alpinenow.com Lightning-fast cluster computing •  Empower users to iterate through the data by utilizing the in-memory cache. •  Logistic regression runs up to 100x faster than Hadoop M/R in memory. •  We’re able to train exact models without doing any approximation.
  6. 6. Learn more about Advanced Analytics at http://www.alpinenow.com Why Alpine supports MLlib? •  MLlib is a Spark subproject providing Machine Learning primitives. •  It’s built on Apache Spark, a fast and general engine for large- scale data processing. •  Shipped with Apache Spark since version 0.8 •  High quality engineering design and effort •  More than 50 contributors since July 2014 •  Alpine is 100% committed to open source to facilitate industry adoption that are driven by business needs.
  7. 7. Learn more about Advanced Analytics at http://www.alpinenow.com AutoML •  Success of machine learning crucially relies on human machine learning experts, who select appropriate features, workflows, paradigms, algorithms, and their hyper-parameters. •  Even the hyper-parameters can be chosen by grid search with cross-validation, a problem with more than two parameters becomes very difficult and challenging. It’s a non-convex optimization problem. •  There is a demand for off-the-shelf machine learning methods that can be used easily and without expert knowledge. - AutoML workshop @ ICML’14
  8. 8. Learn more about Advanced Analytics at http://www.alpinenow.com Random Forest •  An ensemble learning method for classification & regression that operates by constructing a multitude of decision trees at training time. •  A “black box” without too much tuning and it can automatically identify the structure, interactions, and relationships in the data. •  A technique to reduce the variance of single decision tree predictions by averaging the predictions of many de- correlated trees. •  De-correlation is achieved through Bagging and / or randomly selecting features per tree node. NOTE: Most Kaggle competitions have at least one top entry that heavily uses Random Forests.
  9. 9. Learn more about Advanced Analytics at http://www.alpinenow.com Sequoia Forest Why Sequoia Forest? MLlib already has a decision tree implementation, but it doesn’t support random features and is not optimized to train on large clusters. What does Sequoia Forest do? •  Classification and Regression. •  Numerical and Categorical Features. What’s next? Gradient Boosting Where can you find? https://github.com/AlpineNow/SparkML2 We’re merging back with MLlib and is licensed under the Apache License. More info: http://spark-summit.org/2014/talk/sequoia-forest-random-forest-of-humongous-trees.
  10. 10. Learn more about Advanced Analytics at http://www.alpinenow.com Spark-1157: L-BFGS Optimizer •  No, its not a blender!
  11. 11. Learn more about Advanced Analytics at http://www.alpinenow.com What is Spark-1157: L-BFGS Optimizer •  Merged in Spark 1.0 •  Popular algorithms for parameter estimation in Machine Learning. •  It’s a quasi-Newton Method. •  Hessian matrix of second derivatives doesn't need to be evaluated directly. •  Hessian matrix is approximated using gradient evaluations. •  It converges a way faster than the default optimizer in Spark, Gradient Decent.
  12. 12. Learn more about Advanced Analytics at http://www.alpinenow.com
  13. 13. Learn more about Advanced Analytics at http://www.alpinenow.com SPARK-2934: LogisticRegressionWithLBFGS •  Merged in Spark 1.1 •  Using L-BFGS to train Logistic Regression instead of default Gradient Descent. •  Users don't have to construct their objective function for Logistic Regression, and don't have to implement the whole details. •  Together with SPARK-2979 to minimize the condition number, the convergence rate is further improved.
  14. 14. Learn more about Advanced Analytics at http://www.alpinenow.com
  15. 15. Learn more about Advanced Analytics at http://www.alpinenow.com 0 5 10 15 20 25 30 35 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 Logistic Regression with a9a Dataset (11M rows, 123 features, 11% non-zero elements) 16 executors in INTEL Xeon E3-1230v3 32GB Memory * 5 nodes Hadoop 2.0.5 alpha cluster L-BFGS Dense Features L-BFGS Sparse Features GD Sparse Features GD Dense Features Seconds Log-Likelihood/NumberofSamplesa9a Dataset Benchmark
  16. 16. Learn more about Advanced Analytics at http://www.alpinenow.com a9a Dataset Benchmark -1 1 3 5 7 9 11 13 15 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 Logistic Regression with a9a Dataset (11M rows, 123 features, 11% non-zero elements) 16 executors in INTEL Xeon E3-1230v3 32GB Memory * 5 nodes Hadoop 2.0.5 alpha cluster L-BFGS GD Iterations Log-Likelihood/NumberofSamples
  17. 17. Learn more about Advanced Analytics at http://www.alpinenow.com 0 5 10 15 20 25 30 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Logistic Regression with rcv1 Dataset (6.8M rows, 677,399 features, 0.15% non-zero elements) 16 executors in INTEL Xeon E3-1230v3 32GB Memory * 5 nodes Hadoop 2.0.5 alpha cluster LBFGS Sparse Vector GD Sparse Vector Second Log-Likelihood/NumberofSamples rcv1 Dataset Benchmark
  18. 18. Learn more about Advanced Analytics at http://www.alpinenow.com news20 Dataset Benchmark 0 10 20 30 40 50 60 70 80 0 0.2 0.4 0.6 0.8 1 1.2 Logistic Regression with news20 Dataset (0.14M rows, 1,355,191 features, 0.034% non-zero elements) 16 executors in INTEL Xeon E3-1230v3 32GB Memory * 5 nodes Hadoop 2.0.5 alpha cluster LBFGS Sparse Vector GD Sparse Vector Second Log-Likelihood/NumberofSamples
  19. 19. Learn more about Advanced Analytics at http://www.alpinenow.com SPARK-2979: Improve the convergence rate by standardizing the training features l  Merged in Spark 1.1 l  Due to the invariance property of MLEs, the scale of your inputs are irrelevant. l  However, the optimizer will not be happy with poor condition numbers which can often be improved by scaling. l  The model is trained in the scaled space, but the coefficients are converted to original space; as a result, it's transparent to users. l  Without this, some training datasets mixing the columns with different scales may not be able to converge. l  Scikit and glmnet package also standardize the features before training to improve the convergence. l  Only enable in Logistic Regression for now.
  20. 20. Learn more about Advanced Analytics at http://www.alpinenow.com SPARK-2272: Transformer A spark, the soul of a transformer
  21. 21. Learn more about Advanced Analytics at http://www.alpinenow.com SPARK-2272: Transformer l  Merged in Spark 1.1 l  MLlib data preprocessing pipeline. l  StandardScaler -  Standardize features by removing the mean and scaling to unit variance. -  RBF kernel of Support Vector Machines or the L1 and L2 regularizers of linear models typically works better with zero mean and unit variance. l  Normalizer -  Normalizes samples individually to unit L^n norm. -  Common operation for text classification or clustering for instance. -  For example, the dot product of two l2-normalized TF-IDF vectors is the cosine similarity of the vectors.
  22. 22. Learn more about Advanced Analytics at http://www.alpinenow.com StandardScaler
  23. 23. Learn more about Advanced Analytics at http://www.alpinenow.com Normalizer
  24. 24. Learn more about Advanced Analytics at http://www.alpinenow.com l  Merged in Spark 1.1 l  Online algorithms for computing the mean, variance, min, and max in a streaming fashion. l  Two online summerier can be merged, so we can use one summerier for one block of data in map phase, and merge all of them in reduce phase to obtain the global summarizer. l  A numerically stable one-pass algorithm is implemented to avoid catastrophic cancellation in naive implementation. Ref: http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance l  Optimized for sparse vector, and the time complexity is O(non-zeors) instead of O(numCols) for each sample. SPARK-1969: Online summarizer Two-pass algorithm Naive algorithm
  25. 25. Learn more about Advanced Analytics at http://www.alpinenow.com
  26. 26. Learn more about Advanced Analytics at http://www.alpinenow.com
  27. 27. Learn more about Advanced Analytics at http://www.alpinenow.com l  Merged in Spark 1.1 l  Floating point math is not exact, and most floating-point numbers end up being slightly imprecise due to rounding errors. l  Simple values like 0.1 cannot be precisely represented using binary floating point numbers, and the limited precision of floating point numbers means that slight changes in the order of operations or the precision of intermediates can change the result. l  That means that comparing two floats to see if they are equal is usually not what we want. As long as this imprecision stays small, it can usually be ignored. l  Scala syntax sugar comparators are implemented using implicit conversion allowing developers to write unittest easier. SPARK-2479: MLlib UnitTests
  28. 28. Learn more about Advanced Analytics at http://www.alpinenow.com
  29. 29. Learn more about Advanced Analytics at http://www.alpinenow.com SPARK-1892: OWL-QN Optimizer ongoing work l  It extends L-BFGS to handle L2 and L1 regularizations together (balanced with alpha as in elastic nets) l  We fixed couple issues #247 in Breeze's OWLQN implementation, and this work is based on that. l  Blocked by SPARK-2505
  30. 30. Learn more about Advanced Analytics at http://www.alpinenow.com SPARK-2505: Weighted Regularization ongoing work l  Each components of weights can be penalized differently. l  We can exclude intercept from regularization in this framework. l  Decoupling regularization from the raw gradient update which is not used in other optimization schemes. l  Allow various update/learning rate schemes (adagrad, normalized adaptive gradient, etc) to be applied independent of the regularization l  Smooth and L1 regularization will be handled differently in optimizer.
  31. 31. Learn more about Advanced Analytics at http://www.alpinenow.com SPARK-2309: Multinomial Logistic Regression ongoing work l  For K classes multinomial problem, we can generalize it via K -1 linear models with logist link functions. l  As a result, the weights will have dimension of (K-1)(N + 1) where N is number of features. l  MLlib interface is designed for one set of paramerters per model, so it requires some interface design changes. l  Expected to be merged in next release of MLlib, Spark 1.2 Ref: http://www.slideshare.net/dbtsai/2014-0620-mlor-36132297
  32. 32. Learn more about Advanced Analytics at http://www.alpinenow.com l  Technology we're using: Scala/Java, Akka, Spray, Hadoop, Spark/SparkSQL, Pig, Sqoop, Javascripts, D3.js etc. l  Actively involved in the open source community: almost of all our newly developed algorithms in Spark will be contributed back to MLLib. l  Actively developing on application to/from Spark Yarn communication infrastructure (application lifecycle, error reporting, progress monitoring and interactive command etc) l  In addition to Spark, we are the maintainer of several open source projects including Chorus, SBT plugin for JUnit test Listener, and Akka-based R engine. l  Weekly tech/ML talks. Speakers: David Hall (author of Breeze), Heather Miller (student of Martin Ordersky), H.Y. Li (author of Tachyon), and Jason Lee (student of Trevor Hastie), etc… l  Oraginzes the SF Machine Learning meetup (2k+ members). Speakers: Andrew Ng (Stanford), Michael Jorden (Berkely), Xiangrui Meng (Databricks), Sandy Ryza (Cloudera), etc… We’re open source friendly and tech driven!
  33. 33. Learn more about Advanced Analytics at http://www.alpinenow.com We're hiring! l  Machine Learning Engineer l  Data Scientist l  UI/UX Engineer l  Platform Engineer l  Automation Test Engineer Shoot me an email at dbtsai@alpinenow.com
  34. 34. Learn more about Advanced Analytics at http://www.alpinenow.com For more information, contact us 1550 Bryant Street Suite 1000 San Francisco, CA 94103 USA +1 (877) 542-0062 www.alpinenow.com Get Started Today! http://start.alpinenow.com

×