SlideShare a Scribd company logo
1 of 34
Download to read offline
Learn more about Advanced Analytics at http://www.alpinenow.com
Innovation on
DB Tsai
dbtsai@alpinenow.com
Sung Chung
schung@alpinenow.com
Machine Learning Engineering @AlpineDataLabs
August 14, 2014
Learn more about Advanced Analytics at http://www.alpinenow.com
TRADITIONAL
DESKTOP
IN-DATABASE
METHODS
WEB-BASED AND
COLLABORATIVE
SIMPLIFIED CODE-FREE
HADOOP & MPP DATABASE
ONGOING INNOVATION
The Path to Innovation
Learn more about Advanced Analytics at http://www.alpinenow.com
The Path to Innovation
Iterative algorithms
scan through the
data each time
With Spark, data is
cached in memory
after first iteration
Quasi-Newton methods
enhance in-memory
benefits
921s
150m
m
rows
97s
Learn more about Advanced Analytics at http://www.alpinenow.com
Machine Learning in the Big Data Era
•  Hadoop Map Reduce solutions
•  MapReduce scales well for batch processing
•  Lots of machine learning algorithms are iterative by nature
•  There are lots of tricks people do, like training with subsamples of
data, and then average the models. Why have big data if you’re only
approximating.
+ =
Learn more about Advanced Analytics at http://www.alpinenow.com
Lightning-fast cluster computing
•  Empower users to iterate
through the data by utilizing
the in-memory cache.
•  Logistic regression runs up
to 100x faster than Hadoop
M/R in memory.
•  We’re able to train exact
models without doing any
approximation.
Learn more about Advanced Analytics at http://www.alpinenow.com
Why Alpine supports MLlib?
•  MLlib is a Spark subproject providing Machine Learning
primitives.
•  It’s built on Apache Spark, a fast and general engine for large-
scale data processing.
•  Shipped with Apache Spark since version 0.8
•  High quality engineering design and effort
•  More than 50 contributors since July 2014
•  Alpine is 100% committed to open source to facilitate industry
adoption that are driven by business needs.
Learn more about Advanced Analytics at http://www.alpinenow.com
AutoML
•  Success of machine learning crucially relies on human machine
learning experts, who select appropriate features, workflows,
paradigms, algorithms, and their hyper-parameters.
•  Even the hyper-parameters can be chosen by grid search with
cross-validation, a problem with more than two parameters becomes
very difficult and challenging. It’s a non-convex optimization
problem.
•  There is a demand for off-the-shelf machine learning methods that
can be used easily and without expert knowledge.
- AutoML workshop @ ICML’14
Learn more about Advanced Analytics at http://www.alpinenow.com
Random Forest
•  An ensemble learning method for classification &
regression that operates by constructing a multitude of
decision trees at training time.
•  A “black box” without too much tuning and it can
automatically identify the structure, interactions, and
relationships in the data.
•  A technique to reduce the variance of single decision
tree predictions by averaging the predictions of many de-
correlated trees.
•  De-correlation is achieved through Bagging and / or
randomly selecting features per tree node.
NOTE: Most Kaggle competitions have at least one top
entry that heavily uses Random Forests.
Learn more about Advanced Analytics at http://www.alpinenow.com
Sequoia Forest
Why Sequoia Forest?
MLlib already has a decision tree implementation, but it doesn’t support random features and is not
optimized to train on large clusters.
What does Sequoia Forest do?
•  Classification and Regression.
•  Numerical and Categorical Features.
What’s next?
Gradient Boosting
Where can you find?
https://github.com/AlpineNow/SparkML2
We’re merging back with MLlib and is licensed under the Apache License.
More info: http://spark-summit.org/2014/talk/sequoia-forest-random-forest-of-humongous-trees.
Learn more about Advanced Analytics at http://www.alpinenow.com
Spark-1157: L-BFGS Optimizer
•  No, its not a blender!
Learn more about Advanced Analytics at http://www.alpinenow.com
What is Spark-1157: L-BFGS Optimizer
•  Merged in Spark 1.0
•  Popular algorithms for parameter estimation in Machine
Learning.
•  It’s a quasi-Newton Method.
•  Hessian matrix of second derivatives doesn't need to be
evaluated directly.
•  Hessian matrix is approximated using gradient evaluations.
•  It converges a way faster than the default optimizer in Spark,
Gradient Decent.
Learn more about Advanced Analytics at http://www.alpinenow.com
Learn more about Advanced Analytics at http://www.alpinenow.com
SPARK-2934: LogisticRegressionWithLBFGS
•  Merged in Spark 1.1
•  Using L-BFGS to train Logistic Regression instead of
default Gradient Descent.
•  Users don't have to construct their objective function for
Logistic Regression, and don't have to implement the
whole details.
•  Together with SPARK-2979 to minimize the condition
number, the convergence rate is further improved.
Learn more about Advanced Analytics at http://www.alpinenow.com
Learn more about Advanced Analytics at http://www.alpinenow.com
0 5 10 15 20 25 30 35
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
Logistic Regression with a9a Dataset (11M rows, 123 features, 11% non-zero elements)
16 executors in INTEL Xeon E3-1230v3 32GB Memory * 5 nodes Hadoop 2.0.5 alpha cluster
L-BFGS Dense Features
L-BFGS Sparse Features
GD Sparse Features
GD Dense Features
Seconds
Log-Likelihood/NumberofSamplesa9a Dataset Benchmark
Learn more about Advanced Analytics at http://www.alpinenow.com
a9a Dataset Benchmark
-1 1 3 5 7 9 11 13 15
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
Logistic Regression with a9a Dataset (11M rows, 123 features, 11% non-zero elements)
16 executors in INTEL Xeon E3-1230v3 32GB Memory * 5 nodes Hadoop 2.0.5 alpha cluster
L-BFGS
GD
Iterations
Log-Likelihood/NumberofSamples
Learn more about Advanced Analytics at http://www.alpinenow.com
0 5 10 15 20 25 30
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Logistic Regression with rcv1 Dataset (6.8M rows, 677,399 features, 0.15% non-zero elements)
16 executors in INTEL Xeon E3-1230v3 32GB Memory * 5 nodes Hadoop 2.0.5 alpha cluster
LBFGS Sparse Vector
GD Sparse Vector
Second
Log-Likelihood/NumberofSamples
rcv1 Dataset Benchmark
Learn more about Advanced Analytics at http://www.alpinenow.com
news20 Dataset Benchmark
0 10 20 30 40 50 60 70 80
0
0.2
0.4
0.6
0.8
1
1.2
Logistic Regression with news20 Dataset (0.14M rows, 1,355,191 features, 0.034% non-zero elements)
16 executors in INTEL Xeon E3-1230v3 32GB Memory * 5 nodes Hadoop 2.0.5 alpha cluster
LBFGS Sparse Vector
GD Sparse Vector
Second
Log-Likelihood/NumberofSamples
Learn more about Advanced Analytics at http://www.alpinenow.com
SPARK-2979: Improve the convergence rate by
standardizing the training features
l  Merged in Spark 1.1
l  Due to the invariance property of MLEs, the scale of your inputs are
irrelevant.
l  However, the optimizer will not be happy with poor condition numbers
which can often be improved by scaling.
l  The model is trained in the scaled space, but the coefficients are
converted to original space; as a result, it's transparent to users.
l  Without this, some training datasets mixing the columns with different
scales may not be able to converge.
l  Scikit and glmnet package also standardize the features before training to
improve the convergence.
l  Only enable in Logistic Regression for now.
Learn more about Advanced Analytics at http://www.alpinenow.com
SPARK-2272: Transformer
A spark, the soul of a transformer
Learn more about Advanced Analytics at http://www.alpinenow.com
SPARK-2272: Transformer
l  Merged in Spark 1.1
l  MLlib data preprocessing pipeline.
l  StandardScaler
-  Standardize features by removing the mean and scaling to unit variance.
-  RBF kernel of Support Vector Machines or the L1 and L2 regularizers of linear
models typically works better with zero mean and unit variance.
l  Normalizer
-  Normalizes samples individually to unit L^n norm.
-  Common operation for text classification or clustering for instance.
-  For example, the dot product of two l2-normalized TF-IDF vectors is the
cosine similarity of the vectors.
Learn more about Advanced Analytics at http://www.alpinenow.com
StandardScaler
Learn more about Advanced Analytics at http://www.alpinenow.com
Normalizer
Learn more about Advanced Analytics at http://www.alpinenow.com
l  Merged in Spark 1.1
l  Online algorithms for computing the mean, variance, min, and max in a streaming
fashion.
l  Two online summerier can be merged, so we can use one summerier for one block of
data in map phase, and merge all of them in reduce phase to obtain the global
summarizer.
l  A numerically stable one-pass algorithm is implemented to avoid catastrophic cancellation
in naive implementation.
Ref: http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance
l  Optimized for sparse vector, and the time complexity is O(non-zeors) instead of
O(numCols) for each sample.
SPARK-1969: Online summarizer
Two-pass algorithm Naive algorithm
Learn more about Advanced Analytics at http://www.alpinenow.com
Learn more about Advanced Analytics at http://www.alpinenow.com
Learn more about Advanced Analytics at http://www.alpinenow.com
l  Merged in Spark 1.1
l  Floating point math is not exact, and most floating-point numbers end up
being slightly imprecise due to rounding errors.
l  Simple values like 0.1 cannot be precisely represented using binary
floating point numbers, and the limited precision of floating point numbers
means that slight changes in the order of operations or the precision of
intermediates can change the result.
l  That means that comparing two floats to see if they are equal is usually not
what we want. As long as this imprecision stays small, it can usually be
ignored.
l  Scala syntax sugar comparators are implemented using implicit conversion
allowing developers to write unittest easier.
SPARK-2479: MLlib UnitTests
Learn more about Advanced Analytics at http://www.alpinenow.com
Learn more about Advanced Analytics at http://www.alpinenow.com
SPARK-1892: OWL-QN Optimizer
ongoing work
l  It extends L-BFGS to handle L2 and L1 regularizations
together
(balanced with alpha as in elastic nets)
l  We fixed couple issues #247 in Breeze's OWLQN
implementation, and this work is based on that.
l  Blocked by SPARK-2505
Learn more about Advanced Analytics at http://www.alpinenow.com
SPARK-2505: Weighted Regularization
ongoing work
l  Each components of weights can be penalized differently.
l  We can exclude intercept from regularization in this framework.
l  Decoupling regularization from the raw gradient update which is
not used in other optimization schemes.
l  Allow various update/learning rate schemes (adagrad,
normalized adaptive gradient, etc) to be applied independent of
the regularization
l  Smooth and L1 regularization will be handled differently in
optimizer.
Learn more about Advanced Analytics at http://www.alpinenow.com
SPARK-2309: Multinomial Logistic Regression
ongoing work
l  For K classes multinomial problem, we can generalize it via
K -1 linear models with logist link functions.
l  As a result, the weights will have dimension of (K-1)(N + 1)
where N is number of features.
l  MLlib interface is designed for one set of paramerters per
model, so it requires some interface design changes.
l  Expected to be merged in next release of MLlib, Spark 1.2
Ref: http://www.slideshare.net/dbtsai/2014-0620-mlor-36132297
Learn more about Advanced Analytics at http://www.alpinenow.com
l  Technology we're using: Scala/Java, Akka, Spray, Hadoop, Spark/SparkSQL, Pig, Sqoop,
Javascripts, D3.js etc.
l  Actively involved in the open source community: almost of all our newly developed algorithms
in Spark will be contributed back to MLLib.
l  Actively developing on application to/from Spark Yarn communication infrastructure
(application lifecycle, error reporting, progress monitoring and interactive command etc)
l  In addition to Spark, we are the maintainer of several open source projects including Chorus,
SBT plugin for JUnit test Listener, and Akka-based R engine.
l  Weekly tech/ML talks. Speakers: David Hall (author of Breeze), Heather Miller (student of
Martin Ordersky), H.Y. Li (author of Tachyon), and Jason Lee (student of Trevor Hastie), etc…
l  Oraginzes the SF Machine Learning meetup (2k+ members). Speakers: Andrew Ng
(Stanford), Michael Jorden (Berkely), Xiangrui Meng (Databricks), Sandy Ryza (Cloudera),
etc…
We’re open source friendly and tech driven!
Learn more about Advanced Analytics at http://www.alpinenow.com
We're hiring!
l  Machine Learning Engineer
l  Data Scientist
l  UI/UX Engineer
l  Platform Engineer
l  Automation Test Engineer
Shoot me an email at
dbtsai@alpinenow.com
Learn more about Advanced Analytics at http://www.alpinenow.com
For more information, contact us
1550 Bryant Street
Suite 1000
San Francisco, CA 94103
USA
+1 (877) 542-0062
www.alpinenow.com
Get Started Today!
http://start.alpinenow.com

More Related Content

What's hot

CaffeOnSpark: Deep Learning On Spark Cluster
CaffeOnSpark: Deep Learning On Spark ClusterCaffeOnSpark: Deep Learning On Spark Cluster
CaffeOnSpark: Deep Learning On Spark ClusterJen Aman
 
Predicting Optimal Parallelism for Data Analytics
Predicting Optimal Parallelism for Data AnalyticsPredicting Optimal Parallelism for Data Analytics
Predicting Optimal Parallelism for Data AnalyticsDatabricks
 
Leveraging Apache Spark for Scalable Data Prep and Inference in Deep Learning
Leveraging Apache Spark for Scalable Data Prep and Inference in Deep LearningLeveraging Apache Spark for Scalable Data Prep and Inference in Deep Learning
Leveraging Apache Spark for Scalable Data Prep and Inference in Deep LearningDatabricks
 
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang Spark Summit
 
Best Practices for Hyperparameter Tuning with MLflow
Best Practices for Hyperparameter Tuning with MLflowBest Practices for Hyperparameter Tuning with MLflow
Best Practices for Hyperparameter Tuning with MLflowDatabricks
 
Summary machine learning and model deployment
Summary machine learning and model deploymentSummary machine learning and model deployment
Summary machine learning and model deploymentNovita Sari
 
Filtering vs Enriching Data in Apache Spark
Filtering vs Enriching Data in Apache SparkFiltering vs Enriching Data in Apache Spark
Filtering vs Enriching Data in Apache SparkDatabricks
 
Scaling Machine Learning with Apache Spark
Scaling Machine Learning with Apache SparkScaling Machine Learning with Apache Spark
Scaling Machine Learning with Apache SparkDatabricks
 
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...Spark Summit
 
Splice Machine's use of Apache Spark and MLflow
Splice Machine's use of Apache Spark and MLflowSplice Machine's use of Apache Spark and MLflow
Splice Machine's use of Apache Spark and MLflowDatabricks
 
Scalable machine learning
Scalable machine learningScalable machine learning
Scalable machine learningArnaud Rachez
 
Oct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopOct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopJosh Patterson
 
Which Is Deeper - Comparison Of Deep Learning Frameworks On Spark
 Which Is Deeper - Comparison Of Deep Learning Frameworks On Spark Which Is Deeper - Comparison Of Deep Learning Frameworks On Spark
Which Is Deeper - Comparison Of Deep Learning Frameworks On SparkSpark Summit
 
PandasUDFs: One Weird Trick to Scaled Ensembles
PandasUDFs: One Weird Trick to Scaled EnsemblesPandasUDFs: One Weird Trick to Scaled Ensembles
PandasUDFs: One Weird Trick to Scaled EnsemblesDatabricks
 
Scalable Automatic Machine Learning in H2O
 Scalable Automatic Machine Learning in H2O Scalable Automatic Machine Learning in H2O
Scalable Automatic Machine Learning in H2OSri Ambati
 
Large Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache GiraphLarge Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache Giraphsscdotopen
 
Productionizing Deep Reinforcement Learning with Spark and MLflow
Productionizing Deep Reinforcement Learning with Spark and MLflowProductionizing Deep Reinforcement Learning with Spark and MLflow
Productionizing Deep Reinforcement Learning with Spark and MLflowDatabricks
 
Data Intensive Applications with Apache Flink
Data Intensive Applications with Apache FlinkData Intensive Applications with Apache Flink
Data Intensive Applications with Apache FlinkSimone Robutti
 
Node Architecture Implications for In-Memory Data Analytics on Scale-in Clusters
Node Architecture Implications for In-Memory Data Analytics on Scale-in ClustersNode Architecture Implications for In-Memory Data Analytics on Scale-in Clusters
Node Architecture Implications for In-Memory Data Analytics on Scale-in ClustersAhsan Javed Awan
 

What's hot (20)

CaffeOnSpark: Deep Learning On Spark Cluster
CaffeOnSpark: Deep Learning On Spark ClusterCaffeOnSpark: Deep Learning On Spark Cluster
CaffeOnSpark: Deep Learning On Spark Cluster
 
Predicting Optimal Parallelism for Data Analytics
Predicting Optimal Parallelism for Data AnalyticsPredicting Optimal Parallelism for Data Analytics
Predicting Optimal Parallelism for Data Analytics
 
Spark 101
Spark 101Spark 101
Spark 101
 
Leveraging Apache Spark for Scalable Data Prep and Inference in Deep Learning
Leveraging Apache Spark for Scalable Data Prep and Inference in Deep LearningLeveraging Apache Spark for Scalable Data Prep and Inference in Deep Learning
Leveraging Apache Spark for Scalable Data Prep and Inference in Deep Learning
 
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
Best Practices for Hyperparameter Tuning with MLflow
Best Practices for Hyperparameter Tuning with MLflowBest Practices for Hyperparameter Tuning with MLflow
Best Practices for Hyperparameter Tuning with MLflow
 
Summary machine learning and model deployment
Summary machine learning and model deploymentSummary machine learning and model deployment
Summary machine learning and model deployment
 
Filtering vs Enriching Data in Apache Spark
Filtering vs Enriching Data in Apache SparkFiltering vs Enriching Data in Apache Spark
Filtering vs Enriching Data in Apache Spark
 
Scaling Machine Learning with Apache Spark
Scaling Machine Learning with Apache SparkScaling Machine Learning with Apache Spark
Scaling Machine Learning with Apache Spark
 
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
 
Splice Machine's use of Apache Spark and MLflow
Splice Machine's use of Apache Spark and MLflowSplice Machine's use of Apache Spark and MLflow
Splice Machine's use of Apache Spark and MLflow
 
Scalable machine learning
Scalable machine learningScalable machine learning
Scalable machine learning
 
Oct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopOct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on Hadoop
 
Which Is Deeper - Comparison Of Deep Learning Frameworks On Spark
 Which Is Deeper - Comparison Of Deep Learning Frameworks On Spark Which Is Deeper - Comparison Of Deep Learning Frameworks On Spark
Which Is Deeper - Comparison Of Deep Learning Frameworks On Spark
 
PandasUDFs: One Weird Trick to Scaled Ensembles
PandasUDFs: One Weird Trick to Scaled EnsemblesPandasUDFs: One Weird Trick to Scaled Ensembles
PandasUDFs: One Weird Trick to Scaled Ensembles
 
Scalable Automatic Machine Learning in H2O
 Scalable Automatic Machine Learning in H2O Scalable Automatic Machine Learning in H2O
Scalable Automatic Machine Learning in H2O
 
Large Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache GiraphLarge Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache Giraph
 
Productionizing Deep Reinforcement Learning with Spark and MLflow
Productionizing Deep Reinforcement Learning with Spark and MLflowProductionizing Deep Reinforcement Learning with Spark and MLflow
Productionizing Deep Reinforcement Learning with Spark and MLflow
 
Data Intensive Applications with Apache Flink
Data Intensive Applications with Apache FlinkData Intensive Applications with Apache Flink
Data Intensive Applications with Apache Flink
 
Node Architecture Implications for In-Memory Data Analytics on Scale-in Clusters
Node Architecture Implications for In-Memory Data Analytics on Scale-in ClustersNode Architecture Implications for In-Memory Data Analytics on Scale-in Clusters
Node Architecture Implications for In-Memory Data Analytics on Scale-in Clusters
 

Viewers also liked

Single cover progression
Single cover progressionSingle cover progression
Single cover progressionryanrococo
 
Contents progress
Contents progressContents progress
Contents progressryanrococo
 
Manipulating Photographs 1L
Manipulating Photographs 1LManipulating Photographs 1L
Manipulating Photographs 1Lryanrococo
 
Double page spread process
Double page spread processDouble page spread process
Double page spread processryanrococo
 
Steph manipulation of photographs
Steph   manipulation of photographsSteph   manipulation of photographs
Steph manipulation of photographsryanrococo
 
Imogen manipulation of photographs
Imogen   manipulation of photographsImogen   manipulation of photographs
Imogen manipulation of photographsryanrococo
 
Isabel Movie Plotline
Isabel Movie PlotlineIsabel Movie Plotline
Isabel Movie Plotlineryanrococo
 
K atie maniuplation of photographs
K atie   maniuplation of photographsK atie   maniuplation of photographs
K atie maniuplation of photographsryanrococo
 
Weather conditions
Weather conditionsWeather conditions
Weather conditionsryanrococo
 
Framing shots inspirations
Framing shots   inspirationsFraming shots   inspirations
Framing shots inspirationsryanrococo
 
Decolonizzazione
DecolonizzazioneDecolonizzazione
Decolonizzazionetave88
 
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...DB Tsai
 
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...DB Tsai
 
2014-06-20 Multinomial Logistic Regression with Apache Spark
2014-06-20 Multinomial Logistic Regression with Apache Spark2014-06-20 Multinomial Logistic Regression with Apache Spark
2014-06-20 Multinomial Logistic Regression with Apache SparkDB Tsai
 
Unsupervised Learning with Apache Spark
Unsupervised Learning with Apache SparkUnsupervised Learning with Apache Spark
Unsupervised Learning with Apache SparkDB Tsai
 

Viewers also liked (18)

Single cover progression
Single cover progressionSingle cover progression
Single cover progression
 
Contents progress
Contents progressContents progress
Contents progress
 
Design drafts
Design draftsDesign drafts
Design drafts
 
Manipulating Photographs 1L
Manipulating Photographs 1LManipulating Photographs 1L
Manipulating Photographs 1L
 
Double page spread process
Double page spread processDouble page spread process
Double page spread process
 
Steph manipulation of photographs
Steph   manipulation of photographsSteph   manipulation of photographs
Steph manipulation of photographs
 
Imogen manipulation of photographs
Imogen   manipulation of photographsImogen   manipulation of photographs
Imogen manipulation of photographs
 
Clothing
ClothingClothing
Clothing
 
Isabel Movie Plotline
Isabel Movie PlotlineIsabel Movie Plotline
Isabel Movie Plotline
 
K atie maniuplation of photographs
K atie   maniuplation of photographsK atie   maniuplation of photographs
K atie maniuplation of photographs
 
Weather conditions
Weather conditionsWeather conditions
Weather conditions
 
Framing shots inspirations
Framing shots   inspirationsFraming shots   inspirations
Framing shots inspirations
 
Decolonizzazione
DecolonizzazioneDecolonizzazione
Decolonizzazione
 
Clothing 2
Clothing 2Clothing 2
Clothing 2
 
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
 
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
 
2014-06-20 Multinomial Logistic Regression with Apache Spark
2014-06-20 Multinomial Logistic Regression with Apache Spark2014-06-20 Multinomial Logistic Regression with Apache Spark
2014-06-20 Multinomial Logistic Regression with Apache Spark
 
Unsupervised Learning with Apache Spark
Unsupervised Learning with Apache SparkUnsupervised Learning with Apache Spark
Unsupervised Learning with Apache Spark
 

Similar to 2014-08-14 Alpine Innovation to Spark

Alpine innovation final v1.0
Alpine innovation final v1.0Alpine innovation final v1.0
Alpine innovation final v1.0alpinedatalabs
 
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML ConferenceDB Tsai
 
Spark Summit EU talk by Josef Habdank
Spark Summit EU talk by Josef HabdankSpark Summit EU talk by Josef Habdank
Spark Summit EU talk by Josef HabdankSpark Summit
 
Boosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of TechniquesBoosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of TechniquesAhsan Javed Awan
 
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearnPrediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearnJosef A. Habdank
 
Auto-Pilot for Apache Spark Using Machine Learning
Auto-Pilot for Apache Spark Using Machine LearningAuto-Pilot for Apache Spark Using Machine Learning
Auto-Pilot for Apache Spark Using Machine LearningDatabricks
 
Using Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning ModelsUsing Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning ModelsScott Clark
 
Using Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning ModelsUsing Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning ModelsSigOpt
 
Biomedical Signal and Image Analytics using MATLAB
Biomedical Signal and Image Analytics using MATLABBiomedical Signal and Image Analytics using MATLAB
Biomedical Signal and Image Analytics using MATLABCodeOps Technologies LLP
 
Scaling Analytics with Apache Spark
Scaling Analytics with Apache SparkScaling Analytics with Apache Spark
Scaling Analytics with Apache SparkQuantUniversity
 
Scaling Apache Spark at Facebook
Scaling Apache Spark at FacebookScaling Apache Spark at Facebook
Scaling Apache Spark at FacebookDatabricks
 
Apache Spark MLlib - Random Foreset and Desicion Trees
Apache Spark MLlib - Random Foreset and Desicion TreesApache Spark MLlib - Random Foreset and Desicion Trees
Apache Spark MLlib - Random Foreset and Desicion TreesTuhin Mahmud
 
Asynchronous Hyperparameter Search with Spark on Hopsworks and Maggy
Asynchronous Hyperparameter Search with Spark on Hopsworks and MaggyAsynchronous Hyperparameter Search with Spark on Hopsworks and Maggy
Asynchronous Hyperparameter Search with Spark on Hopsworks and MaggyJim Dowling
 
Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...
Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...
Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...Spark Summit
 
Machine learning with Apache Spark MLlib | Big Data Hadoop Spark Tutorial | C...
Machine learning with Apache Spark MLlib | Big Data Hadoop Spark Tutorial | C...Machine learning with Apache Spark MLlib | Big Data Hadoop Spark Tutorial | C...
Machine learning with Apache Spark MLlib | Big Data Hadoop Spark Tutorial | C...CloudxLab
 
Intro to SnappyData Webinar
Intro to SnappyData WebinarIntro to SnappyData Webinar
Intro to SnappyData WebinarSnappyData
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataData Con LA
 
Productionalizing Spark ML
Productionalizing Spark MLProductionalizing Spark ML
Productionalizing Spark MLdatamantra
 
071410 sun a_1515_feldman_stephen
071410 sun a_1515_feldman_stephen071410 sun a_1515_feldman_stephen
071410 sun a_1515_feldman_stephenSteve Feldman
 

Similar to 2014-08-14 Alpine Innovation to Spark (20)

Alpine innovation final v1.0
Alpine innovation final v1.0Alpine innovation final v1.0
Alpine innovation final v1.0
 
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
 
Spark Summit EU talk by Josef Habdank
Spark Summit EU talk by Josef HabdankSpark Summit EU talk by Josef Habdank
Spark Summit EU talk by Josef Habdank
 
Boosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of TechniquesBoosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of Techniques
 
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearnPrediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
 
Auto-Pilot for Apache Spark Using Machine Learning
Auto-Pilot for Apache Spark Using Machine LearningAuto-Pilot for Apache Spark Using Machine Learning
Auto-Pilot for Apache Spark Using Machine Learning
 
Using Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning ModelsUsing Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning Models
 
Using Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning ModelsUsing Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning Models
 
Biomedical Signal and Image Analytics using MATLAB
Biomedical Signal and Image Analytics using MATLABBiomedical Signal and Image Analytics using MATLAB
Biomedical Signal and Image Analytics using MATLAB
 
Scaling Analytics with Apache Spark
Scaling Analytics with Apache SparkScaling Analytics with Apache Spark
Scaling Analytics with Apache Spark
 
Scaling Apache Spark at Facebook
Scaling Apache Spark at FacebookScaling Apache Spark at Facebook
Scaling Apache Spark at Facebook
 
Apache Spark MLlib - Random Foreset and Desicion Trees
Apache Spark MLlib - Random Foreset and Desicion TreesApache Spark MLlib - Random Foreset and Desicion Trees
Apache Spark MLlib - Random Foreset and Desicion Trees
 
Asynchronous Hyperparameter Search with Spark on Hopsworks and Maggy
Asynchronous Hyperparameter Search with Spark on Hopsworks and MaggyAsynchronous Hyperparameter Search with Spark on Hopsworks and Maggy
Asynchronous Hyperparameter Search with Spark on Hopsworks and Maggy
 
Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...
Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...
Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...
 
Machine learning with Apache Spark MLlib | Big Data Hadoop Spark Tutorial | C...
Machine learning with Apache Spark MLlib | Big Data Hadoop Spark Tutorial | C...Machine learning with Apache Spark MLlib | Big Data Hadoop Spark Tutorial | C...
Machine learning with Apache Spark MLlib | Big Data Hadoop Spark Tutorial | C...
 
Intro to SnappyData Webinar
Intro to SnappyData WebinarIntro to SnappyData Webinar
Intro to SnappyData Webinar
 
System mldl meetup
System mldl meetupSystem mldl meetup
System mldl meetup
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and Snappydata
 
Productionalizing Spark ML
Productionalizing Spark MLProductionalizing Spark ML
Productionalizing Spark ML
 
071410 sun a_1515_feldman_stephen
071410 sun a_1515_feldman_stephen071410 sun a_1515_feldman_stephen
071410 sun a_1515_feldman_stephen
 

Recently uploaded

Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 

Recently uploaded (20)

Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 

2014-08-14 Alpine Innovation to Spark

  • 1. Learn more about Advanced Analytics at http://www.alpinenow.com Innovation on DB Tsai dbtsai@alpinenow.com Sung Chung schung@alpinenow.com Machine Learning Engineering @AlpineDataLabs August 14, 2014
  • 2. Learn more about Advanced Analytics at http://www.alpinenow.com TRADITIONAL DESKTOP IN-DATABASE METHODS WEB-BASED AND COLLABORATIVE SIMPLIFIED CODE-FREE HADOOP & MPP DATABASE ONGOING INNOVATION The Path to Innovation
  • 3. Learn more about Advanced Analytics at http://www.alpinenow.com The Path to Innovation Iterative algorithms scan through the data each time With Spark, data is cached in memory after first iteration Quasi-Newton methods enhance in-memory benefits 921s 150m m rows 97s
  • 4. Learn more about Advanced Analytics at http://www.alpinenow.com Machine Learning in the Big Data Era •  Hadoop Map Reduce solutions •  MapReduce scales well for batch processing •  Lots of machine learning algorithms are iterative by nature •  There are lots of tricks people do, like training with subsamples of data, and then average the models. Why have big data if you’re only approximating. + =
  • 5. Learn more about Advanced Analytics at http://www.alpinenow.com Lightning-fast cluster computing •  Empower users to iterate through the data by utilizing the in-memory cache. •  Logistic regression runs up to 100x faster than Hadoop M/R in memory. •  We’re able to train exact models without doing any approximation.
  • 6. Learn more about Advanced Analytics at http://www.alpinenow.com Why Alpine supports MLlib? •  MLlib is a Spark subproject providing Machine Learning primitives. •  It’s built on Apache Spark, a fast and general engine for large- scale data processing. •  Shipped with Apache Spark since version 0.8 •  High quality engineering design and effort •  More than 50 contributors since July 2014 •  Alpine is 100% committed to open source to facilitate industry adoption that are driven by business needs.
  • 7. Learn more about Advanced Analytics at http://www.alpinenow.com AutoML •  Success of machine learning crucially relies on human machine learning experts, who select appropriate features, workflows, paradigms, algorithms, and their hyper-parameters. •  Even the hyper-parameters can be chosen by grid search with cross-validation, a problem with more than two parameters becomes very difficult and challenging. It’s a non-convex optimization problem. •  There is a demand for off-the-shelf machine learning methods that can be used easily and without expert knowledge. - AutoML workshop @ ICML’14
  • 8. Learn more about Advanced Analytics at http://www.alpinenow.com Random Forest •  An ensemble learning method for classification & regression that operates by constructing a multitude of decision trees at training time. •  A “black box” without too much tuning and it can automatically identify the structure, interactions, and relationships in the data. •  A technique to reduce the variance of single decision tree predictions by averaging the predictions of many de- correlated trees. •  De-correlation is achieved through Bagging and / or randomly selecting features per tree node. NOTE: Most Kaggle competitions have at least one top entry that heavily uses Random Forests.
  • 9. Learn more about Advanced Analytics at http://www.alpinenow.com Sequoia Forest Why Sequoia Forest? MLlib already has a decision tree implementation, but it doesn’t support random features and is not optimized to train on large clusters. What does Sequoia Forest do? •  Classification and Regression. •  Numerical and Categorical Features. What’s next? Gradient Boosting Where can you find? https://github.com/AlpineNow/SparkML2 We’re merging back with MLlib and is licensed under the Apache License. More info: http://spark-summit.org/2014/talk/sequoia-forest-random-forest-of-humongous-trees.
  • 10. Learn more about Advanced Analytics at http://www.alpinenow.com Spark-1157: L-BFGS Optimizer •  No, its not a blender!
  • 11. Learn more about Advanced Analytics at http://www.alpinenow.com What is Spark-1157: L-BFGS Optimizer •  Merged in Spark 1.0 •  Popular algorithms for parameter estimation in Machine Learning. •  It’s a quasi-Newton Method. •  Hessian matrix of second derivatives doesn't need to be evaluated directly. •  Hessian matrix is approximated using gradient evaluations. •  It converges a way faster than the default optimizer in Spark, Gradient Decent.
  • 12. Learn more about Advanced Analytics at http://www.alpinenow.com
  • 13. Learn more about Advanced Analytics at http://www.alpinenow.com SPARK-2934: LogisticRegressionWithLBFGS •  Merged in Spark 1.1 •  Using L-BFGS to train Logistic Regression instead of default Gradient Descent. •  Users don't have to construct their objective function for Logistic Regression, and don't have to implement the whole details. •  Together with SPARK-2979 to minimize the condition number, the convergence rate is further improved.
  • 14. Learn more about Advanced Analytics at http://www.alpinenow.com
  • 15. Learn more about Advanced Analytics at http://www.alpinenow.com 0 5 10 15 20 25 30 35 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 Logistic Regression with a9a Dataset (11M rows, 123 features, 11% non-zero elements) 16 executors in INTEL Xeon E3-1230v3 32GB Memory * 5 nodes Hadoop 2.0.5 alpha cluster L-BFGS Dense Features L-BFGS Sparse Features GD Sparse Features GD Dense Features Seconds Log-Likelihood/NumberofSamplesa9a Dataset Benchmark
  • 16. Learn more about Advanced Analytics at http://www.alpinenow.com a9a Dataset Benchmark -1 1 3 5 7 9 11 13 15 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 Logistic Regression with a9a Dataset (11M rows, 123 features, 11% non-zero elements) 16 executors in INTEL Xeon E3-1230v3 32GB Memory * 5 nodes Hadoop 2.0.5 alpha cluster L-BFGS GD Iterations Log-Likelihood/NumberofSamples
  • 17. Learn more about Advanced Analytics at http://www.alpinenow.com 0 5 10 15 20 25 30 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Logistic Regression with rcv1 Dataset (6.8M rows, 677,399 features, 0.15% non-zero elements) 16 executors in INTEL Xeon E3-1230v3 32GB Memory * 5 nodes Hadoop 2.0.5 alpha cluster LBFGS Sparse Vector GD Sparse Vector Second Log-Likelihood/NumberofSamples rcv1 Dataset Benchmark
  • 18. Learn more about Advanced Analytics at http://www.alpinenow.com news20 Dataset Benchmark 0 10 20 30 40 50 60 70 80 0 0.2 0.4 0.6 0.8 1 1.2 Logistic Regression with news20 Dataset (0.14M rows, 1,355,191 features, 0.034% non-zero elements) 16 executors in INTEL Xeon E3-1230v3 32GB Memory * 5 nodes Hadoop 2.0.5 alpha cluster LBFGS Sparse Vector GD Sparse Vector Second Log-Likelihood/NumberofSamples
  • 19. Learn more about Advanced Analytics at http://www.alpinenow.com SPARK-2979: Improve the convergence rate by standardizing the training features l  Merged in Spark 1.1 l  Due to the invariance property of MLEs, the scale of your inputs are irrelevant. l  However, the optimizer will not be happy with poor condition numbers which can often be improved by scaling. l  The model is trained in the scaled space, but the coefficients are converted to original space; as a result, it's transparent to users. l  Without this, some training datasets mixing the columns with different scales may not be able to converge. l  Scikit and glmnet package also standardize the features before training to improve the convergence. l  Only enable in Logistic Regression for now.
  • 20. Learn more about Advanced Analytics at http://www.alpinenow.com SPARK-2272: Transformer A spark, the soul of a transformer
  • 21. Learn more about Advanced Analytics at http://www.alpinenow.com SPARK-2272: Transformer l  Merged in Spark 1.1 l  MLlib data preprocessing pipeline. l  StandardScaler -  Standardize features by removing the mean and scaling to unit variance. -  RBF kernel of Support Vector Machines or the L1 and L2 regularizers of linear models typically works better with zero mean and unit variance. l  Normalizer -  Normalizes samples individually to unit L^n norm. -  Common operation for text classification or clustering for instance. -  For example, the dot product of two l2-normalized TF-IDF vectors is the cosine similarity of the vectors.
  • 22. Learn more about Advanced Analytics at http://www.alpinenow.com StandardScaler
  • 23. Learn more about Advanced Analytics at http://www.alpinenow.com Normalizer
  • 24. Learn more about Advanced Analytics at http://www.alpinenow.com l  Merged in Spark 1.1 l  Online algorithms for computing the mean, variance, min, and max in a streaming fashion. l  Two online summerier can be merged, so we can use one summerier for one block of data in map phase, and merge all of them in reduce phase to obtain the global summarizer. l  A numerically stable one-pass algorithm is implemented to avoid catastrophic cancellation in naive implementation. Ref: http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance l  Optimized for sparse vector, and the time complexity is O(non-zeors) instead of O(numCols) for each sample. SPARK-1969: Online summarizer Two-pass algorithm Naive algorithm
  • 25. Learn more about Advanced Analytics at http://www.alpinenow.com
  • 26. Learn more about Advanced Analytics at http://www.alpinenow.com
  • 27. Learn more about Advanced Analytics at http://www.alpinenow.com l  Merged in Spark 1.1 l  Floating point math is not exact, and most floating-point numbers end up being slightly imprecise due to rounding errors. l  Simple values like 0.1 cannot be precisely represented using binary floating point numbers, and the limited precision of floating point numbers means that slight changes in the order of operations or the precision of intermediates can change the result. l  That means that comparing two floats to see if they are equal is usually not what we want. As long as this imprecision stays small, it can usually be ignored. l  Scala syntax sugar comparators are implemented using implicit conversion allowing developers to write unittest easier. SPARK-2479: MLlib UnitTests
  • 28. Learn more about Advanced Analytics at http://www.alpinenow.com
  • 29. Learn more about Advanced Analytics at http://www.alpinenow.com SPARK-1892: OWL-QN Optimizer ongoing work l  It extends L-BFGS to handle L2 and L1 regularizations together (balanced with alpha as in elastic nets) l  We fixed couple issues #247 in Breeze's OWLQN implementation, and this work is based on that. l  Blocked by SPARK-2505
  • 30. Learn more about Advanced Analytics at http://www.alpinenow.com SPARK-2505: Weighted Regularization ongoing work l  Each components of weights can be penalized differently. l  We can exclude intercept from regularization in this framework. l  Decoupling regularization from the raw gradient update which is not used in other optimization schemes. l  Allow various update/learning rate schemes (adagrad, normalized adaptive gradient, etc) to be applied independent of the regularization l  Smooth and L1 regularization will be handled differently in optimizer.
  • 31. Learn more about Advanced Analytics at http://www.alpinenow.com SPARK-2309: Multinomial Logistic Regression ongoing work l  For K classes multinomial problem, we can generalize it via K -1 linear models with logist link functions. l  As a result, the weights will have dimension of (K-1)(N + 1) where N is number of features. l  MLlib interface is designed for one set of paramerters per model, so it requires some interface design changes. l  Expected to be merged in next release of MLlib, Spark 1.2 Ref: http://www.slideshare.net/dbtsai/2014-0620-mlor-36132297
  • 32. Learn more about Advanced Analytics at http://www.alpinenow.com l  Technology we're using: Scala/Java, Akka, Spray, Hadoop, Spark/SparkSQL, Pig, Sqoop, Javascripts, D3.js etc. l  Actively involved in the open source community: almost of all our newly developed algorithms in Spark will be contributed back to MLLib. l  Actively developing on application to/from Spark Yarn communication infrastructure (application lifecycle, error reporting, progress monitoring and interactive command etc) l  In addition to Spark, we are the maintainer of several open source projects including Chorus, SBT plugin for JUnit test Listener, and Akka-based R engine. l  Weekly tech/ML talks. Speakers: David Hall (author of Breeze), Heather Miller (student of Martin Ordersky), H.Y. Li (author of Tachyon), and Jason Lee (student of Trevor Hastie), etc… l  Oraginzes the SF Machine Learning meetup (2k+ members). Speakers: Andrew Ng (Stanford), Michael Jorden (Berkely), Xiangrui Meng (Databricks), Sandy Ryza (Cloudera), etc… We’re open source friendly and tech driven!
  • 33. Learn more about Advanced Analytics at http://www.alpinenow.com We're hiring! l  Machine Learning Engineer l  Data Scientist l  UI/UX Engineer l  Platform Engineer l  Automation Test Engineer Shoot me an email at dbtsai@alpinenow.com
  • 34. Learn more about Advanced Analytics at http://www.alpinenow.com For more information, contact us 1550 Bryant Street Suite 1000 San Francisco, CA 94103 USA +1 (877) 542-0062 www.alpinenow.com Get Started Today! http://start.alpinenow.com