Alpine is constantly innovating, ever since the founding of the company based on in-database analytics that went far beyond traditional, in-memory, code-based desktop applications. This initial innovation built on the work of the MADlib team at Greenplum/Pivotal, ultimately inspired by the work of Joe Hellerstein’s team at UC Berkeley. The team then made all of this functionality available in a simple web interface, which enabled enterprise collaboration and a team-based approach to analytics. Later on, Alpine released its first support for Hadoop, enabling complex analytics on Hadoop without any coding, taking care of all the complexity of MapReduce and Hadoop configuration. Most recently, Alpine has been building new capabilities on top of Spark, to offer Hadoop users a new level of performance and scale.
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
Alpine innovation final v1.0
1. Learn more about Advanced Analytics at http://www.alpinenow.com
Innovation on
DB Tsai
dbtsai@alpinenow.com
Sung Chung
schung@alpinenow.com
Machine Learning Engineering @AlpineDataLabs
August 14, 2014
2. Learn more about Advanced Analytics at http://www.alpinenow.com
Alpine Data Labs
• Advanced Analytic Software Company
– Founded in 2011
– Agile Advanced Analytics, Collaboration and Management at Enterprise
Scale
– Partnerships with EMC, Pivotal, MapR, Cloudera, QlikView and Tableau
• 50+ employees, based in San Francisco
– Machine Learning, Statistics and Big Data (Stanford, Berkeley, MIT)
• Growing in excess of 200% YOY with a broad international
customer base
– Financial Services, Online Media, Government, Retail, Manufacturing…
2
3. Learn more about Advanced Analytics at http://www.alpinenow.com
Advanced Analytics on Big Data
Alpine Data Labs. Confidential and Proprietary.
Timeframe of Relevance
Work independently and re-use data
scientist work. Collaborate across
functions and teams. Iterate quickly.
Scalable Business Analytics
Allowing the Enterprise to manage
“Data as an Asset.”
Scale and guard data practices
Data Science Productivity
Work faster, safer, in a more open
manner. Industry leading machine
learning algorithms built natively for
parallel processing.
ALPINE CHORUS 4.0
ENTERPRISE DATA ENVIRONMENT
Data Scientist
Database Analyst
Data Engineer
Business Analyst
Campaign Manager
Sales
Division
Customer
Success
Product Manager
4. Learn more about Advanced Analytics at http://www.alpinenow.com
TRADITIONAL
DESKTOP
IN-DATABASE
METHODS
WEB-BASED AND
COLLABORATIVE
SIMPLIFIED CODE-FREE
HADOOP & MPP DATABASE
ONGOING INNOVATION
The Path to Innovation
5. Learn more about Advanced Analytics at http://www.alpinenow.com
The Path to Innovation
Iterative algorithms
scan through the
data each time
With Spark, data is
cached in memory
after first iteration
Quasi-Newton methods
enhance in-memory
benefits
921s
150m
m
rows
97s
6. Learn more about Advanced Analytics at http://www.alpinenow.com
Machine Learning in the Big Data Era
• Hadoop Map Reduce solutions
• MapReduce scales well for batch processing
• Lots of machine learning algorithms are iterative by nature
• There are lots of tricks people do, like training with subsamples of
data, and then average the models. Why have big data if you’re only
approximating.
+ =
7. Learn more about Advanced Analytics at http://www.alpinenow.com
Lightning-fast cluster
computing
• Empower users to iterate
through the data by
utilizing the in-memory
cache.
• Logistic regression runs up
to 100x faster than Hadoop
M/R in memory.
• We’re able to train exact
models without doing any
approximation.
8. Learn more about Advanced Analytics at http://www.alpinenow.com
Why Alpine supports MLlib?
• MLlib is a Spark subproject providing Machine Learning
primitives.
• It’s built on Apache Spark, a fast and general engine for large-
scale data processing.
• Shipped with Apache Spark since version 0.8
• High quality engineering design and effort
• More than 50 contributors since July 2014
• Alpine is 100% committed to open source to facilitate industry
adoption that are driven by business needs.
9. Learn more about Advanced Analytics at http://www.alpinenow.com
AutoML
• Success of machine learning crucially relies on human
machine learning experts, who select appropriate features,
workflows, paradigms, algorithms, and their hyper-
parameters.
• Even the hyper-parameters can be chosen by grid search
with cross-validation, a problem with more than two
parameters becomes very difficult and challenging. It’s a non-
convex optimization problem.
• There is a demand for off-the-shelf machine learning methods
that can be used easily and without expert knowledge.
- AutoML workshop @ ICML’14
10. Learn more about Advanced Analytics at http://www.alpinenow.com
Random Forest
• An ensemble learning method for classification &
regression that operates by constructing a multitude of
decision trees at training time.
• A “black box” without too much tuning and it can
automatically identify the structure, interactions, and
relationships in the data.
• A technique to reduce the variance of single decision
tree predictions by averaging the predictions of many de-
correlated trees.
• De-correlation is achieved through Bagging and / or
randomly selecting features per tree node.
NOTE: Most Kaggle competitions have at least one top
entry that heavily uses Random Forests.
11. Learn more about Advanced Analytics at http://www.alpinenow.com
Sequoia Forest
Why Sequoia Forest?
MLlib already has a decision tree implementation, but it doesn’t support random features and is not
optimized to train on large clusters.
What does Sequoia Forest do?
• Classification and Regression.
• Numerical and Categorical Features.
What’s next?
Gradient Boosting
Where can you find?
https://github.com/AlpineNow/SparkML2
We’re merging back with MLlib and is licensed under the Apache License.
More info: http://spark-summit.org/2014/talk/sequoia-forest-random-forest-of-humongous-trees.
12. Learn more about Advanced Analytics at http://www.alpinenow.com
Spark-1157: L-BFGS Optimizer
• No, its not a blender!
13. Learn more about Advanced Analytics at http://www.alpinenow.com
What is Spark-1157: L-BFGS Optimizer
• Merged in Spark 1.0
• Popular algorithms for parameter estimation in Machine
Learning.
• It’s a quasi-Newton Method.
• Hessian matrix of second derivatives doesn't need to be
evaluated directly.
• Hessian matrix is approximated using gradient evaluations.
• It converges a way faster than the default optimizer in Spark,
Gradient Decent.
15. Learn more about Advanced Analytics at http://www.alpinenow.com
SPARK-2934:
LogisticRegressionWithLBFGS
• Merged in Spark 1.1
• Using L-BFGS to train Logistic Regression instead of
default Gradient Descent.
• Users don't have to construct their objective function for
Logistic Regression, and don't have to implement the
whole details.
• Together with SPARK-2979 to minimize the condition
number, the convergence rate is further improved.
21. Learn more about Advanced Analytics at http://www.alpinenow.com
SPARK-2979: Improve the convergence rate by
standardizing the training features
Merged in Spark 1.1
Due to the invariance property of MLEs, the scale of your inputs are
irrelevant.
However, the optimizer will not be happy with poor condition numbers
which can often be improved by scaling.
The model is trained in the scaled space, but the coefficients are
converted to original space; as a result, it's transparent to users.
Without this, some training datasets mixing the columns with different
scales may not be able to converge.
Scikit and glmnet package also standardize the features before training to
improve the convergence.
Only enable in Logistic Regression for now.
22. Learn more about Advanced Analytics at http://www.alpinenow.com
SPARK-2272: Transformer
A spark, the soul of a transformer
23. Learn more about Advanced Analytics at http://www.alpinenow.com
SPARK-2272: Transformer
Merged in Spark 1.1
MLlib data preprocessing pipeline.
StandardScaler
Standardize features by removing the mean and scaling to unit variance.
RBF kernel of Support Vector Machines or the L1 and L2 regularizers of linear
models typically works better with zero mean and unit variance.
Normalizer
Normalizes samples individually to unit L^n norm.
Common operation for text classification or clustering for instance.
For example, the dot product of two l2-normalized TF-IDF vectors is the
cosine similarity of the vectors.
24. Learn more about Advanced Analytics at http://www.alpinenow.com
StandardScaler
25. Learn more about Advanced Analytics at http://www.alpinenow.com
Normalizer
26. Learn more about Advanced Analytics at http://www.alpinenow.com
Merged in Spark 1.1
Online algorithms for computing the mean, variance, min, and max in a streaming fashion.
Two online summerier can be merged, so we can use one summerier for one block of
data in map phase, and merge all of them in reduce phase to obtain the global
summarizer.
A numerically stable one-pass algorithm is implemented to avoid catastrophic cancellation
in naive implementation.
Ref: http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance
Optimized for sparse vector, and the time complexity is O(non-zeors) instead of
O(numCols) for each sample.
SPARK-1969: Online summarizer
Two-pass algorithm Naive algorithm
29. Learn more about Advanced Analytics at http://www.alpinenow.com
Merged in Spark 1.1
Floating point math is not exact, and most floating-point numbers end up
being slightly imprecise due to rounding errors.
Simple values like 0.1 cannot be precisely represented using binary
floating point numbers, and the limited precision of floating point numbers
means that slight changes in the order of operations or the precision of
intermediates can change the result.
That means that comparing two floats to see if they are equal is usually
not what we want. As long as this imprecision stays small, it can usually be
ignored.
Scala syntax sugar comparators are implemented using implicit
conversion allowing developers to write unittest easier.
SPARK-2479: MLlib UnitTests
31. Learn more about Advanced Analytics at http://www.alpinenow.com
SPARK-1892: OWL-QN Optimizer
ongoing work
It extends L-BFGS to handle L2 and L1 regularizations
together
(balanced with alpha as in elastic nets)
We fixed couple issues #247 in Breeze's OWLQN
implementation, and this work is based on that.
Blocked by SPARK-2505
32. Learn more about Advanced Analytics at http://www.alpinenow.com
SPARK-2505: Weighted Regularization
ongoing work
Each components of weights can be penalized differently.
We can exclude intercept from regularization in this framework.
Decoupling regularization from the raw gradient update which is
not used in other optimization schemes.
Allow various update/learning rate schemes (adagrad,
normalized adaptive gradient, etc) to be applied independent of
the regularization
Smooth and L1 regularization will be handled differently in
optimizer.
33. Learn more about Advanced Analytics at http://www.alpinenow.com
SPARK-2309: Multinomial Logistic Regression
ongoing work
For K classes multinomial problem, we can generalize it via
K -1 linear models with logist link functions.
As a result, the weights will have dimension of (K-1)(N + 1)
where N is number of features.
MLlib interface is designed for one set of paramerters per
model, so it requires some interface design changes.
Expected to be merged in next release of MLlib, Spark 1.2
Ref: http://www.slideshare.net/dbtsai/2014-0620-mlor-36132297
34. Learn more about Advanced Analytics at http://www.alpinenow.com
Technology we're using: Scala/Java, Akka, Spray, Hadoop, Spark/SparkSQL, Pig, Sqoop,
Javascripts, D3.js etc.
Actively involved in the open source community: almost of all our newly developed algorithms
in Spark will be contributed back to MLLib.
Actively developing on application to/from Spark Yarn communication infrastructure
(application lifecycle, error reporting, progress monitoring and interactive command etc)
In addition to Spark, we are the maintainer of several open source projects including Chorus,
SBT plugin for JUnit test Listener, and Akka-based R engine.
Weekly tech/ML talks. Speakers: David Hall (author of Breeze), Heather Miller (student of
Martin Ordersky), H.Y. Li (author of Tachyon), and Jason Lee (student of Trevor Hastie), etc…
Oraginzes the SF Machine Learning meetup (2k+ members). Speakers: Andrew Ng
(Stanford), Michael Jorden (Berkely), Xiangrui Meng (Databricks), Sandy Ryza (Cloudera),
etc…
We’re open source friendly and tech driven!
35. Learn more about Advanced Analytics at http://www.alpinenow.com
We're hiring!
Machine Learning Engineer
Data Scientist
UI/UX Engineer
Platform Engineer
Automation Test Engineer
Shoot me an email at
dbtsai@alpinenow.com
36. Learn more about Advanced Analytics at http://www.alpinenow.com
For more information, contact us
1550 Bryant Street
Suite 1000
San Francisco, CA 94103
USA
+1 (877) 542-0062
www.alpinenow.com
Get Started Today!
http://start.alpinenow.com