SlideShare a Scribd company logo
1 of 36
Learn more about Advanced Analytics at http://www.alpinenow.com
Innovation on
DB Tsai
dbtsai@alpinenow.com
Sung Chung
schung@alpinenow.com
Machine Learning Engineering @AlpineDataLabs
August 14, 2014
Learn more about Advanced Analytics at http://www.alpinenow.com
Alpine Data Labs
• Advanced Analytic Software Company
– Founded in 2011
– Agile Advanced Analytics, Collaboration and Management at Enterprise
Scale
– Partnerships with EMC, Pivotal, MapR, Cloudera, QlikView and Tableau
• 50+ employees, based in San Francisco
– Machine Learning, Statistics and Big Data (Stanford, Berkeley, MIT)
• Growing in excess of 200% YOY with a broad international
customer base
– Financial Services, Online Media, Government, Retail, Manufacturing…
2
Learn more about Advanced Analytics at http://www.alpinenow.com
Advanced Analytics on Big Data
Alpine Data Labs. Confidential and Proprietary.
Timeframe of Relevance
Work independently and re-use data
scientist work. Collaborate across
functions and teams. Iterate quickly.
Scalable Business Analytics
Allowing the Enterprise to manage
“Data as an Asset.”
Scale and guard data practices
Data Science Productivity
Work faster, safer, in a more open
manner. Industry leading machine
learning algorithms built natively for
parallel processing.
ALPINE CHORUS 4.0
ENTERPRISE DATA ENVIRONMENT
Data Scientist
Database Analyst
Data Engineer
Business Analyst
Campaign Manager
Sales
Division
Customer
Success
Product Manager
Learn more about Advanced Analytics at http://www.alpinenow.com
TRADITIONAL
DESKTOP
IN-DATABASE
METHODS
WEB-BASED AND
COLLABORATIVE
SIMPLIFIED CODE-FREE
HADOOP & MPP DATABASE
ONGOING INNOVATION
The Path to Innovation
Learn more about Advanced Analytics at http://www.alpinenow.com
The Path to Innovation
Iterative algorithms
scan through the
data each time
With Spark, data is
cached in memory
after first iteration
Quasi-Newton methods
enhance in-memory
benefits
921s
150m
m
rows
97s
Learn more about Advanced Analytics at http://www.alpinenow.com
Machine Learning in the Big Data Era
• Hadoop Map Reduce solutions
• MapReduce scales well for batch processing
• Lots of machine learning algorithms are iterative by nature
• There are lots of tricks people do, like training with subsamples of
data, and then average the models. Why have big data if you’re only
approximating.
+ =
Learn more about Advanced Analytics at http://www.alpinenow.com
Lightning-fast cluster
computing
• Empower users to iterate
through the data by
utilizing the in-memory
cache.
• Logistic regression runs up
to 100x faster than Hadoop
M/R in memory.
• We’re able to train exact
models without doing any
approximation.
Learn more about Advanced Analytics at http://www.alpinenow.com
Why Alpine supports MLlib?
• MLlib is a Spark subproject providing Machine Learning
primitives.
• It’s built on Apache Spark, a fast and general engine for large-
scale data processing.
• Shipped with Apache Spark since version 0.8
• High quality engineering design and effort
• More than 50 contributors since July 2014
• Alpine is 100% committed to open source to facilitate industry
adoption that are driven by business needs.
Learn more about Advanced Analytics at http://www.alpinenow.com
AutoML
• Success of machine learning crucially relies on human
machine learning experts, who select appropriate features,
workflows, paradigms, algorithms, and their hyper-
parameters.
• Even the hyper-parameters can be chosen by grid search
with cross-validation, a problem with more than two
parameters becomes very difficult and challenging. It’s a non-
convex optimization problem.
• There is a demand for off-the-shelf machine learning methods
that can be used easily and without expert knowledge.
- AutoML workshop @ ICML’14
Learn more about Advanced Analytics at http://www.alpinenow.com
Random Forest
• An ensemble learning method for classification &
regression that operates by constructing a multitude of
decision trees at training time.
• A “black box” without too much tuning and it can
automatically identify the structure, interactions, and
relationships in the data.
• A technique to reduce the variance of single decision
tree predictions by averaging the predictions of many de-
correlated trees.
• De-correlation is achieved through Bagging and / or
randomly selecting features per tree node.
NOTE: Most Kaggle competitions have at least one top
entry that heavily uses Random Forests.
Learn more about Advanced Analytics at http://www.alpinenow.com
Sequoia Forest
Why Sequoia Forest?
MLlib already has a decision tree implementation, but it doesn’t support random features and is not
optimized to train on large clusters.
What does Sequoia Forest do?
• Classification and Regression.
• Numerical and Categorical Features.
What’s next?
Gradient Boosting
Where can you find?
https://github.com/AlpineNow/SparkML2
We’re merging back with MLlib and is licensed under the Apache License.
More info: http://spark-summit.org/2014/talk/sequoia-forest-random-forest-of-humongous-trees.
Learn more about Advanced Analytics at http://www.alpinenow.com
Spark-1157: L-BFGS Optimizer
• No, its not a blender!
Learn more about Advanced Analytics at http://www.alpinenow.com
What is Spark-1157: L-BFGS Optimizer
• Merged in Spark 1.0
• Popular algorithms for parameter estimation in Machine
Learning.
• It’s a quasi-Newton Method.
• Hessian matrix of second derivatives doesn't need to be
evaluated directly.
• Hessian matrix is approximated using gradient evaluations.
• It converges a way faster than the default optimizer in Spark,
Gradient Decent.
Learn more about Advanced Analytics at http://www.alpinenow.com
Learn more about Advanced Analytics at http://www.alpinenow.com
SPARK-2934:
LogisticRegressionWithLBFGS
• Merged in Spark 1.1
• Using L-BFGS to train Logistic Regression instead of
default Gradient Descent.
• Users don't have to construct their objective function for
Logistic Regression, and don't have to implement the
whole details.
• Together with SPARK-2979 to minimize the condition
number, the convergence rate is further improved.
Learn more about Advanced Analytics at http://www.alpinenow.com
Learn more about Advanced Analytics at http://www.alpinenow.com
0 5 10 15 20 25 30 35
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
Logistic Regression with a9a Dataset (11M rows, 123 features, 11% non-zero elements)
16 executors in INTEL Xeon E3-1230v3 32GB Memory * 5 nodes Hadoop 2.0.5 alpha cluster
L-BFGS Dense Features
L-BFGS Sparse Features
GD Sparse Features
GD Dense Features
Seconds
Log-Likelihood/NumberofSamplesa9a Dataset Benchmark
Learn more about Advanced Analytics at http://www.alpinenow.com
a9a Dataset Benchmark
-1 1 3 5 7 9 11 13 15
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
Logistic Regression with a9a Dataset (11M rows, 123 features, 11% non-zero elements)
16 executors in INTEL Xeon E3-1230v3 32GB Memory * 5 nodes Hadoop 2.0.5 alpha cluster
L-BFGS
GD
Iterations
Log-Likelihood/NumberofSamples
Learn more about Advanced Analytics at http://www.alpinenow.com
0 5 10 15 20 25 30
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Logistic Regression with rcv1 Dataset (6.8M rows, 677,399 features, 0.15% non-zero elements)
16 executors in INTEL Xeon E3-1230v3 32GB Memory * 5 nodes Hadoop 2.0.5 alpha cluster
LBFGS Sparse Vector
GD Sparse Vector
Second
Log-Likelihood/NumberofSamples
rcv1 Dataset Benchmark
Learn more about Advanced Analytics at http://www.alpinenow.com
news20 Dataset Benchmark
0 10 20 30 40 50 60 70 80
0
0.2
0.4
0.6
0.8
1
1.2
Logistic Regression with news20 Dataset (0.14M rows, 1,355,191 features, 0.034% non-zero elements)
16 executors in INTEL Xeon E3-1230v3 32GB Memory * 5 nodes Hadoop 2.0.5 alpha cluster
LBFGS Sparse Vector
GD Sparse Vector
Second
Log-Likelihood/NumberofSamples
Learn more about Advanced Analytics at http://www.alpinenow.com
SPARK-2979: Improve the convergence rate by
standardizing the training features
 Merged in Spark 1.1
 Due to the invariance property of MLEs, the scale of your inputs are
irrelevant.
 However, the optimizer will not be happy with poor condition numbers
which can often be improved by scaling.
 The model is trained in the scaled space, but the coefficients are
converted to original space; as a result, it's transparent to users.
 Without this, some training datasets mixing the columns with different
scales may not be able to converge.
 Scikit and glmnet package also standardize the features before training to
improve the convergence.
 Only enable in Logistic Regression for now.
Learn more about Advanced Analytics at http://www.alpinenow.com
SPARK-2272: Transformer
A spark, the soul of a transformer
Learn more about Advanced Analytics at http://www.alpinenow.com
SPARK-2272: Transformer
 Merged in Spark 1.1
 MLlib data preprocessing pipeline.
 StandardScaler
 Standardize features by removing the mean and scaling to unit variance.
 RBF kernel of Support Vector Machines or the L1 and L2 regularizers of linear
models typically works better with zero mean and unit variance.
 Normalizer
 Normalizes samples individually to unit L^n norm.
 Common operation for text classification or clustering for instance.
 For example, the dot product of two l2-normalized TF-IDF vectors is the
cosine similarity of the vectors.
Learn more about Advanced Analytics at http://www.alpinenow.com
StandardScaler
Learn more about Advanced Analytics at http://www.alpinenow.com
Normalizer
Learn more about Advanced Analytics at http://www.alpinenow.com
 Merged in Spark 1.1
 Online algorithms for computing the mean, variance, min, and max in a streaming fashion.
 Two online summerier can be merged, so we can use one summerier for one block of
data in map phase, and merge all of them in reduce phase to obtain the global
summarizer.
 A numerically stable one-pass algorithm is implemented to avoid catastrophic cancellation
in naive implementation.
Ref: http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance
 Optimized for sparse vector, and the time complexity is O(non-zeors) instead of
O(numCols) for each sample.
SPARK-1969: Online summarizer
Two-pass algorithm Naive algorithm
Learn more about Advanced Analytics at http://www.alpinenow.com
Learn more about Advanced Analytics at http://www.alpinenow.com
Learn more about Advanced Analytics at http://www.alpinenow.com
 Merged in Spark 1.1
 Floating point math is not exact, and most floating-point numbers end up
being slightly imprecise due to rounding errors.
 Simple values like 0.1 cannot be precisely represented using binary
floating point numbers, and the limited precision of floating point numbers
means that slight changes in the order of operations or the precision of
intermediates can change the result.
 That means that comparing two floats to see if they are equal is usually
not what we want. As long as this imprecision stays small, it can usually be
ignored.
 Scala syntax sugar comparators are implemented using implicit
conversion allowing developers to write unittest easier.
SPARK-2479: MLlib UnitTests
Learn more about Advanced Analytics at http://www.alpinenow.com
Learn more about Advanced Analytics at http://www.alpinenow.com
SPARK-1892: OWL-QN Optimizer
ongoing work
 It extends L-BFGS to handle L2 and L1 regularizations
together
(balanced with alpha as in elastic nets)
 We fixed couple issues #247 in Breeze's OWLQN
implementation, and this work is based on that.
 Blocked by SPARK-2505
Learn more about Advanced Analytics at http://www.alpinenow.com
SPARK-2505: Weighted Regularization
ongoing work
 Each components of weights can be penalized differently.
 We can exclude intercept from regularization in this framework.
 Decoupling regularization from the raw gradient update which is
not used in other optimization schemes.
 Allow various update/learning rate schemes (adagrad,
normalized adaptive gradient, etc) to be applied independent of
the regularization
 Smooth and L1 regularization will be handled differently in
optimizer.
Learn more about Advanced Analytics at http://www.alpinenow.com
SPARK-2309: Multinomial Logistic Regression
ongoing work
 For K classes multinomial problem, we can generalize it via
K -1 linear models with logist link functions.
 As a result, the weights will have dimension of (K-1)(N + 1)
where N is number of features.
 MLlib interface is designed for one set of paramerters per
model, so it requires some interface design changes.
 Expected to be merged in next release of MLlib, Spark 1.2
Ref: http://www.slideshare.net/dbtsai/2014-0620-mlor-36132297
Learn more about Advanced Analytics at http://www.alpinenow.com
 Technology we're using: Scala/Java, Akka, Spray, Hadoop, Spark/SparkSQL, Pig, Sqoop,
Javascripts, D3.js etc.
 Actively involved in the open source community: almost of all our newly developed algorithms
in Spark will be contributed back to MLLib.
 Actively developing on application to/from Spark Yarn communication infrastructure
(application lifecycle, error reporting, progress monitoring and interactive command etc)
 In addition to Spark, we are the maintainer of several open source projects including Chorus,
SBT plugin for JUnit test Listener, and Akka-based R engine.
 Weekly tech/ML talks. Speakers: David Hall (author of Breeze), Heather Miller (student of
Martin Ordersky), H.Y. Li (author of Tachyon), and Jason Lee (student of Trevor Hastie), etc…
 Oraginzes the SF Machine Learning meetup (2k+ members). Speakers: Andrew Ng
(Stanford), Michael Jorden (Berkely), Xiangrui Meng (Databricks), Sandy Ryza (Cloudera),
etc…
We’re open source friendly and tech driven!
Learn more about Advanced Analytics at http://www.alpinenow.com
We're hiring!
 Machine Learning Engineer
 Data Scientist
 UI/UX Engineer
 Platform Engineer
 Automation Test Engineer
Shoot me an email at
dbtsai@alpinenow.com
Learn more about Advanced Analytics at http://www.alpinenow.com
For more information, contact us
1550 Bryant Street
Suite 1000
San Francisco, CA 94103
USA
+1 (877) 542-0062
www.alpinenow.com
Get Started Today!
http://start.alpinenow.com

More Related Content

What's hot

MLeap: Productionize Data Science Workflows Using Spark
MLeap: Productionize Data Science Workflows Using SparkMLeap: Productionize Data Science Workflows Using Spark
MLeap: Productionize Data Science Workflows Using SparkJen Aman
 
Data Agility—A Journey to Advanced Analytics and Machine Learning at Scale
Data Agility—A Journey to Advanced Analytics and Machine Learning at ScaleData Agility—A Journey to Advanced Analytics and Machine Learning at Scale
Data Agility—A Journey to Advanced Analytics and Machine Learning at ScaleDatabricks
 
Data Driven-Toyota Customer 360 Insights on Apache Spark and MLlib-(Brian Kur...
Data Driven-Toyota Customer 360 Insights on Apache Spark and MLlib-(Brian Kur...Data Driven-Toyota Customer 360 Insights on Apache Spark and MLlib-(Brian Kur...
Data Driven-Toyota Customer 360 Insights on Apache Spark and MLlib-(Brian Kur...Spark Summit
 
Delight: An Improved Apache Spark UI, Free, and Cross-Platform
Delight: An Improved Apache Spark UI, Free, and Cross-PlatformDelight: An Improved Apache Spark UI, Free, and Cross-Platform
Delight: An Improved Apache Spark UI, Free, and Cross-PlatformDatabricks
 
Benchmark Tests and How-Tos of Convolutional Neural Network on HorovodRunner ...
Benchmark Tests and How-Tos of Convolutional Neural Network on HorovodRunner ...Benchmark Tests and How-Tos of Convolutional Neural Network on HorovodRunner ...
Benchmark Tests and How-Tos of Convolutional Neural Network on HorovodRunner ...Databricks
 
Magnet Shuffle Service: Push-based Shuffle at LinkedIn
Magnet Shuffle Service: Push-based Shuffle at LinkedInMagnet Shuffle Service: Push-based Shuffle at LinkedIn
Magnet Shuffle Service: Push-based Shuffle at LinkedInDatabricks
 
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim LauSpark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim LauSpark Summit
 
Semantic Image Logging Using Approximate Statistics & MLflow
Semantic Image Logging Using Approximate Statistics & MLflowSemantic Image Logging Using Approximate Statistics & MLflow
Semantic Image Logging Using Approximate Statistics & MLflowDatabricks
 
How We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IOHow We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IODatabricks
 
Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...
Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...
Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...Databricks
 
Extending Machine Learning Algorithms with PySpark
Extending Machine Learning Algorithms with PySparkExtending Machine Learning Algorithms with PySpark
Extending Machine Learning Algorithms with PySparkDatabricks
 
Seattle Scalability Meetup - Ted Dunning - MapR
Seattle Scalability Meetup - Ted Dunning - MapRSeattle Scalability Meetup - Ted Dunning - MapR
Seattle Scalability Meetup - Ted Dunning - MapRclive boulton
 
SAM—streaming analytics made easy
SAM—streaming analytics made easySAM—streaming analytics made easy
SAM—streaming analytics made easyDataWorks Summit
 
Apache Spark - A High Level overview
Apache Spark - A High Level overviewApache Spark - A High Level overview
Apache Spark - A High Level overviewKaran Alang
 
SparkOscope: Enabling Apache Spark Optimization through Cross Stack Monitorin...
SparkOscope: Enabling Apache Spark Optimization through Cross Stack Monitorin...SparkOscope: Enabling Apache Spark Optimization through Cross Stack Monitorin...
SparkOscope: Enabling Apache Spark Optimization through Cross Stack Monitorin...Databricks
 
Frequently Bought Together Recommendations Based on Embeddings
Frequently Bought Together Recommendations Based on EmbeddingsFrequently Bought Together Recommendations Based on Embeddings
Frequently Bought Together Recommendations Based on EmbeddingsDatabricks
 
Feature Hashing for Scalable Machine Learning with Nick Pentreath
Feature Hashing for Scalable Machine Learning with Nick PentreathFeature Hashing for Scalable Machine Learning with Nick Pentreath
Feature Hashing for Scalable Machine Learning with Nick PentreathSpark Summit
 

What's hot (20)

Spark ML Pipeline serving
Spark ML Pipeline servingSpark ML Pipeline serving
Spark ML Pipeline serving
 
MLeap: Productionize Data Science Workflows Using Spark
MLeap: Productionize Data Science Workflows Using SparkMLeap: Productionize Data Science Workflows Using Spark
MLeap: Productionize Data Science Workflows Using Spark
 
Data Agility—A Journey to Advanced Analytics and Machine Learning at Scale
Data Agility—A Journey to Advanced Analytics and Machine Learning at ScaleData Agility—A Journey to Advanced Analytics and Machine Learning at Scale
Data Agility—A Journey to Advanced Analytics and Machine Learning at Scale
 
Data Driven-Toyota Customer 360 Insights on Apache Spark and MLlib-(Brian Kur...
Data Driven-Toyota Customer 360 Insights on Apache Spark and MLlib-(Brian Kur...Data Driven-Toyota Customer 360 Insights on Apache Spark and MLlib-(Brian Kur...
Data Driven-Toyota Customer 360 Insights on Apache Spark and MLlib-(Brian Kur...
 
Delight: An Improved Apache Spark UI, Free, and Cross-Platform
Delight: An Improved Apache Spark UI, Free, and Cross-PlatformDelight: An Improved Apache Spark UI, Free, and Cross-Platform
Delight: An Improved Apache Spark UI, Free, and Cross-Platform
 
Benchmark Tests and How-Tos of Convolutional Neural Network on HorovodRunner ...
Benchmark Tests and How-Tos of Convolutional Neural Network on HorovodRunner ...Benchmark Tests and How-Tos of Convolutional Neural Network on HorovodRunner ...
Benchmark Tests and How-Tos of Convolutional Neural Network on HorovodRunner ...
 
Magnet Shuffle Service: Push-based Shuffle at LinkedIn
Magnet Shuffle Service: Push-based Shuffle at LinkedInMagnet Shuffle Service: Push-based Shuffle at LinkedIn
Magnet Shuffle Service: Push-based Shuffle at LinkedIn
 
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim LauSpark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
 
Semantic Image Logging Using Approximate Statistics & MLflow
Semantic Image Logging Using Approximate Statistics & MLflowSemantic Image Logging Using Approximate Statistics & MLflow
Semantic Image Logging Using Approximate Statistics & MLflow
 
How We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IOHow We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IO
 
Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...
Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...
Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...
 
Extending Machine Learning Algorithms with PySpark
Extending Machine Learning Algorithms with PySparkExtending Machine Learning Algorithms with PySpark
Extending Machine Learning Algorithms with PySpark
 
Seattle Scalability Meetup - Ted Dunning - MapR
Seattle Scalability Meetup - Ted Dunning - MapRSeattle Scalability Meetup - Ted Dunning - MapR
Seattle Scalability Meetup - Ted Dunning - MapR
 
SAM—streaming analytics made easy
SAM—streaming analytics made easySAM—streaming analytics made easy
SAM—streaming analytics made easy
 
Cloudera Impala
Cloudera ImpalaCloudera Impala
Cloudera Impala
 
Apache spark
Apache sparkApache spark
Apache spark
 
Apache Spark - A High Level overview
Apache Spark - A High Level overviewApache Spark - A High Level overview
Apache Spark - A High Level overview
 
SparkOscope: Enabling Apache Spark Optimization through Cross Stack Monitorin...
SparkOscope: Enabling Apache Spark Optimization through Cross Stack Monitorin...SparkOscope: Enabling Apache Spark Optimization through Cross Stack Monitorin...
SparkOscope: Enabling Apache Spark Optimization through Cross Stack Monitorin...
 
Frequently Bought Together Recommendations Based on Embeddings
Frequently Bought Together Recommendations Based on EmbeddingsFrequently Bought Together Recommendations Based on Embeddings
Frequently Bought Together Recommendations Based on Embeddings
 
Feature Hashing for Scalable Machine Learning with Nick Pentreath
Feature Hashing for Scalable Machine Learning with Nick PentreathFeature Hashing for Scalable Machine Learning with Nick Pentreath
Feature Hashing for Scalable Machine Learning with Nick Pentreath
 

Viewers also liked

Analysis of the corporate reputation of the company
Analysis of the corporate reputation of the companyAnalysis of the corporate reputation of the company
Analysis of the corporate reputation of the companyGeorgeDolezal
 
Jencap company presentation
Jencap company presentationJencap company presentation
Jencap company presentationNoel Donovan
 
Strata Big Data Camp 2013
Strata Big Data Camp 2013Strata Big Data Camp 2013
Strata Big Data Camp 2013alpinedatalabs
 
(775180194) 1 s 2015 química segundaevaluacion version cero nutricion
(775180194) 1 s 2015 química segundaevaluacion version cero nutricion(775180194) 1 s 2015 química segundaevaluacion version cero nutricion
(775180194) 1 s 2015 química segundaevaluacion version cero nutricionDanny Riofrio Cornel
 
Don't Gamble With Your Data
Don't Gamble With Your DataDon't Gamble With Your Data
Don't Gamble With Your Dataalpinedatalabs
 
Predictive analytics from a to z
Predictive analytics from a to zPredictive analytics from a to z
Predictive analytics from a to zalpinedatalabs
 
Integrating R and the JVM Platform - Alpine Data Labs' R Execute Operator
Integrating R and the JVM Platform - Alpine Data Labs' R Execute OperatorIntegrating R and the JVM Platform - Alpine Data Labs' R Execute Operator
Integrating R and the JVM Platform - Alpine Data Labs' R Execute Operatoralpinedatalabs
 
Steven Hillion Presents, "Why Women are Better Data Scientists."
Steven Hillion Presents, "Why Women are Better Data Scientists."Steven Hillion Presents, "Why Women are Better Data Scientists."
Steven Hillion Presents, "Why Women are Better Data Scientists."alpinedatalabs
 
Alpine Spark Implementation - Technical
Alpine Spark Implementation - TechnicalAlpine Spark Implementation - Technical
Alpine Spark Implementation - Technicalalpinedatalabs
 
Analysis of external communication media of the company
Analysis of external communication media of the companyAnalysis of external communication media of the company
Analysis of external communication media of the companyGeorgeDolezal
 
Negative impacts of social media as my space and facebook on teenagers in th...
Negative impacts of social media as my space and facebook on teenagers  in th...Negative impacts of social media as my space and facebook on teenagers  in th...
Negative impacts of social media as my space and facebook on teenagers in th...GeorgeDolezal
 

Viewers also liked (14)

My world
My worldMy world
My world
 
Analysis of the corporate reputation of the company
Analysis of the corporate reputation of the companyAnalysis of the corporate reputation of the company
Analysis of the corporate reputation of the company
 
Jencap company presentation
Jencap company presentationJencap company presentation
Jencap company presentation
 
Marriott cuantitativas
Marriott cuantitativasMarriott cuantitativas
Marriott cuantitativas
 
Strata Big Data Camp 2013
Strata Big Data Camp 2013Strata Big Data Camp 2013
Strata Big Data Camp 2013
 
(775180194) 1 s 2015 química segundaevaluacion version cero nutricion
(775180194) 1 s 2015 química segundaevaluacion version cero nutricion(775180194) 1 s 2015 química segundaevaluacion version cero nutricion
(775180194) 1 s 2015 química segundaevaluacion version cero nutricion
 
Don't Gamble With Your Data
Don't Gamble With Your DataDon't Gamble With Your Data
Don't Gamble With Your Data
 
Predictive analytics from a to z
Predictive analytics from a to zPredictive analytics from a to z
Predictive analytics from a to z
 
Jamaica
JamaicaJamaica
Jamaica
 
Integrating R and the JVM Platform - Alpine Data Labs' R Execute Operator
Integrating R and the JVM Platform - Alpine Data Labs' R Execute OperatorIntegrating R and the JVM Platform - Alpine Data Labs' R Execute Operator
Integrating R and the JVM Platform - Alpine Data Labs' R Execute Operator
 
Steven Hillion Presents, "Why Women are Better Data Scientists."
Steven Hillion Presents, "Why Women are Better Data Scientists."Steven Hillion Presents, "Why Women are Better Data Scientists."
Steven Hillion Presents, "Why Women are Better Data Scientists."
 
Alpine Spark Implementation - Technical
Alpine Spark Implementation - TechnicalAlpine Spark Implementation - Technical
Alpine Spark Implementation - Technical
 
Analysis of external communication media of the company
Analysis of external communication media of the companyAnalysis of external communication media of the company
Analysis of external communication media of the company
 
Negative impacts of social media as my space and facebook on teenagers in th...
Negative impacts of social media as my space and facebook on teenagers  in th...Negative impacts of social media as my space and facebook on teenagers  in th...
Negative impacts of social media as my space and facebook on teenagers in th...
 

Similar to Alpine innovation final v1.0

2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML ConferenceDB Tsai
 
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...DB Tsai
 
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsApache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsAnyscale
 
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....Databricks
 
SQL on Hadoop benchmarks using TPC-DS query set
SQL on Hadoop benchmarks using TPC-DS query setSQL on Hadoop benchmarks using TPC-DS query set
SQL on Hadoop benchmarks using TPC-DS query setKognitio
 
Open-Falcon: A Distributed and High-Performance Monitoring System
Open-Falcon: A Distributed and High-Performance Monitoring SystemOpen-Falcon: A Distributed and High-Performance Monitoring System
Open-Falcon: A Distributed and High-Performance Monitoring SystemYao-Wei Ou
 
How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...
How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...
How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...InfluxData
 
Apache Spark MLlib - Random Foreset and Desicion Trees
Apache Spark MLlib - Random Foreset and Desicion TreesApache Spark MLlib - Random Foreset and Desicion Trees
Apache Spark MLlib - Random Foreset and Desicion TreesTuhin Mahmud
 
END-TO-END MACHINE LEARNING STACK
END-TO-END MACHINE LEARNING STACKEND-TO-END MACHINE LEARNING STACK
END-TO-END MACHINE LEARNING STACKJan Wiegelmann
 
Boosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of TechniquesBoosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of TechniquesAhsan Javed Awan
 
Very large scale distributed deep learning on BigDL
Very large scale distributed deep learning on BigDLVery large scale distributed deep learning on BigDL
Very large scale distributed deep learning on BigDLDESMOND YUEN
 
Oracle super cluster for oracle e business suite
Oracle super cluster for oracle e business suiteOracle super cluster for oracle e business suite
Oracle super cluster for oracle e business suiteOTN Systems Hub
 
HANA SPS07 App Function Library
HANA SPS07 App Function LibraryHANA SPS07 App Function Library
HANA SPS07 App Function LibrarySAP Technology
 
Agile data warehousing
Agile data warehousingAgile data warehousing
Agile data warehousingSneha Challa
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataDatabricks
 

Similar to Alpine innovation final v1.0 (20)

2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
 
Using Apache Spark with IBM SPSS Modeler
Using Apache Spark with IBM SPSS ModelerUsing Apache Spark with IBM SPSS Modeler
Using Apache Spark with IBM SPSS Modeler
 
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
 
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsApache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
 
DevOps for DataScience
DevOps for DataScienceDevOps for DataScience
DevOps for DataScience
 
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
 
SQL on Hadoop benchmarks using TPC-DS query set
SQL on Hadoop benchmarks using TPC-DS query setSQL on Hadoop benchmarks using TPC-DS query set
SQL on Hadoop benchmarks using TPC-DS query set
 
Open-Falcon: A Distributed and High-Performance Monitoring System
Open-Falcon: A Distributed and High-Performance Monitoring SystemOpen-Falcon: A Distributed and High-Performance Monitoring System
Open-Falcon: A Distributed and High-Performance Monitoring System
 
Tutorial4
Tutorial4Tutorial4
Tutorial4
 
Autodesk Technical Webinar: SAP HANA in-memory database
Autodesk Technical Webinar: SAP HANA in-memory databaseAutodesk Technical Webinar: SAP HANA in-memory database
Autodesk Technical Webinar: SAP HANA in-memory database
 
How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...
How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...
How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...
 
Apache Spark MLlib - Random Foreset and Desicion Trees
Apache Spark MLlib - Random Foreset and Desicion TreesApache Spark MLlib - Random Foreset and Desicion Trees
Apache Spark MLlib - Random Foreset and Desicion Trees
 
END-TO-END MACHINE LEARNING STACK
END-TO-END MACHINE LEARNING STACKEND-TO-END MACHINE LEARNING STACK
END-TO-END MACHINE LEARNING STACK
 
Boosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of TechniquesBoosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of Techniques
 
Very large scale distributed deep learning on BigDL
Very large scale distributed deep learning on BigDLVery large scale distributed deep learning on BigDL
Very large scale distributed deep learning on BigDL
 
Oracle super cluster for oracle e business suite
Oracle super cluster for oracle e business suiteOracle super cluster for oracle e business suite
Oracle super cluster for oracle e business suite
 
HANA SPS07 App Function Library
HANA SPS07 App Function LibraryHANA SPS07 App Function Library
HANA SPS07 App Function Library
 
Agile data warehousing
Agile data warehousingAgile data warehousing
Agile data warehousing
 
Kafka internals
Kafka internalsKafka internals
Kafka internals
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
 

Recently uploaded

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024SynarionITSolutions
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024The Digital Insurer
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 

Recently uploaded (20)

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 

Alpine innovation final v1.0

  • 1. Learn more about Advanced Analytics at http://www.alpinenow.com Innovation on DB Tsai dbtsai@alpinenow.com Sung Chung schung@alpinenow.com Machine Learning Engineering @AlpineDataLabs August 14, 2014
  • 2. Learn more about Advanced Analytics at http://www.alpinenow.com Alpine Data Labs • Advanced Analytic Software Company – Founded in 2011 – Agile Advanced Analytics, Collaboration and Management at Enterprise Scale – Partnerships with EMC, Pivotal, MapR, Cloudera, QlikView and Tableau • 50+ employees, based in San Francisco – Machine Learning, Statistics and Big Data (Stanford, Berkeley, MIT) • Growing in excess of 200% YOY with a broad international customer base – Financial Services, Online Media, Government, Retail, Manufacturing… 2
  • 3. Learn more about Advanced Analytics at http://www.alpinenow.com Advanced Analytics on Big Data Alpine Data Labs. Confidential and Proprietary. Timeframe of Relevance Work independently and re-use data scientist work. Collaborate across functions and teams. Iterate quickly. Scalable Business Analytics Allowing the Enterprise to manage “Data as an Asset.” Scale and guard data practices Data Science Productivity Work faster, safer, in a more open manner. Industry leading machine learning algorithms built natively for parallel processing. ALPINE CHORUS 4.0 ENTERPRISE DATA ENVIRONMENT Data Scientist Database Analyst Data Engineer Business Analyst Campaign Manager Sales Division Customer Success Product Manager
  • 4. Learn more about Advanced Analytics at http://www.alpinenow.com TRADITIONAL DESKTOP IN-DATABASE METHODS WEB-BASED AND COLLABORATIVE SIMPLIFIED CODE-FREE HADOOP & MPP DATABASE ONGOING INNOVATION The Path to Innovation
  • 5. Learn more about Advanced Analytics at http://www.alpinenow.com The Path to Innovation Iterative algorithms scan through the data each time With Spark, data is cached in memory after first iteration Quasi-Newton methods enhance in-memory benefits 921s 150m m rows 97s
  • 6. Learn more about Advanced Analytics at http://www.alpinenow.com Machine Learning in the Big Data Era • Hadoop Map Reduce solutions • MapReduce scales well for batch processing • Lots of machine learning algorithms are iterative by nature • There are lots of tricks people do, like training with subsamples of data, and then average the models. Why have big data if you’re only approximating. + =
  • 7. Learn more about Advanced Analytics at http://www.alpinenow.com Lightning-fast cluster computing • Empower users to iterate through the data by utilizing the in-memory cache. • Logistic regression runs up to 100x faster than Hadoop M/R in memory. • We’re able to train exact models without doing any approximation.
  • 8. Learn more about Advanced Analytics at http://www.alpinenow.com Why Alpine supports MLlib? • MLlib is a Spark subproject providing Machine Learning primitives. • It’s built on Apache Spark, a fast and general engine for large- scale data processing. • Shipped with Apache Spark since version 0.8 • High quality engineering design and effort • More than 50 contributors since July 2014 • Alpine is 100% committed to open source to facilitate industry adoption that are driven by business needs.
  • 9. Learn more about Advanced Analytics at http://www.alpinenow.com AutoML • Success of machine learning crucially relies on human machine learning experts, who select appropriate features, workflows, paradigms, algorithms, and their hyper- parameters. • Even the hyper-parameters can be chosen by grid search with cross-validation, a problem with more than two parameters becomes very difficult and challenging. It’s a non- convex optimization problem. • There is a demand for off-the-shelf machine learning methods that can be used easily and without expert knowledge. - AutoML workshop @ ICML’14
  • 10. Learn more about Advanced Analytics at http://www.alpinenow.com Random Forest • An ensemble learning method for classification & regression that operates by constructing a multitude of decision trees at training time. • A “black box” without too much tuning and it can automatically identify the structure, interactions, and relationships in the data. • A technique to reduce the variance of single decision tree predictions by averaging the predictions of many de- correlated trees. • De-correlation is achieved through Bagging and / or randomly selecting features per tree node. NOTE: Most Kaggle competitions have at least one top entry that heavily uses Random Forests.
  • 11. Learn more about Advanced Analytics at http://www.alpinenow.com Sequoia Forest Why Sequoia Forest? MLlib already has a decision tree implementation, but it doesn’t support random features and is not optimized to train on large clusters. What does Sequoia Forest do? • Classification and Regression. • Numerical and Categorical Features. What’s next? Gradient Boosting Where can you find? https://github.com/AlpineNow/SparkML2 We’re merging back with MLlib and is licensed under the Apache License. More info: http://spark-summit.org/2014/talk/sequoia-forest-random-forest-of-humongous-trees.
  • 12. Learn more about Advanced Analytics at http://www.alpinenow.com Spark-1157: L-BFGS Optimizer • No, its not a blender!
  • 13. Learn more about Advanced Analytics at http://www.alpinenow.com What is Spark-1157: L-BFGS Optimizer • Merged in Spark 1.0 • Popular algorithms for parameter estimation in Machine Learning. • It’s a quasi-Newton Method. • Hessian matrix of second derivatives doesn't need to be evaluated directly. • Hessian matrix is approximated using gradient evaluations. • It converges a way faster than the default optimizer in Spark, Gradient Decent.
  • 14. Learn more about Advanced Analytics at http://www.alpinenow.com
  • 15. Learn more about Advanced Analytics at http://www.alpinenow.com SPARK-2934: LogisticRegressionWithLBFGS • Merged in Spark 1.1 • Using L-BFGS to train Logistic Regression instead of default Gradient Descent. • Users don't have to construct their objective function for Logistic Regression, and don't have to implement the whole details. • Together with SPARK-2979 to minimize the condition number, the convergence rate is further improved.
  • 16. Learn more about Advanced Analytics at http://www.alpinenow.com
  • 17. Learn more about Advanced Analytics at http://www.alpinenow.com 0 5 10 15 20 25 30 35 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 Logistic Regression with a9a Dataset (11M rows, 123 features, 11% non-zero elements) 16 executors in INTEL Xeon E3-1230v3 32GB Memory * 5 nodes Hadoop 2.0.5 alpha cluster L-BFGS Dense Features L-BFGS Sparse Features GD Sparse Features GD Dense Features Seconds Log-Likelihood/NumberofSamplesa9a Dataset Benchmark
  • 18. Learn more about Advanced Analytics at http://www.alpinenow.com a9a Dataset Benchmark -1 1 3 5 7 9 11 13 15 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 Logistic Regression with a9a Dataset (11M rows, 123 features, 11% non-zero elements) 16 executors in INTEL Xeon E3-1230v3 32GB Memory * 5 nodes Hadoop 2.0.5 alpha cluster L-BFGS GD Iterations Log-Likelihood/NumberofSamples
  • 19. Learn more about Advanced Analytics at http://www.alpinenow.com 0 5 10 15 20 25 30 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Logistic Regression with rcv1 Dataset (6.8M rows, 677,399 features, 0.15% non-zero elements) 16 executors in INTEL Xeon E3-1230v3 32GB Memory * 5 nodes Hadoop 2.0.5 alpha cluster LBFGS Sparse Vector GD Sparse Vector Second Log-Likelihood/NumberofSamples rcv1 Dataset Benchmark
  • 20. Learn more about Advanced Analytics at http://www.alpinenow.com news20 Dataset Benchmark 0 10 20 30 40 50 60 70 80 0 0.2 0.4 0.6 0.8 1 1.2 Logistic Regression with news20 Dataset (0.14M rows, 1,355,191 features, 0.034% non-zero elements) 16 executors in INTEL Xeon E3-1230v3 32GB Memory * 5 nodes Hadoop 2.0.5 alpha cluster LBFGS Sparse Vector GD Sparse Vector Second Log-Likelihood/NumberofSamples
  • 21. Learn more about Advanced Analytics at http://www.alpinenow.com SPARK-2979: Improve the convergence rate by standardizing the training features  Merged in Spark 1.1  Due to the invariance property of MLEs, the scale of your inputs are irrelevant.  However, the optimizer will not be happy with poor condition numbers which can often be improved by scaling.  The model is trained in the scaled space, but the coefficients are converted to original space; as a result, it's transparent to users.  Without this, some training datasets mixing the columns with different scales may not be able to converge.  Scikit and glmnet package also standardize the features before training to improve the convergence.  Only enable in Logistic Regression for now.
  • 22. Learn more about Advanced Analytics at http://www.alpinenow.com SPARK-2272: Transformer A spark, the soul of a transformer
  • 23. Learn more about Advanced Analytics at http://www.alpinenow.com SPARK-2272: Transformer  Merged in Spark 1.1  MLlib data preprocessing pipeline.  StandardScaler  Standardize features by removing the mean and scaling to unit variance.  RBF kernel of Support Vector Machines or the L1 and L2 regularizers of linear models typically works better with zero mean and unit variance.  Normalizer  Normalizes samples individually to unit L^n norm.  Common operation for text classification or clustering for instance.  For example, the dot product of two l2-normalized TF-IDF vectors is the cosine similarity of the vectors.
  • 24. Learn more about Advanced Analytics at http://www.alpinenow.com StandardScaler
  • 25. Learn more about Advanced Analytics at http://www.alpinenow.com Normalizer
  • 26. Learn more about Advanced Analytics at http://www.alpinenow.com  Merged in Spark 1.1  Online algorithms for computing the mean, variance, min, and max in a streaming fashion.  Two online summerier can be merged, so we can use one summerier for one block of data in map phase, and merge all of them in reduce phase to obtain the global summarizer.  A numerically stable one-pass algorithm is implemented to avoid catastrophic cancellation in naive implementation. Ref: http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance  Optimized for sparse vector, and the time complexity is O(non-zeors) instead of O(numCols) for each sample. SPARK-1969: Online summarizer Two-pass algorithm Naive algorithm
  • 27. Learn more about Advanced Analytics at http://www.alpinenow.com
  • 28. Learn more about Advanced Analytics at http://www.alpinenow.com
  • 29. Learn more about Advanced Analytics at http://www.alpinenow.com  Merged in Spark 1.1  Floating point math is not exact, and most floating-point numbers end up being slightly imprecise due to rounding errors.  Simple values like 0.1 cannot be precisely represented using binary floating point numbers, and the limited precision of floating point numbers means that slight changes in the order of operations or the precision of intermediates can change the result.  That means that comparing two floats to see if they are equal is usually not what we want. As long as this imprecision stays small, it can usually be ignored.  Scala syntax sugar comparators are implemented using implicit conversion allowing developers to write unittest easier. SPARK-2479: MLlib UnitTests
  • 30. Learn more about Advanced Analytics at http://www.alpinenow.com
  • 31. Learn more about Advanced Analytics at http://www.alpinenow.com SPARK-1892: OWL-QN Optimizer ongoing work  It extends L-BFGS to handle L2 and L1 regularizations together (balanced with alpha as in elastic nets)  We fixed couple issues #247 in Breeze's OWLQN implementation, and this work is based on that.  Blocked by SPARK-2505
  • 32. Learn more about Advanced Analytics at http://www.alpinenow.com SPARK-2505: Weighted Regularization ongoing work  Each components of weights can be penalized differently.  We can exclude intercept from regularization in this framework.  Decoupling regularization from the raw gradient update which is not used in other optimization schemes.  Allow various update/learning rate schemes (adagrad, normalized adaptive gradient, etc) to be applied independent of the regularization  Smooth and L1 regularization will be handled differently in optimizer.
  • 33. Learn more about Advanced Analytics at http://www.alpinenow.com SPARK-2309: Multinomial Logistic Regression ongoing work  For K classes multinomial problem, we can generalize it via K -1 linear models with logist link functions.  As a result, the weights will have dimension of (K-1)(N + 1) where N is number of features.  MLlib interface is designed for one set of paramerters per model, so it requires some interface design changes.  Expected to be merged in next release of MLlib, Spark 1.2 Ref: http://www.slideshare.net/dbtsai/2014-0620-mlor-36132297
  • 34. Learn more about Advanced Analytics at http://www.alpinenow.com  Technology we're using: Scala/Java, Akka, Spray, Hadoop, Spark/SparkSQL, Pig, Sqoop, Javascripts, D3.js etc.  Actively involved in the open source community: almost of all our newly developed algorithms in Spark will be contributed back to MLLib.  Actively developing on application to/from Spark Yarn communication infrastructure (application lifecycle, error reporting, progress monitoring and interactive command etc)  In addition to Spark, we are the maintainer of several open source projects including Chorus, SBT plugin for JUnit test Listener, and Akka-based R engine.  Weekly tech/ML talks. Speakers: David Hall (author of Breeze), Heather Miller (student of Martin Ordersky), H.Y. Li (author of Tachyon), and Jason Lee (student of Trevor Hastie), etc…  Oraginzes the SF Machine Learning meetup (2k+ members). Speakers: Andrew Ng (Stanford), Michael Jorden (Berkely), Xiangrui Meng (Databricks), Sandy Ryza (Cloudera), etc… We’re open source friendly and tech driven!
  • 35. Learn more about Advanced Analytics at http://www.alpinenow.com We're hiring!  Machine Learning Engineer  Data Scientist  UI/UX Engineer  Platform Engineer  Automation Test Engineer Shoot me an email at dbtsai@alpinenow.com
  • 36. Learn more about Advanced Analytics at http://www.alpinenow.com For more information, contact us 1550 Bryant Street Suite 1000 San Francisco, CA 94103 USA +1 (877) 542-0062 www.alpinenow.com Get Started Today! http://start.alpinenow.com