SlideShare a Scribd company logo
Large-Scale Machine Learning with 
DB Tsai 
Machine Learning Engineering Lead @ AlpineDataLabs 
Internet of Things Conference @ Moscone Center, SF 
http://www.iotaconf.com/ 
October 20, 2014 
Learn more about Advanced Analytics at http://www.alpinenow.com
TRADITIONAL 
DESKTOP 
IN-DATABASE 
METHODS 
Learn more about Advanced Analytics at http://www.alpinenow.com 
WEB-BASED AND 
COLLABORATIVE 
SIMPLIFIED CODE-FREE 
HADOOP & MPP DATABASE 
ONGOING INNOVATION 
The Path to Innovation
The Path to Innovation 
Iterative algorithms 
scan through the 
data each time 
Learn more about Advanced Analytics at http://www.alpinenow.com 
With Spark, data is 
cached in memory 
after first iteration 
Quasi-Newton methods 
enhance in-memory 
benefits 
921s 
150m 
m 
rows 
97s
Machine Learning in the Big Data Era 
• Hadoop Map Reduce solutions 
+ = 
• MapReduce scales well for batch processing 
• Lots of machine learning algorithms are iterative by nature 
• There are lots of tricks people do, like training with sub-samples of 
data, and then average the models. Why have big data if you’re only 
approximating. 
Learn more about Advanced Analytics at http://www.alpinenow.com
Lightning-fast cluster computing 
• Empower users to iterate 
through the data by utilizing 
the in-memory cache. 
• Logistic regression runs up 
to 100x faster than Hadoop 
M/R in memory. 
• We’re able to train exact 
models without doing any 
approximation. 
Learn more about Advanced Analytics at http://www.alpinenow.com
Why MLlib? 
• MLlib is a Spark subproject providing Machine Learning 
primitives 
• It’s built on Apache Spark, a fast and general engine for 
large-scale data processing 
• Shipped with Apache Spark since version 0.8 
• High quality engineering design and effort 
• More than 50 contributors since July 2014 
Learn more about Advanced Analytics at http://www.alpinenow.com
Algorithms supported in MLlib 
• Classification: SVMs, logistic regression, decision trees, 
naïve Bayes, and random forests 
• Regression: linear regression, and random forests 
• Collaborative filtering: alternating least squares (ALS) 
• Clustering: k-means 
• Dimensionality reduction: singular value decomposition 
(SVD), and principal component analysis (PCA) 
• Basic statistics: summary statistics, correlations, stratified 
sampling, hypothesis testing, and random data generation 
• Feature extraction and transformation: TF-IDF, Word2Vec, 
StandardScaler, and Normalizer 
Learn more about Advanced Analytics at http://www.alpinenow.com
MapReduce Review 
• MapReduce – Simplified Data Processing on Large 
Clusters, 2004. 
• Scales Linearly 
• Data Locality 
• Fault Tolerance in Data Storage and Computation 
Learn more about Advanced Analytics at http://www.alpinenow.com
Hadoop MapReduce Review 
• Mapper: Loads the data and emits a set of key-value pair 
• Reducer: Collects the key-value pairs with the same key to process, 
and output the result. 
• Combiner: Can reduce shuffle traffic by combining key-value pairs 
locally before going to reducer. 
• In-Mapper Combiner: Aggregating the result in the mapper side, 
and using the LRU cache to prevent out of heap space. 
http://alpinenow.com/blog/in-mapper-combiner/ 
• Good: Built in fault tolerance, scalable, and production proven in 
industry. 
• Bad: Optimized for disk IO without leveraging memory well; iterative 
algorithms go through disk IO again and again; primitive API is not 
easy and clean to develop. 
Learn more about Advanced Analytics at http://www.alpinenow.com
Spark MapReduce 
• Spark also uses MapReduce as a programming model but 
with much richer APIs in Scala, Java, and Python. 
• With Scala expressive APIs, 5-10x less code. 
• Not just a distributed computation framework, Spark provides 
several pre-built components helping users to implement 
application faster and easier. 
- Spark Streaming 
- Spark SQL 
- MLlib (Machine Learning) 
- GraphX (Graph Processing) 
Learn more about Advanced Analytics at http://www.alpinenow.com
Resilient Distributed Datasets (RDDs) 
• RDD is a fault-tolerant collection of elements that can be 
operated on in parallel. 
• RDDs can be created by parallelizing an existing 
collection in your driver program, or referencing a dataset 
in an external storage system, such as a shared 
filesystem, HDFS, HBase, HIVE, or any data source 
offering a Hadoop InputFormat. 
• RDDs can be cached in memory or on disk 
Learn more about Advanced Analytics at http://www.alpinenow.com
Hadoop M/R vs Spark M/R 
• Hadoop 
• Spark 
Learn more about Advanced Analytics at http://www.alpinenow.com
RDD Operations - two types of operations 
• Transformations: Creates a new dataset from 
an existing one. They are lazy, in that they do 
not compute their results right away. 
• Actions: Returns a value to the driver program 
after running a computation on the dataset. 
Learn more about Advanced Analytics at http://www.alpinenow.com
Transformations 
• map(func) - Return a new distributed dataset formed by passing each 
element of the source through a function func. 
• filter(func) - Return a new dataset formed by selecting those elements of the 
source on which func returns true. 
• flatMap(func) - Similar to map, but each input item can be mapped to 0 or 
more output items (so func should return a Seq rather than a single item). 
• mapPartitions(func) - Similar to map, but runs separately on each partition 
(block) of the RDD, so func must be of type Iterator<T> => Iterator<U> when 
running on an RDD of type T. 
• groupByKey([numTasks]) - When called on a dataset of (K, V) pairs, returns a 
dataset of (K, Iterable<V>) pairs. 
• reduceByKey(func, [numTasks]) – When called on a dataset of (K, V) pairs, 
returns a dataset of (K, V) pairs where the values for each key are 
aggregated using the given reduce function func, which must be of type (V,V) 
=> V. 
http://spark.apache.org/docs/latest/programming-guide.html#transformations 
Learn more about Advanced Analytics at http://www.alpinenow.com
Actions 
• reduce(func) - Aggregate the elements of the dataset 
using a function func (which takes two arguments and 
returns one). The function should be commutative and 
associative so that it can be computed correctly in 
parallel. 
• collect() - Return all the elements of the dataset as an 
array at the driver program. This is usually useful after a 
filter or other operation that returns a sufficiently small 
subset of the data. 
• count(), first(), take(n), saveAsTextFile(path), etc. 
http://spark.apache.org/docs/latest/programming-guide. 
html#actions 
Learn more about Advanced Analytics at http://www.alpinenow.com
RDD Persistence/Cache 
• RDD can be persisted using the persist() or cache() 
methods on it. The first time it is computed in an action, it 
will be kept in memory on the nodes. Spark’s cache is 
fault-tolerant – if any partition of an RDD is lost, it will 
automatically be recomputed using the transformations 
that originally created it. 
• Persisted RDD can be stored using a different storage 
level, allowing you, for example, to persist the dataset on 
disk, persist it in memory but as serialized Java objects 
(to save space). 
Learn more about Advanced Analytics at http://www.alpinenow.com
RDD Storage Level 
• MEMORY_ONLY - Store RDD as deserialized Java objects in the JVM. 
If the RDD does not fit in memory, some partitions will not be cached 
and will be recomputed on the fly each time they're needed. This is the 
default level. 
• MEMORY_AND_DISK - Store RDD as deserialized Java objects in the 
JVM. If the RDD does not fit in memory, store the partitions that don't fit 
on disk, and read them from there when they're needed. 
• MEMORY_ONLY_SER - Store RDD as serialized Java objects (one 
byte array per partition). This is generally more space-efficient than 
deserialized objects, especially when using a fast serializer, but more 
CPU-intensive to read. 
• MEMORY_AND_DISK_SER - Similar to MEMORY_ONLY_SER, but 
spill partitions that don't fit in memory to disk instead of recomputing 
them on the fly each time they're needed. 
http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence 
Learn more about Advanced Analytics at http://www.alpinenow.com
Word Count Example in Scala 
Learn more about Advanced Analytics at http://www.alpinenow.com
Learn more about Advanced Analytics at http://www.alpinenow.com
Learn more about Advanced Analytics at http://www.alpinenow.com
Learn more about Advanced Analytics at http://www.alpinenow.com
API’s design philosophy in MLlib 
• Works seamlessly with Spark Core, and Spark SQL; users can use 
core API’s or Spark SQL for data pre-processing, and then pipe into 
training step. 
• Algorithms are implemented in Scala. Public interfaces don’t use 
advanced Scala features to ensure Java compatibility. 
• Many of MLlib API’s have python bindings. 
• MLlib is under active development. The APIs marked Experimental/ 
DeveloperApi may change in future releases, and will provide 
migration guide if they are changed. 
• API’s are well documented, and designed to be expressive. 
• Code is well-tested, comprehensive unittest coverage. There are lots 
of comments in the code, and it’s a enjoyable experience to read the 
code. 
Learn more about Advanced Analytics at http://www.alpinenow.com
Data Types 
• MLlib local vectors and local matrices are currently 
wrapping Breeze implementation; as a result, the underlying linear algebra 
operations are provided by Breeze and jblas. 
https://github.com/scalanlp/breeze 
• However, the methods converting MLlib to Breeze vectors/matrices or the 
other way around are private to org.apache.spark.mllib scope. This 
restriction can be workaround by having your custom code in 
org.apache.spark.mllib.something package. 
• A training sample used in supervised learning is stored in LabeledPoint 
which contains a label/response and a feature vector in dense or sparse. 
• Distributed RowMatrix – basically, it’s RDD[Vector] which doesn’t have 
meaningful row indices. 
• Distributed IndexedRowMatrix – it’s similar to RowMatrix, but each row 
is represented by its index and a local vector. 
Learn more about Advanced Analytics at http://www.alpinenow.com
Local vector 
The base class of local vectors is Vector, and we provide two implementations: 
DenseVector and SparseVector. 
Learn more about Advanced Analytics at http://www.alpinenow.com
Some useful tips related to local vector 
• If you want to use native Breeze functionality, you can 
have your code in org.apache.spark.mllib package. 
Learn more about Advanced Analytics at http://www.alpinenow.com
Real code in MLlib in MultivariateOnlineSummarizer 
Learn more about Advanced Analytics at http://www.alpinenow.com
LabeledPoint 
• Double is used for storing the label, so we can use the labeled points 
in both regression and classification. For binary classification, a label 
should be either 0.0 or 1.0. For N-class classification, labels should 
be class indices starting from zero: 0.0, 1.0, 2.0, …, N - 1 
Learn more about Advanced Analytics at http://www.alpinenow.com
Supervised Learning 
• Binary Classification: linear SVMs (SGD), logistic regression (L-BFGS 
and SGD), decision trees, random forests (Spark 1.2), and 
naïve Bayes. 
• Multiclass Classification: Decision trees, naïve Bayes (coming 
soon - multinomial logistic regression in GLMNET) 
• Regression: linear least squares (SGD), Lasso (SGD + soft-threshold), 
ridge regression (SGD), decision trees, and random 
forests (Spark 1.2) 
• Currently, the regularization in linear model will penalize all the 
weights including the intercept which is not desired in some use-cases. 
Alpine has GLMNET implementation using OWLQN which 
can exactly reproduce R’s GLMNET package result with scalability. 
We’re in the process of merging it into MLlib community. 
Learn more about Advanced Analytics at http://www.alpinenow.com
LinearRegressionWithSGD 
Learn more about Advanced Analytics at http://www.alpinenow.com
SVMWithSGD 
Learn more about Advanced Analytics at http://www.alpinenow.com
SPARK-2934: LogisticRegressionWithLBFGS 
• Merged in Spark 1.1 
• Contributed by Alpine Data Labs 
• Using L-BFGS to train Logistic Regression instead of 
default Gradient Descent. 
• Users don't have to construct their objective function for 
Logistic Regression, and don't have to implement the 
whole details. 
• Together with SPARK-2979 to minimize the condition 
number, the convergence rate is further improved. 
Learn more about Advanced Analytics at http://www.alpinenow.com
SPARK-2979: Improve the convergence rate by 
standardizing the training features 
l Merged in Spark 1.1 
l Contributed by Alpine Data Labs 
l Due to the invariance property of MLEs, the scale of your inputs are 
irrelevant. 
l However, the optimizer will not be happy with poor condition numbers 
which can often be improved by scaling. 
l The model is trained in the scaled space, but the coefficients are 
converted to original space; as a result, it's transparent to users. 
l Without this, some training datasets mixing the columns with different 
scales may not be able to converge. 
l Scikit and glmnet package also standardize the features before training to 
improve the convergence. 
l Only enable in Logistic Regression for now. 
Learn more about Advanced Analytics at http://www.alpinenow.com
Learn more about Advanced Analytics at http://www.alpinenow.com
a9a Dataset Benchmark 
0.7 
0.65 
0.6 
0.55 
0.5 
0.45 
0.4 
0.35 
0.3 
Logistic Regression with a9a Dataset (11M rows, 123 features, 11% non-zero elements) 
16 executors in INTEL Xeon E3-1230v3 32GB Memory * 5 nodes Hadoop 2.0.5 alpha cluster 
-1 1 3 5 7 9 11 13 15 
Learn more about Advanced Analytics at http://www.alpinenow.com 
L-BFGS 
GD 
Iterations 
Log-Likelihood / Number of Samples
rcv1 Dataset Benchmark 
Logistic Regression with rcv1 Dataset (6.8M rows, 677,399 features, 0.15% non-zero elements) 
16 executors in INTEL Xeon E3-1230v3 32GB Memory * 5 nodes Hadoop 2.0.5 alpha cluster 
0 5 10 15 20 25 30 
0.8 
0.7 
0.6 
0.5 
0.4 
0.3 
0.2 
0.1 
0 
Learn more about Advanced Analytics at http://www.alpinenow.com 
LBFGS Sparse Vector 
GD Sparse Vector 
Second 
Log-Likelihood / Number of Samples
news20 Dataset Benchmark 
Logistic Regression with news20 Dataset (0.14M rows, 1,355,191 features, 0.034% non-zero elements) 
16 executors in INTEL Xeon E3-1230v3 32GB Memory * 5 nodes Hadoop 2.0.5 alpha cluster 
0 10 20 30 40 50 60 70 80 
1.2 
1 
0.8 
0.6 
0.4 
0.2 
0 
Learn more about Advanced Analytics at http://www.alpinenow.com 
LBFGS Sparse Vector 
GD Sparse Vector 
Second Log-Likelihood / Number of Samples
K-Means 
Learn more about Advanced Analytics at http://www.alpinenow.com
PCA + K-Means 
Learn more about Advanced Analytics at http://www.alpinenow.com
Collaborative Filtering 
Learn more about Advanced Analytics at http://www.alpinenow.com
Spark-1157: L-BFGS Optimizer 
• No, its not a blender! 
Learn more about Advanced Analytics at http://www.alpinenow.com
What is Spark-1157: L-BFGS Optimizer 
• Merged in Spark 1.0 
• Contributed by Alpine Data Labs 
• Popular algorithms for parameter estimation in Machine Learning. 
• It’s a quasi-Newton Method. 
• Hessian matrix of second derivatives doesn't need to be evaluated 
directly. 
• Hessian matrix is approximated using gradient evaluations. 
• It converges a way faster than the default optimizer in Spark, 
Gradient Decent. 
• We are contributing OWLQN which is an variant of LBFGS to deal 
with L1 problem to Spark. It’s a building block of GLMNET. 
Learn more about Advanced Analytics at http://www.alpinenow.com
Learn more about Advanced Analytics at http://www.alpinenow.com
SPARK-2505: Weighted Regularization 
ongoing work 
l Each components of weights can be penalized differently. 
l We can exclude intercept from regularization in this framework. 
l Decoupling regularization from the raw gradient update which is 
not used in other optimization schemes. 
l Allow various update/learning rate schemes (adagrad, 
normalized adaptive gradient, etc) to be applied independent of 
the regularization 
l Smooth and L1 regularization will be handled differently in 
optimizer. 
Learn more about Advanced Analytics at http://www.alpinenow.com
SPARK-2309: Multinomial Logistic Regression 
ongoing work 
l For K classes multinomial problem, we can generalize it via 
K -1 linear models with logist link functions. 
l As a result, the weights will have dimension of (K-1)(N + 1) 
where N is number of features. 
l MLlib interface is designed for one set of paramerters per 
model, so it requires some interface design changes. 
l Expected to be merged in next release of MLlib, Spark 1.2 
Ref: http://www.slideshare.net/dbtsai/2014-0620-mlor-36132297 
Learn more about Advanced Analytics at http://www.alpinenow.com
SPARK-2272: Transformer 
A spark, the soul of a transformer 
Learn more about Advanced Analytics at http://www.alpinenow.com
SPARK-2272: Transformer 
l Merged in Spark 1.1 
l Contributed by Alpine Data Labs 
l MLlib data preprocessing pipeline. 
l StandardScaler 
- Standardize features by removing the mean and scaling to unit variance. 
- RBF kernel of Support Vector Machines or the L1 and L2 regularizers of linear 
models typically works better with zero mean and unit variance. 
l Normalizer 
- Normalizes samples individually to unit L^n norm. 
- Common operation for text classification or clustering for instance. 
- For example, the dot product of two l2-normalized TF-IDF vectors is the 
cosine similarity of the vectors. 
Learn more about Advanced Analytics at http://www.alpinenow.com
StandardScaler 
Learn more about Advanced Analytics at http://www.alpinenow.com
Learn more about Advanced Analytics at http://www.alpinenow.com 
Normalizer
SPARK-1969: Online summarizer 
l Merged in Spark 1.1 
l Contributed by Alpine Data Labs 
l Online algorithms for computing the mean, variance, min, and max in a streaming 
fashion. 
l Two online summerier can be merged, so we can use one summerier for one block of 
data in map phase, and merge all of them in reduce phase to obtain the global 
summarizer. 
l A numerically stable one-pass algorithm is implemented to avoid catastrophic cancellation 
in naive implementation. 
Ref: http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance 
Two-pass algorithm 
l Optimized for sparse vector, and the time complexity is O(non-zeors) instead of 
O(numCols) for each sample. 
Learn more about Advanced Analytics at http://www.alpinenow.com 
Naive algorithm
Learn more about Advanced Analytics at http://www.alpinenow.com
Learn more about Advanced Analytics at http://www.alpinenow.com
Spark SQL 
• Spark SQL allows relational queries expressed in SQL, HiveQL, or 
Scala to be executed using Spark. At the core of this component is a 
new type of RDD, SchemaRDD. 
• SchemaRDDs are composed of Row objects, along with a schema 
that describes the data types of each column in the row. A 
SchemaRDD is similar to a table in a traditional relational database. 
• A SchemaRDD can be created from an existing RDD, a Parquet file, 
a JSON dataset, or by running HiveQL against data stored in Apache 
Hive. 
http://spark.apache.org/docs/latest/sql-programming-guide.html 
Learn more about Advanced Analytics at http://www.alpinenow.com
Spark SQL + MLlib 
l With SparkSQL, users can easily load the parquet/ 
avro datasets into Spark, and perform the data pre-processing 
before the training steps. 
l MLlib considers to use schemaRDD as a native 
typed data format, like R’s data-frame. This allows 
us to create output model with types and column 
names, and also be easier to create PMML model. 
Learn more about Advanced Analytics at http://www.alpinenow.com
Spark SQL + MLlib 
l With SparkSQL, users can easily load the parquet/ 
avro datasets into Spark, and perform the data pre-processing 
before the training steps. 
l MLlib considers to use schemaRDD as a native 
typed data format, like R’s data-frame. This allows 
us to create output model with types and column 
names, and also be easier to create PMML model. 
Learn more about Advanced Analytics at http://www.alpinenow.com
Example: Prepare training data using Spark SQL 
Learn more about Advanced Analytics at http://www.alpinenow.com
Example: Prepare training data using Spark SQL 
Learn more about Advanced Analytics at http://www.alpinenow.com
Interested in MLlib? 
l MLlib official guide - 
https://spark.apache.org/docs/latest/mllib-guide.html 
l Github – https://github.com/apache/spark 
l Mailing lists - user@spark.apache.org 
or dev@spark.apache.org 
Learn more about Advanced Analytics at http://www.alpinenow.com
For more information, contact us 
1550 Bryant Street 
Suite 1000 
San Francisco, CA 94103 
USA 
+1 (877) 542-0062 
www.alpinenow.com 
Learn more about Advanced Analytics at http://www.alpinenow.com 
Get Started Today! 
http://start.alpinenow.com

More Related Content

What's hot

Generalized Linear Models in Spark MLlib and SparkR
Generalized Linear Models in Spark MLlib and SparkRGeneralized Linear Models in Spark MLlib and SparkR
Generalized Linear Models in Spark MLlib and SparkR
Databricks
 
Large Scale Machine learning with Spark
Large Scale Machine learning with SparkLarge Scale Machine learning with Spark
Large Scale Machine learning with Spark
Md. Mahedi Kaysar
 
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
Spark Summit
 
Machine learning with Apache Spark MLlib | Big Data Hadoop Spark Tutorial | C...
Machine learning with Apache Spark MLlib | Big Data Hadoop Spark Tutorial | C...Machine learning with Apache Spark MLlib | Big Data Hadoop Spark Tutorial | C...
Machine learning with Apache Spark MLlib | Big Data Hadoop Spark Tutorial | C...
CloudxLab
 
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
Spark Summit
 
Lazy Join Optimizations Without Upfront Statistics with Matteo Interlandi
Lazy Join Optimizations Without Upfront Statistics with Matteo InterlandiLazy Join Optimizations Without Upfront Statistics with Matteo Interlandi
Lazy Join Optimizations Without Upfront Statistics with Matteo Interlandi
Databricks
 
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Spark Summit
 
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerDeep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Spark Summit
 
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...
Databricks
 
Sparse Data Support in MLlib
Sparse Data Support in MLlibSparse Data Support in MLlib
Sparse Data Support in MLlib
Xiangrui Meng
 
Recent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and BeyondRecent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and Beyond
DataWorks Summit
 
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
Accumulo Summit
 
Generalized Linear Models with H2O
Generalized Linear Models with H2O Generalized Linear Models with H2O
Generalized Linear Models with H2O
Sri Ambati
 
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Spark Summit
 
Recent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and BeyondRecent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and Beyond
Xiangrui Meng
 
Inside Apache SystemML by Frederick Reiss
Inside Apache SystemML by Frederick ReissInside Apache SystemML by Frederick Reiss
Inside Apache SystemML by Frederick Reiss
Spark Summit
 
A Graph-Based Method For Cross-Entity Threat Detection
 A Graph-Based Method For Cross-Entity Threat Detection A Graph-Based Method For Cross-Entity Threat Detection
A Graph-Based Method For Cross-Entity Threat Detection
Jen Aman
 
Overview of Apache SystemML by Berthold Reinwald and Nakul Jindal
Overview of Apache SystemML by Berthold Reinwald and Nakul JindalOverview of Apache SystemML by Berthold Reinwald and Nakul Jindal
Overview of Apache SystemML by Berthold Reinwald and Nakul Jindal
Arvind Surve
 
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...
Databricks
 

What's hot (19)

Generalized Linear Models in Spark MLlib and SparkR
Generalized Linear Models in Spark MLlib and SparkRGeneralized Linear Models in Spark MLlib and SparkR
Generalized Linear Models in Spark MLlib and SparkR
 
Large Scale Machine learning with Spark
Large Scale Machine learning with SparkLarge Scale Machine learning with Spark
Large Scale Machine learning with Spark
 
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
 
Machine learning with Apache Spark MLlib | Big Data Hadoop Spark Tutorial | C...
Machine learning with Apache Spark MLlib | Big Data Hadoop Spark Tutorial | C...Machine learning with Apache Spark MLlib | Big Data Hadoop Spark Tutorial | C...
Machine learning with Apache Spark MLlib | Big Data Hadoop Spark Tutorial | C...
 
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
 
Lazy Join Optimizations Without Upfront Statistics with Matteo Interlandi
Lazy Join Optimizations Without Upfront Statistics with Matteo InterlandiLazy Join Optimizations Without Upfront Statistics with Matteo Interlandi
Lazy Join Optimizations Without Upfront Statistics with Matteo Interlandi
 
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
 
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerDeep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
 
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...
 
Sparse Data Support in MLlib
Sparse Data Support in MLlibSparse Data Support in MLlib
Sparse Data Support in MLlib
 
Recent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and BeyondRecent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and Beyond
 
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
 
Generalized Linear Models with H2O
Generalized Linear Models with H2O Generalized Linear Models with H2O
Generalized Linear Models with H2O
 
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
 
Recent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and BeyondRecent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and Beyond
 
Inside Apache SystemML by Frederick Reiss
Inside Apache SystemML by Frederick ReissInside Apache SystemML by Frederick Reiss
Inside Apache SystemML by Frederick Reiss
 
A Graph-Based Method For Cross-Entity Threat Detection
 A Graph-Based Method For Cross-Entity Threat Detection A Graph-Based Method For Cross-Entity Threat Detection
A Graph-Based Method For Cross-Entity Threat Detection
 
Overview of Apache SystemML by Berthold Reinwald and Nakul Jindal
Overview of Apache SystemML by Berthold Reinwald and Nakul JindalOverview of Apache SystemML by Berthold Reinwald and Nakul Jindal
Overview of Apache SystemML by Berthold Reinwald and Nakul Jindal
 
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...
 

Viewers also liked

Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information RetrievalValidity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
Julián Urbano
 
Website Classification using Apache Spark
Website Classification using Apache SparkWebsite Classification using Apache Spark
Website Classification using Apache Spark
Amith Nambiar
 
Standard Datasets in Information Retrieval
Standard Datasets in Information Retrieval Standard Datasets in Information Retrieval
Standard Datasets in Information Retrieval
Jean Brenda
 
Scala for Machine Learning
Scala for Machine LearningScala for Machine Learning
Scala for Machine Learning
Patrick Nicolas
 
Datasets for logistic regression
Datasets for logistic regressionDatasets for logistic regression
Datasets for logistic regression
Prashant2902
 
CTR logistic regression
CTR logistic regressionCTR logistic regression
CTR logistic regression
Joseph Duimstra, Ph.D.
 
Scalable and Flexible Machine Learning With Scala @ LinkedIn
Scalable and Flexible Machine Learning With Scala @ LinkedInScalable and Flexible Machine Learning With Scala @ LinkedIn
Scalable and Flexible Machine Learning With Scala @ LinkedIn
Vitaly Gordon
 
Exploring Optimization in Vowpal Wabbit
Exploring Optimization in Vowpal WabbitExploring Optimization in Vowpal Wabbit
Exploring Optimization in Vowpal Wabbit
Shiladitya Sen
 
Linear regression on 1 terabytes of data? Some crazy observations and actions
Linear regression on 1 terabytes of data? Some crazy observations and actionsLinear regression on 1 terabytes of data? Some crazy observations and actions
Linear regression on 1 terabytes of data? Some crazy observations and actions
Hesen Peng
 
Rise of the machine (learning algorithms)
Rise of the machine (learning algorithms)Rise of the machine (learning algorithms)
Rise of the machine (learning algorithms)
Frank Van Lankvelt
 
In pursuit of augmented intelligence
In pursuit of augmented intelligenceIn pursuit of augmented intelligence
In pursuit of augmented intelligence
DataScienceAssociation
 
一淘广告机器学习
一淘广告机器学习一淘广告机器学习
一淘广告机器学习
Shaoning Pan
 
Click-Trough Rate (CTR) prediction
Click-Trough Rate (CTR) predictionClick-Trough Rate (CTR) prediction
Click-Trough Rate (CTR) prediction
Andrey Lange
 
Dynamic pricing
Dynamic pricingDynamic pricing
Dynamic pricing
jsnowbabyyyy
 
Cross Device Ad Targeting at Scale
Cross Device Ad Targeting at ScaleCross Device Ad Targeting at Scale
Cross Device Ad Targeting at Scale
Trieu Nguyen
 
Ad Click Prediction - Paper review
Ad Click Prediction - Paper reviewAd Click Prediction - Paper review
Ad Click Prediction - Paper review
Mazen Aly
 
Machine Learning Meetup SOF: Intro to ML
Machine Learning Meetup SOF: Intro to MLMachine Learning Meetup SOF: Intro to ML
Machine Learning Meetup SOF: Intro to ML
Imagga Technology
 
Augmented Intelligence 2.0
Augmented Intelligence 2.0Augmented Intelligence 2.0
Augmented Intelligence 2.0
Daniel Kornev
 
Training Large-scale Ad Ranking Models in Spark
Training Large-scale Ad Ranking Models in SparkTraining Large-scale Ad Ranking Models in Spark
Training Large-scale Ad Ranking Models in Spark
Patrick Pletscher
 
Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)
Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)
Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)
Spark Summit
 

Viewers also liked (20)

Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information RetrievalValidity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
 
Website Classification using Apache Spark
Website Classification using Apache SparkWebsite Classification using Apache Spark
Website Classification using Apache Spark
 
Standard Datasets in Information Retrieval
Standard Datasets in Information Retrieval Standard Datasets in Information Retrieval
Standard Datasets in Information Retrieval
 
Scala for Machine Learning
Scala for Machine LearningScala for Machine Learning
Scala for Machine Learning
 
Datasets for logistic regression
Datasets for logistic regressionDatasets for logistic regression
Datasets for logistic regression
 
CTR logistic regression
CTR logistic regressionCTR logistic regression
CTR logistic regression
 
Scalable and Flexible Machine Learning With Scala @ LinkedIn
Scalable and Flexible Machine Learning With Scala @ LinkedInScalable and Flexible Machine Learning With Scala @ LinkedIn
Scalable and Flexible Machine Learning With Scala @ LinkedIn
 
Exploring Optimization in Vowpal Wabbit
Exploring Optimization in Vowpal WabbitExploring Optimization in Vowpal Wabbit
Exploring Optimization in Vowpal Wabbit
 
Linear regression on 1 terabytes of data? Some crazy observations and actions
Linear regression on 1 terabytes of data? Some crazy observations and actionsLinear regression on 1 terabytes of data? Some crazy observations and actions
Linear regression on 1 terabytes of data? Some crazy observations and actions
 
Rise of the machine (learning algorithms)
Rise of the machine (learning algorithms)Rise of the machine (learning algorithms)
Rise of the machine (learning algorithms)
 
In pursuit of augmented intelligence
In pursuit of augmented intelligenceIn pursuit of augmented intelligence
In pursuit of augmented intelligence
 
一淘广告机器学习
一淘广告机器学习一淘广告机器学习
一淘广告机器学习
 
Click-Trough Rate (CTR) prediction
Click-Trough Rate (CTR) predictionClick-Trough Rate (CTR) prediction
Click-Trough Rate (CTR) prediction
 
Dynamic pricing
Dynamic pricingDynamic pricing
Dynamic pricing
 
Cross Device Ad Targeting at Scale
Cross Device Ad Targeting at ScaleCross Device Ad Targeting at Scale
Cross Device Ad Targeting at Scale
 
Ad Click Prediction - Paper review
Ad Click Prediction - Paper reviewAd Click Prediction - Paper review
Ad Click Prediction - Paper review
 
Machine Learning Meetup SOF: Intro to ML
Machine Learning Meetup SOF: Intro to MLMachine Learning Meetup SOF: Intro to ML
Machine Learning Meetup SOF: Intro to ML
 
Augmented Intelligence 2.0
Augmented Intelligence 2.0Augmented Intelligence 2.0
Augmented Intelligence 2.0
 
Training Large-scale Ad Ranking Models in Spark
Training Large-scale Ad Ranking Models in SparkTraining Large-scale Ad Ranking Models in Spark
Training Large-scale Ad Ranking Models in Spark
 
Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)
Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)
Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)
 

Similar to 2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Things Conference

2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
DB Tsai
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
DeepaThirumurugan
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
Josi Aranda
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
Rahul Borate
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
MapR Technologies
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
Synergetics Learning and Cloud Consulting
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for Beginners
Anirudh
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
Zahra Eskandari
 
Yet another intro to Apache Spark
Yet another intro to Apache SparkYet another intro to Apache Spark
Yet another intro to Apache Spark
Simon Lia-Jonassen
 
Apache Spark PDF
Apache Spark PDFApache Spark PDF
Apache Spark PDF
Naresh Rupareliya
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
Andrii Gakhov
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
James Chen
 
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Mac Moore
 
Spark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with SparkSpark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with Spark
Spark Summit
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session
RUHULAMINHAZARIKA
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache Spark
Manish Gupta
 
Data processing with spark in r &amp; python
Data processing with spark in r &amp; pythonData processing with spark in r &amp; python
Data processing with spark in r &amp; python
Maloy Manna, PMP®
 
Big Data tools in practice
Big Data tools in practiceBig Data tools in practice
Big Data tools in practice
Darko Marjanovic
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
Djamel Zouaoui
 

Similar to 2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Things Conference (20)

2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for Beginners
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Yet another intro to Apache Spark
Yet another intro to Apache SparkYet another intro to Apache Spark
Yet another intro to Apache Spark
 
Apache Spark PDF
Apache Spark PDFApache Spark PDF
Apache Spark PDF
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
 
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
 
Spark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with SparkSpark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with Spark
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache Spark
 
Data processing with spark in r &amp; python
Data processing with spark in r &amp; pythonData processing with spark in r &amp; python
Data processing with spark in r &amp; python
 
Big Data tools in practice
Big Data tools in practiceBig Data tools in practice
Big Data tools in practice
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 

Recently uploaded

UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
Zilliz
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
Claudio Di Ciccio
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 

Recently uploaded (20)

UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 

2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Things Conference

  • 1. Large-Scale Machine Learning with DB Tsai Machine Learning Engineering Lead @ AlpineDataLabs Internet of Things Conference @ Moscone Center, SF http://www.iotaconf.com/ October 20, 2014 Learn more about Advanced Analytics at http://www.alpinenow.com
  • 2. TRADITIONAL DESKTOP IN-DATABASE METHODS Learn more about Advanced Analytics at http://www.alpinenow.com WEB-BASED AND COLLABORATIVE SIMPLIFIED CODE-FREE HADOOP & MPP DATABASE ONGOING INNOVATION The Path to Innovation
  • 3. The Path to Innovation Iterative algorithms scan through the data each time Learn more about Advanced Analytics at http://www.alpinenow.com With Spark, data is cached in memory after first iteration Quasi-Newton methods enhance in-memory benefits 921s 150m m rows 97s
  • 4. Machine Learning in the Big Data Era • Hadoop Map Reduce solutions + = • MapReduce scales well for batch processing • Lots of machine learning algorithms are iterative by nature • There are lots of tricks people do, like training with sub-samples of data, and then average the models. Why have big data if you’re only approximating. Learn more about Advanced Analytics at http://www.alpinenow.com
  • 5. Lightning-fast cluster computing • Empower users to iterate through the data by utilizing the in-memory cache. • Logistic regression runs up to 100x faster than Hadoop M/R in memory. • We’re able to train exact models without doing any approximation. Learn more about Advanced Analytics at http://www.alpinenow.com
  • 6. Why MLlib? • MLlib is a Spark subproject providing Machine Learning primitives • It’s built on Apache Spark, a fast and general engine for large-scale data processing • Shipped with Apache Spark since version 0.8 • High quality engineering design and effort • More than 50 contributors since July 2014 Learn more about Advanced Analytics at http://www.alpinenow.com
  • 7. Algorithms supported in MLlib • Classification: SVMs, logistic regression, decision trees, naïve Bayes, and random forests • Regression: linear regression, and random forests • Collaborative filtering: alternating least squares (ALS) • Clustering: k-means • Dimensionality reduction: singular value decomposition (SVD), and principal component analysis (PCA) • Basic statistics: summary statistics, correlations, stratified sampling, hypothesis testing, and random data generation • Feature extraction and transformation: TF-IDF, Word2Vec, StandardScaler, and Normalizer Learn more about Advanced Analytics at http://www.alpinenow.com
  • 8. MapReduce Review • MapReduce – Simplified Data Processing on Large Clusters, 2004. • Scales Linearly • Data Locality • Fault Tolerance in Data Storage and Computation Learn more about Advanced Analytics at http://www.alpinenow.com
  • 9. Hadoop MapReduce Review • Mapper: Loads the data and emits a set of key-value pair • Reducer: Collects the key-value pairs with the same key to process, and output the result. • Combiner: Can reduce shuffle traffic by combining key-value pairs locally before going to reducer. • In-Mapper Combiner: Aggregating the result in the mapper side, and using the LRU cache to prevent out of heap space. http://alpinenow.com/blog/in-mapper-combiner/ • Good: Built in fault tolerance, scalable, and production proven in industry. • Bad: Optimized for disk IO without leveraging memory well; iterative algorithms go through disk IO again and again; primitive API is not easy and clean to develop. Learn more about Advanced Analytics at http://www.alpinenow.com
  • 10. Spark MapReduce • Spark also uses MapReduce as a programming model but with much richer APIs in Scala, Java, and Python. • With Scala expressive APIs, 5-10x less code. • Not just a distributed computation framework, Spark provides several pre-built components helping users to implement application faster and easier. - Spark Streaming - Spark SQL - MLlib (Machine Learning) - GraphX (Graph Processing) Learn more about Advanced Analytics at http://www.alpinenow.com
  • 11. Resilient Distributed Datasets (RDDs) • RDD is a fault-tolerant collection of elements that can be operated on in parallel. • RDDs can be created by parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, HIVE, or any data source offering a Hadoop InputFormat. • RDDs can be cached in memory or on disk Learn more about Advanced Analytics at http://www.alpinenow.com
  • 12. Hadoop M/R vs Spark M/R • Hadoop • Spark Learn more about Advanced Analytics at http://www.alpinenow.com
  • 13. RDD Operations - two types of operations • Transformations: Creates a new dataset from an existing one. They are lazy, in that they do not compute their results right away. • Actions: Returns a value to the driver program after running a computation on the dataset. Learn more about Advanced Analytics at http://www.alpinenow.com
  • 14. Transformations • map(func) - Return a new distributed dataset formed by passing each element of the source through a function func. • filter(func) - Return a new dataset formed by selecting those elements of the source on which func returns true. • flatMap(func) - Similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item). • mapPartitions(func) - Similar to map, but runs separately on each partition (block) of the RDD, so func must be of type Iterator<T> => Iterator<U> when running on an RDD of type T. • groupByKey([numTasks]) - When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable<V>) pairs. • reduceByKey(func, [numTasks]) – When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function func, which must be of type (V,V) => V. http://spark.apache.org/docs/latest/programming-guide.html#transformations Learn more about Advanced Analytics at http://www.alpinenow.com
  • 15. Actions • reduce(func) - Aggregate the elements of the dataset using a function func (which takes two arguments and returns one). The function should be commutative and associative so that it can be computed correctly in parallel. • collect() - Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data. • count(), first(), take(n), saveAsTextFile(path), etc. http://spark.apache.org/docs/latest/programming-guide. html#actions Learn more about Advanced Analytics at http://www.alpinenow.com
  • 16. RDD Persistence/Cache • RDD can be persisted using the persist() or cache() methods on it. The first time it is computed in an action, it will be kept in memory on the nodes. Spark’s cache is fault-tolerant – if any partition of an RDD is lost, it will automatically be recomputed using the transformations that originally created it. • Persisted RDD can be stored using a different storage level, allowing you, for example, to persist the dataset on disk, persist it in memory but as serialized Java objects (to save space). Learn more about Advanced Analytics at http://www.alpinenow.com
  • 17. RDD Storage Level • MEMORY_ONLY - Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed. This is the default level. • MEMORY_AND_DISK - Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that don't fit on disk, and read them from there when they're needed. • MEMORY_ONLY_SER - Store RDD as serialized Java objects (one byte array per partition). This is generally more space-efficient than deserialized objects, especially when using a fast serializer, but more CPU-intensive to read. • MEMORY_AND_DISK_SER - Similar to MEMORY_ONLY_SER, but spill partitions that don't fit in memory to disk instead of recomputing them on the fly each time they're needed. http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence Learn more about Advanced Analytics at http://www.alpinenow.com
  • 18. Word Count Example in Scala Learn more about Advanced Analytics at http://www.alpinenow.com
  • 19. Learn more about Advanced Analytics at http://www.alpinenow.com
  • 20. Learn more about Advanced Analytics at http://www.alpinenow.com
  • 21. Learn more about Advanced Analytics at http://www.alpinenow.com
  • 22. API’s design philosophy in MLlib • Works seamlessly with Spark Core, and Spark SQL; users can use core API’s or Spark SQL for data pre-processing, and then pipe into training step. • Algorithms are implemented in Scala. Public interfaces don’t use advanced Scala features to ensure Java compatibility. • Many of MLlib API’s have python bindings. • MLlib is under active development. The APIs marked Experimental/ DeveloperApi may change in future releases, and will provide migration guide if they are changed. • API’s are well documented, and designed to be expressive. • Code is well-tested, comprehensive unittest coverage. There are lots of comments in the code, and it’s a enjoyable experience to read the code. Learn more about Advanced Analytics at http://www.alpinenow.com
  • 23. Data Types • MLlib local vectors and local matrices are currently wrapping Breeze implementation; as a result, the underlying linear algebra operations are provided by Breeze and jblas. https://github.com/scalanlp/breeze • However, the methods converting MLlib to Breeze vectors/matrices or the other way around are private to org.apache.spark.mllib scope. This restriction can be workaround by having your custom code in org.apache.spark.mllib.something package. • A training sample used in supervised learning is stored in LabeledPoint which contains a label/response and a feature vector in dense or sparse. • Distributed RowMatrix – basically, it’s RDD[Vector] which doesn’t have meaningful row indices. • Distributed IndexedRowMatrix – it’s similar to RowMatrix, but each row is represented by its index and a local vector. Learn more about Advanced Analytics at http://www.alpinenow.com
  • 24. Local vector The base class of local vectors is Vector, and we provide two implementations: DenseVector and SparseVector. Learn more about Advanced Analytics at http://www.alpinenow.com
  • 25. Some useful tips related to local vector • If you want to use native Breeze functionality, you can have your code in org.apache.spark.mllib package. Learn more about Advanced Analytics at http://www.alpinenow.com
  • 26. Real code in MLlib in MultivariateOnlineSummarizer Learn more about Advanced Analytics at http://www.alpinenow.com
  • 27. LabeledPoint • Double is used for storing the label, so we can use the labeled points in both regression and classification. For binary classification, a label should be either 0.0 or 1.0. For N-class classification, labels should be class indices starting from zero: 0.0, 1.0, 2.0, …, N - 1 Learn more about Advanced Analytics at http://www.alpinenow.com
  • 28. Supervised Learning • Binary Classification: linear SVMs (SGD), logistic regression (L-BFGS and SGD), decision trees, random forests (Spark 1.2), and naïve Bayes. • Multiclass Classification: Decision trees, naïve Bayes (coming soon - multinomial logistic regression in GLMNET) • Regression: linear least squares (SGD), Lasso (SGD + soft-threshold), ridge regression (SGD), decision trees, and random forests (Spark 1.2) • Currently, the regularization in linear model will penalize all the weights including the intercept which is not desired in some use-cases. Alpine has GLMNET implementation using OWLQN which can exactly reproduce R’s GLMNET package result with scalability. We’re in the process of merging it into MLlib community. Learn more about Advanced Analytics at http://www.alpinenow.com
  • 29. LinearRegressionWithSGD Learn more about Advanced Analytics at http://www.alpinenow.com
  • 30. SVMWithSGD Learn more about Advanced Analytics at http://www.alpinenow.com
  • 31. SPARK-2934: LogisticRegressionWithLBFGS • Merged in Spark 1.1 • Contributed by Alpine Data Labs • Using L-BFGS to train Logistic Regression instead of default Gradient Descent. • Users don't have to construct their objective function for Logistic Regression, and don't have to implement the whole details. • Together with SPARK-2979 to minimize the condition number, the convergence rate is further improved. Learn more about Advanced Analytics at http://www.alpinenow.com
  • 32. SPARK-2979: Improve the convergence rate by standardizing the training features l Merged in Spark 1.1 l Contributed by Alpine Data Labs l Due to the invariance property of MLEs, the scale of your inputs are irrelevant. l However, the optimizer will not be happy with poor condition numbers which can often be improved by scaling. l The model is trained in the scaled space, but the coefficients are converted to original space; as a result, it's transparent to users. l Without this, some training datasets mixing the columns with different scales may not be able to converge. l Scikit and glmnet package also standardize the features before training to improve the convergence. l Only enable in Logistic Regression for now. Learn more about Advanced Analytics at http://www.alpinenow.com
  • 33. Learn more about Advanced Analytics at http://www.alpinenow.com
  • 34. a9a Dataset Benchmark 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0.3 Logistic Regression with a9a Dataset (11M rows, 123 features, 11% non-zero elements) 16 executors in INTEL Xeon E3-1230v3 32GB Memory * 5 nodes Hadoop 2.0.5 alpha cluster -1 1 3 5 7 9 11 13 15 Learn more about Advanced Analytics at http://www.alpinenow.com L-BFGS GD Iterations Log-Likelihood / Number of Samples
  • 35. rcv1 Dataset Benchmark Logistic Regression with rcv1 Dataset (6.8M rows, 677,399 features, 0.15% non-zero elements) 16 executors in INTEL Xeon E3-1230v3 32GB Memory * 5 nodes Hadoop 2.0.5 alpha cluster 0 5 10 15 20 25 30 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Learn more about Advanced Analytics at http://www.alpinenow.com LBFGS Sparse Vector GD Sparse Vector Second Log-Likelihood / Number of Samples
  • 36. news20 Dataset Benchmark Logistic Regression with news20 Dataset (0.14M rows, 1,355,191 features, 0.034% non-zero elements) 16 executors in INTEL Xeon E3-1230v3 32GB Memory * 5 nodes Hadoop 2.0.5 alpha cluster 0 10 20 30 40 50 60 70 80 1.2 1 0.8 0.6 0.4 0.2 0 Learn more about Advanced Analytics at http://www.alpinenow.com LBFGS Sparse Vector GD Sparse Vector Second Log-Likelihood / Number of Samples
  • 37. K-Means Learn more about Advanced Analytics at http://www.alpinenow.com
  • 38. PCA + K-Means Learn more about Advanced Analytics at http://www.alpinenow.com
  • 39. Collaborative Filtering Learn more about Advanced Analytics at http://www.alpinenow.com
  • 40. Spark-1157: L-BFGS Optimizer • No, its not a blender! Learn more about Advanced Analytics at http://www.alpinenow.com
  • 41. What is Spark-1157: L-BFGS Optimizer • Merged in Spark 1.0 • Contributed by Alpine Data Labs • Popular algorithms for parameter estimation in Machine Learning. • It’s a quasi-Newton Method. • Hessian matrix of second derivatives doesn't need to be evaluated directly. • Hessian matrix is approximated using gradient evaluations. • It converges a way faster than the default optimizer in Spark, Gradient Decent. • We are contributing OWLQN which is an variant of LBFGS to deal with L1 problem to Spark. It’s a building block of GLMNET. Learn more about Advanced Analytics at http://www.alpinenow.com
  • 42. Learn more about Advanced Analytics at http://www.alpinenow.com
  • 43. SPARK-2505: Weighted Regularization ongoing work l Each components of weights can be penalized differently. l We can exclude intercept from regularization in this framework. l Decoupling regularization from the raw gradient update which is not used in other optimization schemes. l Allow various update/learning rate schemes (adagrad, normalized adaptive gradient, etc) to be applied independent of the regularization l Smooth and L1 regularization will be handled differently in optimizer. Learn more about Advanced Analytics at http://www.alpinenow.com
  • 44. SPARK-2309: Multinomial Logistic Regression ongoing work l For K classes multinomial problem, we can generalize it via K -1 linear models with logist link functions. l As a result, the weights will have dimension of (K-1)(N + 1) where N is number of features. l MLlib interface is designed for one set of paramerters per model, so it requires some interface design changes. l Expected to be merged in next release of MLlib, Spark 1.2 Ref: http://www.slideshare.net/dbtsai/2014-0620-mlor-36132297 Learn more about Advanced Analytics at http://www.alpinenow.com
  • 45. SPARK-2272: Transformer A spark, the soul of a transformer Learn more about Advanced Analytics at http://www.alpinenow.com
  • 46. SPARK-2272: Transformer l Merged in Spark 1.1 l Contributed by Alpine Data Labs l MLlib data preprocessing pipeline. l StandardScaler - Standardize features by removing the mean and scaling to unit variance. - RBF kernel of Support Vector Machines or the L1 and L2 regularizers of linear models typically works better with zero mean and unit variance. l Normalizer - Normalizes samples individually to unit L^n norm. - Common operation for text classification or clustering for instance. - For example, the dot product of two l2-normalized TF-IDF vectors is the cosine similarity of the vectors. Learn more about Advanced Analytics at http://www.alpinenow.com
  • 47. StandardScaler Learn more about Advanced Analytics at http://www.alpinenow.com
  • 48. Learn more about Advanced Analytics at http://www.alpinenow.com Normalizer
  • 49. SPARK-1969: Online summarizer l Merged in Spark 1.1 l Contributed by Alpine Data Labs l Online algorithms for computing the mean, variance, min, and max in a streaming fashion. l Two online summerier can be merged, so we can use one summerier for one block of data in map phase, and merge all of them in reduce phase to obtain the global summarizer. l A numerically stable one-pass algorithm is implemented to avoid catastrophic cancellation in naive implementation. Ref: http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance Two-pass algorithm l Optimized for sparse vector, and the time complexity is O(non-zeors) instead of O(numCols) for each sample. Learn more about Advanced Analytics at http://www.alpinenow.com Naive algorithm
  • 50. Learn more about Advanced Analytics at http://www.alpinenow.com
  • 51. Learn more about Advanced Analytics at http://www.alpinenow.com
  • 52. Spark SQL • Spark SQL allows relational queries expressed in SQL, HiveQL, or Scala to be executed using Spark. At the core of this component is a new type of RDD, SchemaRDD. • SchemaRDDs are composed of Row objects, along with a schema that describes the data types of each column in the row. A SchemaRDD is similar to a table in a traditional relational database. • A SchemaRDD can be created from an existing RDD, a Parquet file, a JSON dataset, or by running HiveQL against data stored in Apache Hive. http://spark.apache.org/docs/latest/sql-programming-guide.html Learn more about Advanced Analytics at http://www.alpinenow.com
  • 53. Spark SQL + MLlib l With SparkSQL, users can easily load the parquet/ avro datasets into Spark, and perform the data pre-processing before the training steps. l MLlib considers to use schemaRDD as a native typed data format, like R’s data-frame. This allows us to create output model with types and column names, and also be easier to create PMML model. Learn more about Advanced Analytics at http://www.alpinenow.com
  • 54. Spark SQL + MLlib l With SparkSQL, users can easily load the parquet/ avro datasets into Spark, and perform the data pre-processing before the training steps. l MLlib considers to use schemaRDD as a native typed data format, like R’s data-frame. This allows us to create output model with types and column names, and also be easier to create PMML model. Learn more about Advanced Analytics at http://www.alpinenow.com
  • 55. Example: Prepare training data using Spark SQL Learn more about Advanced Analytics at http://www.alpinenow.com
  • 56. Example: Prepare training data using Spark SQL Learn more about Advanced Analytics at http://www.alpinenow.com
  • 57. Interested in MLlib? l MLlib official guide - https://spark.apache.org/docs/latest/mllib-guide.html l Github – https://github.com/apache/spark l Mailing lists - user@spark.apache.org or dev@spark.apache.org Learn more about Advanced Analytics at http://www.alpinenow.com
  • 58. For more information, contact us 1550 Bryant Street Suite 1000 San Francisco, CA 94103 USA +1 (877) 542-0062 www.alpinenow.com Learn more about Advanced Analytics at http://www.alpinenow.com Get Started Today! http://start.alpinenow.com