SlideShare a Scribd company logo
1 of 40
Download to read offline
Sparse data support in MLlib
Xiangrui Meng
Spark MLlib
MLlib is an Apache Spark component focusing on
machine learning:
• initial contribution from AMPLab, UC Berkeley
• shipped with Spark since version 0.8 (Sep 2013)
• 50 contributors
2
Algorithms
• classification: logistic regression, linear support vector machine
(SVM), naive Bayes, classification tree
• regression: generalized linear models (GLMs), regression tree
• collaborative filtering: alternating least squares (ALS), non-negative
matrix factorization (NMF)
• clustering: k-means
• decomposition: singular value decomposition (SVD), principal
component analysis (PCA)
3
What’s new in v1.0
• new user guide and code examples
• API stability
• sparse data support
• regression and classification tree
• distributed matrices
• tall-and-skinny PCA and SVD
• L-BFGS
• binary classification model evaluation
4
Sparse data support
“large-scale sparse problems”
Sparsity is almost everywhere
Sparse datasets appear almost everywhere in the world
of big data, where the sparsity may come from many
sources, e.g.,
• feature transformation: 

one-hot encoding, interaction, and bucketing,
• large feature space: n-grams,
• missing data: rating matrix,
• low-rank structure: images and signals.
7
One-hot encoding
Converts a categorical feature to numerical, e.g.,
• country: {Germany, Brazil, Argentina, …}
• Germany -> 0, Brazil -> 1, Argentina -> 2, …
• Germany -> [1, 0, 0, 0, …], Brazil -> [0, 1, 0, 0, …],
Argentina -> [0, 0, 1, 0, …], …
• density: 1/#categories
8
Bucketing
Converts a numerical feature to categorical, e.g.,
• second of day: [0, 24*3600)
• hour of day: [0, 24)
• 4:33am -> [0, 0, 0, 0, 1, 0, …]
• density: 1/#buckets
9
Sparsity is almost everywhere
The Netflix Prize:
• number of users: 480,189
• number of movies: 17,770
• number of observed ratings: 100,480,507
• density = 1.17%
10
Sparsity is almost everywhere
rcv1.binary (test):
• number of examples: 677,399
• number of features: 47,236
• density: 0.15%
• storage: 270GB (dense) or 600MB (sparse)
11
Sparsity in a broader sense
real-world data = + noise
sparse
— or —
low-rank
12
A dense image …
13
… is sparse under wavelet basis
14
A huge rating matrix …
Amazon reviews:
• Number of users: 6,643,669
• Number of products: 2,441,053
• Number of reviews: 34,686,770
15
… is approximately low-rank
16
Exploiting sparsity
• As a user
• recognize sparsity
• As a developer
• utilize sparsity
17
Exploiting sparsity
• In Spark v1.0, MLlib adds support for sparse input data
in Scala, Java, and Python.
• MLlib takes advantage of sparsity in both storage and
computation in
• collaborative filtering,
• linear methods (linear SVM, logistic regression, etc),
• naive Bayes,
• k-means,
• summary statistics,
• singular value decomposition.
18
Open-source linear algebra packages
We benchmarked several JVM-based open-source
linear algebra packages on the operations we need to
implement sparse data support.
• breeze
• matrix-toolkits-java
• mahout-math
• commons-math3
• jblas
19
Sparse representation of features
20
dense : 1. 0. 0. 0. 0. 0. 3.
sparse :
8
><
>:
size : 7
indices : 0 6
values : 1. 3.
Create/save/load sparse data
Create a sparse vector representing [1., 0., 3.]
• in Scala: Vectors.sparse(3, Array(0, 2), Array(1., 3.))
• in Java: Vectors.sparse(3, new int[] {0, 2}, new double[] {1., 3.})
• in Python: Vectors.sparse(3, [0, 2], [1., 3.])
Create a labeled point with a sparse feature vector
• Scala/Java/Python: LabeledPoint(label, sparseVector)
21
Create/save/load sparse data
Save a sparse training data set (RDD[LabeledPoint])
• in LIBSVM format
• MLUtils.saveAsLibSVMFile(rdd, dir) -> 1 1:1.0 3:3.0
• in MLlib’s format (v1.1)
• rdd.saveAsTextFile(dir) -> (3,[0,2],[1.,3.])
22
Create/save/load sparse data
Load a sparse training dataset
• in LIBSVM format
• MLUtils.loadLibSVMFile(sc, path)
• in MLlib’s format (v1.1)
• MLUtils.loadLabeledData(sc, path)
23
O(nnz)
Exploiting sparsity in k-means
Training set:
• number of examples: 12 million
• number of features: 500
• density: 10%
!
Not only did we save 40GB of storage by switching to the sparse
format, but we also received a 4x speedup.
dense sparse
storage 47GB 7GB
time 240s 58s
25
Implementation of sparse k-means
Algorithm:
• For each point, find its closest center.



• Update cluster centers.
li = arg min
j
kxi cjk2
2
cj =
P
i,li=j xj
P
i,li=j 1
26
Implementation of sparse k-means
The points are usually sparse, but the centers are most likely to be
dense. Computing the distance takes O(d) time. So the time complexity
is O(n d k) per iteration. We don’t take any advantage of sparsity on the
running time. However, we have
kx ck2
2 = kxk2
2 + kck2
2 2hx, ci
Computing the inner product only needs non-zero elements. So we
can cache the norms of the points and of the centers, and then only
need the inner products to obtain the distances. This reduce the
running time to O(nnz k + d k) per iteration.
27
Exploiting sparsity in linear methods
• The essential part of the computation in a gradient-
based method is computing the sum of gradients.
• For linear methods, the gradient is sparse if the
feature vector is sparse.
28
Exploiting sparsity in linear methods
Proposal:
g = points.map(p => grad(w, p)).reduce(_ + _)
Cons:
• Creating many small objects.
• Adding sparse vectors.
29
Exploiting sparsity in linear methods
MLlib’s implementation
• does not add sparse vectors together,
• instead, adds sparse vectors directly to a dense
vector for each partition and then computes the sum
• fast random access
• no temporary object creation
30
Exploiting sparsity in summary statistics
Multivariate summary statistics:
• count / mean / max / min / nnz / variance
MLlib’s implementation
• computes the variance accurately in a single pass,
• ignores the zero values during the pass.
31
• Top singular values can be computed via the
eigenvalue decomposition of the Gram matrix.
!
• Lanczos algorithm computes eigenvalue
decomposition, which only needs a routine that
multiplies a matrix with a vector.
Singular value decomposition
(AT
A)v =
X
i
(aT
i v)ai
j(A) =
q
j(AT A)
32
Exploiting sparsity in SVD
!
• v is dense in general, while {a_i} are sparse.
• The inner product can be computed quickly.
• Computing the sum is the same as in linear methods.
(AT
A)v =
X
i
(aT
i v)ai
33
Make the right choice
Sparse vs. dense
• Storage
• sparse format: 12 nnz + 4 bytes.
• dense format: 8n bytes
• Speed
• problem dependent
Pick algorithms that fit your data
36
ALS vs. SVD
Both algorithms compute low-rank matrix factorizations.
• ALS
• is scalable on both directions
• ignores unobserved entries (explicit feedback)
• SVD
• is scalable on one direction
• treats unobserved entries as zeros
37
Acknowledgement
• David Hall (breeze)
• Sam Halliday (netlib-java)
• Xusen Yin (summary statistics)
• Li Pu (SVD via Lanczos)
• Tor Myklebust (NMF)
38
Summary
• Real-world data = sparse/low-rank + noise
• MLlib supports sparse data in
• linear methods / naive Bayes / k-means / collaborative filtering /
summary statistics / singular value decomposition
• As a user, recognize sparsity.
• As a developer, utilize sparsity.
39
Thank You!

More Related Content

What's hot

Machine learning with Apache Spark MLlib | Big Data Hadoop Spark Tutorial | C...
Machine learning with Apache Spark MLlib | Big Data Hadoop Spark Tutorial | C...Machine learning with Apache Spark MLlib | Big Data Hadoop Spark Tutorial | C...
Machine learning with Apache Spark MLlib | Big Data Hadoop Spark Tutorial | C...CloudxLab
 
Machine Learning using Apache Spark MLlib
Machine Learning using Apache Spark MLlibMachine Learning using Apache Spark MLlib
Machine Learning using Apache Spark MLlibIMC Institute
 
Introduction to Machine Learning with Spark
Introduction to Machine Learning with SparkIntroduction to Machine Learning with Spark
Introduction to Machine Learning with Sparkdatamantra
 
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...DB Tsai
 
AMP Camp 5 Intro
AMP Camp 5 IntroAMP Camp 5 Intro
AMP Camp 5 Introjeykottalam
 
Practical Machine Learning Pipelines with MLlib
Practical Machine Learning Pipelines with MLlibPractical Machine Learning Pipelines with MLlib
Practical Machine Learning Pipelines with MLlibDatabricks
 
Spark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesSpark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesDatabricks
 
Machine Learning by Example - Apache Spark
Machine Learning by Example - Apache SparkMachine Learning by Example - Apache Spark
Machine Learning by Example - Apache SparkMeeraj Kunnumpurath
 
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...Spark Summit
 
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...Spark Summit
 
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...Spark Summit
 
Combining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache SparkCombining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache SparkDatabricks
 
Designing Distributed Machine Learning on Apache Spark
Designing Distributed Machine Learning on Apache SparkDesigning Distributed Machine Learning on Apache Spark
Designing Distributed Machine Learning on Apache SparkDatabricks
 
Spark Summit EU talk by Reza Karimi
Spark Summit EU talk by Reza KarimiSpark Summit EU talk by Reza Karimi
Spark Summit EU talk by Reza KarimiSpark Summit
 
Machine Learning With Spark
Machine Learning With SparkMachine Learning With Spark
Machine Learning With SparkShivaji Dutta
 
MLconf NYC Xiangrui Meng
MLconf NYC Xiangrui MengMLconf NYC Xiangrui Meng
MLconf NYC Xiangrui MengMLconf
 
Spark ML par Xebia (Spark Meetup du 11/06/2015)
Spark ML par Xebia (Spark Meetup du 11/06/2015)Spark ML par Xebia (Spark Meetup du 11/06/2015)
Spark ML par Xebia (Spark Meetup du 11/06/2015)Modern Data Stack France
 
Composable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and WeldComposable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and WeldDatabricks
 
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengChallenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengDatabricks
 

What's hot (20)

Apache Spark & MLlib
Apache Spark & MLlibApache Spark & MLlib
Apache Spark & MLlib
 
Machine learning with Apache Spark MLlib | Big Data Hadoop Spark Tutorial | C...
Machine learning with Apache Spark MLlib | Big Data Hadoop Spark Tutorial | C...Machine learning with Apache Spark MLlib | Big Data Hadoop Spark Tutorial | C...
Machine learning with Apache Spark MLlib | Big Data Hadoop Spark Tutorial | C...
 
Machine Learning using Apache Spark MLlib
Machine Learning using Apache Spark MLlibMachine Learning using Apache Spark MLlib
Machine Learning using Apache Spark MLlib
 
Introduction to Machine Learning with Spark
Introduction to Machine Learning with SparkIntroduction to Machine Learning with Spark
Introduction to Machine Learning with Spark
 
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
 
AMP Camp 5 Intro
AMP Camp 5 IntroAMP Camp 5 Intro
AMP Camp 5 Intro
 
Practical Machine Learning Pipelines with MLlib
Practical Machine Learning Pipelines with MLlibPractical Machine Learning Pipelines with MLlib
Practical Machine Learning Pipelines with MLlib
 
Spark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesSpark DataFrames and ML Pipelines
Spark DataFrames and ML Pipelines
 
Machine Learning by Example - Apache Spark
Machine Learning by Example - Apache SparkMachine Learning by Example - Apache Spark
Machine Learning by Example - Apache Spark
 
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...
 
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
 
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
 
Combining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache SparkCombining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache Spark
 
Designing Distributed Machine Learning on Apache Spark
Designing Distributed Machine Learning on Apache SparkDesigning Distributed Machine Learning on Apache Spark
Designing Distributed Machine Learning on Apache Spark
 
Spark Summit EU talk by Reza Karimi
Spark Summit EU talk by Reza KarimiSpark Summit EU talk by Reza Karimi
Spark Summit EU talk by Reza Karimi
 
Machine Learning With Spark
Machine Learning With SparkMachine Learning With Spark
Machine Learning With Spark
 
MLconf NYC Xiangrui Meng
MLconf NYC Xiangrui MengMLconf NYC Xiangrui Meng
MLconf NYC Xiangrui Meng
 
Spark ML par Xebia (Spark Meetup du 11/06/2015)
Spark ML par Xebia (Spark Meetup du 11/06/2015)Spark ML par Xebia (Spark Meetup du 11/06/2015)
Spark ML par Xebia (Spark Meetup du 11/06/2015)
 
Composable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and WeldComposable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and Weld
 
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengChallenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
 

Viewers also liked

Configuring Your First Hadoop Cluster On EC2
Configuring Your First Hadoop Cluster On EC2Configuring Your First Hadoop Cluster On EC2
Configuring Your First Hadoop Cluster On EC2benjaminwootton
 
Spark application on ec2 cluster
Spark application on ec2 clusterSpark application on ec2 cluster
Spark application on ec2 clusterChao-Hsuan Shen
 
Distributed ML in Apache Spark
Distributed ML in Apache SparkDistributed ML in Apache Spark
Distributed ML in Apache SparkDatabricks
 
Use r tutorial part1, introduction to sparkr
Use r tutorial part1, introduction to sparkrUse r tutorial part1, introduction to sparkr
Use r tutorial part1, introduction to sparkrDatabricks
 
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
Apache® Spark™ MLlib: From Quick Start to Scikit-LearnApache® Spark™ MLlib: From Quick Start to Scikit-Learn
Apache® Spark™ MLlib: From Quick Start to Scikit-LearnDatabricks
 
Recent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and BeyondRecent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and BeyondXiangrui Meng
 
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...Spark Summit
 
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...Spark Summit
 
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...Spark Summit
 
Apache Spark Model Deployment
Apache Spark Model Deployment Apache Spark Model Deployment
Apache Spark Model Deployment Databricks
 
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...Spark Summit
 
Apache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterApache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterDatabricks
 
Apache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and ProductionApache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and ProductionDatabricks
 
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFramesApache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFramesDatabricks
 
Accelerating Machine Learning and Deep Learning At Scale...With Apache Spark:...
Accelerating Machine Learning and Deep Learning At Scale...With Apache Spark:...Accelerating Machine Learning and Deep Learning At Scale...With Apache Spark:...
Accelerating Machine Learning and Deep Learning At Scale...With Apache Spark:...Spark Summit
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsDatabricks
 

Viewers also liked (16)

Configuring Your First Hadoop Cluster On EC2
Configuring Your First Hadoop Cluster On EC2Configuring Your First Hadoop Cluster On EC2
Configuring Your First Hadoop Cluster On EC2
 
Spark application on ec2 cluster
Spark application on ec2 clusterSpark application on ec2 cluster
Spark application on ec2 cluster
 
Distributed ML in Apache Spark
Distributed ML in Apache SparkDistributed ML in Apache Spark
Distributed ML in Apache Spark
 
Use r tutorial part1, introduction to sparkr
Use r tutorial part1, introduction to sparkrUse r tutorial part1, introduction to sparkr
Use r tutorial part1, introduction to sparkr
 
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
Apache® Spark™ MLlib: From Quick Start to Scikit-LearnApache® Spark™ MLlib: From Quick Start to Scikit-Learn
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
 
Recent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and BeyondRecent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and Beyond
 
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
 
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
 
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
 
Apache Spark Model Deployment
Apache Spark Model Deployment Apache Spark Model Deployment
Apache Spark Model Deployment
 
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
 
Apache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterApache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and Smarter
 
Apache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and ProductionApache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and Production
 
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFramesApache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
 
Accelerating Machine Learning and Deep Learning At Scale...With Apache Spark:...
Accelerating Machine Learning and Deep Learning At Scale...With Apache Spark:...Accelerating Machine Learning and Deep Learning At Scale...With Apache Spark:...
Accelerating Machine Learning and Deep Learning At Scale...With Apache Spark:...
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
 

Similar to MLlib Sparse Data Support

Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...Spark Summit
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...Jose Quesada (hiring)
 
Generalized Linear Models in Spark MLlib and SparkR by Xiangrui Meng
Generalized Linear Models in Spark MLlib and SparkR by Xiangrui MengGeneralized Linear Models in Spark MLlib and SparkR by Xiangrui Meng
Generalized Linear Models in Spark MLlib and SparkR by Xiangrui MengSpark Summit
 
Generalized Linear Models in Spark MLlib and SparkR
Generalized Linear Models in Spark MLlib and SparkRGeneralized Linear Models in Spark MLlib and SparkR
Generalized Linear Models in Spark MLlib and SparkRDatabricks
 
Combining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkCombining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkDataWorks Summit/Hadoop Summit
 
Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...
Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...
Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...Spark Summit
 
Spark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin OderskySpark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin OderskySpark Summit
 
Two strategies for large-scale multi-label classification on the YouTube-8M d...
Two strategies for large-scale multi-label classification on the YouTube-8M d...Two strategies for large-scale multi-label classification on the YouTube-8M d...
Two strategies for large-scale multi-label classification on the YouTube-8M d...Dalei Li
 
Bringing Algebraic Semantics to Mahout
Bringing Algebraic Semantics to MahoutBringing Algebraic Semantics to Mahout
Bringing Algebraic Semantics to Mahoutsscdotopen
 
Enterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkEnterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkAlpine Data
 
Enterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkEnterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkSpark Summit
 
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16MLconf
 
Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016
Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016
Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016MLconf
 
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...DB Tsai
 
2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetupGanesan Narayanasamy
 
MLconf seattle 2015 presentation
MLconf seattle 2015 presentationMLconf seattle 2015 presentation
MLconf seattle 2015 presentationehtshamelahi
 
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Spark Summit
 
Data mining with Weka
Data mining with WekaData mining with Weka
Data mining with WekaAlbanLevy
 
Apache Cassandra training. Overview and Basics
Apache Cassandra training. Overview and BasicsApache Cassandra training. Overview and Basics
Apache Cassandra training. Overview and BasicsOleg Magazov
 
Data-Intensive Computing for Competent Genetic Algorithms: A Pilot Study us...
Data-Intensive Computing for  Competent Genetic Algorithms:  A Pilot Study us...Data-Intensive Computing for  Competent Genetic Algorithms:  A Pilot Study us...
Data-Intensive Computing for Competent Genetic Algorithms: A Pilot Study us...Xavier Llorà
 

Similar to MLlib Sparse Data Support (20)

Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
Generalized Linear Models in Spark MLlib and SparkR by Xiangrui Meng
Generalized Linear Models in Spark MLlib and SparkR by Xiangrui MengGeneralized Linear Models in Spark MLlib and SparkR by Xiangrui Meng
Generalized Linear Models in Spark MLlib and SparkR by Xiangrui Meng
 
Generalized Linear Models in Spark MLlib and SparkR
Generalized Linear Models in Spark MLlib and SparkRGeneralized Linear Models in Spark MLlib and SparkR
Generalized Linear Models in Spark MLlib and SparkR
 
Combining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkCombining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache Spark
 
Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...
Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...
Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...
 
Spark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin OderskySpark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin Odersky
 
Two strategies for large-scale multi-label classification on the YouTube-8M d...
Two strategies for large-scale multi-label classification on the YouTube-8M d...Two strategies for large-scale multi-label classification on the YouTube-8M d...
Two strategies for large-scale multi-label classification on the YouTube-8M d...
 
Bringing Algebraic Semantics to Mahout
Bringing Algebraic Semantics to MahoutBringing Algebraic Semantics to Mahout
Bringing Algebraic Semantics to Mahout
 
Enterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkEnterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using Spark
 
Enterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkEnterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using Spark
 
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
 
Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016
Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016
Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016
 
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
 
2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup
 
MLconf seattle 2015 presentation
MLconf seattle 2015 presentationMLconf seattle 2015 presentation
MLconf seattle 2015 presentation
 
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
 
Data mining with Weka
Data mining with WekaData mining with Weka
Data mining with Weka
 
Apache Cassandra training. Overview and Basics
Apache Cassandra training. Overview and BasicsApache Cassandra training. Overview and Basics
Apache Cassandra training. Overview and Basics
 
Data-Intensive Computing for Competent Genetic Algorithms: A Pilot Study us...
Data-Intensive Computing for  Competent Genetic Algorithms:  A Pilot Study us...Data-Intensive Computing for  Competent Genetic Algorithms:  A Pilot Study us...
Data-Intensive Computing for Competent Genetic Algorithms: A Pilot Study us...
 

Recently uploaded

Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 

Recently uploaded (20)

Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 

MLlib Sparse Data Support

  • 1. Sparse data support in MLlib Xiangrui Meng
  • 2. Spark MLlib MLlib is an Apache Spark component focusing on machine learning: • initial contribution from AMPLab, UC Berkeley • shipped with Spark since version 0.8 (Sep 2013) • 50 contributors 2
  • 3. Algorithms • classification: logistic regression, linear support vector machine (SVM), naive Bayes, classification tree • regression: generalized linear models (GLMs), regression tree • collaborative filtering: alternating least squares (ALS), non-negative matrix factorization (NMF) • clustering: k-means • decomposition: singular value decomposition (SVD), principal component analysis (PCA) 3
  • 4. What’s new in v1.0 • new user guide and code examples • API stability • sparse data support • regression and classification tree • distributed matrices • tall-and-skinny PCA and SVD • L-BFGS • binary classification model evaluation 4
  • 7. Sparsity is almost everywhere Sparse datasets appear almost everywhere in the world of big data, where the sparsity may come from many sources, e.g., • feature transformation: 
 one-hot encoding, interaction, and bucketing, • large feature space: n-grams, • missing data: rating matrix, • low-rank structure: images and signals. 7
  • 8. One-hot encoding Converts a categorical feature to numerical, e.g., • country: {Germany, Brazil, Argentina, …} • Germany -> 0, Brazil -> 1, Argentina -> 2, … • Germany -> [1, 0, 0, 0, …], Brazil -> [0, 1, 0, 0, …], Argentina -> [0, 0, 1, 0, …], … • density: 1/#categories 8
  • 9. Bucketing Converts a numerical feature to categorical, e.g., • second of day: [0, 24*3600) • hour of day: [0, 24) • 4:33am -> [0, 0, 0, 0, 1, 0, …] • density: 1/#buckets 9
  • 10. Sparsity is almost everywhere The Netflix Prize: • number of users: 480,189 • number of movies: 17,770 • number of observed ratings: 100,480,507 • density = 1.17% 10
  • 11. Sparsity is almost everywhere rcv1.binary (test): • number of examples: 677,399 • number of features: 47,236 • density: 0.15% • storage: 270GB (dense) or 600MB (sparse) 11
  • 12. Sparsity in a broader sense real-world data = + noise sparse — or — low-rank 12
  • 13. A dense image … 13
  • 14. … is sparse under wavelet basis 14
  • 15. A huge rating matrix … Amazon reviews: • Number of users: 6,643,669 • Number of products: 2,441,053 • Number of reviews: 34,686,770 15
  • 16. … is approximately low-rank 16
  • 17. Exploiting sparsity • As a user • recognize sparsity • As a developer • utilize sparsity 17
  • 18. Exploiting sparsity • In Spark v1.0, MLlib adds support for sparse input data in Scala, Java, and Python. • MLlib takes advantage of sparsity in both storage and computation in • collaborative filtering, • linear methods (linear SVM, logistic regression, etc), • naive Bayes, • k-means, • summary statistics, • singular value decomposition. 18
  • 19. Open-source linear algebra packages We benchmarked several JVM-based open-source linear algebra packages on the operations we need to implement sparse data support. • breeze • matrix-toolkits-java • mahout-math • commons-math3 • jblas 19
  • 20. Sparse representation of features 20 dense : 1. 0. 0. 0. 0. 0. 3. sparse : 8 >< >: size : 7 indices : 0 6 values : 1. 3.
  • 21. Create/save/load sparse data Create a sparse vector representing [1., 0., 3.] • in Scala: Vectors.sparse(3, Array(0, 2), Array(1., 3.)) • in Java: Vectors.sparse(3, new int[] {0, 2}, new double[] {1., 3.}) • in Python: Vectors.sparse(3, [0, 2], [1., 3.]) Create a labeled point with a sparse feature vector • Scala/Java/Python: LabeledPoint(label, sparseVector) 21
  • 22. Create/save/load sparse data Save a sparse training data set (RDD[LabeledPoint]) • in LIBSVM format • MLUtils.saveAsLibSVMFile(rdd, dir) -> 1 1:1.0 3:3.0 • in MLlib’s format (v1.1) • rdd.saveAsTextFile(dir) -> (3,[0,2],[1.,3.]) 22
  • 23. Create/save/load sparse data Load a sparse training dataset • in LIBSVM format • MLUtils.loadLibSVMFile(sc, path) • in MLlib’s format (v1.1) • MLUtils.loadLabeledData(sc, path) 23
  • 25. Exploiting sparsity in k-means Training set: • number of examples: 12 million • number of features: 500 • density: 10% ! Not only did we save 40GB of storage by switching to the sparse format, but we also received a 4x speedup. dense sparse storage 47GB 7GB time 240s 58s 25
  • 26. Implementation of sparse k-means Algorithm: • For each point, find its closest center.
 
 • Update cluster centers. li = arg min j kxi cjk2 2 cj = P i,li=j xj P i,li=j 1 26
  • 27. Implementation of sparse k-means The points are usually sparse, but the centers are most likely to be dense. Computing the distance takes O(d) time. So the time complexity is O(n d k) per iteration. We don’t take any advantage of sparsity on the running time. However, we have kx ck2 2 = kxk2 2 + kck2 2 2hx, ci Computing the inner product only needs non-zero elements. So we can cache the norms of the points and of the centers, and then only need the inner products to obtain the distances. This reduce the running time to O(nnz k + d k) per iteration. 27
  • 28. Exploiting sparsity in linear methods • The essential part of the computation in a gradient- based method is computing the sum of gradients. • For linear methods, the gradient is sparse if the feature vector is sparse. 28
  • 29. Exploiting sparsity in linear methods Proposal: g = points.map(p => grad(w, p)).reduce(_ + _) Cons: • Creating many small objects. • Adding sparse vectors. 29
  • 30. Exploiting sparsity in linear methods MLlib’s implementation • does not add sparse vectors together, • instead, adds sparse vectors directly to a dense vector for each partition and then computes the sum • fast random access • no temporary object creation 30
  • 31. Exploiting sparsity in summary statistics Multivariate summary statistics: • count / mean / max / min / nnz / variance MLlib’s implementation • computes the variance accurately in a single pass, • ignores the zero values during the pass. 31
  • 32. • Top singular values can be computed via the eigenvalue decomposition of the Gram matrix. ! • Lanczos algorithm computes eigenvalue decomposition, which only needs a routine that multiplies a matrix with a vector. Singular value decomposition (AT A)v = X i (aT i v)ai j(A) = q j(AT A) 32
  • 33. Exploiting sparsity in SVD ! • v is dense in general, while {a_i} are sparse. • The inner product can be computed quickly. • Computing the sum is the same as in linear methods. (AT A)v = X i (aT i v)ai 33
  • 34. Make the right choice
  • 35. Sparse vs. dense • Storage • sparse format: 12 nnz + 4 bytes. • dense format: 8n bytes • Speed • problem dependent
  • 36. Pick algorithms that fit your data 36
  • 37. ALS vs. SVD Both algorithms compute low-rank matrix factorizations. • ALS • is scalable on both directions • ignores unobserved entries (explicit feedback) • SVD • is scalable on one direction • treats unobserved entries as zeros 37
  • 38. Acknowledgement • David Hall (breeze) • Sam Halliday (netlib-java) • Xusen Yin (summary statistics) • Li Pu (SVD via Lanczos) • Tor Myklebust (NMF) 38
  • 39. Summary • Real-world data = sparse/low-rank + noise • MLlib supports sparse data in • linear methods / naive Bayes / k-means / collaborative filtering / summary statistics / singular value decomposition • As a user, recognize sparsity. • As a developer, utilize sparsity. 39