SlideShare a Scribd company logo
Inside Apache SystemML
Fred Reiss
Chief Architect, IBM Spark Technology Center
Member of the IBM Academy of Technology
•  2007-2008: Multiple projects at IBM Research –
Almaden involving machine learning on Hadoop.
•  2009: We create a dedicated team for scalable ML.
•  2009-2010: Through engagements with customers, we
observe how data scientists create machine learning
algorithms.
Origins of the SystemML Project
State-of-the-Art: Small Data
R or
Python
Data
Scientist
Personal
Computer
Data
Results
State-of-the-Art: Big Data
R or
Python
Data
Scientist
Results
Systems
Programmer
Scala
State-of-the-Art: Big Data
R or
Python
Data
Scientist
Results
Systems
Programmer
Scala
😞 Days or weeks per iteration
😞 Errors while translating
algorithms
The SystemML Vision
R or
Python
Data
Scientist
Results
SystemML
The SystemML Vision
R or
Python
Data
Scientist
Results
SystemML
😃 Fast iteration
😃 Same answer
Running Example:
Alternating Least Squares
Products
Customers
i
j
Customer i
bought
product j.
Products Factor
CustomersFactor
Multiply these
two factors to
produce a less-
sparse matrix.
×
New nonzero
values become
product
suggestions.
•  Problem:
Recommend
products to
customers
Alternating Least Squares (in R)
U	=	rand(nrow(X),	r,	min	=	-1.0,	max	=	1.0);		
V	=	rand(r,	ncol(X),	min	=	-1.0,	max	=	1.0);		
while(i	<	mi)	{	
			i	=	i	+	1;	ii	=	1;	
			if	(is_U)	
						G	=	(W	*	(U	%*%	V	-	X))	%*%	t(V)	+	lambda	*	U;	
			else	
						G	=	t(U)	%*%	(W	*	(U	%*%	V	-	X))	+	lambda	*	V;	
			norm_G2	=	sum(G	^	2);	norm_R2	=	norm_G2;					
			R	=	-G;	S	=	R;	
			while(norm_R2	>	10E-9	*	norm_G2	&	ii	<=	mii)	{	
					if	(is_U)	{	
							HS	=	(W	*	(S	%*%	V))	%*%	t(V)	+	lambda	*	S;	
							alpha	=	norm_R2	/	sum	(S	*	HS);	
							U	=	U	+	alpha	*	S;			
					}	else	{	
							HS	=	t(U)	%*%	(W	*	(U	%*%	S))	+	lambda	*	S;	
							alpha	=	norm_R2	/	sum	(S	*	HS);	
							V	=	V	+	alpha	*	S;			
					}	
					R	=	R	-	alpha	*	HS;	
					old_norm_R2	=	norm_R2;	norm_R2	=	sum(R	^	2);	
					S	=	R	+	(norm_R2	/	old_norm_R2)	*	S;	
					ii	=	ii	+	1;	
			}			
			is_U	=	!	is_U;	
}
Alternating Least Squares (in R)
U	=	rand(nrow(X),	r,	min	=	-1.0,	max	=	1.0);		
V	=	rand(r,	ncol(X),	min	=	-1.0,	max	=	1.0);		
while(i	<	mi)	{	
			i	=	i	+	1;	ii	=	1;	
			if	(is_U)	
						G	=	(W	*	(U	%*%	V	-	X))	%*%	t(V)	+	lambda	*	U;	
			else	
						G	=	t(U)	%*%	(W	*	(U	%*%	V	-	X))	+	lambda	*	V;	
			norm_G2	=	sum(G	^	2);	norm_R2	=	norm_G2;					
			R	=	-G;	S	=	R;	
			while(norm_R2	>	10E-9	*	norm_G2	&	ii	<=	mii)	{	
					if	(is_U)	{	
							HS	=	(W	*	(S	%*%	V))	%*%	t(V)	+	lambda	*	S;	
							alpha	=	norm_R2	/	sum	(S	*	HS);	
							U	=	U	+	alpha	*	S;			
					}	else	{	
							HS	=	t(U)	%*%	(W	*	(U	%*%	S))	+	lambda	*	S;	
							alpha	=	norm_R2	/	sum	(S	*	HS);	
							V	=	V	+	alpha	*	S;			
					}	
					R	=	R	-	alpha	*	HS;	
					old_norm_R2	=	norm_R2;	norm_R2	=	sum(R	^	2);	
					S	=	R	+	(norm_R2	/	old_norm_R2)	*	S;	
					ii	=	ii	+	1;	
			}			
			is_U	=	!	is_U;	
}	
1.  Start with random factors.
2.  Hold the Products factor constant
and find the best value for the
Customers factor.
(Value that most closely approximates the original
matrix)
3.  Hold the Customers factor
constant and find the best value
for the Products factor.
4.  Repeat steps 2-3 until
convergence.
1
2
2
3
3
4
4
4
Every line has a clear purpose!
Alternating Least Squares (spark.ml)
Alternating Least Squares (spark.ml)
Alternating Least Squares (spark.ml)
Alternating Least Squares (spark.ml)
•  25 lines’ worth of algorithm…
•  …mixed with 800 lines of performance code
Alternating Least Squares (in R)
U	=	rand(nrow(X),	r,	min	=	-1.0,	max	=	1.0);		
V	=	rand(r,	ncol(X),	min	=	-1.0,	max	=	1.0);		
while(i	<	mi)	{	
			i	=	i	+	1;	ii	=	1;	
			if	(is_U)	
						G	=	(W	*	(U	%*%	V	-	X))	%*%	t(V)	+	lambda	*	U;	
			else	
						G	=	t(U)	%*%	(W	*	(U	%*%	V	-	X))	+	lambda	*	V;	
			norm_G2	=	sum(G	^	2);	norm_R2	=	norm_G2;					
			R	=	-G;	S	=	R;	
			while(norm_R2	>	10E-9	*	norm_G2	&	ii	<=	mii)	{	
					if	(is_U)	{	
							HS	=	(W	*	(S	%*%	V))	%*%	t(V)	+	lambda	*	S;	
							alpha	=	norm_R2	/	sum	(S	*	HS);	
							U	=	U	+	alpha	*	S;			
					}	else	{	
							HS	=	t(U)	%*%	(W	*	(U	%*%	S))	+	lambda	*	S;	
							alpha	=	norm_R2	/	sum	(S	*	HS);	
							V	=	V	+	alpha	*	S;			
					}	
					R	=	R	-	alpha	*	HS;	
					old_norm_R2	=	norm_R2;	norm_R2	=	sum(R	^	2);	
					S	=	R	+	(norm_R2	/	old_norm_R2)	*	S;	
					ii	=	ii	+	1;	
			}			
			is_U	=	!	is_U;	
}
Alternating Least Squares (in R)
U	=	rand(nrow(X),	r,	min	=	-1.0,	max	=	1.0);		
V	=	rand(r,	ncol(X),	min	=	-1.0,	max	=	1.0);		
while(i	<	mi)	{	
			i	=	i	+	1;	ii	=	1;	
			if	(is_U)	
						G	=	(W	*	(U	%*%	V	-	X))	%*%	t(V)	+	lambda	*	U;	
			else	
						G	=	t(U)	%*%	(W	*	(U	%*%	V	-	X))	+	lambda	*	V;	
			norm_G2	=	sum(G	^	2);	norm_R2	=	norm_G2;					
			R	=	-G;	S	=	R;	
			while(norm_R2	>	10E-9	*	norm_G2	&	ii	<=	mii)	{	
					if	(is_U)	{	
							HS	=	(W	*	(S	%*%	V))	%*%	t(V)	+	lambda	*	S;	
							alpha	=	norm_R2	/	sum	(S	*	HS);	
							U	=	U	+	alpha	*	S;			
					}	else	{	
							HS	=	t(U)	%*%	(W	*	(U	%*%	S))	+	lambda	*	S;	
							alpha	=	norm_R2	/	sum	(S	*	HS);	
							V	=	V	+	alpha	*	S;			
					}	
					R	=	R	-	alpha	*	HS;	
					old_norm_R2	=	norm_R2;	norm_R2	=	sum(R	^	2);	
					S	=	R	+	(norm_R2	/	old_norm_R2)	*	S;	
					ii	=	ii	+	1;	
			}			
			is_U	=	!	is_U;	
}	
(in SystemML’s
subset of R)
•  SystemML can
compile and run this
algorithm at scale
•  No additional
performance code
needed!
How fast does it run?
Running time comparisons between
machine learning algorithms are
problematic
–  Different, equally-valid answers
–  Different convergence rates on different
data
–  But we’ll do one anyway
Performance Comparison: ALS
0
5000
10000
15000
20000
1.2GB (sparse
binary)
12GB 120GB
RunningTime(sec)
R
MLLib
SystemML
>24h>24h
OOM
OOM
Synthetic data, 0.01 sparsity, 10^5 products × {10^5,10^6,10^7} users. Data generated by multiplying two rank-50 matrices of normally-distributed
data, sampling from the resulting product, then adding Gaussian noise. Cluster of 6 servers with 12 cores and 96GB of memory per server.
Number of iterations tuned so that all algorithms produce comparable result quality.Details:
Takeaway Points
•  SystemML runs the R script in parallel
– Same answer as original R script
– Performance is comparable to a low-level
RDD-based implementation
•  How does SystemML achieve this result?
Performance Comparison: ALS
0
5000
10000
15000
20000
1.2GB (sparse
binary)
12GB 120GB
RunningTime(sec)
R
MLLib
SystemML
>24h>24h
OOM
OOM
Synthetic data, 0.01 sparsity, 10^5 products × {10^5,10^6,10^7} users. Data generated by multiplying two rank-50 matrices of normally-distributed
data, sampling from the resulting product, then adding Gaussian noise. Cluster of 6 servers with 12 cores and 96GB of memory per server.
Number of iterations tuned so that all algorithms produce comparable result quality.Details:
Several factors at play
•  Subtly different
algorithms
•  Adaptive execution
strategies
•  Runtime differences
Several factors at play
•  Subtly different
algorithms
•  Adaptive execution
strategies
•  Runtime differences
Questions We’ll Focus On
0
5000
10000
15000
20000
1.2GB (sparse
binary)
12GB 120GB
RunningTime(sec)
R
MLLib
SystemML
>24h>24h
OOM
OOM
SystemML runs no
distributed jobs here.
•  How does SystemML
know it’s better to run on
one machine?
•  Why is SystemML so
much faster than single-
node R?
Questions We’ll Focus On
0
5000
10000
15000
20000
1.2GB (sparse
binary)
12GB 120GB
RunningTime(sec)
R
MLLib
SystemML
>24h>24h
OOM
OOM
SystemML runs no
distributed jobs here.
•  How does SystemML
know it’s better to run
on one machine?
•  Why is SystemML so
much faster than single-
node R?
SystemML
Optimizer
High-Level
Algorithm
Parallel
Spark
Program
High-Level
Algorithm
Parallel
Spark
Program
The SystemML Optimizer Stack
The SystemML Optimizer Stack
The SystemML Optimizer Stack
Abstract Syntax Tree
Layers
U	=	rand(nrow(X),	r,	min	=	-1.0,	max	=	1.0);		
V	=	rand(r,	ncol(X),	min	=	-1.0,	max	=	1.0);		
while(i	<	mi)	{	
			i	=	i	+	1;	ii	=	1;	
			if	(is_U)	
						G	=	(W	*	(U	%*%	V	-	X))	%*%	t(V)	+	lambda	*	U;	
			else	
						G	=	t(U)	%*%	(W	*	(U	%*%	V	-	X))	+	lambda	*	V;	
			norm_G2	=	sum(G	^	2);	norm_R2	=	norm_G2;					
			R	=	-G;	S	=	R;	
			while(norm_R2	>	10E-9	*	norm_G2	&	ii	<=	mii)	{	
					if	(is_U)	{	
							HS	=	(W	*	(S	%*%	V))	%*%	t(V)	+	lambda	*	S;	
							alpha	=	norm_R2	/	sum	(S	*	HS);	
							U	=	U	+	alpha	*	S;			
					}	else	{	
							HS	=	t(U)	%*%	(W	*	(U	%*%	S))	+	lambda	*	S;	
							alpha	=	norm_R2	/	sum	(S	*	HS);	
							V	=	V	+	alpha	*	S;			
					}	
					R	=	R	-	alpha	*	HS;	
					old_norm_R2	=	norm_R2;	norm_R2	=	sum(R	^	2);	
					S	=	R	+	(norm_R2	/	old_norm_R2)	*	S;	
					ii	=	ii	+	1;	
			}			
			is_U	=	!	is_U;	
}	
•  Parsing
•  Live variable
analysis
•  Validation
The SystemML Optimizer Stack
Abstract Syntax Tree
Layers
U	=	rand(nrow(X),	r,	min	=	-1.0,	max	=	1.0);		
V	=	rand(r,	ncol(X),	min	=	-1.0,	max	=	1.0);		
while(i	<	mi)	{	
			i	=	i	+	1;	ii	=	1;	
			if	(is_U)	
						G	=	(W	*	(U	%*%	V	-	X))	%*%	t(V)	+	lambda	*	U;	
			else	
						G	=	t(U)	%*%	(W	*	(U	%*%	V	-	X))	+	lambda	*	V;	
			norm_G2	=	sum(G	^	2);	norm_R2	=	norm_G2;					
			R	=	-G;	S	=	R;	
			while(norm_R2	>	10E-9	*	norm_G2	&	ii	<=	mii)	{	
					if	(is_U)	{	
							HS	=	(W	*	(S	%*%	V))	%*%	t(V)	+	lambda	*	S;	
							alpha	=	norm_R2	/	sum	(S	*	HS);	
							U	=	U	+	alpha	*	S;			
					}	else	{	
							HS	=	t(U)	%*%	(W	*	(U	%*%	S))	+	lambda	*	S;	
							alpha	=	norm_R2	/	sum	(S	*	HS);	
							V	=	V	+	alpha	*	S;			
					}	
					R	=	R	-	alpha	*	HS;	
					old_norm_R2	=	norm_R2;	norm_R2	=	sum(R	^	2);	
					S	=	R	+	(norm_R2	/	old_norm_R2)	*	S;	
					ii	=	ii	+	1;	
			}			
			is_U	=	!	is_U;	
}	
•  Parsing
•  Live variable
analysis
•  Validation
+
The SystemML Optimizer Stack
High-Level Operations
Layers
HS	=	t(U)	%*%	(W	*	(U	%*%	S))	+	lambda	*	S;	
alpha	=	norm_R2	/	sum	(S	*	HS);	
V	=	V	+	alpha	*	S;			
	
%*%
WU S
*t()
lambda
*
%*%
write(HS)
Construct graph
of High-Level
Operations
(HOPs)
+
The SystemML Optimizer Stack
High-Level Operations
Layers
HS	=	t(U)	%*%	(W	*	(U	%*%	S))	+	lambda	*	S;	
%*%
WU S
*t()
lambda
*
%*%
write(HS)•  Construct HOPs
The SystemML Optimizer Stack
High-Level Operations
Layers
HS	=	t(U)	%*%	(W	*	(U	%*%	S))	
%*%
WU S
*t()
%*%
1.2GB

sparse
80GB

dense
80GB

dense
800MB

dense
800MB

dense
800MB

dense
•  Construct HOPs
•  Propagate statistics
•  Determine distributed operationsAll operands
fit into heap
à use one
node
800MB

dense
Questions We’ll Focus On
0
5000
10000
15000
20000
1.2GB (sparse
binary)
12GB 120GB
RunningTime(sec)
R
MLLib
SystemML
>24h>24h
OOM
OOM
SystemML runs no
distributed jobs here.
•  How does SystemML
know it’s better to run on
one machine?
•  Why is SystemML so
much faster than
single-node R?
The SystemML Optimizer Stack
High-Level Operations
Layers
HS	=	t(U)	%*%	(W	*	(U	%*%	S))	
%*%
WU S
*t()
%*%
1.2GB

sparse
80GB

dense
80GB

dense
800MB

dense
800MB

dense
800MB

dense
All operands
fit into heap
à use one
node
•  Construct HOPs
•  Propagate stats
•  Determine distributed operations
•  Rewrites
800MB

dense
Example Rewrite: wdivmm
W
S
U
U × S
*( (
t(U) t(U)×(W*(U×S)))	
×
Large dense
intermediate
Can compute
directly from U,
S, and W!
t(U)	%*%	(W	*	(U	%*%	S))
The SystemML Optimizer Stack
High-Level Operations
Layers
HS	=	t(U)	%*%	(W	*	(U	%*%	S)	
wdivmm
WU S
1.2GB

sparse
800MB

dense
800MB

dense
•  Construct HOPs
•  Propagate stats
•  Determine distributed operations
•  Rewrites
800MB

dense
The SystemML Optimizer Stack
Low-Level Operations
Layers
HS	=	t(U)	%*%	(W	*	(U	%*%	S)	
wdivmm
WU S
1.2GB

sparse
800MB

dense
800MB

dense
800MB

dense
•  Convert HOPs to Low-Level
Operations (LOPs)
The SystemML Optimizer Stack
Low-Level Operations
Layers
HS	=	t(U)	%*%	(W	*	(U	%*%	S)	
Single-Node
WDivMM
WU S
•  Convert HOPs to Low-Level
Operations (LOPs)
The SystemML Optimizer Stack
Low-Level Operations
Layers
HS	=	t(U)	%*%	(W	*	(U	%*%	S)	
Single-Node
WDivMM
WU S
•  Convert HOPs to Low-Level
Operations (LOPs)
•  Generate runtime instructions
To SystemML
Runtime
The SystemML Runtime for Spark
•  Automates critical performance decisions
– Distributed or local computation?
– How to partition the data?
– To persist or not to persist?
The SystemML Runtime for Spark
•  Distributed vs local: Hybrid runtime
– Multithreaded computation in Spark Driver
– Distributed computation in Spark Executors
– Optimizer makes a cost-based choice
The SystemML Runtime for Spark
Efficient Linear Algebra
•  Binary block matrices
(JavaPairRDD<MatrixIndexes, MatrixBlock>)
•  Adaptive block storage formats: Dense, Sparse,
Ultra-Sparse, Empty
•  Efficient kernels for all combinations of block
types
Automated RDD
Caching
•  Lineage tracking for RDDs/
broadcasts
•  Guarded RDD collect/parallelize
•  Partitioned Broadcast variables
Logical Blocking
(w/ Bc=1,000)
Physical Blocking and Partitioning
(w/ Bc=1,000)
Recap
Questions
•  How does SystemML
know it’s better to run on
one machine?
•  Why is SystemML so
much faster than single-
node R?
Answers
•  Live variable analysis
•  Propagation of statistics
•  Advanced rewrites
•  Efficient runtime
But wait, there’s more!
•  Many other rewrites
•  Cost-based selection of physical operators
•  Dynamic recompilation for accurate stats
•  Parallel FOR (ParFor) optimizer
•  Direct operations on RDD partitions
•  YARN and MapReduce support
•  SystemML is open source!
–  Announced in June 2015
–  Available on Github since September 1
–  First open-source binary release (0.8.0) in October 2015
–  Entered Apache incubation in November 2015
–  First Apache open-source binary release (0.9) available now
•  We are actively seeking contributors and users!
http://systemml.apache.org/
Open-Sourcing SystemML
THANK YOU.
For more information, go to
http://systemml.apache.org/

More Related Content

What's hot

Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
Accumulo Summit
 
Spark Summit EU talk by Mikhail Semeniuk Hollin Wilkins
Spark Summit EU talk by Mikhail Semeniuk Hollin WilkinsSpark Summit EU talk by Mikhail Semeniuk Hollin Wilkins
Spark Summit EU talk by Mikhail Semeniuk Hollin Wilkins
Spark Summit
 
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Spark Summit
 
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
Spark Summit
 
Spark Summit EU talk by Reza Karimi
Spark Summit EU talk by Reza KarimiSpark Summit EU talk by Reza Karimi
Spark Summit EU talk by Reza Karimi
Spark Summit
 
Spark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesSpark DataFrames and ML Pipelines
Spark DataFrames and ML Pipelines
Databricks
 
DASK and Apache Spark
DASK and Apache SparkDASK and Apache Spark
DASK and Apache Spark
Databricks
 
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Spark Summit
 
FlinkML - Big data application meetup
FlinkML - Big data application meetupFlinkML - Big data application meetup
FlinkML - Big data application meetup
Theodoros Vasiloudis
 
Spark Summit EU talk by Nick Pentreath
Spark Summit EU talk by Nick PentreathSpark Summit EU talk by Nick Pentreath
Spark Summit EU talk by Nick Pentreath
Spark Summit
 
Designing Distributed Machine Learning on Apache Spark
Designing Distributed Machine Learning on Apache SparkDesigning Distributed Machine Learning on Apache Spark
Designing Distributed Machine Learning on Apache Spark
Databricks
 
Extending Machine Learning Algorithms with PySpark
Extending Machine Learning Algorithms with PySparkExtending Machine Learning Algorithms with PySpark
Extending Machine Learning Algorithms with PySpark
Databricks
 
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
Spark Summit
 
Operationalizing Machine Learning at Scale with Sameer Nori
Operationalizing Machine Learning at Scale with Sameer NoriOperationalizing Machine Learning at Scale with Sameer Nori
Operationalizing Machine Learning at Scale with Sameer Nori
Databricks
 
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...
Databricks
 
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Spark Summit
 
IndexedRDD: Efficeint Fine-Grained Updates for RDD's-(Ankur Dave, UC Berkeley)
IndexedRDD: Efficeint Fine-Grained Updates for RDD's-(Ankur Dave, UC Berkeley)IndexedRDD: Efficeint Fine-Grained Updates for RDD's-(Ankur Dave, UC Berkeley)
IndexedRDD: Efficeint Fine-Grained Updates for RDD's-(Ankur Dave, UC Berkeley)
Spark Summit
 
Generalized Linear Models in Spark MLlib and SparkR
Generalized Linear Models in Spark MLlib and SparkRGeneralized Linear Models in Spark MLlib and SparkR
Generalized Linear Models in Spark MLlib and SparkR
Databricks
 
Data science on big data. Pragmatic approach
Data science on big data. Pragmatic approachData science on big data. Pragmatic approach
Data science on big data. Pragmatic approach
Pavel Mezentsev
 
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerDeep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Spark Summit
 

What's hot (20)

Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
 
Spark Summit EU talk by Mikhail Semeniuk Hollin Wilkins
Spark Summit EU talk by Mikhail Semeniuk Hollin WilkinsSpark Summit EU talk by Mikhail Semeniuk Hollin Wilkins
Spark Summit EU talk by Mikhail Semeniuk Hollin Wilkins
 
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
 
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
 
Spark Summit EU talk by Reza Karimi
Spark Summit EU talk by Reza KarimiSpark Summit EU talk by Reza Karimi
Spark Summit EU talk by Reza Karimi
 
Spark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesSpark DataFrames and ML Pipelines
Spark DataFrames and ML Pipelines
 
DASK and Apache Spark
DASK and Apache SparkDASK and Apache Spark
DASK and Apache Spark
 
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
 
FlinkML - Big data application meetup
FlinkML - Big data application meetupFlinkML - Big data application meetup
FlinkML - Big data application meetup
 
Spark Summit EU talk by Nick Pentreath
Spark Summit EU talk by Nick PentreathSpark Summit EU talk by Nick Pentreath
Spark Summit EU talk by Nick Pentreath
 
Designing Distributed Machine Learning on Apache Spark
Designing Distributed Machine Learning on Apache SparkDesigning Distributed Machine Learning on Apache Spark
Designing Distributed Machine Learning on Apache Spark
 
Extending Machine Learning Algorithms with PySpark
Extending Machine Learning Algorithms with PySparkExtending Machine Learning Algorithms with PySpark
Extending Machine Learning Algorithms with PySpark
 
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
 
Operationalizing Machine Learning at Scale with Sameer Nori
Operationalizing Machine Learning at Scale with Sameer NoriOperationalizing Machine Learning at Scale with Sameer Nori
Operationalizing Machine Learning at Scale with Sameer Nori
 
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...
 
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
 
IndexedRDD: Efficeint Fine-Grained Updates for RDD's-(Ankur Dave, UC Berkeley)
IndexedRDD: Efficeint Fine-Grained Updates for RDD's-(Ankur Dave, UC Berkeley)IndexedRDD: Efficeint Fine-Grained Updates for RDD's-(Ankur Dave, UC Berkeley)
IndexedRDD: Efficeint Fine-Grained Updates for RDD's-(Ankur Dave, UC Berkeley)
 
Generalized Linear Models in Spark MLlib and SparkR
Generalized Linear Models in Spark MLlib and SparkRGeneralized Linear Models in Spark MLlib and SparkR
Generalized Linear Models in Spark MLlib and SparkR
 
Data science on big data. Pragmatic approach
Data science on big data. Pragmatic approachData science on big data. Pragmatic approach
Data science on big data. Pragmatic approach
 
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerDeep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
 

Viewers also liked

Building Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemMLBuilding Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemML
Jen Aman
 
Inside Apache SystemML
Inside Apache SystemMLInside Apache SystemML
Inside Apache SystemML
Frederick Reiss
 
Regression using Apache SystemML by Alexandre V Evfimievski
Regression using Apache SystemML by Alexandre V EvfimievskiRegression using Apache SystemML by Alexandre V Evfimievski
Regression using Apache SystemML by Alexandre V Evfimievski
Arvind Surve
 
Data preparation, training and validation using SystemML by Faraz Makari Mans...
Data preparation, training and validation using SystemML by Faraz Makari Mans...Data preparation, training and validation using SystemML by Faraz Makari Mans...
Data preparation, training and validation using SystemML by Faraz Makari Mans...
Arvind Surve
 
S1 DML Syntax and Invocation
S1 DML Syntax and InvocationS1 DML Syntax and Invocation
S1 DML Syntax and Invocation
Arvind Surve
 
Amia tb-review-11
Amia tb-review-11Amia tb-review-11
Amia tb-review-11
Russ Altman
 
Parallel Machine Learning- DSGD and SystemML
Parallel Machine Learning- DSGD and SystemMLParallel Machine Learning- DSGD and SystemML
Parallel Machine Learning- DSGD and SystemMLJanani C
 
南投縣發祥國小辦理教育優先區計畫實施情形考核表
南投縣發祥國小辦理教育優先區計畫實施情形考核表南投縣發祥國小辦理教育優先區計畫實施情形考核表
南投縣發祥國小辦理教育優先區計畫實施情形考核表Shi Guo Xian
 
Spark Summit EU talk by Heiko Korndorf
Spark Summit EU talk by Heiko KorndorfSpark Summit EU talk by Heiko Korndorf
Spark Summit EU talk by Heiko Korndorf
Spark Summit
 
Clustering and Factorization using Apache SystemML by Prithviraj Sen
Clustering and Factorization using Apache SystemML by  Prithviraj SenClustering and Factorization using Apache SystemML by  Prithviraj Sen
Clustering and Factorization using Apache SystemML by Prithviraj Sen
Arvind Surve
 
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
Arvind Surve
 
Classification using Apache SystemML by Prithviraj Sen
Classification using Apache SystemML by Prithviraj SenClassification using Apache SystemML by Prithviraj Sen
Classification using Apache SystemML by Prithviraj Sen
Arvind Surve
 
Apache SystemML Architecture by Niketan Panesar
Apache SystemML Architecture by Niketan PanesarApache SystemML Architecture by Niketan Panesar
Apache SystemML Architecture by Niketan Panesar
Arvind Surve
 
Clustering and Factorization using Apache SystemML by Alexandre V Evfimievski
Clustering and Factorization using Apache SystemML by  Alexandre V EvfimievskiClustering and Factorization using Apache SystemML by  Alexandre V Evfimievski
Clustering and Factorization using Apache SystemML by Alexandre V Evfimievski
Arvind Surve
 
Equilibrium – puttingdemandandsupplytogether
Equilibrium – puttingdemandandsupplytogetherEquilibrium – puttingdemandandsupplytogether
Equilibrium – puttingdemandandsupplytogether
Conferat Conferat
 

Viewers also liked (16)

Building Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemMLBuilding Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemML
 
Inside Apache SystemML
Inside Apache SystemMLInside Apache SystemML
Inside Apache SystemML
 
Regression using Apache SystemML by Alexandre V Evfimievski
Regression using Apache SystemML by Alexandre V EvfimievskiRegression using Apache SystemML by Alexandre V Evfimievski
Regression using Apache SystemML by Alexandre V Evfimievski
 
Data preparation, training and validation using SystemML by Faraz Makari Mans...
Data preparation, training and validation using SystemML by Faraz Makari Mans...Data preparation, training and validation using SystemML by Faraz Makari Mans...
Data preparation, training and validation using SystemML by Faraz Makari Mans...
 
S1 DML Syntax and Invocation
S1 DML Syntax and InvocationS1 DML Syntax and Invocation
S1 DML Syntax and Invocation
 
Amia tb-review-11
Amia tb-review-11Amia tb-review-11
Amia tb-review-11
 
Parallel Machine Learning- DSGD and SystemML
Parallel Machine Learning- DSGD and SystemMLParallel Machine Learning- DSGD and SystemML
Parallel Machine Learning- DSGD and SystemML
 
南投縣發祥國小辦理教育優先區計畫實施情形考核表
南投縣發祥國小辦理教育優先區計畫實施情形考核表南投縣發祥國小辦理教育優先區計畫實施情形考核表
南投縣發祥國小辦理教育優先區計畫實施情形考核表
 
Spark Summit EU talk by Heiko Korndorf
Spark Summit EU talk by Heiko KorndorfSpark Summit EU talk by Heiko Korndorf
Spark Summit EU talk by Heiko Korndorf
 
Clustering and Factorization using Apache SystemML by Prithviraj Sen
Clustering and Factorization using Apache SystemML by  Prithviraj SenClustering and Factorization using Apache SystemML by  Prithviraj Sen
Clustering and Factorization using Apache SystemML by Prithviraj Sen
 
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
 
Classification using Apache SystemML by Prithviraj Sen
Classification using Apache SystemML by Prithviraj SenClassification using Apache SystemML by Prithviraj Sen
Classification using Apache SystemML by Prithviraj Sen
 
Resume sachin kuckian
Resume sachin kuckianResume sachin kuckian
Resume sachin kuckian
 
Apache SystemML Architecture by Niketan Panesar
Apache SystemML Architecture by Niketan PanesarApache SystemML Architecture by Niketan Panesar
Apache SystemML Architecture by Niketan Panesar
 
Clustering and Factorization using Apache SystemML by Alexandre V Evfimievski
Clustering and Factorization using Apache SystemML by  Alexandre V EvfimievskiClustering and Factorization using Apache SystemML by  Alexandre V Evfimievski
Clustering and Factorization using Apache SystemML by Alexandre V Evfimievski
 
Equilibrium – puttingdemandandsupplytogether
Equilibrium – puttingdemandandsupplytogetherEquilibrium – puttingdemandandsupplytogether
Equilibrium – puttingdemandandsupplytogether
 

Similar to Inside Apache SystemML by Frederick Reiss

System mldl meetup
System mldl meetupSystem mldl meetup
System mldl meetup
Ganesan Narayanasamy
 
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
MLconf
 
Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014
Vijay Srinivas Agneeswaran, Ph.D
 
Workshop on Advanced Design Patterns for Amazon DynamoDB - DAT405 - re:Invent...
Workshop on Advanced Design Patterns for Amazon DynamoDB - DAT405 - re:Invent...Workshop on Advanced Design Patterns for Amazon DynamoDB - DAT405 - re:Invent...
Workshop on Advanced Design Patterns for Amazon DynamoDB - DAT405 - re:Invent...
Amazon Web Services
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
Revolution Analytics
 
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Spark Summit
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the Cloud
Revolution Analytics
 
AutoML for user segmentation: how to match millions of users with hundreds of...
AutoML for user segmentation: how to match millions of users with hundreds of...AutoML for user segmentation: how to match millions of users with hundreds of...
AutoML for user segmentation: how to match millions of users with hundreds of...
Institute of Contemporary Sciences
 
Is Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
Is Revolution R Enterprise Faster than SAS? Benchmarking Results RevealedIs Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
Is Revolution R Enterprise Faster than SAS? Benchmarking Results RevealedRevolution Analytics
 
Launching Your First Big Data Project on AWS
Launching Your First Big Data Project on AWSLaunching Your First Big Data Project on AWS
Launching Your First Big Data Project on AWS
Amazon Web Services
 
Tensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with HummingbirdTensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with Hummingbird
Databricks
 
Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Performance and Scale Options for R with Hadoop: A comparison of potential ar...Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Revolution Analytics
 
Next generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph labNext generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph lab
Impetus Technologies
 
Model Building with RevoScaleR: Using R and Hadoop for Statistical Computation
Model Building with RevoScaleR: Using R and Hadoop for Statistical ComputationModel Building with RevoScaleR: Using R and Hadoop for Statistical Computation
Model Building with RevoScaleR: Using R and Hadoop for Statistical Computation
Revolution Analytics
 
Production NoSQL in an Hour: Introduction to Amazon DynamoDB (DAT101) | AWS r...
Production NoSQL in an Hour: Introduction to Amazon DynamoDB (DAT101) | AWS r...Production NoSQL in an Hour: Introduction to Amazon DynamoDB (DAT101) | AWS r...
Production NoSQL in an Hour: Introduction to Amazon DynamoDB (DAT101) | AWS r...
Amazon Web Services
 
Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-pres...
Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-pres...Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-pres...
Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-pres...
Massimo Gaetano Panunzio
 
Recommendations for Building Machine Learning Software
Recommendations for Building Machine Learning SoftwareRecommendations for Building Machine Learning Software
Recommendations for Building Machine Learning Software
Justin Basilico
 
A Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptA Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.ppt
Sanket Shikhar
 
Java EE 7 with Apache Spark for the World’s Largest Credit Card Core Systems ...
Java EE 7 with Apache Spark for the World’s Largest Credit Card Core Systems ...Java EE 7 with Apache Spark for the World’s Largest Credit Card Core Systems ...
Java EE 7 with Apache Spark for the World’s Largest Credit Card Core Systems ...
Hirofumi Iwasaki
 
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of ThingsDay 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
Amazon Web Services
 

Similar to Inside Apache SystemML by Frederick Reiss (20)

System mldl meetup
System mldl meetupSystem mldl meetup
System mldl meetup
 
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
 
Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014
 
Workshop on Advanced Design Patterns for Amazon DynamoDB - DAT405 - re:Invent...
Workshop on Advanced Design Patterns for Amazon DynamoDB - DAT405 - re:Invent...Workshop on Advanced Design Patterns for Amazon DynamoDB - DAT405 - re:Invent...
Workshop on Advanced Design Patterns for Amazon DynamoDB - DAT405 - re:Invent...
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
 
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the Cloud
 
AutoML for user segmentation: how to match millions of users with hundreds of...
AutoML for user segmentation: how to match millions of users with hundreds of...AutoML for user segmentation: how to match millions of users with hundreds of...
AutoML for user segmentation: how to match millions of users with hundreds of...
 
Is Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
Is Revolution R Enterprise Faster than SAS? Benchmarking Results RevealedIs Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
Is Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
 
Launching Your First Big Data Project on AWS
Launching Your First Big Data Project on AWSLaunching Your First Big Data Project on AWS
Launching Your First Big Data Project on AWS
 
Tensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with HummingbirdTensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with Hummingbird
 
Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Performance and Scale Options for R with Hadoop: A comparison of potential ar...Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Performance and Scale Options for R with Hadoop: A comparison of potential ar...
 
Next generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph labNext generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph lab
 
Model Building with RevoScaleR: Using R and Hadoop for Statistical Computation
Model Building with RevoScaleR: Using R and Hadoop for Statistical ComputationModel Building with RevoScaleR: Using R and Hadoop for Statistical Computation
Model Building with RevoScaleR: Using R and Hadoop for Statistical Computation
 
Production NoSQL in an Hour: Introduction to Amazon DynamoDB (DAT101) | AWS r...
Production NoSQL in an Hour: Introduction to Amazon DynamoDB (DAT101) | AWS r...Production NoSQL in an Hour: Introduction to Amazon DynamoDB (DAT101) | AWS r...
Production NoSQL in an Hour: Introduction to Amazon DynamoDB (DAT101) | AWS r...
 
Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-pres...
Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-pres...Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-pres...
Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-pres...
 
Recommendations for Building Machine Learning Software
Recommendations for Building Machine Learning SoftwareRecommendations for Building Machine Learning Software
Recommendations for Building Machine Learning Software
 
A Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptA Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.ppt
 
Java EE 7 with Apache Spark for the World’s Largest Credit Card Core Systems ...
Java EE 7 with Apache Spark for the World’s Largest Credit Card Core Systems ...Java EE 7 with Apache Spark for the World’s Largest Credit Card Core Systems ...
Java EE 7 with Apache Spark for the World’s Largest Credit Card Core Systems ...
 
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of ThingsDay 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
 

More from Spark Summit

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
Spark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 

More from Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 

Recently uploaded

standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
balafet
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
pchutichetpong
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 

Recently uploaded (20)

standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 

Inside Apache SystemML by Frederick Reiss

  • 1. Inside Apache SystemML Fred Reiss Chief Architect, IBM Spark Technology Center Member of the IBM Academy of Technology
  • 2. •  2007-2008: Multiple projects at IBM Research – Almaden involving machine learning on Hadoop. •  2009: We create a dedicated team for scalable ML. •  2009-2010: Through engagements with customers, we observe how data scientists create machine learning algorithms. Origins of the SystemML Project
  • 3. State-of-the-Art: Small Data R or Python Data Scientist Personal Computer Data Results
  • 4. State-of-the-Art: Big Data R or Python Data Scientist Results Systems Programmer Scala
  • 5. State-of-the-Art: Big Data R or Python Data Scientist Results Systems Programmer Scala 😞 Days or weeks per iteration 😞 Errors while translating algorithms
  • 6. The SystemML Vision R or Python Data Scientist Results SystemML
  • 7. The SystemML Vision R or Python Data Scientist Results SystemML 😃 Fast iteration 😃 Same answer
  • 8. Running Example: Alternating Least Squares Products Customers i j Customer i bought product j. Products Factor CustomersFactor Multiply these two factors to produce a less- sparse matrix. × New nonzero values become product suggestions. •  Problem: Recommend products to customers
  • 9. Alternating Least Squares (in R) U = rand(nrow(X), r, min = -1.0, max = 1.0); V = rand(r, ncol(X), min = -1.0, max = 1.0); while(i < mi) { i = i + 1; ii = 1; if (is_U) G = (W * (U %*% V - X)) %*% t(V) + lambda * U; else G = t(U) %*% (W * (U %*% V - X)) + lambda * V; norm_G2 = sum(G ^ 2); norm_R2 = norm_G2; R = -G; S = R; while(norm_R2 > 10E-9 * norm_G2 & ii <= mii) { if (is_U) { HS = (W * (S %*% V)) %*% t(V) + lambda * S; alpha = norm_R2 / sum (S * HS); U = U + alpha * S; } else { HS = t(U) %*% (W * (U %*% S)) + lambda * S; alpha = norm_R2 / sum (S * HS); V = V + alpha * S; } R = R - alpha * HS; old_norm_R2 = norm_R2; norm_R2 = sum(R ^ 2); S = R + (norm_R2 / old_norm_R2) * S; ii = ii + 1; } is_U = ! is_U; }
  • 10. Alternating Least Squares (in R) U = rand(nrow(X), r, min = -1.0, max = 1.0); V = rand(r, ncol(X), min = -1.0, max = 1.0); while(i < mi) { i = i + 1; ii = 1; if (is_U) G = (W * (U %*% V - X)) %*% t(V) + lambda * U; else G = t(U) %*% (W * (U %*% V - X)) + lambda * V; norm_G2 = sum(G ^ 2); norm_R2 = norm_G2; R = -G; S = R; while(norm_R2 > 10E-9 * norm_G2 & ii <= mii) { if (is_U) { HS = (W * (S %*% V)) %*% t(V) + lambda * S; alpha = norm_R2 / sum (S * HS); U = U + alpha * S; } else { HS = t(U) %*% (W * (U %*% S)) + lambda * S; alpha = norm_R2 / sum (S * HS); V = V + alpha * S; } R = R - alpha * HS; old_norm_R2 = norm_R2; norm_R2 = sum(R ^ 2); S = R + (norm_R2 / old_norm_R2) * S; ii = ii + 1; } is_U = ! is_U; } 1.  Start with random factors. 2.  Hold the Products factor constant and find the best value for the Customers factor. (Value that most closely approximates the original matrix) 3.  Hold the Customers factor constant and find the best value for the Products factor. 4.  Repeat steps 2-3 until convergence. 1 2 2 3 3 4 4 4 Every line has a clear purpose!
  • 15. •  25 lines’ worth of algorithm… •  …mixed with 800 lines of performance code
  • 16. Alternating Least Squares (in R) U = rand(nrow(X), r, min = -1.0, max = 1.0); V = rand(r, ncol(X), min = -1.0, max = 1.0); while(i < mi) { i = i + 1; ii = 1; if (is_U) G = (W * (U %*% V - X)) %*% t(V) + lambda * U; else G = t(U) %*% (W * (U %*% V - X)) + lambda * V; norm_G2 = sum(G ^ 2); norm_R2 = norm_G2; R = -G; S = R; while(norm_R2 > 10E-9 * norm_G2 & ii <= mii) { if (is_U) { HS = (W * (S %*% V)) %*% t(V) + lambda * S; alpha = norm_R2 / sum (S * HS); U = U + alpha * S; } else { HS = t(U) %*% (W * (U %*% S)) + lambda * S; alpha = norm_R2 / sum (S * HS); V = V + alpha * S; } R = R - alpha * HS; old_norm_R2 = norm_R2; norm_R2 = sum(R ^ 2); S = R + (norm_R2 / old_norm_R2) * S; ii = ii + 1; } is_U = ! is_U; }
  • 17. Alternating Least Squares (in R) U = rand(nrow(X), r, min = -1.0, max = 1.0); V = rand(r, ncol(X), min = -1.0, max = 1.0); while(i < mi) { i = i + 1; ii = 1; if (is_U) G = (W * (U %*% V - X)) %*% t(V) + lambda * U; else G = t(U) %*% (W * (U %*% V - X)) + lambda * V; norm_G2 = sum(G ^ 2); norm_R2 = norm_G2; R = -G; S = R; while(norm_R2 > 10E-9 * norm_G2 & ii <= mii) { if (is_U) { HS = (W * (S %*% V)) %*% t(V) + lambda * S; alpha = norm_R2 / sum (S * HS); U = U + alpha * S; } else { HS = t(U) %*% (W * (U %*% S)) + lambda * S; alpha = norm_R2 / sum (S * HS); V = V + alpha * S; } R = R - alpha * HS; old_norm_R2 = norm_R2; norm_R2 = sum(R ^ 2); S = R + (norm_R2 / old_norm_R2) * S; ii = ii + 1; } is_U = ! is_U; } (in SystemML’s subset of R) •  SystemML can compile and run this algorithm at scale •  No additional performance code needed!
  • 18. How fast does it run? Running time comparisons between machine learning algorithms are problematic –  Different, equally-valid answers –  Different convergence rates on different data –  But we’ll do one anyway
  • 19. Performance Comparison: ALS 0 5000 10000 15000 20000 1.2GB (sparse binary) 12GB 120GB RunningTime(sec) R MLLib SystemML >24h>24h OOM OOM Synthetic data, 0.01 sparsity, 10^5 products × {10^5,10^6,10^7} users. Data generated by multiplying two rank-50 matrices of normally-distributed data, sampling from the resulting product, then adding Gaussian noise. Cluster of 6 servers with 12 cores and 96GB of memory per server. Number of iterations tuned so that all algorithms produce comparable result quality.Details:
  • 20. Takeaway Points •  SystemML runs the R script in parallel – Same answer as original R script – Performance is comparable to a low-level RDD-based implementation •  How does SystemML achieve this result?
  • 21. Performance Comparison: ALS 0 5000 10000 15000 20000 1.2GB (sparse binary) 12GB 120GB RunningTime(sec) R MLLib SystemML >24h>24h OOM OOM Synthetic data, 0.01 sparsity, 10^5 products × {10^5,10^6,10^7} users. Data generated by multiplying two rank-50 matrices of normally-distributed data, sampling from the resulting product, then adding Gaussian noise. Cluster of 6 servers with 12 cores and 96GB of memory per server. Number of iterations tuned so that all algorithms produce comparable result quality.Details: Several factors at play •  Subtly different algorithms •  Adaptive execution strategies •  Runtime differences Several factors at play •  Subtly different algorithms •  Adaptive execution strategies •  Runtime differences
  • 22. Questions We’ll Focus On 0 5000 10000 15000 20000 1.2GB (sparse binary) 12GB 120GB RunningTime(sec) R MLLib SystemML >24h>24h OOM OOM SystemML runs no distributed jobs here. •  How does SystemML know it’s better to run on one machine? •  Why is SystemML so much faster than single- node R?
  • 23. Questions We’ll Focus On 0 5000 10000 15000 20000 1.2GB (sparse binary) 12GB 120GB RunningTime(sec) R MLLib SystemML >24h>24h OOM OOM SystemML runs no distributed jobs here. •  How does SystemML know it’s better to run on one machine? •  Why is SystemML so much faster than single- node R?
  • 28. The SystemML Optimizer Stack Abstract Syntax Tree Layers U = rand(nrow(X), r, min = -1.0, max = 1.0); V = rand(r, ncol(X), min = -1.0, max = 1.0); while(i < mi) { i = i + 1; ii = 1; if (is_U) G = (W * (U %*% V - X)) %*% t(V) + lambda * U; else G = t(U) %*% (W * (U %*% V - X)) + lambda * V; norm_G2 = sum(G ^ 2); norm_R2 = norm_G2; R = -G; S = R; while(norm_R2 > 10E-9 * norm_G2 & ii <= mii) { if (is_U) { HS = (W * (S %*% V)) %*% t(V) + lambda * S; alpha = norm_R2 / sum (S * HS); U = U + alpha * S; } else { HS = t(U) %*% (W * (U %*% S)) + lambda * S; alpha = norm_R2 / sum (S * HS); V = V + alpha * S; } R = R - alpha * HS; old_norm_R2 = norm_R2; norm_R2 = sum(R ^ 2); S = R + (norm_R2 / old_norm_R2) * S; ii = ii + 1; } is_U = ! is_U; } •  Parsing •  Live variable analysis •  Validation
  • 29. The SystemML Optimizer Stack Abstract Syntax Tree Layers U = rand(nrow(X), r, min = -1.0, max = 1.0); V = rand(r, ncol(X), min = -1.0, max = 1.0); while(i < mi) { i = i + 1; ii = 1; if (is_U) G = (W * (U %*% V - X)) %*% t(V) + lambda * U; else G = t(U) %*% (W * (U %*% V - X)) + lambda * V; norm_G2 = sum(G ^ 2); norm_R2 = norm_G2; R = -G; S = R; while(norm_R2 > 10E-9 * norm_G2 & ii <= mii) { if (is_U) { HS = (W * (S %*% V)) %*% t(V) + lambda * S; alpha = norm_R2 / sum (S * HS); U = U + alpha * S; } else { HS = t(U) %*% (W * (U %*% S)) + lambda * S; alpha = norm_R2 / sum (S * HS); V = V + alpha * S; } R = R - alpha * HS; old_norm_R2 = norm_R2; norm_R2 = sum(R ^ 2); S = R + (norm_R2 / old_norm_R2) * S; ii = ii + 1; } is_U = ! is_U; } •  Parsing •  Live variable analysis •  Validation
  • 30. + The SystemML Optimizer Stack High-Level Operations Layers HS = t(U) %*% (W * (U %*% S)) + lambda * S; alpha = norm_R2 / sum (S * HS); V = V + alpha * S; %*% WU S *t() lambda * %*% write(HS) Construct graph of High-Level Operations (HOPs)
  • 31. + The SystemML Optimizer Stack High-Level Operations Layers HS = t(U) %*% (W * (U %*% S)) + lambda * S; %*% WU S *t() lambda * %*% write(HS)•  Construct HOPs
  • 32. The SystemML Optimizer Stack High-Level Operations Layers HS = t(U) %*% (W * (U %*% S)) %*% WU S *t() %*% 1.2GB
 sparse 80GB
 dense 80GB
 dense 800MB
 dense 800MB
 dense 800MB
 dense •  Construct HOPs •  Propagate statistics •  Determine distributed operationsAll operands fit into heap à use one node 800MB
 dense
  • 33. Questions We’ll Focus On 0 5000 10000 15000 20000 1.2GB (sparse binary) 12GB 120GB RunningTime(sec) R MLLib SystemML >24h>24h OOM OOM SystemML runs no distributed jobs here. •  How does SystemML know it’s better to run on one machine? •  Why is SystemML so much faster than single-node R?
  • 34. The SystemML Optimizer Stack High-Level Operations Layers HS = t(U) %*% (W * (U %*% S)) %*% WU S *t() %*% 1.2GB
 sparse 80GB
 dense 80GB
 dense 800MB
 dense 800MB
 dense 800MB
 dense All operands fit into heap à use one node •  Construct HOPs •  Propagate stats •  Determine distributed operations •  Rewrites 800MB
 dense
  • 35. Example Rewrite: wdivmm W S U U × S *( ( t(U) t(U)×(W*(U×S))) × Large dense intermediate Can compute directly from U, S, and W! t(U) %*% (W * (U %*% S))
  • 36. The SystemML Optimizer Stack High-Level Operations Layers HS = t(U) %*% (W * (U %*% S) wdivmm WU S 1.2GB
 sparse 800MB
 dense 800MB
 dense •  Construct HOPs •  Propagate stats •  Determine distributed operations •  Rewrites 800MB
 dense
  • 37. The SystemML Optimizer Stack Low-Level Operations Layers HS = t(U) %*% (W * (U %*% S) wdivmm WU S 1.2GB
 sparse 800MB
 dense 800MB
 dense 800MB
 dense •  Convert HOPs to Low-Level Operations (LOPs)
  • 38. The SystemML Optimizer Stack Low-Level Operations Layers HS = t(U) %*% (W * (U %*% S) Single-Node WDivMM WU S •  Convert HOPs to Low-Level Operations (LOPs)
  • 39. The SystemML Optimizer Stack Low-Level Operations Layers HS = t(U) %*% (W * (U %*% S) Single-Node WDivMM WU S •  Convert HOPs to Low-Level Operations (LOPs) •  Generate runtime instructions To SystemML Runtime
  • 40. The SystemML Runtime for Spark •  Automates critical performance decisions – Distributed or local computation? – How to partition the data? – To persist or not to persist?
  • 41. The SystemML Runtime for Spark •  Distributed vs local: Hybrid runtime – Multithreaded computation in Spark Driver – Distributed computation in Spark Executors – Optimizer makes a cost-based choice
  • 42. The SystemML Runtime for Spark Efficient Linear Algebra •  Binary block matrices (JavaPairRDD<MatrixIndexes, MatrixBlock>) •  Adaptive block storage formats: Dense, Sparse, Ultra-Sparse, Empty •  Efficient kernels for all combinations of block types Automated RDD Caching •  Lineage tracking for RDDs/ broadcasts •  Guarded RDD collect/parallelize •  Partitioned Broadcast variables Logical Blocking (w/ Bc=1,000) Physical Blocking and Partitioning (w/ Bc=1,000)
  • 43. Recap Questions •  How does SystemML know it’s better to run on one machine? •  Why is SystemML so much faster than single- node R? Answers •  Live variable analysis •  Propagation of statistics •  Advanced rewrites •  Efficient runtime
  • 44. But wait, there’s more! •  Many other rewrites •  Cost-based selection of physical operators •  Dynamic recompilation for accurate stats •  Parallel FOR (ParFor) optimizer •  Direct operations on RDD partitions •  YARN and MapReduce support
  • 45. •  SystemML is open source! –  Announced in June 2015 –  Available on Github since September 1 –  First open-source binary release (0.8.0) in October 2015 –  Entered Apache incubation in November 2015 –  First Apache open-source binary release (0.9) available now •  We are actively seeking contributors and users! http://systemml.apache.org/ Open-Sourcing SystemML
  • 46. THANK YOU. For more information, go to http://systemml.apache.org/