SlideShare a Scribd company logo
Bringing Algebraic Semantics
to Apache Mahout
Sebastian Schelter
Infolunch @ Hasso Plattner Institut, Potsdam
05/07/2014
Mahout: Past & Future
Apache Mahout: History
• library for scalable machine learning (ML)
• started six years ago as ML on MapReduce
• focus on popular ML problems and algorithms
– Collaborative Filtering
• (Item-based, User-based, Matrix Factorization)
– Classification
• (Naive Bayes, Logistic Regression, Random Forest)
– Clustering
• (k-Means, Streaming k-Means)
– Dimensionality Reduction
• (Lanczos, Stochastic SVD)
+ in-memory math library (fork of Colt)
• large userbase (e.g. Adobe, AOL, Accenture, Foursquare, Mendeley, Researchgate, Twitter)
Apache Mahout: Problems
• Organisational
– Onerous to provide support and documentation for scalable ML
(need to convey technical and theoretical background at the same
time)
– partly failed to compensate for developers that left the project
– barrier for contributions too low
• Technical
– MapReduce not well suited for ML
• slow execution, especially for iterations
• constrained programming model makes code hard to
write, read and adjust
– different quality of implementations
– lack of unified preprocessing machinery
Apache Mahout: Future
• Narrowing of focus
– removed a large number of rarely used or barely maintained algorithm
implementations
• Abandonment of MapReduce
– will reject new MapReduce implementations
– widely used „legacy“ implementations will be maintained
• „Reboot“
– usage of modern parallel processing systems which offer richer progamming
model & more efficient execution
→ Apache Spark, possibly Stratosphere in the future
– DSL for linear algebraic operations as foundation for new implemenations
– automatic optimization and parallelization of programs written in this DSL
Scala & Spark-Bindings
Requirements for an ideal ML environment
1. R/Matlab-like semantics
– type system that covers (a) linear algebra,
(b) statistics and (c) data frames
2. Modern programming language qualities
– functional programming
– object oriented programming
– sensible byte code Performance
– scriptable and interactive
3. Scalability
– automatic distribution and
parallelization with sensible
performance
4. Collection of off-the-shelf building
blocks and algorithms
5. Visualization
Requirements for an ideal ML environment
1. R/Matlab-like semantics
– type system that covers (a) linear algebra,
(b) statistics and (c) data frames
2. Modern programming language qualities
– functional programming
– object oriented programming
– sensible byte code Performance
– scriptable and Interactive
3. Scalability
– automatic distribution and
parallelization with sensible
performance
4. Collection of off-the-shelf building
blocks and algorithms
5. Visualization → Mahout‘s new Scala & Spark-bindings
address requirements 1(a), 2, 3 and 4
Scala & Spark Bindings
• Scala as programming/scripting environment
• R-like DSL :
val G = B %*% B.t - C - C.t + (ksi dot ksi) * (s_q cross s_q)
• Algebraic expression optimizer for distributed linear algebra
– provides a translation layer to distributed engines
– currently supports Apache Spark only
– might support Stratosphere in the future
q
T
q
TTT
ssCCBBG 
Data Types
• Scalar real values
• In-memory vectors
– dense
– 2 types of sparse
• In-memory matrices
– sparse and dense
– a number of specialized matrices
• Distributed Row Matrices (DRM)
– huge matrix, partitioned by rows
– lives in the main memory of the cluster
– provides small set of parallelized
operations
– lazily evaluated operation execution
val x = 2.367
val v = dvec(1, 0, 5)
val w =
svec((0 -> 1)::(2 -> 5)::Nil)
val A = dense((1, 0, 5),
(2, 1, 4),
(4, 3, 1))
val drmA = drmFromHDFS(...)
Features (1)
• Matrix, vector, scalar operators:
in-memory, out-of-core
• Slicing operators
• Assignments (in-memory only)
• Vector-specific
• Summaries
drmA %*% drmB
A %*% x
A.t %*% drmB
A * B
A(5 until 20, 3 until 40)
A(5, ::); A(5, 5)
x(a to b)
A(5, ::) := x
A *= B
A -=: B; 1 /:= x
x dot y; x cross y
A.nrow; x.length;
A.colSums; B.rowMeans
x.sum; A.norm
Features (2)
• solving linear systems
• in-memory decompositions
• out-of-core decompositions
• caching of DRMs
val x = solve(A, b)
val (inMemQ, inMemR) = qr(inMemM)
val ch = chol(inMemM)
val (inMemV, d) = eigen(inMemM)
val (inMemU, inMemV, s) = svd(inMemM)
val (drmQ, inMemR) = thinQR(drmA)
val (drmU, drmV, s) =
dssvd(drmA, k = 50, q = 1)
val drmA_cached = drmA.checkpoint()
drmA_cached.uncache()
Example
Cereals
Name protein fat carbo sugars rating
Apple Cinnamon Cheerios 2 2 10.5 10 29.509541
Cap‘n‘Crunch 1 2 12 12 18.042851
Cocoa Puffs 1 1 12 13 22.736446
Froot Loops 2 1 11 13 32.207582
Honey Graham Ohs 1 2 12 11 21.871292
Wheaties Honey Gold 2 1 16 8 36.187559
Cheerios 6 2 17 1 50.764999
Clusters 3 2 13 7 40.400208
Great Grains Pecan 3 3 13 4 45.811716
http://lib.stat.cmu.edu/DASL/Datafiles/Cereals.html
Linear Regression
• Assumption: target variable y generated by linear combination of feature
matrix X with parameter vector β, plus noise ε
• Goal: find estimate of the parameter
vector β that explains the data well
• Cereals example
X = weights of ingredients
y = customer rating
  Xy
Data Ingestion
• Usually: load dataset as DRM from a distributed filesystem:
val drmData = drmFromHdfs(...)
• ‚Mimick‘ a large dataset for our example:
val drmData = drmParallelize(dense(
(2, 2, 10.5, 10, 29.509541), // Apple Cinnamon Cheerios
(1, 2, 12, 12, 18.042851), // Cap'n'Crunch
(1, 1, 12, 13, 22.736446), // Cocoa Puffs
(2, 1, 11, 13, 32.207582), // Froot Loops
(1, 2, 12, 11, 21.871292), // Honey Graham Ohs
(2, 1, 16, 8, 36.187559), // Wheaties Honey Gold
(6, 2, 17, 1, 50.764999), // Cheerios
(3, 2, 13, 7, 40.400208), // Clusters
(3, 3, 13, 4, 45.811716)), // Great Grains Pecan
numPartitions = 2)
Data Preparation
• Cereals example: target variable y is customer rating, weights of
ingredients are features X
• extract X as DRM by slicing,
fetch y as in-core vector
val drmX = drmData(::, 0 until 4)
val y = drmData.collect(::, 4)




























8117164541333
4002084071323
7649995011726
1875593681612
87129221111221
20758232131112
73644622131211
04285118121221
509541291051022
.
.
.
.
.
.
.
.
..
drmX y
Estimating β
• Ordinary Least Squares: minimizes the sum of residual squares between
true target variable and prediction of target variable
• Closed-form expression for estimation of ß as
• Computing XTX and XTy is as simple as typing the formulas:
val drmXtX = drmX.t %*% drmX
val drmXty = drmX %*% y
yXXX
TT 1
)(ˆ 

Estimating β
• Solve the following linear system to get least-squares estimate of ß
• Fetch XTX andXTy onto the driver and use an in-core solver
– assumes XTX fits into memory
– uses analogon to R’s solve() function
val XtX = drmXtX.collect
val Xty = drmXty.collect(::, 0)
val betaHat = solve(XtX, Xty)
yXXX
TT
ˆ
Estimating β
• Solve the following linear system to get least-squares estimate of ß
• Fetch XTX andXTy onto the driver and use an in-memory solver
– assumes XTX fits into memory
– uses analogon to R’s solve() function
val XtX = drmXtX.collect
val Xty = drmXty.collect(::, 0)
val betaHat = solve(XtX, Xty)
yXXX
TT
ˆ
→ We have implemented distributed linear regression!
Goodness of fit
• Prediction of the target variable is simple matrix-vector multiplication
• Check L2 norm of the difference between true target variable and our
prediction
val yHat = (drmX %*% betaHat).collect(::, 0)
(y - yHat).norm(2)
ˆˆ Xy 
Adding a bias term
• Bias term left out so far
– constant factor added to the model, “shifts the line vertically”
• Common trick is to add a column of ones to the feature matrix
– bias term will be learned automatically




























141333
171323
111726
181612
1111221
1131112
1131211
1121221
11051022 .




























41333
71323
11726
81612
111221
131112
131211
121221
105.1022
Adding a bias term
• How do we add a new column to a DRM?
→ mapBlock() allows custom modifications to the matrix
val drmXwithBiasColumn = drmX.mapBlock(ncol = drmX.ncol + 1) {
case(keys, block) =>
// create a new block with an additional column
val blockWithBiasCol = block.like(block.nrow, block.ncol+1)
// copy data from current block into the new block
blockWithBiasCol(::, 0 until block.ncol) := block
// last column consists of ones
blockWithBiasColumn(::, block.ncol) := 1
keys -> blockWithBiasColumn
}
Under the covers
Underlying systems
• currently: prototype on Apache Spark
– fast and expressive cluster
computing system
– general computation graphs,
in-memory primitives, rich API,
interactive shell
• future: add Stratosphere
– research system developed by
TU Berlin, HU Berlin, HPI
– accepted into Apache Incubator recently
– functionality similar to Apache Spark, adds data flow
optimization and efficient out-of-core execution
Optimization
• Execution is defered, user
composes logical operators
• Computational actions implicitly
trigger optimization (= selection
of physical plan) and execution
• Optimization factors: size of operands, orientation of operands,
partitioning, sharing of computational paths
• e. g.: matrix multiplication:
– 5 physical operators for drmA %*% drmB
– 2 operators for drmA %*% inMemA
– 1 operator for drm A %*% x
– 1 operator for x %*% drmA
val drmA = drmB.t %*% drmC
drmA.writeDrm(path);
val inMemA =
(drmB.t %*% drmB).collect
Optimization Example (1)
• Computation of XTX in the linear regression example
• Logical optimization:
logical plan gets rewritten to use
a special logical operator for
Transpose-Times-Self multiplication
via a rule in the optimizer
val drmXtX = drmX.t %*% drmX
val XtX = drmXtX.collect
OpAt
OpAB
drmX drmX
drmXtX
OpAtA
drmX
drmXtX
Optimization Example (2)
• Mahout computes XTX via row-outer-product formulation of matrix
multiplication that can be efficiently executed in a single pass over the
row-partitioned X
• optimizer chooses between to physical operators for OpAtA
– standard implementation:
• Map operator builds partial outer products from individual blocks based
on an estimate of a good partitioning
• Reduce operator adds them up using a Reduce operator
– AtA_slim for tall but skinny matrices when an intermediate symmetric matrix
requiring (X.ncol * X.ncol) / 2 entries fits into memory




m
i
T
ii
T
xxXX
0
Thank you. Questions?
Credits:
Design & implementation of the Scala & Spark Bindings
and parts of this slideset by Dmitriy Lyubimov
dlyubimov@apache.org

More Related Content

What's hot

Spark schema for free with David Szakallas
Spark schema for free with David SzakallasSpark schema for free with David Szakallas
Spark schema for free with David Szakallas
Databricks
 
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
DB Tsai
 
Large-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkLarge-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache Spark
DB Tsai
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms
Amund Tveit
 
Streaming Python on Hadoop
Streaming Python on HadoopStreaming Python on Hadoop
Streaming Python on Hadoop
Vivian S. Zhang
 
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
DB Tsai
 
VAE-type Deep Generative Models
VAE-type Deep Generative ModelsVAE-type Deep Generative Models
VAE-type Deep Generative Models
Kenta Oono
 
Large Scale Machine learning with Spark
Large Scale Machine learning with SparkLarge Scale Machine learning with Spark
Large Scale Machine learning with Spark
Md. Mahedi Kaysar
 
GBM in H2O with Cliff Click: H2O API
GBM in H2O with Cliff Click: H2O APIGBM in H2O with Cliff Click: H2O API
GBM in H2O with Cliff Click: H2O API
Sri Ambati
 
Practical data science_public
Practical data science_publicPractical data science_public
Practical data science_public
Long Nguyen
 
Beyond Map/Reduce: Getting Creative With Parallel Processing
Beyond Map/Reduce: Getting Creative With Parallel ProcessingBeyond Map/Reduce: Getting Creative With Parallel Processing
Beyond Map/Reduce: Getting Creative With Parallel Processing
Ed Kohlwey
 
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerDeep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Spark Summit
 
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Jen Aman
 
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
Spark Summit
 
Data Wrangling For Kaggle Data Science Competitions
Data Wrangling For Kaggle Data Science CompetitionsData Wrangling For Kaggle Data Science Competitions
Data Wrangling For Kaggle Data Science Competitions
Krishna Sankar
 
Large-scale Recommendation Systems on Just a PC
Large-scale Recommendation Systems on Just a PCLarge-scale Recommendation Systems on Just a PC
Large-scale Recommendation Systems on Just a PC
Aapo Kyrölä
 
Stratosphere Intro (Java and Scala Interface)
Stratosphere Intro (Java and Scala Interface)Stratosphere Intro (Java and Scala Interface)
Stratosphere Intro (Java and Scala Interface)
Robert Metzger
 
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Robert Metzger
 
Introduction of Feature Hashing
Introduction of Feature HashingIntroduction of Feature Hashing
Introduction of Feature Hashing
Wush Wu
 
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CAApache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
Robert Metzger
 

What's hot (20)

Spark schema for free with David Szakallas
Spark schema for free with David SzakallasSpark schema for free with David Szakallas
Spark schema for free with David Szakallas
 
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
 
Large-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkLarge-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache Spark
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms
 
Streaming Python on Hadoop
Streaming Python on HadoopStreaming Python on Hadoop
Streaming Python on Hadoop
 
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
 
VAE-type Deep Generative Models
VAE-type Deep Generative ModelsVAE-type Deep Generative Models
VAE-type Deep Generative Models
 
Large Scale Machine learning with Spark
Large Scale Machine learning with SparkLarge Scale Machine learning with Spark
Large Scale Machine learning with Spark
 
GBM in H2O with Cliff Click: H2O API
GBM in H2O with Cliff Click: H2O APIGBM in H2O with Cliff Click: H2O API
GBM in H2O with Cliff Click: H2O API
 
Practical data science_public
Practical data science_publicPractical data science_public
Practical data science_public
 
Beyond Map/Reduce: Getting Creative With Parallel Processing
Beyond Map/Reduce: Getting Creative With Parallel ProcessingBeyond Map/Reduce: Getting Creative With Parallel Processing
Beyond Map/Reduce: Getting Creative With Parallel Processing
 
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerDeep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
 
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
 
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
 
Data Wrangling For Kaggle Data Science Competitions
Data Wrangling For Kaggle Data Science CompetitionsData Wrangling For Kaggle Data Science Competitions
Data Wrangling For Kaggle Data Science Competitions
 
Large-scale Recommendation Systems on Just a PC
Large-scale Recommendation Systems on Just a PCLarge-scale Recommendation Systems on Just a PC
Large-scale Recommendation Systems on Just a PC
 
Stratosphere Intro (Java and Scala Interface)
Stratosphere Intro (Java and Scala Interface)Stratosphere Intro (Java and Scala Interface)
Stratosphere Intro (Java and Scala Interface)
 
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
 
Introduction of Feature Hashing
Introduction of Feature HashingIntroduction of Feature Hashing
Introduction of Feature Hashing
 
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CAApache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
 

Similar to Bringing Algebraic Semantics to Mahout

Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSLSebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Flink Forward
 
Blazing Performance with Flame Graphs
Blazing Performance with Flame GraphsBlazing Performance with Flame Graphs
Blazing Performance with Flame Graphs
Brendan Gregg
 
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATLSandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
MLconf
 
Lec9 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part 1
Lec9 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part 1Lec9 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part 1
Lec9 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part 1
Hsien-Hsin Sean Lee, Ph.D.
 
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...
PyData
 
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Spark Summit
 
Writing Fast MATLAB Code
Writing Fast MATLAB CodeWriting Fast MATLAB Code
Writing Fast MATLAB Code
Jia-Bin Huang
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
Jose Quesada (hiring)
 
Apache SystemML Optimizer and Runtime techniques by Matthias Boehm
Apache SystemML Optimizer and Runtime techniques by Matthias BoehmApache SystemML Optimizer and Runtime techniques by Matthias Boehm
Apache SystemML Optimizer and Runtime techniques by Matthias Boehm
Arvind Surve
 
Apache SystemML Optimizer and Runtime techniques by Matthias Boehm
Apache SystemML Optimizer and Runtime techniques by Matthias BoehmApache SystemML Optimizer and Runtime techniques by Matthias Boehm
Apache SystemML Optimizer and Runtime techniques by Matthias Boehm
Arvind Surve
 
Machine learning at Scale with Apache Spark
Machine learning at Scale with Apache SparkMachine learning at Scale with Apache Spark
Machine learning at Scale with Apache Spark
Martin Zapletal
 
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
Gruter
 
Machine Learning with Apache Flink at Stockholm Machine Learning Group
Machine Learning with Apache Flink at Stockholm Machine Learning GroupMachine Learning with Apache Flink at Stockholm Machine Learning Group
Machine Learning with Apache Flink at Stockholm Machine Learning Group
Till Rohrmann
 
Intro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data MeetupIntro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data MeetupGwen (Chen) Shapira
 
Sperasoft‬ talks j point 2015
Sperasoft‬ talks j point 2015Sperasoft‬ talks j point 2015
Sperasoft‬ talks j point 2015
Sperasoft
 
Ga4 gh meeting at the the sanger institute
Ga4 gh meeting at the the sanger instituteGa4 gh meeting at the the sanger institute
Ga4 gh meeting at the the sanger instituteMatt Massie
 
Sparse Data Support in MLlib
Sparse Data Support in MLlibSparse Data Support in MLlib
Sparse Data Support in MLlib
Xiangrui Meng
 
Apache Tajo: Query Optimization Techniques and JIT-based Vectorized Engine
Apache Tajo: Query Optimization Techniques and JIT-based Vectorized EngineApache Tajo: Query Optimization Techniques and JIT-based Vectorized Engine
Apache Tajo: Query Optimization Techniques and JIT-based Vectorized EngineDataWorks Summit
 
maxbox starter60 machine learning
maxbox starter60 machine learningmaxbox starter60 machine learning
maxbox starter60 machine learning
Max Kleiner
 
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsGreg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Flink Forward
 

Similar to Bringing Algebraic Semantics to Mahout (20)

Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSLSebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
 
Blazing Performance with Flame Graphs
Blazing Performance with Flame GraphsBlazing Performance with Flame Graphs
Blazing Performance with Flame Graphs
 
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATLSandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
 
Lec9 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part 1
Lec9 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part 1Lec9 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part 1
Lec9 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part 1
 
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...
 
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
 
Writing Fast MATLAB Code
Writing Fast MATLAB CodeWriting Fast MATLAB Code
Writing Fast MATLAB Code
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
Apache SystemML Optimizer and Runtime techniques by Matthias Boehm
Apache SystemML Optimizer and Runtime techniques by Matthias BoehmApache SystemML Optimizer and Runtime techniques by Matthias Boehm
Apache SystemML Optimizer and Runtime techniques by Matthias Boehm
 
Apache SystemML Optimizer and Runtime techniques by Matthias Boehm
Apache SystemML Optimizer and Runtime techniques by Matthias BoehmApache SystemML Optimizer and Runtime techniques by Matthias Boehm
Apache SystemML Optimizer and Runtime techniques by Matthias Boehm
 
Machine learning at Scale with Apache Spark
Machine learning at Scale with Apache SparkMachine learning at Scale with Apache Spark
Machine learning at Scale with Apache Spark
 
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
 
Machine Learning with Apache Flink at Stockholm Machine Learning Group
Machine Learning with Apache Flink at Stockholm Machine Learning GroupMachine Learning with Apache Flink at Stockholm Machine Learning Group
Machine Learning with Apache Flink at Stockholm Machine Learning Group
 
Intro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data MeetupIntro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data Meetup
 
Sperasoft‬ talks j point 2015
Sperasoft‬ talks j point 2015Sperasoft‬ talks j point 2015
Sperasoft‬ talks j point 2015
 
Ga4 gh meeting at the the sanger institute
Ga4 gh meeting at the the sanger instituteGa4 gh meeting at the the sanger institute
Ga4 gh meeting at the the sanger institute
 
Sparse Data Support in MLlib
Sparse Data Support in MLlibSparse Data Support in MLlib
Sparse Data Support in MLlib
 
Apache Tajo: Query Optimization Techniques and JIT-based Vectorized Engine
Apache Tajo: Query Optimization Techniques and JIT-based Vectorized EngineApache Tajo: Query Optimization Techniques and JIT-based Vectorized Engine
Apache Tajo: Query Optimization Techniques and JIT-based Vectorized Engine
 
maxbox starter60 machine learning
maxbox starter60 machine learningmaxbox starter60 machine learning
maxbox starter60 machine learning
 
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsGreg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
 

More from sscdotopen

Next directions in Mahout's recommenders
Next directions in Mahout's recommendersNext directions in Mahout's recommenders
Next directions in Mahout's recommenders
sscdotopen
 
New Directions in Mahout's Recommenders
New Directions in Mahout's RecommendersNew Directions in Mahout's Recommenders
New Directions in Mahout's Recommenderssscdotopen
 
Introduction to Collaborative Filtering with Apache Mahout
Introduction to Collaborative Filtering with Apache MahoutIntroduction to Collaborative Filtering with Apache Mahout
Introduction to Collaborative Filtering with Apache Mahoutsscdotopen
 
Scalable Similarity-Based Neighborhood Methods with MapReduce
Scalable Similarity-Based Neighborhood Methods with MapReduceScalable Similarity-Based Neighborhood Methods with MapReduce
Scalable Similarity-Based Neighborhood Methods with MapReducesscdotopen
 
Latent factor models for Collaborative Filtering
Latent factor models for Collaborative FilteringLatent factor models for Collaborative Filtering
Latent factor models for Collaborative Filteringsscdotopen
 
Large Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache GiraphLarge Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache Giraphsscdotopen
 
Introducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph ProcessingIntroducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph Processingsscdotopen
 

More from sscdotopen (8)

Next directions in Mahout's recommenders
Next directions in Mahout's recommendersNext directions in Mahout's recommenders
Next directions in Mahout's recommenders
 
New Directions in Mahout's Recommenders
New Directions in Mahout's RecommendersNew Directions in Mahout's Recommenders
New Directions in Mahout's Recommenders
 
Introduction to Collaborative Filtering with Apache Mahout
Introduction to Collaborative Filtering with Apache MahoutIntroduction to Collaborative Filtering with Apache Mahout
Introduction to Collaborative Filtering with Apache Mahout
 
Scalable Similarity-Based Neighborhood Methods with MapReduce
Scalable Similarity-Based Neighborhood Methods with MapReduceScalable Similarity-Based Neighborhood Methods with MapReduce
Scalable Similarity-Based Neighborhood Methods with MapReduce
 
Latent factor models for Collaborative Filtering
Latent factor models for Collaborative FilteringLatent factor models for Collaborative Filtering
Latent factor models for Collaborative Filtering
 
Large Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache GiraphLarge Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache Giraph
 
Introducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph ProcessingIntroducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph Processing
 
mahout-cf
mahout-cfmahout-cf
mahout-cf
 

Recently uploaded

一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 

Recently uploaded (20)

一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 

Bringing Algebraic Semantics to Mahout

  • 1. Bringing Algebraic Semantics to Apache Mahout Sebastian Schelter Infolunch @ Hasso Plattner Institut, Potsdam 05/07/2014
  • 2. Mahout: Past & Future
  • 3. Apache Mahout: History • library for scalable machine learning (ML) • started six years ago as ML on MapReduce • focus on popular ML problems and algorithms – Collaborative Filtering • (Item-based, User-based, Matrix Factorization) – Classification • (Naive Bayes, Logistic Regression, Random Forest) – Clustering • (k-Means, Streaming k-Means) – Dimensionality Reduction • (Lanczos, Stochastic SVD) + in-memory math library (fork of Colt) • large userbase (e.g. Adobe, AOL, Accenture, Foursquare, Mendeley, Researchgate, Twitter)
  • 4. Apache Mahout: Problems • Organisational – Onerous to provide support and documentation for scalable ML (need to convey technical and theoretical background at the same time) – partly failed to compensate for developers that left the project – barrier for contributions too low • Technical – MapReduce not well suited for ML • slow execution, especially for iterations • constrained programming model makes code hard to write, read and adjust – different quality of implementations – lack of unified preprocessing machinery
  • 5. Apache Mahout: Future • Narrowing of focus – removed a large number of rarely used or barely maintained algorithm implementations • Abandonment of MapReduce – will reject new MapReduce implementations – widely used „legacy“ implementations will be maintained • „Reboot“ – usage of modern parallel processing systems which offer richer progamming model & more efficient execution → Apache Spark, possibly Stratosphere in the future – DSL for linear algebraic operations as foundation for new implemenations – automatic optimization and parallelization of programs written in this DSL
  • 7. Requirements for an ideal ML environment 1. R/Matlab-like semantics – type system that covers (a) linear algebra, (b) statistics and (c) data frames 2. Modern programming language qualities – functional programming – object oriented programming – sensible byte code Performance – scriptable and interactive 3. Scalability – automatic distribution and parallelization with sensible performance 4. Collection of off-the-shelf building blocks and algorithms 5. Visualization
  • 8. Requirements for an ideal ML environment 1. R/Matlab-like semantics – type system that covers (a) linear algebra, (b) statistics and (c) data frames 2. Modern programming language qualities – functional programming – object oriented programming – sensible byte code Performance – scriptable and Interactive 3. Scalability – automatic distribution and parallelization with sensible performance 4. Collection of off-the-shelf building blocks and algorithms 5. Visualization → Mahout‘s new Scala & Spark-bindings address requirements 1(a), 2, 3 and 4
  • 9. Scala & Spark Bindings • Scala as programming/scripting environment • R-like DSL : val G = B %*% B.t - C - C.t + (ksi dot ksi) * (s_q cross s_q) • Algebraic expression optimizer for distributed linear algebra – provides a translation layer to distributed engines – currently supports Apache Spark only – might support Stratosphere in the future q T q TTT ssCCBBG 
  • 10. Data Types • Scalar real values • In-memory vectors – dense – 2 types of sparse • In-memory matrices – sparse and dense – a number of specialized matrices • Distributed Row Matrices (DRM) – huge matrix, partitioned by rows – lives in the main memory of the cluster – provides small set of parallelized operations – lazily evaluated operation execution val x = 2.367 val v = dvec(1, 0, 5) val w = svec((0 -> 1)::(2 -> 5)::Nil) val A = dense((1, 0, 5), (2, 1, 4), (4, 3, 1)) val drmA = drmFromHDFS(...)
  • 11. Features (1) • Matrix, vector, scalar operators: in-memory, out-of-core • Slicing operators • Assignments (in-memory only) • Vector-specific • Summaries drmA %*% drmB A %*% x A.t %*% drmB A * B A(5 until 20, 3 until 40) A(5, ::); A(5, 5) x(a to b) A(5, ::) := x A *= B A -=: B; 1 /:= x x dot y; x cross y A.nrow; x.length; A.colSums; B.rowMeans x.sum; A.norm
  • 12. Features (2) • solving linear systems • in-memory decompositions • out-of-core decompositions • caching of DRMs val x = solve(A, b) val (inMemQ, inMemR) = qr(inMemM) val ch = chol(inMemM) val (inMemV, d) = eigen(inMemM) val (inMemU, inMemV, s) = svd(inMemM) val (drmQ, inMemR) = thinQR(drmA) val (drmU, drmV, s) = dssvd(drmA, k = 50, q = 1) val drmA_cached = drmA.checkpoint() drmA_cached.uncache()
  • 14. Cereals Name protein fat carbo sugars rating Apple Cinnamon Cheerios 2 2 10.5 10 29.509541 Cap‘n‘Crunch 1 2 12 12 18.042851 Cocoa Puffs 1 1 12 13 22.736446 Froot Loops 2 1 11 13 32.207582 Honey Graham Ohs 1 2 12 11 21.871292 Wheaties Honey Gold 2 1 16 8 36.187559 Cheerios 6 2 17 1 50.764999 Clusters 3 2 13 7 40.400208 Great Grains Pecan 3 3 13 4 45.811716 http://lib.stat.cmu.edu/DASL/Datafiles/Cereals.html
  • 15. Linear Regression • Assumption: target variable y generated by linear combination of feature matrix X with parameter vector β, plus noise ε • Goal: find estimate of the parameter vector β that explains the data well • Cereals example X = weights of ingredients y = customer rating   Xy
  • 16. Data Ingestion • Usually: load dataset as DRM from a distributed filesystem: val drmData = drmFromHdfs(...) • ‚Mimick‘ a large dataset for our example: val drmData = drmParallelize(dense( (2, 2, 10.5, 10, 29.509541), // Apple Cinnamon Cheerios (1, 2, 12, 12, 18.042851), // Cap'n'Crunch (1, 1, 12, 13, 22.736446), // Cocoa Puffs (2, 1, 11, 13, 32.207582), // Froot Loops (1, 2, 12, 11, 21.871292), // Honey Graham Ohs (2, 1, 16, 8, 36.187559), // Wheaties Honey Gold (6, 2, 17, 1, 50.764999), // Cheerios (3, 2, 13, 7, 40.400208), // Clusters (3, 3, 13, 4, 45.811716)), // Great Grains Pecan numPartitions = 2)
  • 17. Data Preparation • Cereals example: target variable y is customer rating, weights of ingredients are features X • extract X as DRM by slicing, fetch y as in-core vector val drmX = drmData(::, 0 until 4) val y = drmData.collect(::, 4)                             8117164541333 4002084071323 7649995011726 1875593681612 87129221111221 20758232131112 73644622131211 04285118121221 509541291051022 . . . . . . . . .. drmX y
  • 18. Estimating β • Ordinary Least Squares: minimizes the sum of residual squares between true target variable and prediction of target variable • Closed-form expression for estimation of ß as • Computing XTX and XTy is as simple as typing the formulas: val drmXtX = drmX.t %*% drmX val drmXty = drmX %*% y yXXX TT 1 )(ˆ  
  • 19. Estimating β • Solve the following linear system to get least-squares estimate of ß • Fetch XTX andXTy onto the driver and use an in-core solver – assumes XTX fits into memory – uses analogon to R’s solve() function val XtX = drmXtX.collect val Xty = drmXty.collect(::, 0) val betaHat = solve(XtX, Xty) yXXX TT ˆ
  • 20. Estimating β • Solve the following linear system to get least-squares estimate of ß • Fetch XTX andXTy onto the driver and use an in-memory solver – assumes XTX fits into memory – uses analogon to R’s solve() function val XtX = drmXtX.collect val Xty = drmXty.collect(::, 0) val betaHat = solve(XtX, Xty) yXXX TT ˆ → We have implemented distributed linear regression!
  • 21. Goodness of fit • Prediction of the target variable is simple matrix-vector multiplication • Check L2 norm of the difference between true target variable and our prediction val yHat = (drmX %*% betaHat).collect(::, 0) (y - yHat).norm(2) ˆˆ Xy 
  • 22. Adding a bias term • Bias term left out so far – constant factor added to the model, “shifts the line vertically” • Common trick is to add a column of ones to the feature matrix – bias term will be learned automatically                             141333 171323 111726 181612 1111221 1131112 1131211 1121221 11051022 .                             41333 71323 11726 81612 111221 131112 131211 121221 105.1022
  • 23. Adding a bias term • How do we add a new column to a DRM? → mapBlock() allows custom modifications to the matrix val drmXwithBiasColumn = drmX.mapBlock(ncol = drmX.ncol + 1) { case(keys, block) => // create a new block with an additional column val blockWithBiasCol = block.like(block.nrow, block.ncol+1) // copy data from current block into the new block blockWithBiasCol(::, 0 until block.ncol) := block // last column consists of ones blockWithBiasColumn(::, block.ncol) := 1 keys -> blockWithBiasColumn }
  • 25. Underlying systems • currently: prototype on Apache Spark – fast and expressive cluster computing system – general computation graphs, in-memory primitives, rich API, interactive shell • future: add Stratosphere – research system developed by TU Berlin, HU Berlin, HPI – accepted into Apache Incubator recently – functionality similar to Apache Spark, adds data flow optimization and efficient out-of-core execution
  • 26. Optimization • Execution is defered, user composes logical operators • Computational actions implicitly trigger optimization (= selection of physical plan) and execution • Optimization factors: size of operands, orientation of operands, partitioning, sharing of computational paths • e. g.: matrix multiplication: – 5 physical operators for drmA %*% drmB – 2 operators for drmA %*% inMemA – 1 operator for drm A %*% x – 1 operator for x %*% drmA val drmA = drmB.t %*% drmC drmA.writeDrm(path); val inMemA = (drmB.t %*% drmB).collect
  • 27. Optimization Example (1) • Computation of XTX in the linear regression example • Logical optimization: logical plan gets rewritten to use a special logical operator for Transpose-Times-Self multiplication via a rule in the optimizer val drmXtX = drmX.t %*% drmX val XtX = drmXtX.collect OpAt OpAB drmX drmX drmXtX OpAtA drmX drmXtX
  • 28. Optimization Example (2) • Mahout computes XTX via row-outer-product formulation of matrix multiplication that can be efficiently executed in a single pass over the row-partitioned X • optimizer chooses between to physical operators for OpAtA – standard implementation: • Map operator builds partial outer products from individual blocks based on an estimate of a good partitioning • Reduce operator adds them up using a Reduce operator – AtA_slim for tall but skinny matrices when an intermediate symmetric matrix requiring (X.ncol * X.ncol) / 2 entries fits into memory     m i T ii T xxXX 0
  • 29. Thank you. Questions? Credits: Design & implementation of the Scala & Spark Bindings and parts of this slideset by Dmitriy Lyubimov dlyubimov@apache.org