dmapply: A functional primitive to express distributed machine learning algorithms in R

Authors
Edward Ma,Vishrut Gupta, Michun Hsu and Indrajit Roy
Presenter
Bikash Chandra Karmoakr
M.Sc. Student, ITIS
Leibniz Universität Hannover
Seminar on
Database as a Service
Date: 26/01/2017
Place: TU Clausthal

 Introduction
 Distributed Computing and Data Structures in R
 Challenges faced by R users and Objectives of ddR
 ddR components and package structure
 Communication and computation patterns using dmapply
 Some examples and machine learning algorithms
 Comparison with other packages and performance evaluation
 Conclusion
 References

 R is one of the top choices for statisticians and
data scientists
 ddR (Distributed Data structures in R) is created
to build a unified system that works across
different distributed frameworks in R
 ddR introduced dmapply that executes functions
on distributed data structures.
 dmapply offers a standardized system which is
easy to use with enough flexibility and good
performance.

fork() Sockets MPI Spark Vertica MPP DBs
parallel snow SparkR DistributedR
foreach BiocParallel
parts(data)foreach() bpapply()
Distributed Computing Distributed Data

 Many applications reuse data:
◦ Multi-analysis on same data: load once, run many operations
◦ Iterative algorithms: most machine learning + graph
algorithms
 Persistent, abstract references:
◦ Avoid data movement overhead(send, collect, send cycles)
◦ Enable caching
 Analyst wants to express high-level data
manipulations
 NOT explicitly iterate over chunks

 Interfaces to distributed system are custom, low-
level and non-idiomatic
 Spark has 50+ operators!
◦ Map, flatmap, mapPartitions, mapPartitionsWithIndex...
◦ Lacks common array, list, data.frame operations that R
users expect
◦ SparkR provides some abstraction, but has its own
idiosyncrasies.
What if there is an API based on distributed data-structures?

 Standardize a unified API for distributed:
◦ Iteration
◦ Data Structures
 Enable:
◦ Basic manipulation and reduction of distributed
data (lists, data frames, arrays)
◦ Implementation of parallel algorithms through low-
level primitives
◦ Write once, run everywhere

1. Iteration: Common parallel operators for distributed
data-structures
◦ mapply() -> dmapply()
◦ lapply() -> dlapply()
◦ New: parts(), collect()
2. Data Structures: Distributed variants of core R data-
structures:
◦ array -> darray
◦ data.frame -> dframe
◦ list -> dlist
3. Shared infrastructure for backend implementations
(Spark, distributed R,,,)
The ddR package is not a new distributed infrastructure !

 Distributed versions of array, list, data.frame with
conventional APIs:
◦ Accessors: parts, dim, names
◦ Summaries: mean, median, head, tail, rowSums, aggregate
◦ Sorting: sort
◦ Combination: c, cbind, rbind, merge
◦ Iteration: lapply, split
◦ Math and comparisons on arrays, transform on data.frames
◦ Distributed IO, e.g. dread(“data.csv”)
 Distributed iteration primitives for implementing
algorithms: dmapply()
 Enhanced ease of use, maintainability and portability
due to standard API

parallel distributedR.ddR spark.ddR
ddR
API package with
data structures
+
common operations
useBackend(parallel)
useBackend(distributedR)
useBackend(spark)
Third-party wrapper packages, delegating to
existing backend interfaces

 dlist(......., nparts, psize)
◦ Similar to list() convention
◦ nparts and psize control partition count and size,
respectively
 dmapply(FUN, X, Y, MoreArgs = list(), nparts)
◦ Apply FUN to elements of X and Y, returning a dlist
 parts(L)
◦ Return the set of partitions as a list of dlist objects
 collect(L)
◦ Return the in-memory base R list representation of L
◦ Generally only used after aggregation

Function Broadcast:
Data Broadcast:
Partition Based:

1. #Create a distributed list. By default each
element becomes a partitions
2. A <-dlist(1,2,3,4,5)
3. #Access partitions
4. p <- parts(A)
5. #Multiply elements in each partition by a
constant
6. B <- dmapply (function(x){2*x[[1]]},p)
7. #Fetch the result (= {2 ,4 ,6 ,8 ,10} ) on the
master
8. print(collect(B))

1. A <- dlist(1,2,3,4)
2. B <- dlist(11,12,13,14)
3. # C will be a dlist={12,14,16,18}
4. C <- dmapply (FUN=sum, A, B)
5. # D will be a dlist ={13,15,17,19}
6. D <- dmapply (FUN=sum , A, B, MoreArgs=list( z=1) )
7. print(collect(D))

Three machine learning algorithms is tested here:
1. randomforest, a decision tree based ensemble
learning method,
2. K-means clustering algorithm, and
3. linear regression.
These algorithms (ddR version) are compatible with
established open source machine learning algorithms
like H2O and Spark MLlib.

useBackend( distributedR )
Will run on
HPE
Distributed
Backend

 ddR follows an object oriented programming
pattern
 The main ddR package defines the abstract
classes for distributed objects, while backend
drivers are required to extend these classes
via inheritance
 This permits drivers to override default
generic operators in ddR

 ddR algorithm can indeed be executed on a
variety of backends such as R's parallel, SNOW,
HPE Distributed R, and Spark both in single-
server and multi-server setups.
 ddR algorithms have good performance and
scalability, and are competitive with algorithms
available in other products.
 there is very little overhead of using ddR's
abstractions. Algorithms implemented in ddR
have similar performance to algorithms written
directly in the respective backend.

 Single Server Setup
 Multi Server Setup

To create 500 decision trees
from 1M observations with
10 features:
Default algorithm in R takes
about 28 minutes to
converge.
Using ddR, Distributed R can
reduce the execution time to
about 5 minutes with 12
cores.

To cluster into 500 groups
from 1.2M points with 100
attributes:
Default algorithm in R takes
about 482s for each iteration
of K-means.
When using SNOW, ddR
version of K-means takes 96s
with 12 cores.
HPE Distributed R and
parallel provide the best
performance in this setup,
completing each K-means
iteration in just 10s with 12
cores

For regression 12M records
each with 50 features are
used.
R's single-threaded regression
algorithm converges in 141s.
The ddR regression algorithm
on HPE Distributed R takes
155s with a single core but
converges in 33s with 12
cores.
The parallel version is faster
and converges in around 20s
with 12 cores.

ddR version of K-
means on parallel is
about 1.5 times
faster than H2O's K-
means
For example, ddR
can complete each
iteration in less than
7s with parallel
using 12 cores
compared to more
than 10s by H2O.

Figure shows that ddR's
regression implementation
with parallel and H2O
where H2O is slightly faster
at 8 and 12 cores.
However, if data size
increases to 5 times then
H2O crashes but ddR
scalability remains same on
HPE Distributed and
parallel backends.

Figure shows that
Spark MLlib's K-
means algorithm has
similar performance
as H2O, and is
slightly slower than
the ddR algorithm
running on parallel.

Figure shows that the
regression implementation
in Spark MLlib, when using
4 or less cores, is about 2
times slower than both
H2O and ddR's
implementation on parallel
or HPE Distributed R.
At 8 or more cores the
performance of Spark
MLlib is comparable, but
still less, than the other
systems.

The same ddR algorithms that work on a single
server can also run in multi-server mode with
the appropriate backend.
And can process hundreds of gigabytes of data
and provide similar scalability as custom
implementations.

To utilize multiple servers a
dataset of size about 95GB with
120M rows and has 100 features
per record is used.
Custom regression algorithm in
Distributed R takes 227s per
iteration with a single server
which reduces to 74s with 8
servers.
The ddR version of regression,
running on Distributed R as the
backend takes about 251s to
complete an iteration which
reduces to 97s with 8 servers.
Distributed R is only 23% faster than
the ddR version but ddR algorithm can
runs on other backends giving a single
interface to the R users.

Observations:
First, when Spark is used as the
backend, the ddR algorithm
takes around 7 minutes per
iteration. With Distributed R as
the backend the per-iteration
time of ddR is around 6 minutes.
Second, Therefore, if a user has
both the backends installed, it
can choose to run the
application written in ddR,
without any modifications, on
Distributed R for better
performance.
Finally, evaluation shows that
ddR algorithm gives same or
better performance than custom
algorithm.
To utilize multiple servers a dataset of
size about 180GB with 240M rows
(also 30M,60M and 120M rows) and
has 100 features per record is used.

 Distributed frameworks
◦ MapReduce, Pig, HIVE, DryadLINQ, Mahout library,
Spark, Pregel, GraphLab, Concerto, Strom, Naiad,
Ricardo, Rhadoop, SparkR, SystemML etc.
 Databases and machine learning
◦ Most popular dplyr
◦ Oracle, HPE Vertica, MS SQL server embed R in their
database and also MADlib, SAP HANA etc.
 Parallel libraries in R
◦ Parallel, SNOW, foreach, Rmpi, HPE Distributed R
etc.

 ddR is a standardized system with easy to
use and good performance
 ddR is the first step in extending the R
language and providing a unified interface
for distributed computing
 Write once run everywhere

 Apache Mahout, Spark
 HP Vertica and Hadoop
 Revolution R enterprise scaler
 Hadoop and MapR
 H2O: Machine learning library
 ddR: Distributed data structures in R

dmapply: A functional primitive to express distributed machine learning algorithms in R

dmapply: A functional primitive to express distributed machine learning algorithms in R

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to dmapply: A functional primitive to express distributed machine learning algorithms in R

Similar to dmapply: A functional primitive to express distributed machine learning algorithms in R (20)

More from Bikash Chandra Karmokar

More from Bikash Chandra Karmokar (6)

Recently uploaded

Recently uploaded (20)

dmapply: A functional primitive to express distributed machine learning algorithms in R

Editor's Notes