SlideShare a Scribd company logo
Authors
Edward Ma,Vishrut Gupta, Michun Hsu and Indrajit Roy
Presenter
Bikash Chandra Karmoakr
M.Sc. Student, ITIS
Leibniz Universität Hannover
Seminar on
Database as a Service
Date: 26/01/2017
Place: TU Clausthal
 Introduction
 Distributed Computing and Data Structures in R
 Challenges faced by R users and Objectives of ddR
 ddR components and package structure
 Communication and computation patterns using dmapply
 Some examples and machine learning algorithms
 Comparison with other packages and performance evaluation
 Conclusion
 References
 R is one of the top choices for statisticians and
data scientists
 ddR (Distributed Data structures in R) is created
to build a unified system that works across
different distributed frameworks in R
 ddR introduced dmapply that executes functions
on distributed data structures.
 dmapply offers a standardized system which is
easy to use with enough flexibility and good
performance.
fork() Sockets MPI Spark Vertica MPP DBs
parallel snow SparkR DistributedR
foreach BiocParallel
parts(data)foreach() bpapply()
Distributed Computing Distributed Data
 Many applications reuse data:
◦ Multi-analysis on same data: load once, run many operations
◦ Iterative algorithms: most machine learning + graph
algorithms
 Persistent, abstract references:
◦ Avoid data movement overhead(send, collect, send cycles)
◦ Enable caching
 Analyst wants to express high-level data
manipulations
 NOT explicitly iterate over chunks
 Interfaces to distributed system are custom, low-
level and non-idiomatic
 Spark has 50+ operators!
◦ Map, flatmap, mapPartitions, mapPartitionsWithIndex...
◦ Lacks common array, list, data.frame operations that R
users expect
◦ SparkR provides some abstraction, but has its own
idiosyncrasies.
What if there is an API based on distributed data-structures?
 Standardize a unified API for distributed:
◦ Iteration
◦ Data Structures
 Enable:
◦ Basic manipulation and reduction of distributed
data (lists, data frames, arrays)
◦ Implementation of parallel algorithms through low-
level primitives
◦ Write once, run everywhere
1. Iteration: Common parallel operators for distributed
data-structures
◦ mapply() -> dmapply()
◦ lapply() -> dlapply()
◦ New: parts(), collect()
2. Data Structures: Distributed variants of core R data-
structures:
◦ array -> darray
◦ data.frame -> dframe
◦ list -> dlist
3. Shared infrastructure for backend implementations
(Spark, distributed R,,,)
The ddR package is not a new distributed infrastructure !
 Distributed versions of array, list, data.frame with
conventional APIs:
◦ Accessors: parts, dim, names
◦ Summaries: mean, median, head, tail, rowSums, aggregate
◦ Sorting: sort
◦ Combination: c, cbind, rbind, merge
◦ Iteration: lapply, split
◦ Math and comparisons on arrays, transform on data.frames
◦ Distributed IO, e.g. dread(“data.csv”)
 Distributed iteration primitives for implementing
algorithms: dmapply()
 Enhanced ease of use, maintainability and portability
due to standard API
parallel distributedR.ddR spark.ddR
ddR
API package with
data structures
+
common operations
useBackend(parallel)
useBackend(distributedR)
useBackend(spark)
Third-party wrapper packages, delegating to
existing backend interfaces
ddR
package
backends
 dlist(......., nparts, psize)
◦ Similar to list() convention
◦ nparts and psize control partition count and size,
respectively
 dmapply(FUN, X, Y, MoreArgs = list(), nparts)
◦ Apply FUN to elements of X and Y, returning a dlist
 parts(L)
◦ Return the set of partitions as a list of dlist objects
 collect(L)
◦ Return the in-memory base R list representation of L
◦ Generally only used after aggregation
Function Broadcast:
Data Broadcast:
Partition Based:
1. #Create a distributed list. By default each
element becomes a partitions
2. A <-dlist(1,2,3,4,5)
3. #Access partitions
4. p <- parts(A)
5. #Multiply elements in each partition by a
constant
6. B <- dmapply (function(x){2*x[[1]]},p)
7. #Fetch the result (= {2 ,4 ,6 ,8 ,10} ) on the
master
8. print(collect(B))
1. A <- dlist(1,2,3,4)
2. B <- dlist(11,12,13,14)
3. # C will be a dlist={12,14,16,18}
4. C <- dmapply (FUN=sum, A, B)
5. # D will be a dlist ={13,15,17,19}
6. D <- dmapply (FUN=sum , A, B, MoreArgs=list( z=1) )
7. print(collect(D))
Three machine learning algorithms is tested here:
1. randomforest, a decision tree based ensemble
learning method,
2. K-means clustering algorithm, and
3. linear regression.
These algorithms (ddR version) are compatible with
established open source machine learning algorithms
like H2O and Spark MLlib.
useBackend( distributedR )
Will run on
HPE
Distributed
Backend
 ddR follows an object oriented programming
pattern
 The main ddR package defines the abstract
classes for distributed objects, while backend
drivers are required to extend these classes
via inheritance
 This permits drivers to override default
generic operators in ddR
 ddR algorithm can indeed be executed on a
variety of backends such as R's parallel, SNOW,
HPE Distributed R, and Spark both in single-
server and multi-server setups.
 ddR algorithms have good performance and
scalability, and are competitive with algorithms
available in other products.
 there is very little overhead of using ddR's
abstractions. Algorithms implemented in ddR
have similar performance to algorithms written
directly in the respective backend.
 Single Server Setup
 Multi Server Setup
To create 500 decision trees
from 1M observations with
10 features:
Default algorithm in R takes
about 28 minutes to
converge.
Using ddR, Distributed R can
reduce the execution time to
about 5 minutes with 12
cores.
To cluster into 500 groups
from 1.2M points with 100
attributes:
Default algorithm in R takes
about 482s for each iteration
of K-means.
When using SNOW, ddR
version of K-means takes 96s
with 12 cores.
HPE Distributed R and
parallel provide the best
performance in this setup,
completing each K-means
iteration in just 10s with 12
cores
For regression 12M records
each with 50 features are
used.
R's single-threaded regression
algorithm converges in 141s.
The ddR regression algorithm
on HPE Distributed R takes
155s with a single core but
converges in 33s with 12
cores.
The parallel version is faster
and converges in around 20s
with 12 cores.
ddR version of K-
means on parallel is
about 1.5 times
faster than H2O's K-
means
For example, ddR
can complete each
iteration in less than
7s with parallel
using 12 cores
compared to more
than 10s by H2O.
Figure shows that ddR's
regression implementation
with parallel and H2O
where H2O is slightly faster
at 8 and 12 cores.
However, if data size
increases to 5 times then
H2O crashes but ddR
scalability remains same on
HPE Distributed and
parallel backends.
Figure shows that
Spark MLlib's K-
means algorithm has
similar performance
as H2O, and is
slightly slower than
the ddR algorithm
running on parallel.
Figure shows that the
regression implementation
in Spark MLlib, when using
4 or less cores, is about 2
times slower than both
H2O and ddR's
implementation on parallel
or HPE Distributed R.
At 8 or more cores the
performance of Spark
MLlib is comparable, but
still less, than the other
systems.
The same ddR algorithms that work on a single
server can also run in multi-server mode with
the appropriate backend.
And can process hundreds of gigabytes of data
and provide similar scalability as custom
implementations.
To utilize multiple servers a
dataset of size about 95GB with
120M rows and has 100 features
per record is used.
Custom regression algorithm in
Distributed R takes 227s per
iteration with a single server
which reduces to 74s with 8
servers.
The ddR version of regression,
running on Distributed R as the
backend takes about 251s to
complete an iteration which
reduces to 97s with 8 servers.
Distributed R is only 23% faster than
the ddR version but ddR algorithm can
runs on other backends giving a single
interface to the R users.
Observations:
First, when Spark is used as the
backend, the ddR algorithm
takes around 7 minutes per
iteration. With Distributed R as
the backend the per-iteration
time of ddR is around 6 minutes.
Second, Therefore, if a user has
both the backends installed, it
can choose to run the
application written in ddR,
without any modifications, on
Distributed R for better
performance.
Finally, evaluation shows that
ddR algorithm gives same or
better performance than custom
algorithm.
To utilize multiple servers a dataset of
size about 180GB with 240M rows
(also 30M,60M and 120M rows) and
has 100 features per record is used.
 Distributed frameworks
◦ MapReduce, Pig, HIVE, DryadLINQ, Mahout library,
Spark, Pregel, GraphLab, Concerto, Strom, Naiad,
Ricardo, Rhadoop, SparkR, SystemML etc.
 Databases and machine learning
◦ Most popular dplyr
◦ Oracle, HPE Vertica, MS SQL server embed R in their
database and also MADlib, SAP HANA etc.
 Parallel libraries in R
◦ Parallel, SNOW, foreach, Rmpi, HPE Distributed R
etc.
 ddR is a standardized system with easy to
use and good performance
 ddR is the first step in extending the R
language and providing a unified interface
for distributed computing
 Write once run everywhere
 Apache Mahout, Spark
 HP Vertica and Hadoop
 Revolution R enterprise scaler
 Hadoop and MapR
 H2O: Machine learning library
 ddR: Distributed data structures in R
dmapply: A functional primitive to express distributed machine learning algorithms in R
dmapply: A functional primitive to express distributed machine learning algorithms in R

More Related Content

What's hot

KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.
Kyong-Ha Lee
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoop
Victoria López
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce ParadigmDilip Reddy
 
MapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvement
Kyong-Ha Lee
 
Shark SQL and Rich Analytics at Scale
Shark SQL and Rich Analytics at ScaleShark SQL and Rich Analytics at Scale
Shark SQL and Rich Analytics at ScaleDataWorks Summit
 
Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and Hadoop
Mohamed Elsaka
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
Andrea Iacono
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark
YahooTechConference
 
MapReduce
MapReduceMapReduce
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
Dr Ganesh Iyer
 
Hadoop scalability
Hadoop scalabilityHadoop scalability
Hadoop scalabilityWANdisco Plc
 
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLabBeyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Vijay Srinivas Agneeswaran, Ph.D
 
Introduction to Hadoop part 2
Introduction to Hadoop part 2Introduction to Hadoop part 2
Introduction to Hadoop part 2
Giovanna Roda
 
benchmarks-sigmod09
benchmarks-sigmod09benchmarks-sigmod09
benchmarks-sigmod09Hiroshi Ono
 
Map/Reduce intro
Map/Reduce introMap/Reduce intro
Eg4301808811
Eg4301808811Eg4301808811
Eg4301808811
IJERA Editor
 
CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...
CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...
CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...
Big Data Spain
 

What's hot (20)

KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoop
 
Spark and shark
Spark and sharkSpark and shark
Spark and shark
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
MapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvement
 
MapReduce in Cloud Computing
MapReduce in Cloud ComputingMapReduce in Cloud Computing
MapReduce in Cloud Computing
 
Shark SQL and Rich Analytics at Scale
Shark SQL and Rich Analytics at ScaleShark SQL and Rich Analytics at Scale
Shark SQL and Rich Analytics at Scale
 
Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and Hadoop
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark
 
MapReduce
MapReduceMapReduce
MapReduce
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
Hadoop scalability
Hadoop scalabilityHadoop scalability
Hadoop scalability
 
Pig Experience
Pig ExperiencePig Experience
Pig Experience
 
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLabBeyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
 
Introduction to Hadoop part 2
Introduction to Hadoop part 2Introduction to Hadoop part 2
Introduction to Hadoop part 2
 
benchmarks-sigmod09
benchmarks-sigmod09benchmarks-sigmod09
benchmarks-sigmod09
 
Map/Reduce intro
Map/Reduce introMap/Reduce intro
Map/Reduce intro
 
Eg4301808811
Eg4301808811Eg4301808811
Eg4301808811
 
CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...
CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...
CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...
 

Similar to dmapply: A functional primitive to express distributed machine learning algorithms in R

RDD
RDDRDD
Spark cluster computing with working sets
Spark cluster computing with working setsSpark cluster computing with working sets
Spark cluster computing with working sets
JinxinTang
 
Next generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph labNext generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph lab
Impetus Technologies
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
Sriram Kailasam
 
Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014
Vijay Srinivas Agneeswaran, Ph.D
 
Study Notes: Apache Spark
Study Notes: Apache SparkStudy Notes: Apache Spark
Study Notes: Apache Spark
Gao Yunzhong
 
hadoop-spark.ppt
hadoop-spark.ppthadoop-spark.ppt
hadoop-spark.ppt
NouhaElhaji1
 
Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型
wang xing
 
Spark
SparkSpark
Spark
newmooxx
 
Big Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingBig Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computing
Animesh Chaturvedi
 
Data Science
Data ScienceData Science
Data Science
Subhajit75
 
Spark
SparkSpark
Geek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and ScalaGeek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and Scala
Atif Akhtar
 
Model Building with RevoScaleR: Using R and Hadoop for Statistical Computation
Model Building with RevoScaleR: Using R and Hadoop for Statistical ComputationModel Building with RevoScaleR: Using R and Hadoop for Statistical Computation
Model Building with RevoScaleR: Using R and Hadoop for Statistical Computation
Revolution Analytics
 
User biglm
User biglmUser biglm
User biglm
johnatan pladott
 
Hadoop bigdata overview
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overview
harithakannan
 
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCE
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCEPERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCE
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCE
ijdpsjournal
 
International Journal of Distributed and Parallel systems (IJDPS)
International Journal of Distributed and Parallel systems (IJDPS)International Journal of Distributed and Parallel systems (IJDPS)
International Journal of Distributed and Parallel systems (IJDPS)
ijdpsjournal
 
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATION
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATIONMAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATION
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATION
ijdms
 

Similar to dmapply: A functional primitive to express distributed machine learning algorithms in R (20)

RDD
RDDRDD
RDD
 
Spark cluster computing with working sets
Spark cluster computing with working setsSpark cluster computing with working sets
Spark cluster computing with working sets
 
SparkNotes
SparkNotesSparkNotes
SparkNotes
 
Next generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph labNext generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph lab
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
 
Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014
 
Study Notes: Apache Spark
Study Notes: Apache SparkStudy Notes: Apache Spark
Study Notes: Apache Spark
 
hadoop-spark.ppt
hadoop-spark.ppthadoop-spark.ppt
hadoop-spark.ppt
 
Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型
 
Spark
SparkSpark
Spark
 
Big Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingBig Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computing
 
Data Science
Data ScienceData Science
Data Science
 
Spark
SparkSpark
Spark
 
Geek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and ScalaGeek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and Scala
 
Model Building with RevoScaleR: Using R and Hadoop for Statistical Computation
Model Building with RevoScaleR: Using R and Hadoop for Statistical ComputationModel Building with RevoScaleR: Using R and Hadoop for Statistical Computation
Model Building with RevoScaleR: Using R and Hadoop for Statistical Computation
 
User biglm
User biglmUser biglm
User biglm
 
Hadoop bigdata overview
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overview
 
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCE
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCEPERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCE
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCE
 
International Journal of Distributed and Parallel systems (IJDPS)
International Journal of Distributed and Parallel systems (IJDPS)International Journal of Distributed and Parallel systems (IJDPS)
International Journal of Distributed and Parallel systems (IJDPS)
 
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATION
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATIONMAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATION
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATION
 

More from Bikash Chandra Karmokar

Dependency-Based Word Embeddings
Dependency-Based Word EmbeddingsDependency-Based Word Embeddings
Dependency-Based Word Embeddings
Bikash Chandra Karmokar
 
Sign language recognizer
Sign language recognizerSign language recognizer
Sign language recognizer
Bikash Chandra Karmokar
 
Pc to Mobile chatting using Bluetooth
Pc to Mobile chatting using BluetoothPc to Mobile chatting using Bluetooth
Pc to Mobile chatting using Bluetooth
Bikash Chandra Karmokar
 
Touchless writer
Touchless writerTouchless writer
Touchless writer
Bikash Chandra Karmokar
 
Brain computer interface
Brain computer interfaceBrain computer interface
Brain computer interface
Bikash Chandra Karmokar
 
3D display without glasses
3D display without glasses3D display without glasses
3D display without glasses
Bikash Chandra Karmokar
 

More from Bikash Chandra Karmokar (6)

Dependency-Based Word Embeddings
Dependency-Based Word EmbeddingsDependency-Based Word Embeddings
Dependency-Based Word Embeddings
 
Sign language recognizer
Sign language recognizerSign language recognizer
Sign language recognizer
 
Pc to Mobile chatting using Bluetooth
Pc to Mobile chatting using BluetoothPc to Mobile chatting using Bluetooth
Pc to Mobile chatting using Bluetooth
 
Touchless writer
Touchless writerTouchless writer
Touchless writer
 
Brain computer interface
Brain computer interfaceBrain computer interface
Brain computer interface
 
3D display without glasses
3D display without glasses3D display without glasses
3D display without glasses
 

Recently uploaded

STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
u86oixdj
 
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfEnhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
GetInData
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
kuntobimo2016
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
Nanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdfNanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdf
eddie19851
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
mzpolocfi
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
balafet
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfUnleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Enterprise Wired
 

Recently uploaded (20)

STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
 
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfEnhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
Nanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdfNanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdf
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfUnleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
 

dmapply: A functional primitive to express distributed machine learning algorithms in R

  • 1. Authors Edward Ma,Vishrut Gupta, Michun Hsu and Indrajit Roy Presenter Bikash Chandra Karmoakr M.Sc. Student, ITIS Leibniz Universität Hannover Seminar on Database as a Service Date: 26/01/2017 Place: TU Clausthal
  • 2.
  • 3.  Introduction  Distributed Computing and Data Structures in R  Challenges faced by R users and Objectives of ddR  ddR components and package structure  Communication and computation patterns using dmapply  Some examples and machine learning algorithms  Comparison with other packages and performance evaluation  Conclusion  References
  • 4.  R is one of the top choices for statisticians and data scientists  ddR (Distributed Data structures in R) is created to build a unified system that works across different distributed frameworks in R  ddR introduced dmapply that executes functions on distributed data structures.  dmapply offers a standardized system which is easy to use with enough flexibility and good performance.
  • 5. fork() Sockets MPI Spark Vertica MPP DBs parallel snow SparkR DistributedR foreach BiocParallel parts(data)foreach() bpapply() Distributed Computing Distributed Data
  • 6.  Many applications reuse data: ◦ Multi-analysis on same data: load once, run many operations ◦ Iterative algorithms: most machine learning + graph algorithms  Persistent, abstract references: ◦ Avoid data movement overhead(send, collect, send cycles) ◦ Enable caching  Analyst wants to express high-level data manipulations  NOT explicitly iterate over chunks
  • 7.  Interfaces to distributed system are custom, low- level and non-idiomatic  Spark has 50+ operators! ◦ Map, flatmap, mapPartitions, mapPartitionsWithIndex... ◦ Lacks common array, list, data.frame operations that R users expect ◦ SparkR provides some abstraction, but has its own idiosyncrasies. What if there is an API based on distributed data-structures?
  • 8.  Standardize a unified API for distributed: ◦ Iteration ◦ Data Structures  Enable: ◦ Basic manipulation and reduction of distributed data (lists, data frames, arrays) ◦ Implementation of parallel algorithms through low- level primitives ◦ Write once, run everywhere
  • 9.
  • 10. 1. Iteration: Common parallel operators for distributed data-structures ◦ mapply() -> dmapply() ◦ lapply() -> dlapply() ◦ New: parts(), collect() 2. Data Structures: Distributed variants of core R data- structures: ◦ array -> darray ◦ data.frame -> dframe ◦ list -> dlist 3. Shared infrastructure for backend implementations (Spark, distributed R,,,) The ddR package is not a new distributed infrastructure !
  • 11.  Distributed versions of array, list, data.frame with conventional APIs: ◦ Accessors: parts, dim, names ◦ Summaries: mean, median, head, tail, rowSums, aggregate ◦ Sorting: sort ◦ Combination: c, cbind, rbind, merge ◦ Iteration: lapply, split ◦ Math and comparisons on arrays, transform on data.frames ◦ Distributed IO, e.g. dread(“data.csv”)  Distributed iteration primitives for implementing algorithms: dmapply()  Enhanced ease of use, maintainability and portability due to standard API
  • 12. parallel distributedR.ddR spark.ddR ddR API package with data structures + common operations useBackend(parallel) useBackend(distributedR) useBackend(spark) Third-party wrapper packages, delegating to existing backend interfaces
  • 14.  dlist(......., nparts, psize) ◦ Similar to list() convention ◦ nparts and psize control partition count and size, respectively  dmapply(FUN, X, Y, MoreArgs = list(), nparts) ◦ Apply FUN to elements of X and Y, returning a dlist  parts(L) ◦ Return the set of partitions as a list of dlist objects  collect(L) ◦ Return the in-memory base R list representation of L ◦ Generally only used after aggregation
  • 16.
  • 17.
  • 18. 1. #Create a distributed list. By default each element becomes a partitions 2. A <-dlist(1,2,3,4,5) 3. #Access partitions 4. p <- parts(A) 5. #Multiply elements in each partition by a constant 6. B <- dmapply (function(x){2*x[[1]]},p) 7. #Fetch the result (= {2 ,4 ,6 ,8 ,10} ) on the master 8. print(collect(B))
  • 19. 1. A <- dlist(1,2,3,4) 2. B <- dlist(11,12,13,14) 3. # C will be a dlist={12,14,16,18} 4. C <- dmapply (FUN=sum, A, B) 5. # D will be a dlist ={13,15,17,19} 6. D <- dmapply (FUN=sum , A, B, MoreArgs=list( z=1) ) 7. print(collect(D))
  • 20. Three machine learning algorithms is tested here: 1. randomforest, a decision tree based ensemble learning method, 2. K-means clustering algorithm, and 3. linear regression. These algorithms (ddR version) are compatible with established open source machine learning algorithms like H2O and Spark MLlib.
  • 21. useBackend( distributedR ) Will run on HPE Distributed Backend
  • 22.
  • 23.  ddR follows an object oriented programming pattern  The main ddR package defines the abstract classes for distributed objects, while backend drivers are required to extend these classes via inheritance  This permits drivers to override default generic operators in ddR
  • 24.  ddR algorithm can indeed be executed on a variety of backends such as R's parallel, SNOW, HPE Distributed R, and Spark both in single- server and multi-server setups.  ddR algorithms have good performance and scalability, and are competitive with algorithms available in other products.  there is very little overhead of using ddR's abstractions. Algorithms implemented in ddR have similar performance to algorithms written directly in the respective backend.
  • 25.  Single Server Setup  Multi Server Setup
  • 26. To create 500 decision trees from 1M observations with 10 features: Default algorithm in R takes about 28 minutes to converge. Using ddR, Distributed R can reduce the execution time to about 5 minutes with 12 cores.
  • 27. To cluster into 500 groups from 1.2M points with 100 attributes: Default algorithm in R takes about 482s for each iteration of K-means. When using SNOW, ddR version of K-means takes 96s with 12 cores. HPE Distributed R and parallel provide the best performance in this setup, completing each K-means iteration in just 10s with 12 cores
  • 28. For regression 12M records each with 50 features are used. R's single-threaded regression algorithm converges in 141s. The ddR regression algorithm on HPE Distributed R takes 155s with a single core but converges in 33s with 12 cores. The parallel version is faster and converges in around 20s with 12 cores.
  • 29. ddR version of K- means on parallel is about 1.5 times faster than H2O's K- means For example, ddR can complete each iteration in less than 7s with parallel using 12 cores compared to more than 10s by H2O.
  • 30. Figure shows that ddR's regression implementation with parallel and H2O where H2O is slightly faster at 8 and 12 cores. However, if data size increases to 5 times then H2O crashes but ddR scalability remains same on HPE Distributed and parallel backends.
  • 31. Figure shows that Spark MLlib's K- means algorithm has similar performance as H2O, and is slightly slower than the ddR algorithm running on parallel.
  • 32. Figure shows that the regression implementation in Spark MLlib, when using 4 or less cores, is about 2 times slower than both H2O and ddR's implementation on parallel or HPE Distributed R. At 8 or more cores the performance of Spark MLlib is comparable, but still less, than the other systems.
  • 33. The same ddR algorithms that work on a single server can also run in multi-server mode with the appropriate backend. And can process hundreds of gigabytes of data and provide similar scalability as custom implementations.
  • 34. To utilize multiple servers a dataset of size about 95GB with 120M rows and has 100 features per record is used. Custom regression algorithm in Distributed R takes 227s per iteration with a single server which reduces to 74s with 8 servers. The ddR version of regression, running on Distributed R as the backend takes about 251s to complete an iteration which reduces to 97s with 8 servers. Distributed R is only 23% faster than the ddR version but ddR algorithm can runs on other backends giving a single interface to the R users.
  • 35. Observations: First, when Spark is used as the backend, the ddR algorithm takes around 7 minutes per iteration. With Distributed R as the backend the per-iteration time of ddR is around 6 minutes. Second, Therefore, if a user has both the backends installed, it can choose to run the application written in ddR, without any modifications, on Distributed R for better performance. Finally, evaluation shows that ddR algorithm gives same or better performance than custom algorithm. To utilize multiple servers a dataset of size about 180GB with 240M rows (also 30M,60M and 120M rows) and has 100 features per record is used.
  • 36.  Distributed frameworks ◦ MapReduce, Pig, HIVE, DryadLINQ, Mahout library, Spark, Pregel, GraphLab, Concerto, Strom, Naiad, Ricardo, Rhadoop, SparkR, SystemML etc.  Databases and machine learning ◦ Most popular dplyr ◦ Oracle, HPE Vertica, MS SQL server embed R in their database and also MADlib, SAP HANA etc.  Parallel libraries in R ◦ Parallel, SNOW, foreach, Rmpi, HPE Distributed R etc.
  • 37.  ddR is a standardized system with easy to use and good performance  ddR is the first step in extending the R language and providing a unified interface for distributed computing  Write once run everywhere
  • 38.  Apache Mahout, Spark  HP Vertica and Hadoop  Revolution R enterprise scaler  Hadoop and MapR  H2O: Machine learning library  ddR: Distributed data structures in R

Editor's Notes

  1. There is no general process that we have data that have partitions and we can operate on that partitions. So distributed data structures is the idea to work on that. Some are there like - deplyr, oracle r etc There are various tools like fork(), Sokets communication or MPI(Message Passing Interface) through which we can achieve parallelism in low level. On top of that there are R packages like parallel and SNOW(Simple Network of Workstations) foreach and biocparaler -> functional object oriented approach for parallel computing on top of paralel and snow package. MPP DBS(Massively Parallel Processing (MPP) database), spark, vertica -> Apache Spark is an open-source cluster-computing framework. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation Vertica is a Next-generation high-performance SQL analytics engine with integrated offerings to meet your varying needs -- on premise, in the cloud, or on Hadoop. Your needs are unique, your analytics database should be too. Evaluate HPE Vertica today! foreach: The foreach package provides a new looping construct for executing R code repeatedly. The main reason for using the foreach package is that it supports parallel execution. The foreach package can be used with a variety of different parallel computing systems, include NetWorkSpaces and snow. In addition, foreach can be used with iterators, which allows the data to specified in a very flexible way. Biocparallel:
  2. aggregate(v~group,x,sum) aggregate(v~group,bpvec(x,function(px) aggregate (v~ group,px,sum),AGGREGATE=rbind ),sum)
  3. Idiosyncrasy - A behavioral attribute that is distinctive and peculiar to an individual idiom - A manner of speaking that is natural to native speakers of a language Comments taken from : Open SparkR JIRA ticket: https://issues.apache.org/jira/browse/SPARK-7264
  4. dmapply- Distributed version of mapply with several important differences parts - Retrieves, as a list of independent objects, pointers to each individual partition of the input. parts() is primarily used in conjunction with dmapply when functions are written to be applied over partitions of distributed objects. collect - Fetch partition(s) of ’darray’, ’dframe’ or ’dlist’ from remote workers.
  5. x <- 1:12 x [1] 1 2 3 4 5 6 7 8 9 10 11 12 dim(x) NULL dim(x) <- c(3,4) x [,1] [,2] [,3] [,4] [1,] 1 4 7 10 [2,] 2 5 8 11 [3,] 3 6 9 12 dim(x) 3 4
  6. ddR is based on these three. So anything works one parallel will be on ddR. Parallel is default package for ddR. We have to use useBackend() function for selecting our desired backend.
  7. ddR implemented in three layers: ------------------------------------------ Top Layer: The top layer is the application code, such as a distributed algorithm, which makes calls to the ddR API (e.g. dmapply) and associated utility functions (e.g., colSums). Second Layer: The second layer is the core ddR package, which contains the implementations of the ddR API. This layer is responsible for error checking and other tasks common across backends, and invokes the underlying backend driver to delegate tasks. It consists of about 2,500 lines of code that provide generic definitions of distributed data structures and classes that the backend driver can extend. Final Layers: Finally, the third layer consists of the backend driver (usually implemented as a separate R packaged such as distributedR.ddR) and is responsible for implementing the generic distributed classes and functions for that particular backend. Typically, a backend driver implementation may involve 500-1,000 lines of code.
  8. Aggregate --------------- The most basic uses of aggregate involve base functions such as mean and sd. It is indeed one of the most common uses of aggregate to compare the mean or other properties of sample groups.
  9. Function broadcast: A common programming paradigm is to apply a function on each element of a data structure. In fact, programmers can also express that a function should be applied to each partition at a time instead of each element at a time by calling dmapply(FUN, parts(A)). Data broadcast: In some cases, programmers need to include the same data in all invocations of a function. As an example consider the K-means clustering algorithm that iteratively groups input data into K clusters. In each iteration, the distance of the points to the centers has to be calculated, which means the centers from the previous iteration have to be available to all invocations of the distance calculation function. Partition Based: The dmapply approach allows programmers to operate on any subset of partitions that contain distributed data. Here parts 1 and 2 is working on parts 3 data. Figure 6: Example computation patterns in ddR.
  10. A darray is a collection of array partitions. In this example, the darray is partitioned into 4 blocks, and each server holds only one partition. The darray argument nparts in the figure specifies how the partitions are located in a grid. We can also use dframe instead of darray
  11. In the above the simple example creates a distributed list and accesses its partitions. Line 2 declares a distributed list which holds the numbers 1 to 5. By default it will be stored as five partitions, each containing one number. In line 4, p is a local R list (not a distributed list) which has five elements and each element is a reference to a partition of A. Line 6 executes a function on each element of p, which means each partition of A, and multiplies each partition by 2. The result B is a dlist, has five partitions, and is stored across multiple nodes. Line 8 gathers the result into a single local R list and prints it.
  12. In the above the simple example there are only two distributed lists, A and B, as inputs, and the function is sum. The runtime will extract the first element of A and B and apply sum on it, extract the corresponding second elements, and so on. Line 4 in Figure 5 shows the corresponding program and its results. The MoreArgs argument is a way to pass a list of objects that are available as an input to each invocation of the function. As an example, in line 6, the constant z is passed to every invocation of sum, and hence 1 is added to each element of the previous result C.
  13. The above code shows how programmers can invoke ddR's distributed clustering algorithm. Line 1 imports the ddR package, while line 2 imports a distributed K-means library written using the ddR API. Line 4 determines the backend on which the functions will be dispatched. In this example the backend used in the default parallel backend, which is single-node but can use multiple cores. In line 6, the input data is generated in parallel by calling a user-written function genData using dmapply. The input is returned as a distributed array with as many partitions as the number of cores in the server. Finally, in line 8, the ddR version of the K-means algorithm is invoked to cluster the input data in parallel. The key advantage of this ddR program is that the same code will run on a different backend, such as HPE Distributed R, if line 4 is simply changed to useBackend(distributedR).
  14. Above code shows one implementation of distributed randomforest using ddR. Randomforest is an ensemble learning method that is used for classification and regression. The training phase creates a number of decision trees, such as 500 trees, that are later used for classification. Since training on large data can take hours, it is common to parallelize the computationally intensive step of building trees. Line 3 uses a simple parallelization strategy of broadcasting the input to all workers by specifying it in MoreArgs. Each worker then builds 50 trees in parallel (ntree=50) by calling the existing single threaded randomforest function. At the end of the computation, all the trees are collected at the master and combined to form a single model in line 5. In this example, the full contents of a single data structure are broadcast to all workers.
  15. Using ddR, we can parallelize the tree building phase by assigning each core to build a subset of the 500 trees. By using multiple cores, each of the backends parallel, SNOW, and HPE Distributed R can reduce the execution time to about 5 minutes with 12 cores.
  16. HPE Distributed R and parallel provide the best performance in this setup, completing each K-means iteration in just 10s with 12 cores. ddR version line -?? The performance of SNOW is worse than others because of its inefficient communication layer. SNOW incurs high overheads when moving the input data from the master to the worker processes using sockets.
  17. Since this dataset is multi-gigabyte, SNOW takes tens of minutes to converge, of which most of the time is spent in moving data between processes. Therefore, we exclude SNOW from the figure.
  18. Figure 10 Topic: 6.1.2
  19. Figure 11 Topic: 6.1.2 The reason for the slight performance advantage is because H2O uses multi-threading instead of multi-processing, as by parallel, which lowers the cost of sharing data across workers
  20. Figure 10 Topic: 6.1.2
  21. Figure 11 Topic: 6.1.2
  22. Figure 12
  23. Figure 13
  24. Figure 13