ddR is a package that introduces distributed data structures in R like darray, dframe, and dlist. It provides a standardized API for distributed iteration and data manipulation through functions like dmapply. ddR aims to make distributed computing in R easier to use with good performance by writing algorithms once that can run on different distributed backends like Spark, HPE Distributed R through its unified interface. Evaluation shows ddR algorithms have performance comparable or better than custom implementations and other machine learning libraries.
This presentation will give you Information about :
1. Map/Reduce Overview and Architecture Installation
2. Developing Map/Red Jobs Input and Output Formats
3. Job Configuration Job Submission
4. Practicing Map Reduce Programs (atleast 10 Map Reduce
5. Algorithms )Data Flow Sources and Destinations
6. Data Flow Transformations Data Flow Paths
7. Custom Data Types
8. Input Formats
9. Output Formats
10. Partitioning Data
11. Reporting Custom Metrics
12. Distributing Auxiliary Job Data
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Deanna Kosaraju
Optimal Execution Of MapReduce Jobs In Cloud
Anshul Aggarwal, Software Engineer, Cisco Systems
Session Length: 1 Hour
Tue March 10 21:30 PST
Wed March 11 0:30 EST
Wed March 11 4:30:00 UTC
Wed March 11 10:00 IST
Wed March 11 15:30 Sydney
Voices 2015 www.globaltechwomen.com
We use MapReduce programming paradigm because it lends itself well to most data-intensive analytics jobs run on cloud these days, given its ability to scale-out and leverage several machines to parallel process data. Research has demonstrates that existing approaches to provisioning other applications in the cloud are not immediately relevant to MapReduce -based applications. Provisioning a MapReduce job entails requesting optimum number of resource sets (RS) and configuring MapReduce parameters such that each resource set is maximally utilized.
Each application has a different bottleneck resource (CPU :Disk :Network), and different bottleneck resource utilization, and thus needs to pick a different combination of these parameters based on the job profile such that the bottleneck resource is maximally utilized.
The problem at hand is thus defining a resource provisioning framework for MapReduce jobs running in a cloud keeping in mind performance goals such as Optimal resource utilization with Minimum incurred cost, Lower execution time, Energy Awareness, Automatic handling of node failure and Highly scalable solution.
This presentation will give you Information about :
1. Map/Reduce Overview and Architecture Installation
2. Developing Map/Red Jobs Input and Output Formats
3. Job Configuration Job Submission
4. Practicing Map Reduce Programs (atleast 10 Map Reduce
5. Algorithms )Data Flow Sources and Destinations
6. Data Flow Transformations Data Flow Paths
7. Custom Data Types
8. Input Formats
9. Output Formats
10. Partitioning Data
11. Reporting Custom Metrics
12. Distributing Auxiliary Job Data
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Deanna Kosaraju
Optimal Execution Of MapReduce Jobs In Cloud
Anshul Aggarwal, Software Engineer, Cisco Systems
Session Length: 1 Hour
Tue March 10 21:30 PST
Wed March 11 0:30 EST
Wed March 11 4:30:00 UTC
Wed March 11 10:00 IST
Wed March 11 15:30 Sydney
Voices 2015 www.globaltechwomen.com
We use MapReduce programming paradigm because it lends itself well to most data-intensive analytics jobs run on cloud these days, given its ability to scale-out and leverage several machines to parallel process data. Research has demonstrates that existing approaches to provisioning other applications in the cloud are not immediately relevant to MapReduce -based applications. Provisioning a MapReduce job entails requesting optimum number of resource sets (RS) and configuring MapReduce parameters such that each resource set is maximally utilized.
Each application has a different bottleneck resource (CPU :Disk :Network), and different bottleneck resource utilization, and thus needs to pick a different combination of these parameters based on the job profile such that the bottleneck resource is maximally utilized.
The problem at hand is thus defining a resource provisioning framework for MapReduce jobs running in a cloud keeping in mind performance goals such as Optimal resource utilization with Minimum incurred cost, Lower execution time, Energy Awareness, Automatic handling of node failure and Highly scalable solution.
This was the first session about Hadoop and MapReduce. It introduces what Hadoop is and its main components. It also covers the how to program your first MapReduce task and how to run it on pseudo distributed Hadoop installation.
This session was given in Arabic and i may provide a video for the session soon.
Mapreduce examples starting from the basic WordCount to a more complex K-means algorithm. The code contained in these slides is available at https://github.com/andreaiacono/MapReduce
In this talk, we present two emerging, popular open source projects: Spark and Shark. Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write. It outperform Hadoop by up to 100x in many real-world applications. Spark programs are often much shorter than their MapReduce counterparts thanks to its high-level APIs and language integration in Java, Scala, and Python. Shark is an analytic query engine built on top of Spark that is compatible with Hive. It can run Hive queries much faster in existing Hive warehouses without modifications.
These systems have been adopted by many organizations large and small (e.g. Yahoo, Intel, Adobe, Alibaba, Tencent) to implement data intensive applications such as ETL, interactive SQL, and machine learning.
Slides of the workshop conducted in Model Engineering College, Ernakulam, and Sree Narayana Gurukulam College, Kadayiruppu
Kerala, India in December 2010
This gives a characterization of the machine learning computations and brings out the deficiencies of Hadoop 1.0. It gives the motivation for Hadoop YARN and a brief view of YARN architecture. It illustrates the power of specialized processing frameworks over YARN, such as Spark and GraphLab. In short, Hadoop YARN allows your data to be stored in HDFS and specialized processing frameworks may be used to process the data in various ways.
Accompanying slides for the class “Introduction to Hadoop” at the PRACE Autumn school 2020 - HPC and FAIR Big Data organized by the faculty of Mechanical Engineering of the University of Ljubljana (Slovenia).
Some slides about the Map/Reduce programming model (academic purposes) adapting some examples of the book Map/Reduce design patterns.
Special thanks to the next authors:
-http://shop.oreilly.com/product/0636920025122.do
-http://mapreducepatterns.com/index.php?title=Main_Page
-http://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...Big Data Spain
Session presented at Big Data Spain 2012 Conference
16th Nov 2012
ETSI Telecomunicacion UPM Madrid
www.bigdataspain.org
More info: http://www.bigdataspain.org/es-2012/conference/cloudMC-a-cloud-computing-map-reduce-implementation-for-radiotherapy/ruben-jimenez-and-hector-miras
This was the first session about Hadoop and MapReduce. It introduces what Hadoop is and its main components. It also covers the how to program your first MapReduce task and how to run it on pseudo distributed Hadoop installation.
This session was given in Arabic and i may provide a video for the session soon.
Mapreduce examples starting from the basic WordCount to a more complex K-means algorithm. The code contained in these slides is available at https://github.com/andreaiacono/MapReduce
In this talk, we present two emerging, popular open source projects: Spark and Shark. Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write. It outperform Hadoop by up to 100x in many real-world applications. Spark programs are often much shorter than their MapReduce counterparts thanks to its high-level APIs and language integration in Java, Scala, and Python. Shark is an analytic query engine built on top of Spark that is compatible with Hive. It can run Hive queries much faster in existing Hive warehouses without modifications.
These systems have been adopted by many organizations large and small (e.g. Yahoo, Intel, Adobe, Alibaba, Tencent) to implement data intensive applications such as ETL, interactive SQL, and machine learning.
Slides of the workshop conducted in Model Engineering College, Ernakulam, and Sree Narayana Gurukulam College, Kadayiruppu
Kerala, India in December 2010
This gives a characterization of the machine learning computations and brings out the deficiencies of Hadoop 1.0. It gives the motivation for Hadoop YARN and a brief view of YARN architecture. It illustrates the power of specialized processing frameworks over YARN, such as Spark and GraphLab. In short, Hadoop YARN allows your data to be stored in HDFS and specialized processing frameworks may be used to process the data in various ways.
Accompanying slides for the class “Introduction to Hadoop” at the PRACE Autumn school 2020 - HPC and FAIR Big Data organized by the faculty of Mechanical Engineering of the University of Ljubljana (Slovenia).
Some slides about the Map/Reduce programming model (academic purposes) adapting some examples of the book Map/Reduce design patterns.
Special thanks to the next authors:
-http://shop.oreilly.com/product/0636920025122.do
-http://mapreducepatterns.com/index.php?title=Main_Page
-http://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...Big Data Spain
Session presented at Big Data Spain 2012 Conference
16th Nov 2012
ETSI Telecomunicacion UPM Madrid
www.bigdataspain.org
More info: http://www.bigdataspain.org/es-2012/conference/cloudMC-a-cloud-computing-map-reduce-implementation-for-radiotherapy/ruben-jimenez-and-hector-miras
This deck was presented at the Spark meetup at Bangalore. The key idea behind the presentation was to focus on limitations of Hadoop MapReduce and introduce both Hadoop YARN and Spark in this context. An overview of the other aspects of the Berkeley Data Analytics Stack was also provided.
My study notes on the Apache Spark papers from Hotcloud2010 and NSDI2012. The paper talks about a distributed data processing system that aims to cover more general-purpose use cases than the Google MapReduce framework.
Apache Spark is written in Scala programming language that compiles the program code into byte code for the JVM for spark big data processing.
The open source community has developed a wonderful utility for spark python big data processing known as PySpark.
1. Big Data Analytics
- Big Data
- Spark: Big Data Analytics
- Resilient Distributed Datasets (RDD)
- Spark libraries (SQL, DataFrames, MLlib for machine learning, GraphX, and Streaming)
- PFP: Parallel FP-Growth
2. Ubiquitous Computing
- Edge Computing
- Cloudlet
- Fog computing
- Internet of Things (IoT)
- Virtualization
- Virtual Conferencing
- Virtual Events (2D, 3D, and Hybrid)
Model Building with RevoScaleR: Using R and Hadoop for Statistical ComputationRevolution Analytics
Slides from Joseph Rickert's presentation at Strata NYC 2013
"Using R and Hadoop for Statistical Computation at Scale"
http://strataconf.com/stratany2013/public/schedule/detail/30632
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCEijdpsjournal
Big Data has introduced the challenge of storing and processing large volumes of data (text, images, and
videos). The success of centralised exploitation of massive data on a node is outdated, leading to the
emergence of distributed storage, parallel processing and hybrid distributed storage and parallel
processing frameworks.
The main objective of this paper is to evaluate the load balancing and task allocation strategy of our
hybrid distributed storage and parallel processing framework CLOAK-Reduce. To achieve this goal, we
first performed a theoretical approach of the architecture and operation of some DHT-MapReduce. Then,
we compared the data collected from their load balancing and task allocation strategy by simulation.
Finally, the simulation results show that CLOAK-Reduce C5R5 replication provides better load balancing
efficiency, MapReduce job submission with 10% churn or no churn.
International Journal of Distributed and Parallel systems (IJDPS)ijdpsjournal
Big Data has introduced the challenge of storing and processing large volumes of data (text, images, and
videos). The success of centralised exploitation of massive data on a node is outdated, leading to the
emergence of distributed storage, parallel processing and hybrid distributed storage and parallel
processing frameworks.
The main objective of this paper is to evaluate the load balancing and task allocation strategy of our
hybrid distributed storage and parallel processing framework CLOAK-Reduce. To achieve this goal, we
first performed a theoretical approach of the architecture and operation of some DHT-MapReduce. Then,
we compared the data collected from their load balancing and task allocation strategy by simulation.
Finally, the simulation results show that CLOAK-Reduce C5R5 replication provides better load balancing
efficiency, MapReduce job submission with 10% churn or no churn.
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATIONijdms
Distributed databases and data replication are effective ways to increase the accessibility and reliability of
un-structured, semi-structured and structured data to extract new knowledge. Replications offer better
performance and greater availability of data. With the advent of Big Data, new storage and processing
challenges are emerging.
To meet these challenges, Hadoop and DHTs compete in the storage domain and MapReduce and others in
distributed processing, with their strengths and weaknesses.
We propose an analysis of the circular and radial replication mechanisms of the CLOAK DHT. We
evaluate their performance through a comparative study of data from simulations. The results show that
radial replication is better in storage, unlike circular replication, which gives better search results.
Similar to dmapply: A functional primitive to express distributed machine learning algorithms in R (20)
There’s been a lot of recent work on representing words as vectors with neural networks. These representations referred to as “neural embeddings” or “word embeddings”.
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfGetInData
Recently we have observed the rise of open-source Large Language Models (LLMs) that are community-driven or developed by the AI market leaders, such as Meta (Llama3), Databricks (DBRX) and Snowflake (Arctic). On the other hand, there is a growth in interest in specialized, carefully fine-tuned yet relatively small models that can efficiently assist programmers in day-to-day tasks. Finally, Retrieval-Augmented Generation (RAG) architectures have gained a lot of traction as the preferred approach for LLMs context and prompt augmentation for building conversational SQL data copilots, code copilots and chatbots.
In this presentation, we will show how we built upon these three concepts a robust Data Copilot that can help to democratize access to company data assets and boost performance of everyone working with data platforms.
Why do we need yet another (open-source ) Copilot?
How can we build one?
Architecture and evaluation
State of Artificial intelligence Report 2023kuntobimo2016
Artificial intelligence (AI) is a multidisciplinary field of science and engineering whose goal is to create intelligent machines.
We believe that AI will be a force multiplier on technological progress in our increasingly digital, data-driven world. This is because everything around us today, ranging from culture to consumer products, is a product of intelligence.
The State of AI Report is now in its sixth year. Consider this report as a compilation of the most interesting things we’ve seen with a goal of triggering an informed conversation about the state of AI and its implication for the future.
We consider the following key dimensions in our report:
Research: Technology breakthroughs and their capabilities.
Industry: Areas of commercial application for AI and its business impact.
Politics: Regulation of AI, its economic implications and the evolving geopolitics of AI.
Safety: Identifying and mitigating catastrophic risks that highly-capable future AI systems could pose to us.
Predictions: What we believe will happen in the next 12 months and a 2022 performance review to keep us honest.
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfEnterprise Wired
In this guide, we'll explore the key considerations and features to look for when choosing a Trusted analytics platform that meets your organization's needs and delivers actionable intelligence you can trust.
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
dmapply: A functional primitive to express distributed machine learning algorithms in R
1. Authors
Edward Ma,Vishrut Gupta, Michun Hsu and Indrajit Roy
Presenter
Bikash Chandra Karmoakr
M.Sc. Student, ITIS
Leibniz Universität Hannover
Seminar on
Database as a Service
Date: 26/01/2017
Place: TU Clausthal
2.
3. Introduction
Distributed Computing and Data Structures in R
Challenges faced by R users and Objectives of ddR
ddR components and package structure
Communication and computation patterns using dmapply
Some examples and machine learning algorithms
Comparison with other packages and performance evaluation
Conclusion
References
4. R is one of the top choices for statisticians and
data scientists
ddR (Distributed Data structures in R) is created
to build a unified system that works across
different distributed frameworks in R
ddR introduced dmapply that executes functions
on distributed data structures.
dmapply offers a standardized system which is
easy to use with enough flexibility and good
performance.
6. Many applications reuse data:
◦ Multi-analysis on same data: load once, run many operations
◦ Iterative algorithms: most machine learning + graph
algorithms
Persistent, abstract references:
◦ Avoid data movement overhead(send, collect, send cycles)
◦ Enable caching
Analyst wants to express high-level data
manipulations
NOT explicitly iterate over chunks
7. Interfaces to distributed system are custom, low-
level and non-idiomatic
Spark has 50+ operators!
◦ Map, flatmap, mapPartitions, mapPartitionsWithIndex...
◦ Lacks common array, list, data.frame operations that R
users expect
◦ SparkR provides some abstraction, but has its own
idiosyncrasies.
What if there is an API based on distributed data-structures?
8. Standardize a unified API for distributed:
◦ Iteration
◦ Data Structures
Enable:
◦ Basic manipulation and reduction of distributed
data (lists, data frames, arrays)
◦ Implementation of parallel algorithms through low-
level primitives
◦ Write once, run everywhere
9.
10. 1. Iteration: Common parallel operators for distributed
data-structures
◦ mapply() -> dmapply()
◦ lapply() -> dlapply()
◦ New: parts(), collect()
2. Data Structures: Distributed variants of core R data-
structures:
◦ array -> darray
◦ data.frame -> dframe
◦ list -> dlist
3. Shared infrastructure for backend implementations
(Spark, distributed R,,,)
The ddR package is not a new distributed infrastructure !
11. Distributed versions of array, list, data.frame with
conventional APIs:
◦ Accessors: parts, dim, names
◦ Summaries: mean, median, head, tail, rowSums, aggregate
◦ Sorting: sort
◦ Combination: c, cbind, rbind, merge
◦ Iteration: lapply, split
◦ Math and comparisons on arrays, transform on data.frames
◦ Distributed IO, e.g. dread(“data.csv”)
Distributed iteration primitives for implementing
algorithms: dmapply()
Enhanced ease of use, maintainability and portability
due to standard API
12. parallel distributedR.ddR spark.ddR
ddR
API package with
data structures
+
common operations
useBackend(parallel)
useBackend(distributedR)
useBackend(spark)
Third-party wrapper packages, delegating to
existing backend interfaces
14. dlist(......., nparts, psize)
◦ Similar to list() convention
◦ nparts and psize control partition count and size,
respectively
dmapply(FUN, X, Y, MoreArgs = list(), nparts)
◦ Apply FUN to elements of X and Y, returning a dlist
parts(L)
◦ Return the set of partitions as a list of dlist objects
collect(L)
◦ Return the in-memory base R list representation of L
◦ Generally only used after aggregation
18. 1. #Create a distributed list. By default each
element becomes a partitions
2. A <-dlist(1,2,3,4,5)
3. #Access partitions
4. p <- parts(A)
5. #Multiply elements in each partition by a
constant
6. B <- dmapply (function(x){2*x[[1]]},p)
7. #Fetch the result (= {2 ,4 ,6 ,8 ,10} ) on the
master
8. print(collect(B))
19. 1. A <- dlist(1,2,3,4)
2. B <- dlist(11,12,13,14)
3. # C will be a dlist={12,14,16,18}
4. C <- dmapply (FUN=sum, A, B)
5. # D will be a dlist ={13,15,17,19}
6. D <- dmapply (FUN=sum , A, B, MoreArgs=list( z=1) )
7. print(collect(D))
20. Three machine learning algorithms is tested here:
1. randomforest, a decision tree based ensemble
learning method,
2. K-means clustering algorithm, and
3. linear regression.
These algorithms (ddR version) are compatible with
established open source machine learning algorithms
like H2O and Spark MLlib.
23. ddR follows an object oriented programming
pattern
The main ddR package defines the abstract
classes for distributed objects, while backend
drivers are required to extend these classes
via inheritance
This permits drivers to override default
generic operators in ddR
24. ddR algorithm can indeed be executed on a
variety of backends such as R's parallel, SNOW,
HPE Distributed R, and Spark both in single-
server and multi-server setups.
ddR algorithms have good performance and
scalability, and are competitive with algorithms
available in other products.
there is very little overhead of using ddR's
abstractions. Algorithms implemented in ddR
have similar performance to algorithms written
directly in the respective backend.
26. To create 500 decision trees
from 1M observations with
10 features:
Default algorithm in R takes
about 28 minutes to
converge.
Using ddR, Distributed R can
reduce the execution time to
about 5 minutes with 12
cores.
27. To cluster into 500 groups
from 1.2M points with 100
attributes:
Default algorithm in R takes
about 482s for each iteration
of K-means.
When using SNOW, ddR
version of K-means takes 96s
with 12 cores.
HPE Distributed R and
parallel provide the best
performance in this setup,
completing each K-means
iteration in just 10s with 12
cores
28. For regression 12M records
each with 50 features are
used.
R's single-threaded regression
algorithm converges in 141s.
The ddR regression algorithm
on HPE Distributed R takes
155s with a single core but
converges in 33s with 12
cores.
The parallel version is faster
and converges in around 20s
with 12 cores.
29. ddR version of K-
means on parallel is
about 1.5 times
faster than H2O's K-
means
For example, ddR
can complete each
iteration in less than
7s with parallel
using 12 cores
compared to more
than 10s by H2O.
30. Figure shows that ddR's
regression implementation
with parallel and H2O
where H2O is slightly faster
at 8 and 12 cores.
However, if data size
increases to 5 times then
H2O crashes but ddR
scalability remains same on
HPE Distributed and
parallel backends.
31. Figure shows that
Spark MLlib's K-
means algorithm has
similar performance
as H2O, and is
slightly slower than
the ddR algorithm
running on parallel.
32. Figure shows that the
regression implementation
in Spark MLlib, when using
4 or less cores, is about 2
times slower than both
H2O and ddR's
implementation on parallel
or HPE Distributed R.
At 8 or more cores the
performance of Spark
MLlib is comparable, but
still less, than the other
systems.
33. The same ddR algorithms that work on a single
server can also run in multi-server mode with
the appropriate backend.
And can process hundreds of gigabytes of data
and provide similar scalability as custom
implementations.
34. To utilize multiple servers a
dataset of size about 95GB with
120M rows and has 100 features
per record is used.
Custom regression algorithm in
Distributed R takes 227s per
iteration with a single server
which reduces to 74s with 8
servers.
The ddR version of regression,
running on Distributed R as the
backend takes about 251s to
complete an iteration which
reduces to 97s with 8 servers.
Distributed R is only 23% faster than
the ddR version but ddR algorithm can
runs on other backends giving a single
interface to the R users.
35. Observations:
First, when Spark is used as the
backend, the ddR algorithm
takes around 7 minutes per
iteration. With Distributed R as
the backend the per-iteration
time of ddR is around 6 minutes.
Second, Therefore, if a user has
both the backends installed, it
can choose to run the
application written in ddR,
without any modifications, on
Distributed R for better
performance.
Finally, evaluation shows that
ddR algorithm gives same or
better performance than custom
algorithm.
To utilize multiple servers a dataset of
size about 180GB with 240M rows
(also 30M,60M and 120M rows) and
has 100 features per record is used.
36. Distributed frameworks
◦ MapReduce, Pig, HIVE, DryadLINQ, Mahout library,
Spark, Pregel, GraphLab, Concerto, Strom, Naiad,
Ricardo, Rhadoop, SparkR, SystemML etc.
Databases and machine learning
◦ Most popular dplyr
◦ Oracle, HPE Vertica, MS SQL server embed R in their
database and also MADlib, SAP HANA etc.
Parallel libraries in R
◦ Parallel, SNOW, foreach, Rmpi, HPE Distributed R
etc.
37. ddR is a standardized system with easy to
use and good performance
ddR is the first step in extending the R
language and providing a unified interface
for distributed computing
Write once run everywhere
38. Apache Mahout, Spark
HP Vertica and Hadoop
Revolution R enterprise scaler
Hadoop and MapR
H2O: Machine learning library
ddR: Distributed data structures in R
Editor's Notes
There is no general process that we have data that have partitions and we can operate on that partitions. So distributed data structures is the idea to work on that.
Some are there like - deplyr, oracle r etc
There are various tools like fork(), Sokets communication or MPI(Message Passing Interface) through which we can achieve parallelism in low level.
On top of that there are R packages like parallel and SNOW(Simple Network of Workstations)
foreach and biocparaler -> functional object oriented approach for parallel computing on top of paralel and snow package.
MPP DBS(Massively Parallel Processing (MPP) database), spark, vertica ->
Apache Spark is an open-source cluster-computing framework. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation
Vertica is a Next-generation high-performance SQL analytics engine with integrated offerings to meet your varying needs -- on premise, in the cloud, or on Hadoop. Your needs are unique, your analytics database should be too. Evaluate HPE Vertica today!
foreach:
The foreach package provides a new looping construct for executing R code repeatedly. The main reason for using the foreach package is that it supports parallel execution. The foreach package can be used with a variety of different parallel computing systems, include NetWorkSpaces and snow. In addition, foreach can be used with iterators, which allows the data to specified in a very flexible way.
Biocparallel:
Idiosyncrasy - A behavioral attribute that is distinctive and peculiar to an individual
idiom - A manner of speaking that is natural to native speakers of a language
Comments taken from : Open SparkR JIRA ticket: https://issues.apache.org/jira/browse/SPARK-7264
dmapply- Distributed version of mapply with several important differences
parts - Retrieves, as a list of independent objects, pointers to each individual partition of the input. parts() is primarily used in conjunction with dmapply when functions are written to be applied over
partitions of distributed objects.
collect - Fetch partition(s) of ’darray’, ’dframe’ or ’dlist’ from remote workers.
ddR is based on these three. So anything works one parallel will be on ddR. Parallel is default package for ddR. We have to use useBackend() function for selecting our desired backend.
ddR implemented in three layers:
------------------------------------------
Top Layer: The top layer is the application code, such as a distributed algorithm, which makes calls to the ddR API (e.g. dmapply) and associated utility functions (e.g., colSums).
Second Layer:
The second layer is the core ddR package, which contains the implementations of the ddR API. This layer is responsible for error checking and other tasks common across backends, and invokes the underlying backend driver to delegate tasks. It consists of about 2,500 lines of code that provide generic definitions of distributed data structures and classes that the backend driver can extend.
Final Layers:
Finally, the third layer consists of the backend driver (usually implemented as a separate R packaged such as distributedR.ddR) and is responsible for implementing the generic distributed classes and functions for that particular backend. Typically, a backend
driver implementation may involve 500-1,000 lines of code.
Aggregate
---------------
The most basic uses of aggregate involve base functions such as mean and sd. It is indeed one of the most common uses of aggregate to compare the mean or other properties of sample groups.
Function broadcast:
A common programming paradigm is to apply a function on each element of a data structure. In fact, programmers can also express that a function should be applied to each partition at a time instead of each element at a time by calling dmapply(FUN, parts(A)).
Data broadcast:
In some cases, programmers need to include the same data in all invocations of a function. As an example consider the K-means clustering algorithm that iteratively groups input data into K clusters. In each iteration, the distance of the points to the centers has to be calculated, which means the centers from the previous iteration have to be available to all invocations of the distance calculation function.
Partition Based:
The dmapply approach allows programmers to operate on any subset of partitions that contain distributed data. Here parts 1 and 2 is working on parts 3 data.
Figure 6: Example computation patterns in ddR.
A darray is a collection of array partitions.
In this example, the darray is partitioned into 4 blocks, and each server holds only one partition. The darray argument
nparts in the figure specifies how the partitions are located in a grid.
We can also use dframe instead of darray
In the above the simple example creates a distributed list and accesses its partitions. Line 2 declares a distributed list which holds the numbers 1 to 5. By default it will be stored as five partitions, each containing one number. In line 4, p is a local R list (not a distributed list) which has five elements and each element is a reference to a partition of A. Line 6 executes a function on each element of p, which means each partition of A, and multiplies each partition by 2. The result B is a dlist, has five partitions, and is stored across multiple nodes. Line 8 gathers the result into a single local R list and prints it.
In the above the simple example there are only two distributed lists, A and B, as inputs, and the function is sum. The runtime will extract the first element of A and B and apply sum on it, extract the corresponding second elements, and so on. Line 4 in Figure 5 shows the corresponding program and its results. The MoreArgs argument is a way to pass a list of objects that are available as an input to each invocation of the function. As an example, in line 6, the constant z is passed to every invocation of sum, and hence 1 is added to each element of the previous result C.
The above code shows how programmers can invoke ddR's distributed clustering algorithm. Line 1 imports the ddR package, while line 2 imports a distributed K-means library written using the ddR API. Line 4 determines the backend on which the functions will be dispatched. In this example the backend used in the default parallel backend, which is single-node but can use multiple cores. In line 6, the input data is generated in parallel by calling a user-written function genData using dmapply. The input is returned as a distributed array with as many partitions as the number of cores in the server. Finally, in line 8, the ddR version of the K-means algorithm is invoked to cluster the input data in parallel. The key advantage of this ddR program is that the same code will run on a different backend, such as HPE Distributed R, if line 4 is simply changed to useBackend(distributedR).
Above code shows one implementation of distributed randomforest using ddR. Randomforest is an ensemble learning method that is used for classification and regression. The training phase creates a number of decision trees, such as 500 trees, that are later used for classification. Since training on large data can take hours, it is common to parallelize the computationally intensive step of building trees. Line 3 uses a simple parallelization strategy of broadcasting the input to all workers by specifying it in MoreArgs. Each worker then builds 50 trees in parallel (ntree=50) by calling the existing single threaded randomforest function. At the end of the computation, all the trees are collected at the master and combined to form a single model in line 5. In this example, the full contents of a single data structure are broadcast to all workers.
Using ddR, we can parallelize the tree building phase by assigning each core to build a subset of the 500 trees. By using multiple cores, each of the backends parallel, SNOW, and HPE Distributed R can reduce the execution time to about 5 minutes with 12 cores.
HPE Distributed R and parallel provide the best performance in this setup, completing each K-means iteration in just 10s with 12 cores.
ddR version line -??
The performance of SNOW is worse than others because of its inefficient communication layer. SNOW incurs high overheads when moving the input data from the master to the worker processes using sockets.
Since this dataset is multi-gigabyte, SNOW takes tens of minutes to converge, of which most of the time is spent in moving data between processes. Therefore, we exclude SNOW from the figure.
Figure 10
Topic: 6.1.2
Figure 11
Topic: 6.1.2
The reason for the slight performance advantage is because H2O uses multi-threading instead of multi-processing, as by parallel, which lowers the cost of sharing data across workers