3. Large-Scale Processing Frameworks
Data-parallel frameworks – MapReduce/Dryad (2004)
– Process each record in parallel
– Use case: Computing sufficient statistics, analytics queries
Graph-centric frameworks – Pregel/GraphLab (2010)
– Process each vertex in parallel
– Use case: Graphical models
Array-based frameworks – MadLINQ (2012)
– Process blocks of array in parallel
– Use case: Linear Algebra Operations
4. PageRank using Matrices
Power Method
Dominant
eigenvector
Mp
M = web graph matrix
p = PageRank vector
Simplified algorithm repeat { p = M*p }
Linear Algebra Operations on Sparse Matrices
p
17. Challenge 2 – Data Sharing
Sharing data through pipes/network
Time-inefficient (sending copies)
Space-inefficient (extra copies)
Process
copy of
data
local copy
Process
data
Process
copy of
data
Process
copy of
data
Server 1
network
copy
network
copy
Server 2
Sparse matrices
Communication overhead
18. Extend R – make it scalable, distributed
Large-scale machine learning and graph
processing on sparse matrices
23. PageRank Using Presto
M darray(dim=c(N,N),blocks=(s,N))
P darray(dim=c(N,1),blocks=(s,1))
while(..){
foreach(i,1:len,
calculate(m=splits(M,i),
x=splits(P), p=splits(P,i)) {
p m*x
}
)}
Create Distributed Array
M p
P1
P2
PN/s
24. PageRank Using Presto
M darray(dim=c(N,N),blocks=(s,N))
P darray(dim=c(N,1),blocks=(s,1))
while(..){
foreach(i,1:len,
calculate(m=splits(M,i),
x=splits(P), p=splits(P,i)) {
p m*x
}
)}
Execute function in a cluster
Pass array partitions
p
P1
P2
PN/s
M
30. Data sharing challenges
1. Garbage collection
2. Header conflict
R object data part
R object
header
R instance R instance
31. Overriding R’s allocator
Allocate process-local headers
Map data in shared memory
page
Shared R object data part
Local R
object
header
page boundary page boundary
43. • Presto: Large scale array-based
framework extends R
• Challenges with Sparse matrices
• Repartitioning, sharing versioned arrays
Conclusion
44. IMDb Rating: 8.5
Release Date: 27 June 2008
Director: Doug Sweetland
Studio: Pixar
Runtime: 5 min
Brief:
A stage magician’s rabbit
gets into a magical onstage
brawl against his neglectful
guardian with two magic
hats.
Editor's Notes
MapReduce excels in massively parallel processing, scalability, and fault tolerance.In terms of analytics, however, such systems have been limited primarily to aggregation processing, i.e., computation of simple aggregates such as SUM, COUNT, and AVERAGE, after using filtering, joining, and grouping operations to prepare the data for the aggregation step. Although most DMSs provide hooks for user-defined functions and procedures, they do not deliver the rich analytic functionality found in statistical packages.
Virtually all prior work attempts to get along with only one type of system, either adding large-scale data management capability to statistical packages or adding statistical functionality to DMSs. This approach leads to solutions that are often cumbersome, unfriendly to analysts, or wasteful in that a great deal of well established technology is needlessly re-invented or re-implemented.
Convert matrix operations to MapReduce functions.
R sending aggregation-processing queries to Hadoop (written in the high-level Jaql query language), and Hadoop sending aggregated data to R for advanced statistical processing or visualization.
R has serious limitations when applied to very large datasets: limited support for distributed processing, no strategy for load balancing, no fault tolerance, and is constrained by a server’s DRAM capacity.
Large-scale machine learning and graph processing on sparse matrices
Distributed array (darray) provides a shared, in-memory view of multi-dimensional data stored across multiple servers.
Repartitioning can be used to subdivide an array into a specified number of parts. Repartitioning is an optional performance optimization which helps when there is load imbalance in the system.
Note that for programs with general data structures (e.g., trees) writing invariants is difficult. However, for matrix computation, arrays are the only data structure and the relevant invariant is the compatibility in array sizes.
Qian’s comment: same concept as the snapshot isolation.