Accelerating Machine Learning Algorithms by integrating GPUs into MapReduce Clusters

ACCELERATING MACHINE LEARNING ALGORITHMS BY INTEGRATING
GPUS INTO MAPREDUCE CLUSTERS

Sergio Herrero-Lopez
Intelligent Engineering Systems Laboratory (IESL)

November 30, 2011

1 Accelerating ML algorithms by integrating GPUs in MR Clusters

INTRODUCTION

ABOUT ME:

 Ph.D (December 2011) at Massachusetts Institute of Technology (USA)
 M.Sc (2007) and B.Sc (2005) in Electrical Engineering at University of Navarra (Spain)
 Microsoft Research (Redmond WA, 2008), Tampere University of Technology (Finland,
2005) and IKUSI (Spain, 2003)

ABOUT PROF. WILLIAMS RESEARCH GROUP (ENGINEERING SYSTEMS DIVISION):

 High Performance Price Analytics for the Smart Grid (2008-2009)
 Large-Scale Simulator for Global Data Infrastructure Optimization (2009-2011)
 Music Event Detection from Tweets in New York (2010-2011)
 Accelerating Machine Learning Algorithms by integrating GPUs into
MapReduce Clusters


AGENDA

o PROBLEM STATEMENT: Big Data & Need for scale and/or speed
o PROPOSITION: Modify MapReduce runtime to
o Satisfy the particular requirements of ML algorithms
o Integrate Massively Parallel Processors in the system
o PREVIOUS WORK MapReduce for ML in Multicore/Single-GPU/Multi-
GPU/GPU-Cluster/FPGA
o IMPLEMENTATION of new MR runtime using Port abstractions
o PERFORMANCE results running SVMs on the proposed system
o CONCLUSIONS: Contributions and Limitations. Lessons learned
o FUTURE WORK


MACHINE LEARNING PARALLELIZATION

{ xi, yi },i =1… n, "i, n -Representative sample 1. Does not fit in resources
d -Feature selection 2. Takes too long
xi Î R d , yi Î Y = {1… k} k -Consolidate classes 3. Accuracy was sacrificed

Algorithm 1 Algorithm 1 Independent Runs
L1 Worker X Worker Y
(Cluster)

Algorithm 1 Summation Form
L2 (MapReduce)
Worker X Worker Y

Algorithm 1
L3 Structural Parallelism
(MPPs)

Machine Learning Algorithms decomposable into MR primitives
Naïve Bayes
K-means Expectation Maximization
Neural Network Support Vector Machine Classification
Principal Component Analysis Hidden Markov Models


MAPREDUCE PRIMITIVES & RUNTIME
Input

M [ k1, v1 ] ® [ k2, v2 ]

Split

R ék2 , {v2,i }k
ë
ù® v WORKER 1 WORKER 2 WORKER M-1 WORKER M
2,i =k2 û
3

 Map


Sort

WORKER 1 WORKER 2 WORKER N-1 WORKER N

  
  

Reduce

 

Merge


Output


MAPREDUCE REPRESENTATION OF K-MEANS

M ékit , xi ù ® éki¢t , xi ù
ë û ë û
kit


{
ki¢t = x j : x j - mit £ x j - mit¢ "i¢ =1… k } ki¢t

Rék¢t , { xi }k¢t =k¢t ù ® mk¢t
ë û
t+1
 
{ xi }k¢ =k¢
i
t t

i

å
1
mk¢t =
t+1
x
xi ki¢t =k ¢t x Î{ xi }k¢t =k¢t
t+1
mk¢t
i


MAPREDUCE REPRESENTATION OF EM FOR MIXTURE OF GAUSSIANS

M [(i, k), xi ] ® é(i, k), pi,k ù
ë û xi
a f ( xi | m , S
t
k
t
k
t
k ) 
pi,k = K

åa f ( x | m , S )
t
k i
t
k
t
k pi,k
k=1

Rék, { pi,k¢ }k¢=k ù ® ak
t+1 { pi,k¢ }k¢=k
ë û  
n

åp i,k t+1
a t+1
k = i=1 ak
n

Rék, { xi , pi,k¢ }k¢=k ù ® mk
ë û
t+1
{ xi, pi,k¢ }k¢=k
n
 
åx p i i,k

m t+1
k = i=1
t+1 mk
t+1
nak

Rék, { xi , pi,k¢ }k¢=k ù ® St+1
ë û k  
{ xi, pi,k¢ }k¢=k
n

å p (x i - m k ) ( xi - m k )
t+1 t+1 T
i,k

S t+1
k = i=1
na t+1 St+1
k
k


MAPREDUCE REPRESENTATION OF SVM (SMO)

M [i, fi ] ® [i, fi¢] fi

fi¢= fi + Da Iup yIup k(x Iup , x i )+ Da Ilow yIlow k(x Ilow , x i ) fi¢

M [i, ai ] ® [i, ki ]
I 0 = {i : yi = {1, -1}, 0 < a i < C}
ai
I1 = {i : yi = 1, a i = 0} È {i : yi = -1, a i = C } 
I 2 = {i : yi = 1, ai = C} È {i : yi = -1, a i = 0} ki
kup = {i Î I 0 È I1 }, klow = {i Î I 0 È I 2 }
ki Î kup , klow

R ék, { fi }k =k ù ® (b, I )
ë i û
 
{ fi }k =k
i
bup = min{ fi : ki = kup }, Iup = argmin ki =kup fi
blow = max{ fi : ki = klow }, I low = argmax ki =klow fi (b, I)

M [i, ai ] ® [i, ai¢]
yIup ( fIlow - fIup ) ai
a¢ = aI -
Iup up
2k(xIlow , xIup ) - k(xIlow , xIlow ) - k(xIup , xIup )
a ¢ = a I + yI yI (a I - a ¢ )
I
low low low Iup up up
a i¢


MAPREDUCE FOR ML WISHLIST
Static Variable
mk
Static vs Variable data
xi x
 Static: Largest, fixed, used in every iteration
i (a , mk , St+1 )
t+1 t+1

( xi, yi )
k k
 Variable: Results of each iteration, consumed in the next
iteration ( fi, ai )

DFS


Iterate until convergence
 Avoid reloading static data between iterations MEM
 Utilize memory hierarchy as opposed to DFS or LFS
 

DFS

Massively Threaded MapReduce Tasks
 Map is embarrassingly parallel
CPU MPP
 Reduce is highly parallelizable

Dimensionality & Algebra - b xi -x j
2

 Map Tasks may encapsulate high dimensional matrix-vector k(xi , x j ) = e
or matrix-matrix operations
 Interleave multithreaded BLAS operations using static data i = 1...n, j Î { I up, I low }
 Sparse data structures


COMPUTING ECOSYSTEM

COMMODITY HIGH PERFORMANCE/SUPER
COMPUTING COMPUTING
RELATIONAL DB
HADOOP INFINIBAND
BIGTABLE
DRYAD
CASSANDRA OPENMPI
GPU
DYNAMO GPU

1/10 GB ETHERNET
FPGA
COLUMN DB
HADOOP
20 GB INFINIBAND
SSD

DATA APPLIANCE/ WAREHOUSE
COMPUTING


MAPREDUCE CLUSTER: ARCHITECTURE

Client 1) Distributed File System.
- Unstructured data
File Job - Scales to thousands of nodes
- High reliability through
NameNode
replication

DFS MRF
2) Map Reduce Framework Runtime
JobTracker - Batch processing system
- Load balancing

Task Task
Task
DataNode 1 Block DataNode 2 DataNode 3
Block
Block
MRF MRF MRF

TaskTracker TaskTracker TaskTracker

DFS DFS DFS


MAPREDUCE CLUSTER: LIMITATIONS

DataNode 1 DataNode 2

Task Task
MRF Tracker
MRF Tracker
One (or two) tasks per node

DFS Block DFS Block

One Task  One Data Block 
CPU CPU
One Core  One Thread
Map Map
Task Task

HD Block HD Block

Synchronization by materialization of
intermediate results
CPU CPU
Reduce Reduce
Task Task

DFS Block DFS Block
No support for iterative jobs


MASSIVELY PARALLEL PROCESSORS: NVIDIA TESLA ARCHITECTURE
Host Device
Stream Multiprocessor N

Stream Multiprocessor 2 Memory
Shared
1 Cycle coalesced
Stream Multiprocessor 1 Memory
Shared ~10 Cycles uncoalesced
Registers Registers Registers
Shared Memory
Registers Registers Registers Instruction
Registers Unit
ProcessorRegisters
1 Processor 2 Registers
…. Processor M Instruction
Unit
0 Cycles Processor 1 Processor 2 …. Processor M Instruction
Constant Cache Unit
SP 1 SP 2 …. SP M
Constant Cache
Texture Cache
~10 Cycles Cache Hit
Constant Memory
Texture Cache

Texture Memory

~400 Cycles ~400 Cycles
102 GB/s 102 GB/s
Host
Memory Device Memory
PCI-E 16x
(8GB/s)


NVIDIA TESLA: REPRESENTATIONS
Logical Representation Physical Representation

Thread Processor

Block MultiProcessor

Maximum
(512,512,64)
But max 512
threads per
block

Grid Device

Shared
Shared
Register Memor
Register Register
Maximum
Register Memor s
s Register yRegister
s
Processs y …. s
s Process Process
(65535, Process or ConstantM
Process…. or
65535) or 1 2 Process
or 1 or ConstantM
2 Texture
or
Cache
Cache
Cache


PROPOSED RUNTIME: MR + GPU
Block Block

DFS MRF Task Tracker

HState
HMem
Split
H->D Transfers
DMem DState Pre-Map BLAS

GPU Map

DMem DState Post-Map D->H Transfers

HState Cross-Node
HMem Sort

DMem DState H->D Transfers
Pre-Reduce BLAS

Local
GPU Reduce

DMem DState D->H Transfers
Post-Reduce
HState
Cross-Node Global
HMem
Reduce
Block Block
State Snapshot every
DFS x iterations


PROPOSED RUNTIME: MR + GPU
Block Block


HState
HMem
Multiple tasks per node
DMem DState
Multithreaded MR Tasks

GPU Interleave Multithreaded BLAS

DMem DState
Local/Global Reduction

HState Static/Variable Data
Cross-Node
HMem
Long-running Iterative Jobs
DMem DState Stateful Nodes
Shared-Memory
GPU
Fault-Tolerance Relaxation
DMem DState
HState
Cross-Node
HMem

Block Block

DFS


PREVIOUS WORK

MAPREDUCE ON SINGLE GPU/ SINGLE FPGA Interleave Multithreaded BLAS
•Mars (He et al. PACT 2008)
•NVIDIA (Catanzaro et al. STMCS 2008)
•Cell (de Kruijf and Sankaralingam IBM Journal R&D 2009)
Massively Multithreaded MR Tasks

MAPREDUCE ON MULTICORE Shared-Memory
•Phoenix (Ranger et al. HPCA 2007)
•Phoenix 2 (Yoo et al. IISWC 2009)
•Phoenix ++ (Talbot et al. MAPREDUCE 2011)
Fault-Tolerance Relaxation

MAPREDUCE ON MULTI-GPU/GPU CLUSTERS Intermediate data in-memory
•CellMR (Rafique et al. IPDPS 2009)
•GPMR (Stuart and Owens IPDPS 2011)
Local/Global Reduction

MAPREDUCE FOR MACHINE LEARNING
•Mahout (Apache) Long running (iterative) Tasks
•Multicore (Chu et al. NIPS 2006)
•FGPA (Xu NIPS 2009)
•Twister (Ekanayake et al. MAPREDUCE 2010)
•SystemML (Ghoting et al. ICDE 2011) Static vs Variable Data


PORT-BASED PROGRAMMING: ABSTRACTION
Message

Port
Single Item Receiver

Arbiter Multiple Item Receiver
Dispatcher

Handler
Handler Task
Handler
Join Receiver

Dispatcher
Queue Choice Receiver

Teardown
State Handler Concurrent
Exclusive

Scatter

Gather


SCATTER-GATHER USING GPU-PORTS

Task
MRF Tracker
(Task, Block,
Port Response Port)
Master
Arbiter C#/Java
Thread
Dispatcher

Handler
Scatter
Handler Task
Handler

Dispatcher
Queue

HState CPU Handler C++
Kernel 









 



















 



















 










     

    

   

   

 

CUDA 3.2
Gather
DState


H-DISPATCH ALTERNATIVE

Task
MRF Tracker
(Task, Block,
Response Port)
Master
H-Dispatch
Thread
Scatter

+ Load Balancing for non-uniform
workloads

+ Local variable reutilization. Avoid
GC blocking threads
              

           

           

     

+ Runs hState> sum(Dmem)

Gather
- Detach state and port: dState
load/unload


BINARY SVM

Binary Classification:

Given l samples x1, y1 ,, xl , yl with xi Rn , yi Y , i and Y 1,1 ,
a binary classifier predicts the label y Y of an unseen sample x Rn

1
f*

f*

2
xi x j
k ( xi , x j ) e


PRIMAL & DUAL FORM OF THE SVM

Find the function f that solves the following regularization problem:
l k maxk,0
1 2
min f HC 1 yi f xi f where
i 1 2 C 0
Then slack variables i are introduced to classify non-separable data:

Primal form: Dual form:
l l
1 2 1 T
min f H C i f max K
i 1 2 Rl i 1
i
2
subject to: subject to:
l
yi f xi 1 yi i 0
i
i 1, , l i 1 i 1, , l
i 0 0 i C

where Kij yi y j k xi , x j
is the kernel function

l
Solving the dual: f ( x ) yi i k x , xi b where b is an unreagularized bias term
i 1


MULTICLASS CLASSIFICATION

Multiclass Classification:
Given l samples x1, y1 ,, xl , yl with xi Rn , yi Y , i and Y 1, M ,
,
a multiclass classifier predicts the label y Y of an unseen sample x Rn

Multiclass SVM: Combination of N independent binary classification tasks. Binary tasks
are defined by an output code matrix R of size MxN and R ij 1,0,1

1 1 0 1 1 1
M
All vs All (AVA): R 1 0 1 N
2 One vs All (OVA): R 1 1 1 N M
0 1 1 1 1 1


BINARY SVM AS MAP REDUCE PRIMITIVES IN A SINGLE-GPU

GPU

Processor 1 Processor p Processor P
fi
MAP     
f i'
   
MAP
    (ai , ki )

LOCAL
REDUCE (ki , fi ' ) (ki , fi ' )
 
GLOBAL
REDUCE (bup , I up ) Pre-MAP (blow, Ilow)

MAP ' '
up low
2
- b xi -x j
k(xi , x j ) = e
i = 1...n, j Î { I up, I low }

Device State: (xi , yi ) ( fi , ai , ki , b, I, K) LRU Cache

Static Variable


BINARY SVM AS MAP REDUCE PRIMITIVES IN 4 GPUS
Master
Thread
GPU 1 GPU 2 GPU 3 GPU 4

MAP MAP MAP MAP

                   

               

               

       

LOCAL REDUCE LOCAL REDUCE LOCAL REDUCE LOCAL REDUCE

GLOBAL
REDUCE

MAP MAP MAP MAP


EXPERIMENTS AND HARDWARE

Host Device
Ubuntu 8.10 64bit 4x Tesla C1060

Dual Socket Intel Xeon
# Stream Processors: 240
E5520

Frequency of Frequency of Processors:
Cores: 2.26 GHz 1.3GHz
145 GFlops 933 GFlops
Memory: Memory:
32GB DDR3 4GB DDR3
Memory Bandwidth: Memory Bandwidth:
25.6GB/s 102GB/s
Host <-> Device
PCIe x16 (8GB/s)

LIBSVM Hadoop Multicore Single GPU Multi GPU

• Single threaded • 4 VMs with one • 8 Worker Threads • 1 Worker Thread • 4 Worker Threads
• Double precision datanode each in H-Dispatch • 1 GPU • 4 GPUs
• Sparse • Pegasos SVM • 1 Block – 1 Thread • Single Precision • Single Precision
• Double Precision • Double Precision • Dense-Sparse • Dense-Sparse
• Sparse • Dense


PERFORMANCE RESULTS: DATASETS

SVM Experiment Setup # Training # Testing # (Features,
Dataset (C,β)
Points Points Classes)

Same kernel types (RBF) WEB 49749 14951 (300,2) (64,7.8125)
Same regularization parameter C
Same stopping criteria: 0.001 MNIST 60000 10000 (780,10) (10,0.125)
SMO based (Except Hadoop version)
RCV1 518571 15564 (47236,53) (1,0.1)
One vs All in multiclass problems
1GB kernel cache PROTEIN 17766 6621 (357,3) (10,0.05)
SENSIT 78823 19705 (100,3) (1,0.7)


PERFORMANCE RESULT COMPARISON

Single Multi
Dataset (Non-Zero %) LIBSVM Hadoop Multicore
GPU(Dense) GPU(Dense)
Time(s) 2364.2 1698.7 912.81 154.3 73.6
WEB (3%) Gain (x) 1.00 1.39 2.59 15.32 32.12
Accuracy (%) 82.69 82.69 82.69 82.69 82.69

Time(s) 118943.5 66753.5 22873.75 2010.3 726.9
MNIST (19%) Gain (x) 1.00 1.78 5.20 59.17 163.63
Accuracy (%) 95.76 95.76 95.76 95.76 95.76

Time(s) 710664 231486 N/A N/A N/A
RCV1 (0.1%) Gain (x) 1.00 3.07 N/A N/A N/A
Accuracy (%) 94.67 94.67 94.67 94.67 94.67

Time(s) 861 717.5 260.12 32.93 16.06
PROTEIN (29%) Gain (x) 1.00 1.20 3.31 26.15 53.61
Accuracy (%) 70.03 70.03 70.03 70.03 70.03

Time(s) 8162 4295.78 2005.4 134.67 58.29
SENSIT (100%) Gain (x) 1.00 1.90 4.07 60.61 140.02
Accuracy (%) 83.46 83.46 83.46 83.46 83.46

28 Accelerating ML algorithms by integrating MapReduce Clusters
SVMs by integrating GPUs in GPUs in MR

ELLPACK-R (Vazquez et al. IEEE CIT 2010)

Dataset Single Multi
(Non-Zero %) GPU(Sparse) GPU(Sparse)

Time(s) 107.35 57.3
WEB (3%) Gain (x) 22.02 (1.43) 41.26 (1.26)
Accuracy (%) 82.69 82.69

Time(s) N/A 3686
RCV1 (0.1%) Gain (x) N/A 192.80
Accuracy (%) 94.67 94.67

~8.2 days -> ~1hour


CONCLUSIONS

CONCLUSIONS:

 Constructed a MR runtime that satisfies the requirements of many ML algorithms and integrates GPUs.
 Iterative stateful jobs
 Multithreaded BLAS to prepare Map or Reduce Tasks
 Static/Variable data

 Tested the runtime solving popular classification problems.
 Delivered up to two orders of magnitude of acceleration using 4 GPUs
 Compared different runtimes

LIMITATIONS:

 H-Dispatch (Pull) dependent on H->D state transfers

 Relaxation of Fault-tolerance must be acceptable

 d>>n -> MapReduce will have little benefit


FUTURE WORK

FUTURE:

 GPU Technology:
 Concurrent Kernel Execution-> Maximize utilization
 GPUDirect-> Facilitate Sort operation
 Distributed Memory -> Intermediate Results
 Shared memory space CPU-GPU

 Communication
 Cross-Node performance
 GPU-Port-Abstraction
 In-node: Cross-Thread pointer exchange
 Out-node: MVAPICH2 and MVAPICH2-GPU
 Algorithms
 Requirements for incremental classification and clustering


CONCURRENT KERNEL EXECUTION

Port

CPU
Task
Thread 1
Queue
CPU
Thread 2

• CUDA Compute Capability 2.0
    















  allows up to sixteen concurrent
   
   
kernels.
 
 

• Concurrent kernels need to run
on the same context.


INTEGRATING THE MPP IN THE MR CLUSTER ARCHITECTURE
Block Block


HState
HMem

DMem DState
GPUDirect:
GPU • GPU to GPU memory copy
DState • Communication with network
DMem devices

HState Cross-Node
HMem

DMem DState
Minimal Communication to HState

GPU
DState
DMem
HState
Cross-Node
HMem

Block Block
State Snapshot every
DFS x iterations


PIPELINING/MEMCACHED
DataNode 1 DataNode 2

Task Task
MRF Tracker
MRF Tracker

DFS Block DFS Block
Memcached

node
CPU CPU
Map Map
Task Task
node

MEM MEM

node
CPU CPU

Reduce Reduce
Task Task

DFS Block DFS Block


QUESTIONS

35 Accelerating ML algorithms by integrating MapReduce Clusters
SVMs by integrating GPUs in GPUs in MR

APPLICATION I: EVENT DETECTION USING TWEETS

Sakaki et al: Detect Tweet
outbreaks about large-scale and
infrequent events: Natural
Disasters: Earthquakes, floods.
Accidents: Fire, road accidents

INFREQUENT EVENTS


APPLICATION I: EVENT DETECTION USING TWEETS

Listening to the New
York Philarmonic,
amazing performance

Lots of people trying
to enter the MSG for
the Alice in Chains
concert. I wish I had
tickets.

Goal: Detect popular Nassau County Museum of Art is
events on locations with looking for volunteers to greet,
high volume of tweets. work in gift shop or perform
clerical support.


APPLICATION I: FEATURE VECTOR

It/PRP is/VBZ a/DT good/JJ day/NN when/WRB the/DT CEO/NN
of/IN a/DT multinational/JJ ,/, multi-million/JJ
dollar/NN company/NN tells/VBZ you/PRP you/PRP 're/VBP
a/DT genius/NN ./.:/: D/NNP

Lots/NNS of/IN people/NNS trying/VBG to/TO enter/VB
the/DT MSG/NNP for/IN the/DT Alice/NNP in/IN
Chains/NNP concert/NN ./.I/PRP wish/VBP I/PRP
had/VBD tickets/NNS ./.

Feature Vectors:
- Has unigram with POS
ì 1 If (x,y) contains___
- Has bigram with POSs
hi (x, y) = í - Has trigram with POSs
î 0 otherwise - X1 is subject of X2
- ….

APPLICATION I: EXPERIMENT

Used NYC.com event calendar (Oct 9-11,2009). Extracted ~400 features

Title Location Description

Alice in Chains has sold more than twenty million albums in the
Madison Square Garden, 2 United States (and an estimated 40 million worldwide), released
Alice in
Penn Plaza, New York, NY, two number-one albums and 19 top 40 singles, and has received
Chains
10001 six Grammy nominations…

EXPERIMENT 1:
• 2000 Tweets from the same weekend (160 (%8) “Concert”, 1840 (%92) “Background”)
• RBF Kernel (C=10, gamma=1.0). Testing 20% -> Accuracy of %97
• “False positives”

EXPERIMENT 2:
• 2000 Tweets from the next weekend (160 (%8) “Concert”, 1840 (%92) “Background”)
• RBF Kernel (C=10, gamma=1.0). Testing 100% -> Accuracy of %93
• “False positives” + “False negative”
• After using NYC.com again -> Accuracy of %96


APPLICATION II: PRICE CALCULATIONS FOR EACH HOUSEHOLD

30 x 96 = 2880 Values

8

APPLICATION II: PRICE CALCULATIONS FOR EACH HOUSEHOLD


Accelerating Machine Learning Algorithms by integrating GPUs into MapReduce Clusters

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Accelerating Machine Learning Algorithms by integrating GPUs into MapReduce Clusters

Similar to Accelerating Machine Learning Algorithms by integrating GPUs into MapReduce Clusters (20)

Recently uploaded

Recently uploaded (20)

Accelerating Machine Learning Algorithms by integrating GPUs into MapReduce Clusters