Scaling up Machine Learning Algorithms for Classification

Scaling up Machine Learning
algorithms for classification
Department of Mathematical Informatics
The University of Tokyo
Shin Matsushima

How can we scale up Machine Learning to
Massive datasets?
• Exploit hardware traits
– Disk IO is bottleneck
– Dual Cached Loops
– Run Disk IO and Computation simultaneously
• Distributed asynchronous optimization
(ongoing)
– Current work using multiple machines
2

LINEAR SUPPORT VECTOR MACHINES
VIA DUAL CACHED LOOPS
3

• Intuition of linear SVM
– xi: i-th datapoint
– yi: i-th label. +1 or -1
– yi w･ xi : larger is better, smaller is worse
4
×
×
×
×
×
×
×
×
×
×: yi = +1
×: yi = -1

• Formulation of Linear SVM
– n: number of data points
– d: number of features
– Convex non-smooth optimization
5

• Formulation of Linear SVM
– Primal
– Dual
6

• Coordinate Descent Method
– For each update we solve one-variable optimization
problem with respect to the variable to update.
15

• Applying Coordinate Descent for Dual
formulation of SVM
16

17
• Applying Coordinate Descent for Dual
formulation of SVM

Dual Coordinate Descent [Hsieh et al. 2008]
18

Attractive property
• Suitable for large scale learning
– We need only one data for each update.
• Theoretical guarantees
– Linear convergence（cf. SGD）
• Shrinking[Joachims 1999]
– We can eliminate “uninformative” data:
cf.
19

Shrinking[Joachims 1999]
• Intuition: a datapoint far from the current decision
boundary is unlikely to become a support vector
20
×
×
×
×
○
○

Shrinking[Joachims 1999]
• Condition
• Available only in the dual problem
21

Problem in scaling up to massive data
• In dealing with small-scale data, we first copy the
entire dataset into main memory
• In dealing with large-scale data, we cannot copy
the dataset at once
22
Read
Disk
Memory
Data

Read
Data
• Schemes when data cannot fit in memory
1. Block Minimization [Yu et al. 2010]
– Split the entire dataset into blocks so that each
block can fit in memory

Train RAM
1. Block Minimization [Yu et al. 2010]
– Split the entire dataset into blocks so that each
block can fit in memory

Block Minimization[Yu et al. 2010]
27

Read
Data
2. Selective Block Minimization [Chang and Roth 2011]
– Keep “informative data” in memory

Train RAM
Block
2. Selective Block Minimization [Chang and Roth 2011]
– Keep “informative data” in memory

Selective Block Minimization[Chang and Roth 2011]
34

• Previous schemes switch CPU and DiskIO
– Training (CPU) is idle while reading
– Reading (DiskIO) is idle while training
35

• We want to exploit modern hardware
1. Multicore processors are commonplace
2. CPU(Memory IO) is often 10-100 times
faster than Hard disk IO
36

1.Make reader and trainer run
simultaneously and almost asynchronously.
2.Trainer updates the parameter many
times faster than reader loads new
datapoints.
3.Keep informative data in main memory.
(=Evict uninformative data primarily from main memory)
37
Dual Cached Loops

Reader
Thread
Trainer
Thread
Parameter
Dual Cached Loops
RAM
Disk
Memory
Data
38

Reader
Thread
Trainer
Thread
Parameter
Dual Cached Loops
RAM
Disk
Memory
Data
39

Read
Disk
Memory
Data
W: working
index set
40

Which data is “uninformative”?
• A datapoint far from the current decision
boundary is unlikely to become a support vector
• Ignore the datapoint for a while.
42
×
×
×
×
×
○
○○

Which data is “uninformative”?
– Condition
43

• Datasets with Various Characteristics:
• 2GB Memory for storing datapoints
• Measured Relative Function Value
45

• Comparison with (Selective) Block Minimization
(implemented in Liblinear)
– ocr：dense, 45GB
46

47
• Comparison with (Selective) Block Minimization
– dna： dense, 63GB

48
Comparison with (Selective) Block Minimization
– webspam：sparse, 20GB

49
Comparison with (Selective) Block Minimization
– kddb： sparse, 4.7GB

• When C gets larger (dna C=1)
51

• When C gets larger(dna C=10)
52

53

54

• When memory gets larger(ocr C=1)
55

• Expanding Features on the fly
– Expand features explicitly when the reader thread
loads an example into memory.
• Read (y,x) from the Disk
• Compute f(x) and load (y,f(x)) into RAM
Read
Disk
Data
12495340
( )xf R
x=GTCCCACCT…
56

2TB
data
16GB
memory
10hrs
50M
examples
12M
features
corresponding to
2TB
57

• Summary
– Linear SVM Optimization when data cannot fit in
memory
– Use the scheme of Dual Cached Loops
– Outperforms state of the art by orders of magnitude
– Can be extended to
• Logistic regression
• Support vector regression
• Multiclass classification
58

DISTRIBUTED ASYNCHRONOUS
OPTIMIZATION (CURRENT WORK)
59

Future/Current Work
• Utilize the same principle as dual cached loops in
multi-machine algorithm
– Transportation of data can be efficiently done
without harming optimization performance
– The key is to run Communication and Computation
simultaneously and asynchronously
– Can we do more sophisticated communication
emerging in multi-machine optimization?
60

• (Selective) Block Minimization scheme for Large-
scale SVM
61
Move data Process Optimization
HDD/ File
system
One
machine
One
machine

• Map-Reduce scheme for multi-machine algorithm
62
Move parameters Process Optimization
Master
node
Worker
node
Worker
node

Stratified Stochastic Gradient Descent
[Gemulla, 2011]
66

• Map-Reduce scheme for multi-machine algorithm
69
Move parameters Process Optimization
Master
node
Worker
node
Worker
node

Asynchronous multi-machine scheme
70
Parameter
Communication
Parameter
Updates

Asynchronous multi-machine scheme
• Each machine holds a subset of data
• Keep communicating a potion of parameter from
each other
• Simultaneously run updating parameters for
those each machine possesses
77

• Distributed stochastic gradient descent for saddle
point problems
– Another formulation of SVM (Regularized Risk
Minimization in general)
– Suitable for parallelization
78

How can we scale up Machine Learning to
Massive datasets?
• Exploit hardware traits
– Disk IO is bottleneck
– Run Disk IO and Computation simultaneously
• Distributed asynchronous optimization
(ongoing)
– Current work using multiple machines
79

Scaling up Machine Learning Algorithms for Classification

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Scaling up Machine Learning Algorithms for Classification

Similar to Scaling up Machine Learning Algorithms for Classification (20)

Recently uploaded

Recently uploaded (20)

Scaling up Machine Learning Algorithms for Classification