Performance and scalability for machine learning

Performance and
scalability for machine
learning.
Arnaud Rachez (arnaud.rachez@gmail.com)
!
November 2nd, 2015

Outline
• Performance (7mn)
• Parallelism (7mn)
• Scalability (10mn)

Numbers everyone should know
(2015 update)
3
Source: http://www.eecs.berkeley.edu/~rcs/research/interactive_latency.html
ThrougputGB/s
0
200
400
600
800
L1 L2 L3 RAM
ThrougputGB/s
0
7,5
15
22,5
30
RAM SSD Network
~800MB ~1.25GB~30GB
source: http://forums.aida64.com/topic/2864-i7-5775c-l4-cache-performance/ source: http://www.macrumors.com/2015/05/21/15-inch-retina-macbook-pro-2gbps-throughput/

Outline
• Performance (5-7mn)
• Parallelism (5-7mn)
• Scalability (7-10mn)

Optimising SGD
• Linear regression (like)
stochastic gradient descent
with d=5 features and
n=1,000,000 examples.
• Using Python (1), Numba (2),
Numpy (3) and Cython (4)
(https://gist.github.com/zermelozf/
3cd06c8b0ce28f4eeacd)
• Also compared it to pure C++
code (https://gist.github.com/
zermelozf/
4df67d14f72f04b4338a)
(1)
(2)
(3)
(4)

Runtime optimisation
6
Source: https://gist.github.com/zermelozf/3cd06c8b0ce28f4eeacd

6
Optimisation strategies (d=5 & n=1,000,000)
time(ms)
1
10
100
1000
10000
Python Numpy Cython Numba c++

6
Optimisation strategies (d=5 & n=1,000,000)
time(ms)
1
10
100
1000
10000
Python Numpy Cython Numba c++
M
em
ory
views
M
em
ory
views
pointers
pointers?
M
em
ory
views?

7
Cache optimisation (d=5 & n=1,000,000)
time(ms)
0
40
80
120
160
Numba c++ cython
random linear

7
time(ms)
0
40
80
120
160
Numba c++ cython
random linear
Cache miss
Cache miss
Cache miss

7
time(ms)
0
40
80
120
160
Numba c++ cython
random linear
Cache hit
Cache hitCache miss
Cache miss
Cache miss
Cache hit

(d>>1) Gensim word2vec case study
• Elman style RNN trained with
SGD: 15,079×200 matrix on a
1M word corpus.
• Baseline written by Tomas
Mikolov in optimised C.
• Rewritten by Radim Rehurec in
python.
• Optimised by Radim Rehurec
using Cython, BLAS…
Source: http://rare-technologies.com/word2vec-in-python-part-two-optimizing/

1M word corpus.
python.
Original C
Numpy
Cython
Cython + BLAS
Cython + BLAS + sigmoid table
word/sec (x1000)
0 30 60 90 120

1M word corpus.
python.
Original C
Numpy
Cython
Cython + BLAS
word/sec (x1000)
0 30 60 90 120
pointers
pointers
pointers

What’s this BLAS magic?
Source: https://github.com/piskvorky/gensim/blob/develop/gensim/models/word2vec_inner.pyx
• vectorized y = alpha*x !
• replaced 3 lines of code!
• translated into a 3x speedup over Cython alone!
• please read http://rare-technologies.com/word2vec-in-python-part-two-optimizing/
**On my MacBook Pro, SciPy automatically links against Apple’s vecLib, which contains an excellent BLAS.
Similarly, Intel’s MKL, AMD’s AMCL, Sun’s SunPerf or the automatically tuned ATLAS are all good choices.

Hardware trends: CPU
11
Numberofcores
0
1
2
3
4
ClockspeedMhz
0
1000
2000
3000
4000
1970 1975 1980 1985 1990 1995 2000 2005 2010 2015
Clock speed (Mhz) #Cores
Source: http://www.gotw.ca/publications/concurrency-ddj.htm

(d>>1) Gensim word2vec continued
1M word corpus.
python.
Source: http://rare-technologies.com/parallelizing-word2vec-in-python/

1M word corpus.
python.
• … and parallelised with threads!

1M word corpus.
python.
1 thread
2 threads
3 threads
4 threads
word/sec (x1000)
0 100 200 300 400
Original C

1M word corpus.
python.
1 thread
2 threads
3 threads
4 threads
word/sec (x1000)
0 100 200 300 400
Original C
2.85x speedup

(d>>1) Hogwild!on SAG
• Fabian’s experimentation with Julia (lang).
• Running SAG in
parallel, without
a lock.

(d>>1) Hogwild!on SAG
• Fabian’s experimentation with Julia (lang).
• Running SAG in
parallel, without
a lock.
• Very nice
speed up!!!

Data and does not ﬁt in memory…
Stream data from disk…
… but you cannot read in parallel…
Producer/Consumer pattern
chunk 1 chunk 2 chunk 3 chunk 4 chunk 5 chunk 6 …
…
…
…thread 2"
(consumer)
thread 2"
(consumer)
thread 1"
(producer)
Et cetera…

job 1 …
…
…thread 2"
(consumer)
thread 2"
(consumer)
thread 1"
(producer)
Et cetera…

job 2 …
job 1 …
…thread 2"
(consumer)
thread 2"
(consumer)
thread 1"
(producer)
Et cetera…

job 3 …
job 1
job 2
…
…thread 2"
(consumer)
thread 2"
(consumer)
thread 1"
(producer)
Et cetera…

job 3 job 4 …
job 1
job 2
…
…thread 2"
(consumer)
thread 2"
(consumer)
thread 1"
(producer)
Et cetera…

job 3 job 4 job 5 …
…
…thread 2"
(consumer)
thread 2"
(consumer)
thread 1"
(producer)
done
done
Et cetera…

job 5 …
job 3
job 4
…
…thread 2"
(consumer)
thread 2"
(consumer)
thread 1"
(producer)
done
done
Et cetera…

job 5 …
job 4
…
…thread 2"
(consumer)
thread 2"
(consumer)
thread 1"
(producer)
done
done
done
Et cetera…

…
job 4
job 5 …
…thread 2"
(consumer)
thread 2"
(consumer)
thread 1"
(producer)
done
done
done
Et cetera…

How many consumers?
It depends…
!
• Gensim (R. Rehurec)
• Saw the impact up to 4 consumers earlier 
• Vowpal Wabbit (J. Langford)
• Claims no gain with more than 1 consumer!
• 2’10’’ on my macbook pro for ~10GB and 50MM lines  
(Criteo’s advertising dataset).
!
• CNNs pre-processing (S. Dieleman)
• Big impact with ?? (several) consumers!
• Useful for data augmentation/preprocessing

5.3GB (~105MM lines) word count
0
55
110
165
220
Number of consumers
1 2 3 4 5 6
Word count java benchmark
source: https://gist.github.com/nicomak/1d6561e6f71d936d3178
• Macbook pro 15’’ 2014
• `sudo purge`

Hardware trends: HDD
Capacity(GB)
0
150
300
450
600
Timetoread(sec)
0
1 000
2 000
3 000
4 000
1979 1983 1993 1998 1999 2001 2003 2008 2011
Read full disk (sec.) Capacity (GB)
Source: https://tylermuth.wordpress.com/2011/11/02
18

Distributed computing
19
Scalability - A perspective on Big data

19
• Strong scaling: if you throw twice as many machines at
the task, you solve it in half the time. 
Usually relevant when the task is CPU bound.

19
• Weak scaling: if the dataset is twice as big, throw twice
as many machines at it to solve the task in constant time. 
Memory bound tasks… usually.

19
• Weak scaling: if the dataset is twice as big, throw twice
as many machines at it to solve the task in constant time. 
Memory bound tasks… usually.
Most “big data” problems are I/O bound. Hard to solve the task in an
acceptable time independently of the size of the data (weak scaling).

Bring computation to data
20
Map-Reduce: Statistical query model

20
the sum corresponds

to a reduce operation

20
f, the map function, is

sent to every machine
the sum corresponds


20
f, the map function, is

sent to every machine
the sum corresponds

• D. Caragea et al., A Framework for Learning from Distributed Data Using Sufficient Statistics and Its Application to Learning
Decision Trees. Int. J. Hybrid Intell. Syst. 2004

• Chu et al., Map-Reduce for Machine Learning on Multicore. NIPS’06.

Spark on Criteo’s data
!
• Logistic regression trained with
minibatch SGD"
• 10GB of data (50MM lines).  
Caveat: Quite small for a benchmark
• Super linear strong
scalability.  
Not theoretically possible => small
dataset + few instances saturate.
Numberofcores
0
10
20
30
40
timeinsec.
0
325
650
975
1300
Number of AWS nodes
4 6 8 10
time (sec) #cores

Spark on Criteo’s data
!
• Logistic regression trained with
minibatch SGD"
• 10GB of data (50MM lines).  
Caveat: Quite small for a benchmark
• Super linear strong
scalability.  
Not theoretically possible => small
dataset + few instances saturate.
Numberofcores
0
10
20
30
40
timeinsec.
0
325
650
975
1300
Number of AWS nodes
4 6 8 10
time (sec) #cores
Manual setup of the cluster
was a bit painful…

Software stack for big data
22

22
Local Standalone YARNMESOS
Cluster"
manager

22
HDFS Tachyon Cassandra HBase Others…
Cluster"
manager
Storage "
layer

22
Spark"
Memory-optimised execution
engine
Flink"
Apache incubated excution
engine.
Hadoop MR 2"
Cluster"
manager
Storage "
layer
Execution "
layer

22
Spark"
engine
Flink"
engine.
Hadoop MR 2"
MLlib
GraphX
Streaming
SQL/
Datafra
me
Cluster"
manager
Storage "
layer
Execution "
layer
Libraries

22
Spark"
engine
Flink"
engine.
Hadoop MR 2"
MLlib
GraphX
Streaming
SQL/
Datafra
me
FlinkML
Gelly
(graph)
TableAPI
Batch
Cluster"
manager
Storage "
layer
Execution "
layer
Libraries

Software stack MESOS vs YARN
23

23
• Standalone mode is fastest…

23
• … but resources are requested for the entire job.

23
Cluster management frameworks

23
• Concurrent access (multiuser)

23
• Hyperparameter tuning (multijob)

23
• Hyperparameter tuning (multijob)
Mesos YARN
• Framework receive offers
• Easy install on AWS, GCE
• Lots of compatible frameworks:
Spark, MPI, Cassandra,
HDFS…
• Mesosphere’s DCOS is really,
really easy to use.
• Frameworks make offers
• Conﬁguration hell (can be
made easier with puppet/
ansible recipes
• Several compatible
frameworks: Spark, Flink,
HDFS…

Infrastructure stack
• AWS = AWeSome
• Basic instance with spot price
!
!
!
!
• Graphical Network Designer

• AWS = AWeSome
• Basic instance with spot price
!
!
!
!
• Graphical Network Designer
10 r2.2xlarge instances for
(350GB mem. & 40 cores)
0.85$/hour

VPC
Subnets
public/private

VPC
Subnets
public/private
Security rules

VPC
Subnets
public/private
Security rules
Bootstrap conﬁg
for master/slaves

VPC
Subnets
public/private
Security rules
Bootstrap conﬁg
for master/slaves
Network
entry point

Source: https://aws.amazon.com/architecture/

What’s coming in the next few years ?
BONUS

Performance and scalability for machine learning

Recommended

Recommended

More Related Content

What's hot

What's hot (10)

Viewers also liked

Viewers also liked (20)

Similar to Performance and scalability for machine learning

Similar to Performance and scalability for machine learning (20)

Recently uploaded

Recently uploaded (20)

Performance and scalability for machine learning