Performance and
scalability for machine
learning.
Arnaud Rachez (arnaud.rachez@gmail.com)
!
November 2nd, 2015
Outline
• Performance (7mn)
• Parallelism (7mn)
• Scalability (10mn)
Numbers everyone should know
(2015 update)
3
Source: http://www.eecs.berkeley.edu/~rcs/research/interactive_latency.html
ThrougputGB/s
0
200
400
600
800
L1 L2 L3 RAM
ThrougputGB/s
0
7,5
15
22,5
30
RAM SSD Network
~800MB ~1.25GB~30GB
source: http://forums.aida64.com/topic/2864-i7-5775c-l4-cache-performance/ source: http://www.macrumors.com/2015/05/21/15-inch-retina-macbook-pro-2gbps-throughput/
Outline
• Performance (5-7mn)
• Parallelism (5-7mn)
• Scalability (7-10mn)
Optimising SGD
• Linear regression (like)
stochastic gradient descent
with d=5 features and
n=1,000,000 examples.
• Using Python (1), Numba (2),
Numpy (3) and Cython (4)
(https://gist.github.com/zermelozf/
3cd06c8b0ce28f4eeacd)
• Also compared it to pure C++
code (https://gist.github.com/
zermelozf/
4df67d14f72f04b4338a)
(1)
(2)
(3)
(4)
Runtime optimisation
6
Source: https://gist.github.com/zermelozf/3cd06c8b0ce28f4eeacd
Runtime optimisation
6
Optimisation strategies (d=5 & n=1,000,000)
time(ms)
1
10
100
1000
10000
Python Numpy Cython Numba c++
Source: https://gist.github.com/zermelozf/3cd06c8b0ce28f4eeacd
Runtime optimisation
6
Optimisation strategies (d=5 & n=1,000,000)
time(ms)
1
10
100
1000
10000
Python Numpy Cython Numba c++
Source: https://gist.github.com/zermelozf/3cd06c8b0ce28f4eeacd
Runtime optimisation
6
Optimisation strategies (d=5 & n=1,000,000)
time(ms)
1
10
100
1000
10000
Python Numpy Cython Numba c++
Source: https://gist.github.com/zermelozf/3cd06c8b0ce28f4eeacd
Runtime optimisation
6
Optimisation strategies (d=5 & n=1,000,000)
time(ms)
1
10
100
1000
10000
Python Numpy Cython Numba c++
Source: https://gist.github.com/zermelozf/3cd06c8b0ce28f4eeacd
Runtime optimisation
6
Optimisation strategies (d=5 & n=1,000,000)
time(ms)
1
10
100
1000
10000
Python Numpy Cython Numba c++
Source: https://gist.github.com/zermelozf/3cd06c8b0ce28f4eeacd
Runtime optimisation
6
Optimisation strategies (d=5 & n=1,000,000)
time(ms)
1
10
100
1000
10000
Python Numpy Cython Numba c++
Source: https://gist.github.com/zermelozf/3cd06c8b0ce28f4eeacd
Runtime optimisation
6
Optimisation strategies (d=5 & n=1,000,000)
time(ms)
1
10
100
1000
10000
Python Numpy Cython Numba c++
Source: https://gist.github.com/zermelozf/3cd06c8b0ce28f4eeacd
M
em
ory
views
M
em
ory
views
pointers
pointers?
M
em
ory
views?
Runtime optimisation
7
Runtime optimisation
7
Cache optimisation (d=5 & n=1,000,000)
time(ms)
0
40
80
120
160
Numba c++ cython
random linear
Runtime optimisation
7
Cache optimisation (d=5 & n=1,000,000)
time(ms)
0
40
80
120
160
Numba c++ cython
random linear
Runtime optimisation
7
Cache optimisation (d=5 & n=1,000,000)
time(ms)
0
40
80
120
160
Numba c++ cython
random linear
Cache miss
Cache miss
Cache miss
Runtime optimisation
7
Cache optimisation (d=5 & n=1,000,000)
time(ms)
0
40
80
120
160
Numba c++ cython
random linear
Cache hit
Cache hitCache miss
Cache miss
Cache miss
Cache hit
(d>>1) Gensim word2vec case study
• Elman style RNN trained with
SGD: 15,079×200 matrix on a
1M word corpus.
• Baseline written by Tomas
Mikolov in optimised C.
• Rewritten by Radim Rehurec in
python.
• Optimised by Radim Rehurec
using Cython, BLAS…
Source: http://rare-technologies.com/word2vec-in-python-part-two-optimizing/
(d>>1) Gensim word2vec case study
• Elman style RNN trained with
SGD: 15,079×200 matrix on a
1M word corpus.
• Baseline written by Tomas
Mikolov in optimised C.
• Rewritten by Radim Rehurec in
python.
• Optimised by Radim Rehurec
using Cython, BLAS…
Source: http://rare-technologies.com/word2vec-in-python-part-two-optimizing/
Original C
Numpy
Cython
Cython + BLAS
Cython + BLAS + sigmoid table
word/sec (x1000)
0 30 60 90 120
(d>>1) Gensim word2vec case study
• Elman style RNN trained with
SGD: 15,079×200 matrix on a
1M word corpus.
• Baseline written by Tomas
Mikolov in optimised C.
• Rewritten by Radim Rehurec in
python.
• Optimised by Radim Rehurec
using Cython, BLAS…
Source: http://rare-technologies.com/word2vec-in-python-part-two-optimizing/
Original C
Numpy
Cython
Cython + BLAS
Cython + BLAS + sigmoid table
word/sec (x1000)
0 30 60 90 120
(d>>1) Gensim word2vec case study
• Elman style RNN trained with
SGD: 15,079×200 matrix on a
1M word corpus.
• Baseline written by Tomas
Mikolov in optimised C.
• Rewritten by Radim Rehurec in
python.
• Optimised by Radim Rehurec
using Cython, BLAS…
Source: http://rare-technologies.com/word2vec-in-python-part-two-optimizing/
Original C
Numpy
Cython
Cython + BLAS
Cython + BLAS + sigmoid table
word/sec (x1000)
0 30 60 90 120
(d>>1) Gensim word2vec case study
• Elman style RNN trained with
SGD: 15,079×200 matrix on a
1M word corpus.
• Baseline written by Tomas
Mikolov in optimised C.
• Rewritten by Radim Rehurec in
python.
• Optimised by Radim Rehurec
using Cython, BLAS…
Source: http://rare-technologies.com/word2vec-in-python-part-two-optimizing/
Original C
Numpy
Cython
Cython + BLAS
Cython + BLAS + sigmoid table
word/sec (x1000)
0 30 60 90 120
(d>>1) Gensim word2vec case study
• Elman style RNN trained with
SGD: 15,079×200 matrix on a
1M word corpus.
• Baseline written by Tomas
Mikolov in optimised C.
• Rewritten by Radim Rehurec in
python.
• Optimised by Radim Rehurec
using Cython, BLAS…
Source: http://rare-technologies.com/word2vec-in-python-part-two-optimizing/
Original C
Numpy
Cython
Cython + BLAS
Cython + BLAS + sigmoid table
word/sec (x1000)
0 30 60 90 120
(d>>1) Gensim word2vec case study
• Elman style RNN trained with
SGD: 15,079×200 matrix on a
1M word corpus.
• Baseline written by Tomas
Mikolov in optimised C.
• Rewritten by Radim Rehurec in
python.
• Optimised by Radim Rehurec
using Cython, BLAS…
Source: http://rare-technologies.com/word2vec-in-python-part-two-optimizing/
Original C
Numpy
Cython
Cython + BLAS
Cython + BLAS + sigmoid table
word/sec (x1000)
0 30 60 90 120
(d>>1) Gensim word2vec case study
• Elman style RNN trained with
SGD: 15,079×200 matrix on a
1M word corpus.
• Baseline written by Tomas
Mikolov in optimised C.
• Rewritten by Radim Rehurec in
python.
• Optimised by Radim Rehurec
using Cython, BLAS…
Source: http://rare-technologies.com/word2vec-in-python-part-two-optimizing/
Original C
Numpy
Cython
Cython + BLAS
Cython + BLAS + sigmoid table
word/sec (x1000)
0 30 60 90 120
pointers
pointers
pointers
What’s this BLAS magic?
Source: https://github.com/piskvorky/gensim/blob/develop/gensim/models/word2vec_inner.pyx
• vectorized y = alpha*x !
• replaced 3 lines of code!
• translated into a 3x speedup over Cython alone!
• please read http://rare-technologies.com/word2vec-in-python-part-two-optimizing/
**On my MacBook Pro, SciPy automatically links against Apple’s vecLib, which contains an excellent BLAS.
Similarly, Intel’s MKL, AMD’s AMCL, Sun’s SunPerf or the automatically tuned ATLAS are all good choices.
Outline
• Performance (5-7mn)
• Parallelism (5-7mn)
• Scalability (7-10mn)
Hardware trends: CPU
11
Numberofcores
0
1
2
3
4
ClockspeedMhz
0
1000
2000
3000
4000
1970 1975 1980 1985 1990 1995 2000 2005 2010 2015
Clock speed (Mhz) #Cores
Source: http://www.gotw.ca/publications/concurrency-ddj.htm
(d>>1) Gensim word2vec continued
• Elman style RNN trained with
SGD: 15,079×200 matrix on a
1M word corpus.
• Baseline written by Tomas
Mikolov in optimised C.
• Rewritten by Radim Rehurec in
python.
• Optimised by Radim Rehurec
using Cython, BLAS…
Source: http://rare-technologies.com/parallelizing-word2vec-in-python/
(d>>1) Gensim word2vec continued
• Elman style RNN trained with
SGD: 15,079×200 matrix on a
1M word corpus.
• Baseline written by Tomas
Mikolov in optimised C.
• Rewritten by Radim Rehurec in
python.
• Optimised by Radim Rehurec
using Cython, BLAS…
Source: http://rare-technologies.com/parallelizing-word2vec-in-python/
• … and parallelised with threads!
(d>>1) Gensim word2vec continued
• Elman style RNN trained with
SGD: 15,079×200 matrix on a
1M word corpus.
• Baseline written by Tomas
Mikolov in optimised C.
• Rewritten by Radim Rehurec in
python.
• Optimised by Radim Rehurec
using Cython, BLAS…
Source: http://rare-technologies.com/parallelizing-word2vec-in-python/
1 thread
2 threads
3 threads
4 threads
word/sec (x1000)
0 100 200 300 400
Original C
Cython + BLAS + sigmoid table
• … and parallelised with threads!
(d>>1) Gensim word2vec continued
• Elman style RNN trained with
SGD: 15,079×200 matrix on a
1M word corpus.
• Baseline written by Tomas
Mikolov in optimised C.
• Rewritten by Radim Rehurec in
python.
• Optimised by Radim Rehurec
using Cython, BLAS…
Source: http://rare-technologies.com/parallelizing-word2vec-in-python/
1 thread
2 threads
3 threads
4 threads
word/sec (x1000)
0 100 200 300 400
Original C
Cython + BLAS + sigmoid table
• … and parallelised with threads!
(d>>1) Gensim word2vec continued
• Elman style RNN trained with
SGD: 15,079×200 matrix on a
1M word corpus.
• Baseline written by Tomas
Mikolov in optimised C.
• Rewritten by Radim Rehurec in
python.
• Optimised by Radim Rehurec
using Cython, BLAS…
Source: http://rare-technologies.com/parallelizing-word2vec-in-python/
1 thread
2 threads
3 threads
4 threads
word/sec (x1000)
0 100 200 300 400
Original C
Cython + BLAS + sigmoid table
• … and parallelised with threads!
(d>>1) Gensim word2vec continued
• Elman style RNN trained with
SGD: 15,079×200 matrix on a
1M word corpus.
• Baseline written by Tomas
Mikolov in optimised C.
• Rewritten by Radim Rehurec in
python.
• Optimised by Radim Rehurec
using Cython, BLAS…
Source: http://rare-technologies.com/parallelizing-word2vec-in-python/
1 thread
2 threads
3 threads
4 threads
word/sec (x1000)
0 100 200 300 400
Original C
Cython + BLAS + sigmoid table
• … and parallelised with threads!
(d>>1) Gensim word2vec continued
• Elman style RNN trained with
SGD: 15,079×200 matrix on a
1M word corpus.
• Baseline written by Tomas
Mikolov in optimised C.
• Rewritten by Radim Rehurec in
python.
• Optimised by Radim Rehurec
using Cython, BLAS…
Source: http://rare-technologies.com/parallelizing-word2vec-in-python/
1 thread
2 threads
3 threads
4 threads
word/sec (x1000)
0 100 200 300 400
Original C
Cython + BLAS + sigmoid table
• … and parallelised with threads!
(d>>1) Gensim word2vec continued
• Elman style RNN trained with
SGD: 15,079×200 matrix on a
1M word corpus.
• Baseline written by Tomas
Mikolov in optimised C.
• Rewritten by Radim Rehurec in
python.
• Optimised by Radim Rehurec
using Cython, BLAS…
Source: http://rare-technologies.com/parallelizing-word2vec-in-python/
1 thread
2 threads
3 threads
4 threads
word/sec (x1000)
0 100 200 300 400
Original C
Cython + BLAS + sigmoid table
• … and parallelised with threads!
2.85x speedup
(d>>1) Hogwild!on SAG
• Fabian’s experimentation with Julia (lang).
• Running SAG in
parallel, without
a lock.
(d>>1) Hogwild!on SAG
• Fabian’s experimentation with Julia (lang).
• Running SAG in
parallel, without
a lock.
• Very nice
speed up!!!
Data and does not fit in memory…
Stream data from disk…
… but you cannot read in parallel…
Producer/Consumer pattern
chunk 1 chunk 2 chunk 3 chunk 4 chunk 5 chunk 6 …
…
…
…thread 2"
(consumer)
thread 2"
(consumer)
thread 1"
(producer)
Et cetera…
Data and does not fit in memory…
Stream data from disk…
… but you cannot read in parallel…
Producer/Consumer pattern
chunk 1 chunk 2 chunk 3 chunk 4 chunk 5 chunk 6 …
job 1 …
…
…thread 2"
(consumer)
thread 2"
(consumer)
thread 1"
(producer)
Et cetera…
Data and does not fit in memory…
Stream data from disk…
… but you cannot read in parallel…
Producer/Consumer pattern
chunk 1 chunk 2 chunk 3 chunk 4 chunk 5 chunk 6 …
job 2 …
job 1 …
…thread 2"
(consumer)
thread 2"
(consumer)
thread 1"
(producer)
Et cetera…
Data and does not fit in memory…
Stream data from disk…
… but you cannot read in parallel…
Producer/Consumer pattern
chunk 1 chunk 2 chunk 3 chunk 4 chunk 5 chunk 6 …
job 3 …
job 1
job 2
…
…thread 2"
(consumer)
thread 2"
(consumer)
thread 1"
(producer)
Et cetera…
Data and does not fit in memory…
Stream data from disk…
… but you cannot read in parallel…
Producer/Consumer pattern
chunk 1 chunk 2 chunk 3 chunk 4 chunk 5 chunk 6 …
job 3 job 4 …
job 1
job 2
…
…thread 2"
(consumer)
thread 2"
(consumer)
thread 1"
(producer)
Et cetera…
Data and does not fit in memory…
Stream data from disk…
… but you cannot read in parallel…
Producer/Consumer pattern
chunk 1 chunk 2 chunk 3 chunk 4 chunk 5 chunk 6 …
job 3 job 4 job 5 …
…
…thread 2"
(consumer)
thread 2"
(consumer)
thread 1"
(producer)
done
done
Et cetera…
Data and does not fit in memory…
Stream data from disk…
… but you cannot read in parallel…
Producer/Consumer pattern
chunk 1 chunk 2 chunk 3 chunk 4 chunk 5 chunk 6 …
job 5 …
job 3
job 4
…
…thread 2"
(consumer)
thread 2"
(consumer)
thread 1"
(producer)
done
done
Et cetera…
Data and does not fit in memory…
Stream data from disk…
… but you cannot read in parallel…
Producer/Consumer pattern
chunk 1 chunk 2 chunk 3 chunk 4 chunk 5 chunk 6 …
job 5 …
job 3
job 4
…
…thread 2"
(consumer)
thread 2"
(consumer)
thread 1"
(producer)
done
done
Et cetera…
Data and does not fit in memory…
Stream data from disk…
… but you cannot read in parallel…
Producer/Consumer pattern
chunk 1 chunk 2 chunk 3 chunk 4 chunk 5 chunk 6 …
job 5 …
job 4
…
…thread 2"
(consumer)
thread 2"
(consumer)
thread 1"
(producer)
done
done
done
Et cetera…
Data and does not fit in memory…
Stream data from disk…
… but you cannot read in parallel…
Producer/Consumer pattern
chunk 1 chunk 2 chunk 3 chunk 4 chunk 5 chunk 6 …
…
job 4
job 5 …
…thread 2"
(consumer)
thread 2"
(consumer)
thread 1"
(producer)
done
done
done
Et cetera…
How many consumers?
It depends…
!
• Gensim (R. Rehurec)
• Saw the impact up to 4 consumers earlier

• Vowpal Wabbit (J. Langford)
• Claims no gain with more than 1 consumer!
• 2’10’’ on my macbook pro for ~10GB and 50MM lines 

(Criteo’s advertising dataset).
!
• CNNs pre-processing (S. Dieleman)
• Big impact with ?? (several) consumers!
• Useful for data augmentation/preprocessing
5.3GB (~105MM lines) word count
0
55
110
165
220
Number of consumers
1 2 3 4 5 6
Word count java benchmark
source: https://gist.github.com/nicomak/1d6561e6f71d936d3178
• Macbook pro 15’’ 2014
• `sudo purge`
Outline
• Performance (5-7mn)
• Parallelism (5-7mn)
• Scalability (7-10mn)
Hardware trends: HDD
Capacity(GB)
0
150
300
450
600
Timetoread(sec)
0
1 000
2 000
3 000
4 000
1979 1983 1993 1998 1999 2001 2003 2008 2011
Read full disk (sec.) Capacity (GB)
Source: https://tylermuth.wordpress.com/2011/11/02
18
Distributed computing
19
Scalability - A perspective on Big data
Distributed computing
19
Scalability - A perspective on Big data
Distributed computing
19
• Strong scaling: if you throw twice as many machines at
the task, you solve it in half the time.

Usually relevant when the task is CPU bound.
Scalability - A perspective on Big data
Distributed computing
19
• Strong scaling: if you throw twice as many machines at
the task, you solve it in half the time.

Usually relevant when the task is CPU bound.
• Weak scaling: if the dataset is twice as big, throw twice
as many machines at it to solve the task in constant time.

Memory bound tasks… usually.
Scalability - A perspective on Big data
Distributed computing
19
• Strong scaling: if you throw twice as many machines at
the task, you solve it in half the time.

Usually relevant when the task is CPU bound.
• Weak scaling: if the dataset is twice as big, throw twice
as many machines at it to solve the task in constant time.

Memory bound tasks… usually.
Most “big data” problems are I/O bound. Hard to solve the task in an
acceptable time independently of the size of the data (weak scaling).
Scalability - A perspective on Big data
Bring computation to data
20
Bring computation to data
20
Map-Reduce: Statistical query model
Bring computation to data
20
Map-Reduce: Statistical query model
the sum corresponds	

to a reduce operation
Bring computation to data
20
Map-Reduce: Statistical query model
f, the map function, is	

sent to every machine
the sum corresponds	

to a reduce operation
Bring computation to data
20
Map-Reduce: Statistical query model
f, the map function, is	

sent to every machine
the sum corresponds	

to a reduce operation
• D. Caragea et al., A Framework for Learning from Distributed Data Using Sufficient Statistics and Its Application to Learning
Decision Trees. Int. J. Hybrid Intell. Syst. 2004	

• Chu et al., Map-Reduce for Machine Learning on Multicore. NIPS’06.
Spark on Criteo’s data
!
• Logistic regression trained with
minibatch SGD"
• 10GB of data (50MM lines). 

Caveat: Quite small for a benchmark
• Super linear strong
scalability. 

Not theoretically possible => small
dataset + few instances saturate.
Numberofcores
0
10
20
30
40
timeinsec.
0
325
650
975
1300
Number of AWS nodes
4 6 8 10
time (sec) #cores
Spark on Criteo’s data
!
• Logistic regression trained with
minibatch SGD"
• 10GB of data (50MM lines). 

Caveat: Quite small for a benchmark
• Super linear strong
scalability. 

Not theoretically possible => small
dataset + few instances saturate.
Numberofcores
0
10
20
30
40
timeinsec.
0
325
650
975
1300
Number of AWS nodes
4 6 8 10
time (sec) #cores
Manual setup of the cluster
was a bit painful…
Software stack for big data
22
Software stack for big data
22
Local Standalone YARNMESOS
Cluster"
manager
Software stack for big data
22
Local Standalone YARNMESOS
HDFS Tachyon Cassandra HBase Others…
Cluster"
manager
Storage "
layer
Software stack for big data
22
Local Standalone YARNMESOS
HDFS Tachyon Cassandra HBase Others…
Spark"
Memory-optimised execution
engine
Flink"
Apache incubated excution
engine.
Hadoop MR 2"
Cluster"
manager
Storage "
layer
Execution "
layer
Software stack for big data
22
Local Standalone YARNMESOS
HDFS Tachyon Cassandra HBase Others…
Spark"
Memory-optimised execution
engine
Flink"
Apache incubated excution
engine.
Hadoop MR 2"
MLlib
GraphX
Streaming
SQL/
Datafra
me
Cluster"
manager
Storage "
layer
Execution "
layer
Libraries
Software stack for big data
22
Local Standalone YARNMESOS
HDFS Tachyon Cassandra HBase Others…
Spark"
Memory-optimised execution
engine
Flink"
Apache incubated excution
engine.
Hadoop MR 2"
MLlib
GraphX
Streaming
SQL/
Datafra
me
FlinkML
Gelly
(graph)
TableAPI
Batch
Cluster"
manager
Storage "
layer
Execution "
layer
Libraries
Software stack MESOS vs YARN
23
Software stack MESOS vs YARN
23
• Standalone mode is fastest…
Software stack MESOS vs YARN
23
• Standalone mode is fastest…
• … but resources are requested for the entire job.
Software stack MESOS vs YARN
23
• Standalone mode is fastest…
• … but resources are requested for the entire job.
Cluster management frameworks
Software stack MESOS vs YARN
23
• Standalone mode is fastest…
• … but resources are requested for the entire job.
Cluster management frameworks
Software stack MESOS vs YARN
23
• Standalone mode is fastest…
• … but resources are requested for the entire job.
Cluster management frameworks
• Concurrent access (multiuser)
Software stack MESOS vs YARN
23
• Standalone mode is fastest…
• … but resources are requested for the entire job.
Cluster management frameworks
• Concurrent access (multiuser)
• Hyperparameter tuning (multijob)
Software stack MESOS vs YARN
23
• Standalone mode is fastest…
• … but resources are requested for the entire job.
Cluster management frameworks
• Concurrent access (multiuser)
• Hyperparameter tuning (multijob)
Software stack MESOS vs YARN
23
• Standalone mode is fastest…
• … but resources are requested for the entire job.
Cluster management frameworks
• Concurrent access (multiuser)
• Hyperparameter tuning (multijob)
Mesos YARN
• Framework receive offers
• Easy install on AWS, GCE
• Lots of compatible frameworks:
Spark, MPI, Cassandra,
HDFS…
• Mesosphere’s DCOS is really,
really easy to use.
• Frameworks make offers
• Configuration hell (can be
made easier with puppet/
ansible recipes
• Several compatible
frameworks: Spark, Flink,
HDFS…
Infrastructure stack
• AWS = AWeSome
• Basic instance with spot price
!
!
!
!
• Graphical Network Designer
Infrastructure stack
• AWS = AWeSome
• Basic instance with spot price
!
!
!
!
• Graphical Network Designer
Infrastructure stack
• AWS = AWeSome
• Basic instance with spot price
!
!
!
!
• Graphical Network Designer
10 r2.2xlarge instances for
(350GB mem. & 40 cores)
0.85$/hour
Infrastructure stack
• AWS = AWeSome
• Basic instance with spot price
!
!
!
!
• Graphical Network Designer
Infrastructure stack
Infrastructure stack
VPC
Infrastructure stack
VPC
Subnets
public/private
Infrastructure stack
VPC
Subnets
public/private
Security rules
Infrastructure stack
VPC
Subnets
public/private
Security rules
Bootstrap config
for master/slaves
Infrastructure stack
VPC
Subnets
public/private
Security rules
Bootstrap config
for master/slaves
Network
entry point
Infrastructure stack
Source: https://aws.amazon.com/architecture/
- Questions -
“Thank you”
What’s coming in the next few years ?
BONUS

Performance and scalability for machine learning

  • 1.
    Performance and scalability formachine learning. Arnaud Rachez (arnaud.rachez@gmail.com) ! November 2nd, 2015
  • 2.
    Outline • Performance (7mn) •Parallelism (7mn) • Scalability (10mn)
  • 3.
    Numbers everyone shouldknow (2015 update) 3 Source: http://www.eecs.berkeley.edu/~rcs/research/interactive_latency.html ThrougputGB/s 0 200 400 600 800 L1 L2 L3 RAM ThrougputGB/s 0 7,5 15 22,5 30 RAM SSD Network ~800MB ~1.25GB~30GB source: http://forums.aida64.com/topic/2864-i7-5775c-l4-cache-performance/ source: http://www.macrumors.com/2015/05/21/15-inch-retina-macbook-pro-2gbps-throughput/
  • 4.
    Outline • Performance (5-7mn) •Parallelism (5-7mn) • Scalability (7-10mn)
  • 5.
    Optimising SGD • Linearregression (like) stochastic gradient descent with d=5 features and n=1,000,000 examples. • Using Python (1), Numba (2), Numpy (3) and Cython (4) (https://gist.github.com/zermelozf/ 3cd06c8b0ce28f4eeacd) • Also compared it to pure C++ code (https://gist.github.com/ zermelozf/ 4df67d14f72f04b4338a) (1) (2) (3) (4)
  • 6.
  • 7.
    Runtime optimisation 6 Optimisation strategies(d=5 & n=1,000,000) time(ms) 1 10 100 1000 10000 Python Numpy Cython Numba c++ Source: https://gist.github.com/zermelozf/3cd06c8b0ce28f4eeacd
  • 8.
    Runtime optimisation 6 Optimisation strategies(d=5 & n=1,000,000) time(ms) 1 10 100 1000 10000 Python Numpy Cython Numba c++ Source: https://gist.github.com/zermelozf/3cd06c8b0ce28f4eeacd
  • 9.
    Runtime optimisation 6 Optimisation strategies(d=5 & n=1,000,000) time(ms) 1 10 100 1000 10000 Python Numpy Cython Numba c++ Source: https://gist.github.com/zermelozf/3cd06c8b0ce28f4eeacd
  • 10.
    Runtime optimisation 6 Optimisation strategies(d=5 & n=1,000,000) time(ms) 1 10 100 1000 10000 Python Numpy Cython Numba c++ Source: https://gist.github.com/zermelozf/3cd06c8b0ce28f4eeacd
  • 11.
    Runtime optimisation 6 Optimisation strategies(d=5 & n=1,000,000) time(ms) 1 10 100 1000 10000 Python Numpy Cython Numba c++ Source: https://gist.github.com/zermelozf/3cd06c8b0ce28f4eeacd
  • 12.
    Runtime optimisation 6 Optimisation strategies(d=5 & n=1,000,000) time(ms) 1 10 100 1000 10000 Python Numpy Cython Numba c++ Source: https://gist.github.com/zermelozf/3cd06c8b0ce28f4eeacd
  • 13.
    Runtime optimisation 6 Optimisation strategies(d=5 & n=1,000,000) time(ms) 1 10 100 1000 10000 Python Numpy Cython Numba c++ Source: https://gist.github.com/zermelozf/3cd06c8b0ce28f4eeacd M em ory views M em ory views pointers pointers? M em ory views?
  • 14.
  • 15.
    Runtime optimisation 7 Cache optimisation(d=5 & n=1,000,000) time(ms) 0 40 80 120 160 Numba c++ cython random linear
  • 16.
    Runtime optimisation 7 Cache optimisation(d=5 & n=1,000,000) time(ms) 0 40 80 120 160 Numba c++ cython random linear
  • 17.
    Runtime optimisation 7 Cache optimisation(d=5 & n=1,000,000) time(ms) 0 40 80 120 160 Numba c++ cython random linear Cache miss Cache miss Cache miss
  • 18.
    Runtime optimisation 7 Cache optimisation(d=5 & n=1,000,000) time(ms) 0 40 80 120 160 Numba c++ cython random linear Cache hit Cache hitCache miss Cache miss Cache miss Cache hit
  • 19.
    (d>>1) Gensim word2veccase study • Elman style RNN trained with SGD: 15,079×200 matrix on a 1M word corpus. • Baseline written by Tomas Mikolov in optimised C. • Rewritten by Radim Rehurec in python. • Optimised by Radim Rehurec using Cython, BLAS… Source: http://rare-technologies.com/word2vec-in-python-part-two-optimizing/
  • 20.
    (d>>1) Gensim word2veccase study • Elman style RNN trained with SGD: 15,079×200 matrix on a 1M word corpus. • Baseline written by Tomas Mikolov in optimised C. • Rewritten by Radim Rehurec in python. • Optimised by Radim Rehurec using Cython, BLAS… Source: http://rare-technologies.com/word2vec-in-python-part-two-optimizing/ Original C Numpy Cython Cython + BLAS Cython + BLAS + sigmoid table word/sec (x1000) 0 30 60 90 120
  • 21.
    (d>>1) Gensim word2veccase study • Elman style RNN trained with SGD: 15,079×200 matrix on a 1M word corpus. • Baseline written by Tomas Mikolov in optimised C. • Rewritten by Radim Rehurec in python. • Optimised by Radim Rehurec using Cython, BLAS… Source: http://rare-technologies.com/word2vec-in-python-part-two-optimizing/ Original C Numpy Cython Cython + BLAS Cython + BLAS + sigmoid table word/sec (x1000) 0 30 60 90 120
  • 22.
    (d>>1) Gensim word2veccase study • Elman style RNN trained with SGD: 15,079×200 matrix on a 1M word corpus. • Baseline written by Tomas Mikolov in optimised C. • Rewritten by Radim Rehurec in python. • Optimised by Radim Rehurec using Cython, BLAS… Source: http://rare-technologies.com/word2vec-in-python-part-two-optimizing/ Original C Numpy Cython Cython + BLAS Cython + BLAS + sigmoid table word/sec (x1000) 0 30 60 90 120
  • 23.
    (d>>1) Gensim word2veccase study • Elman style RNN trained with SGD: 15,079×200 matrix on a 1M word corpus. • Baseline written by Tomas Mikolov in optimised C. • Rewritten by Radim Rehurec in python. • Optimised by Radim Rehurec using Cython, BLAS… Source: http://rare-technologies.com/word2vec-in-python-part-two-optimizing/ Original C Numpy Cython Cython + BLAS Cython + BLAS + sigmoid table word/sec (x1000) 0 30 60 90 120
  • 24.
    (d>>1) Gensim word2veccase study • Elman style RNN trained with SGD: 15,079×200 matrix on a 1M word corpus. • Baseline written by Tomas Mikolov in optimised C. • Rewritten by Radim Rehurec in python. • Optimised by Radim Rehurec using Cython, BLAS… Source: http://rare-technologies.com/word2vec-in-python-part-two-optimizing/ Original C Numpy Cython Cython + BLAS Cython + BLAS + sigmoid table word/sec (x1000) 0 30 60 90 120
  • 25.
    (d>>1) Gensim word2veccase study • Elman style RNN trained with SGD: 15,079×200 matrix on a 1M word corpus. • Baseline written by Tomas Mikolov in optimised C. • Rewritten by Radim Rehurec in python. • Optimised by Radim Rehurec using Cython, BLAS… Source: http://rare-technologies.com/word2vec-in-python-part-two-optimizing/ Original C Numpy Cython Cython + BLAS Cython + BLAS + sigmoid table word/sec (x1000) 0 30 60 90 120
  • 26.
    (d>>1) Gensim word2veccase study • Elman style RNN trained with SGD: 15,079×200 matrix on a 1M word corpus. • Baseline written by Tomas Mikolov in optimised C. • Rewritten by Radim Rehurec in python. • Optimised by Radim Rehurec using Cython, BLAS… Source: http://rare-technologies.com/word2vec-in-python-part-two-optimizing/ Original C Numpy Cython Cython + BLAS Cython + BLAS + sigmoid table word/sec (x1000) 0 30 60 90 120 pointers pointers pointers
  • 27.
    What’s this BLASmagic? Source: https://github.com/piskvorky/gensim/blob/develop/gensim/models/word2vec_inner.pyx • vectorized y = alpha*x ! • replaced 3 lines of code! • translated into a 3x speedup over Cython alone! • please read http://rare-technologies.com/word2vec-in-python-part-two-optimizing/ **On my MacBook Pro, SciPy automatically links against Apple’s vecLib, which contains an excellent BLAS. Similarly, Intel’s MKL, AMD’s AMCL, Sun’s SunPerf or the automatically tuned ATLAS are all good choices.
  • 28.
    Outline • Performance (5-7mn) •Parallelism (5-7mn) • Scalability (7-10mn)
  • 29.
    Hardware trends: CPU 11 Numberofcores 0 1 2 3 4 ClockspeedMhz 0 1000 2000 3000 4000 19701975 1980 1985 1990 1995 2000 2005 2010 2015 Clock speed (Mhz) #Cores Source: http://www.gotw.ca/publications/concurrency-ddj.htm
  • 30.
    (d>>1) Gensim word2veccontinued • Elman style RNN trained with SGD: 15,079×200 matrix on a 1M word corpus. • Baseline written by Tomas Mikolov in optimised C. • Rewritten by Radim Rehurec in python. • Optimised by Radim Rehurec using Cython, BLAS… Source: http://rare-technologies.com/parallelizing-word2vec-in-python/
  • 31.
    (d>>1) Gensim word2veccontinued • Elman style RNN trained with SGD: 15,079×200 matrix on a 1M word corpus. • Baseline written by Tomas Mikolov in optimised C. • Rewritten by Radim Rehurec in python. • Optimised by Radim Rehurec using Cython, BLAS… Source: http://rare-technologies.com/parallelizing-word2vec-in-python/ • … and parallelised with threads!
  • 32.
    (d>>1) Gensim word2veccontinued • Elman style RNN trained with SGD: 15,079×200 matrix on a 1M word corpus. • Baseline written by Tomas Mikolov in optimised C. • Rewritten by Radim Rehurec in python. • Optimised by Radim Rehurec using Cython, BLAS… Source: http://rare-technologies.com/parallelizing-word2vec-in-python/ 1 thread 2 threads 3 threads 4 threads word/sec (x1000) 0 100 200 300 400 Original C Cython + BLAS + sigmoid table • … and parallelised with threads!
  • 33.
    (d>>1) Gensim word2veccontinued • Elman style RNN trained with SGD: 15,079×200 matrix on a 1M word corpus. • Baseline written by Tomas Mikolov in optimised C. • Rewritten by Radim Rehurec in python. • Optimised by Radim Rehurec using Cython, BLAS… Source: http://rare-technologies.com/parallelizing-word2vec-in-python/ 1 thread 2 threads 3 threads 4 threads word/sec (x1000) 0 100 200 300 400 Original C Cython + BLAS + sigmoid table • … and parallelised with threads!
  • 34.
    (d>>1) Gensim word2veccontinued • Elman style RNN trained with SGD: 15,079×200 matrix on a 1M word corpus. • Baseline written by Tomas Mikolov in optimised C. • Rewritten by Radim Rehurec in python. • Optimised by Radim Rehurec using Cython, BLAS… Source: http://rare-technologies.com/parallelizing-word2vec-in-python/ 1 thread 2 threads 3 threads 4 threads word/sec (x1000) 0 100 200 300 400 Original C Cython + BLAS + sigmoid table • … and parallelised with threads!
  • 35.
    (d>>1) Gensim word2veccontinued • Elman style RNN trained with SGD: 15,079×200 matrix on a 1M word corpus. • Baseline written by Tomas Mikolov in optimised C. • Rewritten by Radim Rehurec in python. • Optimised by Radim Rehurec using Cython, BLAS… Source: http://rare-technologies.com/parallelizing-word2vec-in-python/ 1 thread 2 threads 3 threads 4 threads word/sec (x1000) 0 100 200 300 400 Original C Cython + BLAS + sigmoid table • … and parallelised with threads!
  • 36.
    (d>>1) Gensim word2veccontinued • Elman style RNN trained with SGD: 15,079×200 matrix on a 1M word corpus. • Baseline written by Tomas Mikolov in optimised C. • Rewritten by Radim Rehurec in python. • Optimised by Radim Rehurec using Cython, BLAS… Source: http://rare-technologies.com/parallelizing-word2vec-in-python/ 1 thread 2 threads 3 threads 4 threads word/sec (x1000) 0 100 200 300 400 Original C Cython + BLAS + sigmoid table • … and parallelised with threads!
  • 37.
    (d>>1) Gensim word2veccontinued • Elman style RNN trained with SGD: 15,079×200 matrix on a 1M word corpus. • Baseline written by Tomas Mikolov in optimised C. • Rewritten by Radim Rehurec in python. • Optimised by Radim Rehurec using Cython, BLAS… Source: http://rare-technologies.com/parallelizing-word2vec-in-python/ 1 thread 2 threads 3 threads 4 threads word/sec (x1000) 0 100 200 300 400 Original C Cython + BLAS + sigmoid table • … and parallelised with threads! 2.85x speedup
  • 38.
    (d>>1) Hogwild!on SAG •Fabian’s experimentation with Julia (lang). • Running SAG in parallel, without a lock.
  • 39.
    (d>>1) Hogwild!on SAG •Fabian’s experimentation with Julia (lang). • Running SAG in parallel, without a lock. • Very nice speed up!!!
  • 40.
    Data and doesnot fit in memory… Stream data from disk… … but you cannot read in parallel… Producer/Consumer pattern chunk 1 chunk 2 chunk 3 chunk 4 chunk 5 chunk 6 … … … …thread 2" (consumer) thread 2" (consumer) thread 1" (producer) Et cetera…
  • 41.
    Data and doesnot fit in memory… Stream data from disk… … but you cannot read in parallel… Producer/Consumer pattern chunk 1 chunk 2 chunk 3 chunk 4 chunk 5 chunk 6 … job 1 … … …thread 2" (consumer) thread 2" (consumer) thread 1" (producer) Et cetera…
  • 42.
    Data and doesnot fit in memory… Stream data from disk… … but you cannot read in parallel… Producer/Consumer pattern chunk 1 chunk 2 chunk 3 chunk 4 chunk 5 chunk 6 … job 2 … job 1 … …thread 2" (consumer) thread 2" (consumer) thread 1" (producer) Et cetera…
  • 43.
    Data and doesnot fit in memory… Stream data from disk… … but you cannot read in parallel… Producer/Consumer pattern chunk 1 chunk 2 chunk 3 chunk 4 chunk 5 chunk 6 … job 3 … job 1 job 2 … …thread 2" (consumer) thread 2" (consumer) thread 1" (producer) Et cetera…
  • 44.
    Data and doesnot fit in memory… Stream data from disk… … but you cannot read in parallel… Producer/Consumer pattern chunk 1 chunk 2 chunk 3 chunk 4 chunk 5 chunk 6 … job 3 job 4 … job 1 job 2 … …thread 2" (consumer) thread 2" (consumer) thread 1" (producer) Et cetera…
  • 45.
    Data and doesnot fit in memory… Stream data from disk… … but you cannot read in parallel… Producer/Consumer pattern chunk 1 chunk 2 chunk 3 chunk 4 chunk 5 chunk 6 … job 3 job 4 job 5 … … …thread 2" (consumer) thread 2" (consumer) thread 1" (producer) done done Et cetera…
  • 46.
    Data and doesnot fit in memory… Stream data from disk… … but you cannot read in parallel… Producer/Consumer pattern chunk 1 chunk 2 chunk 3 chunk 4 chunk 5 chunk 6 … job 5 … job 3 job 4 … …thread 2" (consumer) thread 2" (consumer) thread 1" (producer) done done Et cetera…
  • 47.
    Data and doesnot fit in memory… Stream data from disk… … but you cannot read in parallel… Producer/Consumer pattern chunk 1 chunk 2 chunk 3 chunk 4 chunk 5 chunk 6 … job 5 … job 3 job 4 … …thread 2" (consumer) thread 2" (consumer) thread 1" (producer) done done Et cetera…
  • 48.
    Data and doesnot fit in memory… Stream data from disk… … but you cannot read in parallel… Producer/Consumer pattern chunk 1 chunk 2 chunk 3 chunk 4 chunk 5 chunk 6 … job 5 … job 4 … …thread 2" (consumer) thread 2" (consumer) thread 1" (producer) done done done Et cetera…
  • 49.
    Data and doesnot fit in memory… Stream data from disk… … but you cannot read in parallel… Producer/Consumer pattern chunk 1 chunk 2 chunk 3 chunk 4 chunk 5 chunk 6 … … job 4 job 5 … …thread 2" (consumer) thread 2" (consumer) thread 1" (producer) done done done Et cetera…
  • 50.
    How many consumers? Itdepends… ! • Gensim (R. Rehurec) • Saw the impact up to 4 consumers earlier
 • Vowpal Wabbit (J. Langford) • Claims no gain with more than 1 consumer! • 2’10’’ on my macbook pro for ~10GB and 50MM lines 
 (Criteo’s advertising dataset). ! • CNNs pre-processing (S. Dieleman) • Big impact with ?? (several) consumers! • Useful for data augmentation/preprocessing
  • 51.
    5.3GB (~105MM lines)word count 0 55 110 165 220 Number of consumers 1 2 3 4 5 6 Word count java benchmark source: https://gist.github.com/nicomak/1d6561e6f71d936d3178 • Macbook pro 15’’ 2014 • `sudo purge`
  • 52.
    Outline • Performance (5-7mn) •Parallelism (5-7mn) • Scalability (7-10mn)
  • 53.
    Hardware trends: HDD Capacity(GB) 0 150 300 450 600 Timetoread(sec) 0 1 000 2 000 3 000 4 000 19791983 1993 1998 1999 2001 2003 2008 2011 Read full disk (sec.) Capacity (GB) Source: https://tylermuth.wordpress.com/2011/11/02 18
  • 54.
    Distributed computing 19 Scalability -A perspective on Big data
  • 55.
    Distributed computing 19 Scalability -A perspective on Big data
  • 56.
    Distributed computing 19 • Strongscaling: if you throw twice as many machines at the task, you solve it in half the time.
 Usually relevant when the task is CPU bound. Scalability - A perspective on Big data
  • 57.
    Distributed computing 19 • Strongscaling: if you throw twice as many machines at the task, you solve it in half the time.
 Usually relevant when the task is CPU bound. • Weak scaling: if the dataset is twice as big, throw twice as many machines at it to solve the task in constant time.
 Memory bound tasks… usually. Scalability - A perspective on Big data
  • 58.
    Distributed computing 19 • Strongscaling: if you throw twice as many machines at the task, you solve it in half the time.
 Usually relevant when the task is CPU bound. • Weak scaling: if the dataset is twice as big, throw twice as many machines at it to solve the task in constant time.
 Memory bound tasks… usually. Most “big data” problems are I/O bound. Hard to solve the task in an acceptable time independently of the size of the data (weak scaling). Scalability - A perspective on Big data
  • 59.
  • 60.
    Bring computation todata 20 Map-Reduce: Statistical query model
  • 61.
    Bring computation todata 20 Map-Reduce: Statistical query model the sum corresponds to a reduce operation
  • 62.
    Bring computation todata 20 Map-Reduce: Statistical query model f, the map function, is sent to every machine the sum corresponds to a reduce operation
  • 63.
    Bring computation todata 20 Map-Reduce: Statistical query model f, the map function, is sent to every machine the sum corresponds to a reduce operation • D. Caragea et al., A Framework for Learning from Distributed Data Using Sufficient Statistics and Its Application to Learning Decision Trees. Int. J. Hybrid Intell. Syst. 2004 • Chu et al., Map-Reduce for Machine Learning on Multicore. NIPS’06.
  • 64.
    Spark on Criteo’sdata ! • Logistic regression trained with minibatch SGD" • 10GB of data (50MM lines). 
 Caveat: Quite small for a benchmark • Super linear strong scalability. 
 Not theoretically possible => small dataset + few instances saturate. Numberofcores 0 10 20 30 40 timeinsec. 0 325 650 975 1300 Number of AWS nodes 4 6 8 10 time (sec) #cores
  • 65.
    Spark on Criteo’sdata ! • Logistic regression trained with minibatch SGD" • 10GB of data (50MM lines). 
 Caveat: Quite small for a benchmark • Super linear strong scalability. 
 Not theoretically possible => small dataset + few instances saturate. Numberofcores 0 10 20 30 40 timeinsec. 0 325 650 975 1300 Number of AWS nodes 4 6 8 10 time (sec) #cores Manual setup of the cluster was a bit painful…
  • 66.
    Software stack forbig data 22
  • 67.
    Software stack forbig data 22 Local Standalone YARNMESOS Cluster" manager
  • 68.
    Software stack forbig data 22 Local Standalone YARNMESOS HDFS Tachyon Cassandra HBase Others… Cluster" manager Storage " layer
  • 69.
    Software stack forbig data 22 Local Standalone YARNMESOS HDFS Tachyon Cassandra HBase Others… Spark" Memory-optimised execution engine Flink" Apache incubated excution engine. Hadoop MR 2" Cluster" manager Storage " layer Execution " layer
  • 70.
    Software stack forbig data 22 Local Standalone YARNMESOS HDFS Tachyon Cassandra HBase Others… Spark" Memory-optimised execution engine Flink" Apache incubated excution engine. Hadoop MR 2" MLlib GraphX Streaming SQL/ Datafra me Cluster" manager Storage " layer Execution " layer Libraries
  • 71.
    Software stack forbig data 22 Local Standalone YARNMESOS HDFS Tachyon Cassandra HBase Others… Spark" Memory-optimised execution engine Flink" Apache incubated excution engine. Hadoop MR 2" MLlib GraphX Streaming SQL/ Datafra me FlinkML Gelly (graph) TableAPI Batch Cluster" manager Storage " layer Execution " layer Libraries
  • 72.
  • 73.
    Software stack MESOSvs YARN 23 • Standalone mode is fastest…
  • 74.
    Software stack MESOSvs YARN 23 • Standalone mode is fastest… • … but resources are requested for the entire job.
  • 75.
    Software stack MESOSvs YARN 23 • Standalone mode is fastest… • … but resources are requested for the entire job. Cluster management frameworks
  • 76.
    Software stack MESOSvs YARN 23 • Standalone mode is fastest… • … but resources are requested for the entire job. Cluster management frameworks
  • 77.
    Software stack MESOSvs YARN 23 • Standalone mode is fastest… • … but resources are requested for the entire job. Cluster management frameworks • Concurrent access (multiuser)
  • 78.
    Software stack MESOSvs YARN 23 • Standalone mode is fastest… • … but resources are requested for the entire job. Cluster management frameworks • Concurrent access (multiuser) • Hyperparameter tuning (multijob)
  • 79.
    Software stack MESOSvs YARN 23 • Standalone mode is fastest… • … but resources are requested for the entire job. Cluster management frameworks • Concurrent access (multiuser) • Hyperparameter tuning (multijob)
  • 80.
    Software stack MESOSvs YARN 23 • Standalone mode is fastest… • … but resources are requested for the entire job. Cluster management frameworks • Concurrent access (multiuser) • Hyperparameter tuning (multijob) Mesos YARN • Framework receive offers • Easy install on AWS, GCE • Lots of compatible frameworks: Spark, MPI, Cassandra, HDFS… • Mesosphere’s DCOS is really, really easy to use. • Frameworks make offers • Configuration hell (can be made easier with puppet/ ansible recipes • Several compatible frameworks: Spark, Flink, HDFS…
  • 81.
    Infrastructure stack • AWS= AWeSome • Basic instance with spot price ! ! ! ! • Graphical Network Designer
  • 82.
    Infrastructure stack • AWS= AWeSome • Basic instance with spot price ! ! ! ! • Graphical Network Designer
  • 83.
    Infrastructure stack • AWS= AWeSome • Basic instance with spot price ! ! ! ! • Graphical Network Designer 10 r2.2xlarge instances for (350GB mem. & 40 cores) 0.85$/hour
  • 84.
    Infrastructure stack • AWS= AWeSome • Basic instance with spot price ! ! ! ! • Graphical Network Designer
  • 85.
  • 86.
  • 87.
  • 88.
  • 89.
  • 90.
  • 91.
  • 92.
  • 93.
    What’s coming inthe next few years ? BONUS