SlideShare a Scribd company logo
Performance and
scalability for machine
learning.
Arnaud Rachez (arnaud.rachez@gmail.com)
!
November 2nd, 2015
Outline
• Performance (7mn)
• Parallelism (7mn)
• Scalability (10mn)
Numbers everyone should know
(2015 update)
3
Source: http://www.eecs.berkeley.edu/~rcs/research/interactive_latency.html
ThrougputGB/s
0
200
400
600
800
L1 L2 L3 RAM
ThrougputGB/s
0
7,5
15
22,5
30
RAM SSD Network
~800MB ~1.25GB~30GB
source: http://forums.aida64.com/topic/2864-i7-5775c-l4-cache-performance/ source: http://www.macrumors.com/2015/05/21/15-inch-retina-macbook-pro-2gbps-throughput/
Outline
• Performance (5-7mn)
• Parallelism (5-7mn)
• Scalability (7-10mn)
Optimising SGD
• Linear regression (like)
stochastic gradient descent
with d=5 features and
n=1,000,000 examples.
• Using Python (1), Numba (2),
Numpy (3) and Cython (4)
(https://gist.github.com/zermelozf/
3cd06c8b0ce28f4eeacd)
• Also compared it to pure C++
code (https://gist.github.com/
zermelozf/
4df67d14f72f04b4338a)
(1)
(2)
(3)
(4)
Runtime optimisation
6
Source: https://gist.github.com/zermelozf/3cd06c8b0ce28f4eeacd
Runtime optimisation
6
Optimisation strategies (d=5 & n=1,000,000)
time(ms)
1
10
100
1000
10000
Python Numpy Cython Numba c++
Source: https://gist.github.com/zermelozf/3cd06c8b0ce28f4eeacd
Runtime optimisation
6
Optimisation strategies (d=5 & n=1,000,000)
time(ms)
1
10
100
1000
10000
Python Numpy Cython Numba c++
Source: https://gist.github.com/zermelozf/3cd06c8b0ce28f4eeacd
Runtime optimisation
6
Optimisation strategies (d=5 & n=1,000,000)
time(ms)
1
10
100
1000
10000
Python Numpy Cython Numba c++
Source: https://gist.github.com/zermelozf/3cd06c8b0ce28f4eeacd
Runtime optimisation
6
Optimisation strategies (d=5 & n=1,000,000)
time(ms)
1
10
100
1000
10000
Python Numpy Cython Numba c++
Source: https://gist.github.com/zermelozf/3cd06c8b0ce28f4eeacd
Runtime optimisation
6
Optimisation strategies (d=5 & n=1,000,000)
time(ms)
1
10
100
1000
10000
Python Numpy Cython Numba c++
Source: https://gist.github.com/zermelozf/3cd06c8b0ce28f4eeacd
Runtime optimisation
6
Optimisation strategies (d=5 & n=1,000,000)
time(ms)
1
10
100
1000
10000
Python Numpy Cython Numba c++
Source: https://gist.github.com/zermelozf/3cd06c8b0ce28f4eeacd
Runtime optimisation
6
Optimisation strategies (d=5 & n=1,000,000)
time(ms)
1
10
100
1000
10000
Python Numpy Cython Numba c++
Source: https://gist.github.com/zermelozf/3cd06c8b0ce28f4eeacd
M
em
ory
views
M
em
ory
views
pointers
pointers?
M
em
ory
views?
Runtime optimisation
7
Runtime optimisation
7
Cache optimisation (d=5 & n=1,000,000)
time(ms)
0
40
80
120
160
Numba c++ cython
random linear
Runtime optimisation
7
Cache optimisation (d=5 & n=1,000,000)
time(ms)
0
40
80
120
160
Numba c++ cython
random linear
Runtime optimisation
7
Cache optimisation (d=5 & n=1,000,000)
time(ms)
0
40
80
120
160
Numba c++ cython
random linear
Cache miss
Cache miss
Cache miss
Runtime optimisation
7
Cache optimisation (d=5 & n=1,000,000)
time(ms)
0
40
80
120
160
Numba c++ cython
random linear
Cache hit
Cache hitCache miss
Cache miss
Cache miss
Cache hit
(d>>1) Gensim word2vec case study
• Elman style RNN trained with
SGD: 15,079×200 matrix on a
1M word corpus.
• Baseline written by Tomas
Mikolov in optimised C.
• Rewritten by Radim Rehurec in
python.
• Optimised by Radim Rehurec
using Cython, BLAS…
Source: http://rare-technologies.com/word2vec-in-python-part-two-optimizing/
(d>>1) Gensim word2vec case study
• Elman style RNN trained with
SGD: 15,079×200 matrix on a
1M word corpus.
• Baseline written by Tomas
Mikolov in optimised C.
• Rewritten by Radim Rehurec in
python.
• Optimised by Radim Rehurec
using Cython, BLAS…
Source: http://rare-technologies.com/word2vec-in-python-part-two-optimizing/
Original C
Numpy
Cython
Cython + BLAS
Cython + BLAS + sigmoid table
word/sec (x1000)
0 30 60 90 120
(d>>1) Gensim word2vec case study
• Elman style RNN trained with
SGD: 15,079×200 matrix on a
1M word corpus.
• Baseline written by Tomas
Mikolov in optimised C.
• Rewritten by Radim Rehurec in
python.
• Optimised by Radim Rehurec
using Cython, BLAS…
Source: http://rare-technologies.com/word2vec-in-python-part-two-optimizing/
Original C
Numpy
Cython
Cython + BLAS
Cython + BLAS + sigmoid table
word/sec (x1000)
0 30 60 90 120
(d>>1) Gensim word2vec case study
• Elman style RNN trained with
SGD: 15,079×200 matrix on a
1M word corpus.
• Baseline written by Tomas
Mikolov in optimised C.
• Rewritten by Radim Rehurec in
python.
• Optimised by Radim Rehurec
using Cython, BLAS…
Source: http://rare-technologies.com/word2vec-in-python-part-two-optimizing/
Original C
Numpy
Cython
Cython + BLAS
Cython + BLAS + sigmoid table
word/sec (x1000)
0 30 60 90 120
(d>>1) Gensim word2vec case study
• Elman style RNN trained with
SGD: 15,079×200 matrix on a
1M word corpus.
• Baseline written by Tomas
Mikolov in optimised C.
• Rewritten by Radim Rehurec in
python.
• Optimised by Radim Rehurec
using Cython, BLAS…
Source: http://rare-technologies.com/word2vec-in-python-part-two-optimizing/
Original C
Numpy
Cython
Cython + BLAS
Cython + BLAS + sigmoid table
word/sec (x1000)
0 30 60 90 120
(d>>1) Gensim word2vec case study
• Elman style RNN trained with
SGD: 15,079×200 matrix on a
1M word corpus.
• Baseline written by Tomas
Mikolov in optimised C.
• Rewritten by Radim Rehurec in
python.
• Optimised by Radim Rehurec
using Cython, BLAS…
Source: http://rare-technologies.com/word2vec-in-python-part-two-optimizing/
Original C
Numpy
Cython
Cython + BLAS
Cython + BLAS + sigmoid table
word/sec (x1000)
0 30 60 90 120
(d>>1) Gensim word2vec case study
• Elman style RNN trained with
SGD: 15,079×200 matrix on a
1M word corpus.
• Baseline written by Tomas
Mikolov in optimised C.
• Rewritten by Radim Rehurec in
python.
• Optimised by Radim Rehurec
using Cython, BLAS…
Source: http://rare-technologies.com/word2vec-in-python-part-two-optimizing/
Original C
Numpy
Cython
Cython + BLAS
Cython + BLAS + sigmoid table
word/sec (x1000)
0 30 60 90 120
(d>>1) Gensim word2vec case study
• Elman style RNN trained with
SGD: 15,079×200 matrix on a
1M word corpus.
• Baseline written by Tomas
Mikolov in optimised C.
• Rewritten by Radim Rehurec in
python.
• Optimised by Radim Rehurec
using Cython, BLAS…
Source: http://rare-technologies.com/word2vec-in-python-part-two-optimizing/
Original C
Numpy
Cython
Cython + BLAS
Cython + BLAS + sigmoid table
word/sec (x1000)
0 30 60 90 120
pointers
pointers
pointers
What’s this BLAS magic?
Source: https://github.com/piskvorky/gensim/blob/develop/gensim/models/word2vec_inner.pyx
• vectorized y = alpha*x !
• replaced 3 lines of code!
• translated into a 3x speedup over Cython alone!
• please read http://rare-technologies.com/word2vec-in-python-part-two-optimizing/
**On my MacBook Pro, SciPy automatically links against Apple’s vecLib, which contains an excellent BLAS.
Similarly, Intel’s MKL, AMD’s AMCL, Sun’s SunPerf or the automatically tuned ATLAS are all good choices.
Outline
• Performance (5-7mn)
• Parallelism (5-7mn)
• Scalability (7-10mn)
Hardware trends: CPU
11
Numberofcores
0
1
2
3
4
ClockspeedMhz
0
1000
2000
3000
4000
1970 1975 1980 1985 1990 1995 2000 2005 2010 2015
Clock speed (Mhz) #Cores
Source: http://www.gotw.ca/publications/concurrency-ddj.htm
(d>>1) Gensim word2vec continued
• Elman style RNN trained with
SGD: 15,079×200 matrix on a
1M word corpus.
• Baseline written by Tomas
Mikolov in optimised C.
• Rewritten by Radim Rehurec in
python.
• Optimised by Radim Rehurec
using Cython, BLAS…
Source: http://rare-technologies.com/parallelizing-word2vec-in-python/
(d>>1) Gensim word2vec continued
• Elman style RNN trained with
SGD: 15,079×200 matrix on a
1M word corpus.
• Baseline written by Tomas
Mikolov in optimised C.
• Rewritten by Radim Rehurec in
python.
• Optimised by Radim Rehurec
using Cython, BLAS…
Source: http://rare-technologies.com/parallelizing-word2vec-in-python/
• … and parallelised with threads!
(d>>1) Gensim word2vec continued
• Elman style RNN trained with
SGD: 15,079×200 matrix on a
1M word corpus.
• Baseline written by Tomas
Mikolov in optimised C.
• Rewritten by Radim Rehurec in
python.
• Optimised by Radim Rehurec
using Cython, BLAS…
Source: http://rare-technologies.com/parallelizing-word2vec-in-python/
1 thread
2 threads
3 threads
4 threads
word/sec (x1000)
0 100 200 300 400
Original C
Cython + BLAS + sigmoid table
• … and parallelised with threads!
(d>>1) Gensim word2vec continued
• Elman style RNN trained with
SGD: 15,079×200 matrix on a
1M word corpus.
• Baseline written by Tomas
Mikolov in optimised C.
• Rewritten by Radim Rehurec in
python.
• Optimised by Radim Rehurec
using Cython, BLAS…
Source: http://rare-technologies.com/parallelizing-word2vec-in-python/
1 thread
2 threads
3 threads
4 threads
word/sec (x1000)
0 100 200 300 400
Original C
Cython + BLAS + sigmoid table
• … and parallelised with threads!
(d>>1) Gensim word2vec continued
• Elman style RNN trained with
SGD: 15,079×200 matrix on a
1M word corpus.
• Baseline written by Tomas
Mikolov in optimised C.
• Rewritten by Radim Rehurec in
python.
• Optimised by Radim Rehurec
using Cython, BLAS…
Source: http://rare-technologies.com/parallelizing-word2vec-in-python/
1 thread
2 threads
3 threads
4 threads
word/sec (x1000)
0 100 200 300 400
Original C
Cython + BLAS + sigmoid table
• … and parallelised with threads!
(d>>1) Gensim word2vec continued
• Elman style RNN trained with
SGD: 15,079×200 matrix on a
1M word corpus.
• Baseline written by Tomas
Mikolov in optimised C.
• Rewritten by Radim Rehurec in
python.
• Optimised by Radim Rehurec
using Cython, BLAS…
Source: http://rare-technologies.com/parallelizing-word2vec-in-python/
1 thread
2 threads
3 threads
4 threads
word/sec (x1000)
0 100 200 300 400
Original C
Cython + BLAS + sigmoid table
• … and parallelised with threads!
(d>>1) Gensim word2vec continued
• Elman style RNN trained with
SGD: 15,079×200 matrix on a
1M word corpus.
• Baseline written by Tomas
Mikolov in optimised C.
• Rewritten by Radim Rehurec in
python.
• Optimised by Radim Rehurec
using Cython, BLAS…
Source: http://rare-technologies.com/parallelizing-word2vec-in-python/
1 thread
2 threads
3 threads
4 threads
word/sec (x1000)
0 100 200 300 400
Original C
Cython + BLAS + sigmoid table
• … and parallelised with threads!
(d>>1) Gensim word2vec continued
• Elman style RNN trained with
SGD: 15,079×200 matrix on a
1M word corpus.
• Baseline written by Tomas
Mikolov in optimised C.
• Rewritten by Radim Rehurec in
python.
• Optimised by Radim Rehurec
using Cython, BLAS…
Source: http://rare-technologies.com/parallelizing-word2vec-in-python/
1 thread
2 threads
3 threads
4 threads
word/sec (x1000)
0 100 200 300 400
Original C
Cython + BLAS + sigmoid table
• … and parallelised with threads!
2.85x speedup
(d>>1) Hogwild!on SAG
• Fabian’s experimentation with Julia (lang).
• Running SAG in
parallel, without
a lock.
(d>>1) Hogwild!on SAG
• Fabian’s experimentation with Julia (lang).
• Running SAG in
parallel, without
a lock.
• Very nice
speed up!!!
Data and does not fit in memory…
Stream data from disk…
… but you cannot read in parallel…
Producer/Consumer pattern
chunk 1 chunk 2 chunk 3 chunk 4 chunk 5 chunk 6 …
…
…
…thread 2"
(consumer)
thread 2"
(consumer)
thread 1"
(producer)
Et cetera…
Data and does not fit in memory…
Stream data from disk…
… but you cannot read in parallel…
Producer/Consumer pattern
chunk 1 chunk 2 chunk 3 chunk 4 chunk 5 chunk 6 …
job 1 …
…
…thread 2"
(consumer)
thread 2"
(consumer)
thread 1"
(producer)
Et cetera…
Data and does not fit in memory…
Stream data from disk…
… but you cannot read in parallel…
Producer/Consumer pattern
chunk 1 chunk 2 chunk 3 chunk 4 chunk 5 chunk 6 …
job 2 …
job 1 …
…thread 2"
(consumer)
thread 2"
(consumer)
thread 1"
(producer)
Et cetera…
Data and does not fit in memory…
Stream data from disk…
… but you cannot read in parallel…
Producer/Consumer pattern
chunk 1 chunk 2 chunk 3 chunk 4 chunk 5 chunk 6 …
job 3 …
job 1
job 2
…
…thread 2"
(consumer)
thread 2"
(consumer)
thread 1"
(producer)
Et cetera…
Data and does not fit in memory…
Stream data from disk…
… but you cannot read in parallel…
Producer/Consumer pattern
chunk 1 chunk 2 chunk 3 chunk 4 chunk 5 chunk 6 …
job 3 job 4 …
job 1
job 2
…
…thread 2"
(consumer)
thread 2"
(consumer)
thread 1"
(producer)
Et cetera…
Data and does not fit in memory…
Stream data from disk…
… but you cannot read in parallel…
Producer/Consumer pattern
chunk 1 chunk 2 chunk 3 chunk 4 chunk 5 chunk 6 …
job 3 job 4 job 5 …
…
…thread 2"
(consumer)
thread 2"
(consumer)
thread 1"
(producer)
done
done
Et cetera…
Data and does not fit in memory…
Stream data from disk…
… but you cannot read in parallel…
Producer/Consumer pattern
chunk 1 chunk 2 chunk 3 chunk 4 chunk 5 chunk 6 …
job 5 …
job 3
job 4
…
…thread 2"
(consumer)
thread 2"
(consumer)
thread 1"
(producer)
done
done
Et cetera…
Data and does not fit in memory…
Stream data from disk…
… but you cannot read in parallel…
Producer/Consumer pattern
chunk 1 chunk 2 chunk 3 chunk 4 chunk 5 chunk 6 …
job 5 …
job 3
job 4
…
…thread 2"
(consumer)
thread 2"
(consumer)
thread 1"
(producer)
done
done
Et cetera…
Data and does not fit in memory…
Stream data from disk…
… but you cannot read in parallel…
Producer/Consumer pattern
chunk 1 chunk 2 chunk 3 chunk 4 chunk 5 chunk 6 …
job 5 …
job 4
…
…thread 2"
(consumer)
thread 2"
(consumer)
thread 1"
(producer)
done
done
done
Et cetera…
Data and does not fit in memory…
Stream data from disk…
… but you cannot read in parallel…
Producer/Consumer pattern
chunk 1 chunk 2 chunk 3 chunk 4 chunk 5 chunk 6 …
…
job 4
job 5 …
…thread 2"
(consumer)
thread 2"
(consumer)
thread 1"
(producer)
done
done
done
Et cetera…
How many consumers?
It depends…
!
• Gensim (R. Rehurec)
• Saw the impact up to 4 consumers earlier

• Vowpal Wabbit (J. Langford)
• Claims no gain with more than 1 consumer!
• 2’10’’ on my macbook pro for ~10GB and 50MM lines 

(Criteo’s advertising dataset).
!
• CNNs pre-processing (S. Dieleman)
• Big impact with ?? (several) consumers!
• Useful for data augmentation/preprocessing
5.3GB (~105MM lines) word count
0
55
110
165
220
Number of consumers
1 2 3 4 5 6
Word count java benchmark
source: https://gist.github.com/nicomak/1d6561e6f71d936d3178
• Macbook pro 15’’ 2014
• `sudo purge`
Outline
• Performance (5-7mn)
• Parallelism (5-7mn)
• Scalability (7-10mn)
Hardware trends: HDD
Capacity(GB)
0
150
300
450
600
Timetoread(sec)
0
1 000
2 000
3 000
4 000
1979 1983 1993 1998 1999 2001 2003 2008 2011
Read full disk (sec.) Capacity (GB)
Source: https://tylermuth.wordpress.com/2011/11/02
18
Distributed computing
19
Scalability - A perspective on Big data
Distributed computing
19
Scalability - A perspective on Big data
Distributed computing
19
• Strong scaling: if you throw twice as many machines at
the task, you solve it in half the time.

Usually relevant when the task is CPU bound.
Scalability - A perspective on Big data
Distributed computing
19
• Strong scaling: if you throw twice as many machines at
the task, you solve it in half the time.

Usually relevant when the task is CPU bound.
• Weak scaling: if the dataset is twice as big, throw twice
as many machines at it to solve the task in constant time.

Memory bound tasks… usually.
Scalability - A perspective on Big data
Distributed computing
19
• Strong scaling: if you throw twice as many machines at
the task, you solve it in half the time.

Usually relevant when the task is CPU bound.
• Weak scaling: if the dataset is twice as big, throw twice
as many machines at it to solve the task in constant time.

Memory bound tasks… usually.
Most “big data” problems are I/O bound. Hard to solve the task in an
acceptable time independently of the size of the data (weak scaling).
Scalability - A perspective on Big data
Bring computation to data
20
Bring computation to data
20
Map-Reduce: Statistical query model
Bring computation to data
20
Map-Reduce: Statistical query model
the sum corresponds	

to a reduce operation
Bring computation to data
20
Map-Reduce: Statistical query model
f, the map function, is	

sent to every machine
the sum corresponds	

to a reduce operation
Bring computation to data
20
Map-Reduce: Statistical query model
f, the map function, is	

sent to every machine
the sum corresponds	

to a reduce operation
• D. Caragea et al., A Framework for Learning from Distributed Data Using Sufficient Statistics and Its Application to Learning
Decision Trees. Int. J. Hybrid Intell. Syst. 2004	

• Chu et al., Map-Reduce for Machine Learning on Multicore. NIPS’06.
Spark on Criteo’s data
!
• Logistic regression trained with
minibatch SGD"
• 10GB of data (50MM lines). 

Caveat: Quite small for a benchmark
• Super linear strong
scalability. 

Not theoretically possible => small
dataset + few instances saturate.
Numberofcores
0
10
20
30
40
timeinsec.
0
325
650
975
1300
Number of AWS nodes
4 6 8 10
time (sec) #cores
Spark on Criteo’s data
!
• Logistic regression trained with
minibatch SGD"
• 10GB of data (50MM lines). 

Caveat: Quite small for a benchmark
• Super linear strong
scalability. 

Not theoretically possible => small
dataset + few instances saturate.
Numberofcores
0
10
20
30
40
timeinsec.
0
325
650
975
1300
Number of AWS nodes
4 6 8 10
time (sec) #cores
Manual setup of the cluster
was a bit painful…
Software stack for big data
22
Software stack for big data
22
Local Standalone YARNMESOS
Cluster"
manager
Software stack for big data
22
Local Standalone YARNMESOS
HDFS Tachyon Cassandra HBase Others…
Cluster"
manager
Storage "
layer
Software stack for big data
22
Local Standalone YARNMESOS
HDFS Tachyon Cassandra HBase Others…
Spark"
Memory-optimised execution
engine
Flink"
Apache incubated excution
engine.
Hadoop MR 2"
Cluster"
manager
Storage "
layer
Execution "
layer
Software stack for big data
22
Local Standalone YARNMESOS
HDFS Tachyon Cassandra HBase Others…
Spark"
Memory-optimised execution
engine
Flink"
Apache incubated excution
engine.
Hadoop MR 2"
MLlib
GraphX
Streaming
SQL/
Datafra
me
Cluster"
manager
Storage "
layer
Execution "
layer
Libraries
Software stack for big data
22
Local Standalone YARNMESOS
HDFS Tachyon Cassandra HBase Others…
Spark"
Memory-optimised execution
engine
Flink"
Apache incubated excution
engine.
Hadoop MR 2"
MLlib
GraphX
Streaming
SQL/
Datafra
me
FlinkML
Gelly
(graph)
TableAPI
Batch
Cluster"
manager
Storage "
layer
Execution "
layer
Libraries
Software stack MESOS vs YARN
23
Software stack MESOS vs YARN
23
• Standalone mode is fastest…
Software stack MESOS vs YARN
23
• Standalone mode is fastest…
• … but resources are requested for the entire job.
Software stack MESOS vs YARN
23
• Standalone mode is fastest…
• … but resources are requested for the entire job.
Cluster management frameworks
Software stack MESOS vs YARN
23
• Standalone mode is fastest…
• … but resources are requested for the entire job.
Cluster management frameworks
Software stack MESOS vs YARN
23
• Standalone mode is fastest…
• … but resources are requested for the entire job.
Cluster management frameworks
• Concurrent access (multiuser)
Software stack MESOS vs YARN
23
• Standalone mode is fastest…
• … but resources are requested for the entire job.
Cluster management frameworks
• Concurrent access (multiuser)
• Hyperparameter tuning (multijob)
Software stack MESOS vs YARN
23
• Standalone mode is fastest…
• … but resources are requested for the entire job.
Cluster management frameworks
• Concurrent access (multiuser)
• Hyperparameter tuning (multijob)
Software stack MESOS vs YARN
23
• Standalone mode is fastest…
• … but resources are requested for the entire job.
Cluster management frameworks
• Concurrent access (multiuser)
• Hyperparameter tuning (multijob)
Mesos YARN
• Framework receive offers
• Easy install on AWS, GCE
• Lots of compatible frameworks:
Spark, MPI, Cassandra,
HDFS…
• Mesosphere’s DCOS is really,
really easy to use.
• Frameworks make offers
• Configuration hell (can be
made easier with puppet/
ansible recipes
• Several compatible
frameworks: Spark, Flink,
HDFS…
Infrastructure stack
• AWS = AWeSome
• Basic instance with spot price
!
!
!
!
• Graphical Network Designer
Infrastructure stack
• AWS = AWeSome
• Basic instance with spot price
!
!
!
!
• Graphical Network Designer
Infrastructure stack
• AWS = AWeSome
• Basic instance with spot price
!
!
!
!
• Graphical Network Designer
10 r2.2xlarge instances for
(350GB mem. & 40 cores)
0.85$/hour
Infrastructure stack
• AWS = AWeSome
• Basic instance with spot price
!
!
!
!
• Graphical Network Designer
Infrastructure stack
Infrastructure stack
VPC
Infrastructure stack
VPC
Subnets
public/private
Infrastructure stack
VPC
Subnets
public/private
Security rules
Infrastructure stack
VPC
Subnets
public/private
Security rules
Bootstrap config
for master/slaves
Infrastructure stack
VPC
Subnets
public/private
Security rules
Bootstrap config
for master/slaves
Network
entry point
Infrastructure stack
Source: https://aws.amazon.com/architecture/
- Questions -
“Thank you”
What’s coming in the next few years ?
BONUS

More Related Content

What's hot

Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like systemAccelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Shuai Yuan
 
Inside the ABC's new Media Transcoding system, Metro
Inside the ABC's new Media Transcoding system, MetroInside the ABC's new Media Transcoding system, Metro
Inside the ABC's new Media Transcoding system, Metro
Daphne Chong
 
Scaling Apache Storm (Hadoop Summit 2015)
Scaling Apache Storm (Hadoop Summit 2015)Scaling Apache Storm (Hadoop Summit 2015)
Scaling Apache Storm (Hadoop Summit 2015)
Robert Evans
 
Improved Reliable Streaming Processing: Apache Storm as example
Improved Reliable Streaming Processing: Apache Storm as exampleImproved Reliable Streaming Processing: Apache Storm as example
Improved Reliable Streaming Processing: Apache Storm as example
DataWorks Summit/Hadoop Summit
 
Video Transcoding at the ABC with Microservices at GOTO Chicago
Video Transcoding at the ABC with Microservices at GOTO ChicagoVideo Transcoding at the ABC with Microservices at GOTO Chicago
Video Transcoding at the ABC with Microservices at GOTO Chicago
Daphne Chong
 
Real-Time Analytics with Kafka, Cassandra and Storm
Real-Time Analytics with Kafka, Cassandra and StormReal-Time Analytics with Kafka, Cassandra and Storm
Real-Time Analytics with Kafka, Cassandra and Storm
John Georgiadis
 
Storm Real Time Computation
Storm Real Time ComputationStorm Real Time Computation
Storm Real Time Computation
Sonal Raj
 
R and C++
R and C++R and C++
R and C++
Romain Francois
 
Cassandra and Storm at Health Market Sceince
Cassandra and Storm at Health Market SceinceCassandra and Storm at Health Market Sceince
Cassandra and Storm at Health Market Sceince
P. Taylor Goetz
 
Learning Stream Processing with Apache Storm
Learning Stream Processing with Apache StormLearning Stream Processing with Apache Storm
Learning Stream Processing with Apache Storm
Eugene Dvorkin
 

What's hot (10)

Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like systemAccelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
 
Inside the ABC's new Media Transcoding system, Metro
Inside the ABC's new Media Transcoding system, MetroInside the ABC's new Media Transcoding system, Metro
Inside the ABC's new Media Transcoding system, Metro
 
Scaling Apache Storm (Hadoop Summit 2015)
Scaling Apache Storm (Hadoop Summit 2015)Scaling Apache Storm (Hadoop Summit 2015)
Scaling Apache Storm (Hadoop Summit 2015)
 
Improved Reliable Streaming Processing: Apache Storm as example
Improved Reliable Streaming Processing: Apache Storm as exampleImproved Reliable Streaming Processing: Apache Storm as example
Improved Reliable Streaming Processing: Apache Storm as example
 
Video Transcoding at the ABC with Microservices at GOTO Chicago
Video Transcoding at the ABC with Microservices at GOTO ChicagoVideo Transcoding at the ABC with Microservices at GOTO Chicago
Video Transcoding at the ABC with Microservices at GOTO Chicago
 
Real-Time Analytics with Kafka, Cassandra and Storm
Real-Time Analytics with Kafka, Cassandra and StormReal-Time Analytics with Kafka, Cassandra and Storm
Real-Time Analytics with Kafka, Cassandra and Storm
 
Storm Real Time Computation
Storm Real Time ComputationStorm Real Time Computation
Storm Real Time Computation
 
R and C++
R and C++R and C++
R and C++
 
Cassandra and Storm at Health Market Sceince
Cassandra and Storm at Health Market SceinceCassandra and Storm at Health Market Sceince
Cassandra and Storm at Health Market Sceince
 
Learning Stream Processing with Apache Storm
Learning Stream Processing with Apache StormLearning Stream Processing with Apache Storm
Learning Stream Processing with Apache Storm
 

Viewers also liked

Lightning: large scale machine learning in python
Lightning: large scale machine learning in pythonLightning: large scale machine learning in python
Lightning: large scale machine learning in python
Fabian Pedregosa
 
Profiling in Python
Profiling in PythonProfiling in Python
Profiling in Python
Fabian Pedregosa
 
Hyperparameter optimization with approximate gradient
Hyperparameter optimization with approximate gradientHyperparameter optimization with approximate gradient
Hyperparameter optimization with approximate gradient
Fabian Pedregosa
 
CANDDi Insights
CANDDi InsightsCANDDi Insights
CANDDi Insights
Frederic Abrard
 
Mobile commerce km
Mobile commerce kmMobile commerce km
Mobile commerce km
Kartik Mehta
 
TIBCO Loyalty Lab paris event
TIBCO Loyalty Lab paris eventTIBCO Loyalty Lab paris event
TIBCO Loyalty Lab paris event
Gerald Guigui
 
Implications of 4G Deployments (MEF for MPLS World Congress Ethernet Wholesa...
Implications of 4G Deployments (MEF for MPLS World Congress  Ethernet Wholesa...Implications of 4G Deployments (MEF for MPLS World Congress  Ethernet Wholesa...
Implications of 4G Deployments (MEF for MPLS World Congress Ethernet Wholesa...
Javier Gonzalez
 
Seerus analytics or how integrate smart data in your company
Seerus analytics or how integrate smart data in your company Seerus analytics or how integrate smart data in your company
Seerus analytics or how integrate smart data in your company
Quentin Liénart
 
Introduction to C#3
Introduction to C#3Introduction to C#3
Introduction to C#3
Christian Jaensch
 
Growth hacking - Telecom bretagne - 2015-10-21
Growth hacking - Telecom bretagne - 2015-10-21Growth hacking - Telecom bretagne - 2015-10-21
Growth hacking - Telecom bretagne - 2015-10-21
Francois Pacot
 
CogLab | Imaginove | UI#02 – BCI : Usages et enjeux pour l’innovation et la c...
CogLab | Imaginove | UI#02 – BCI : Usages et enjeux pour l’innovation et la c...CogLab | Imaginove | UI#02 – BCI : Usages et enjeux pour l’innovation et la c...
CogLab | Imaginove | UI#02 – BCI : Usages et enjeux pour l’innovation et la c...
af83
 
Elasticmeetup curiosity 20141113
Elasticmeetup curiosity 20141113Elasticmeetup curiosity 20141113
Elasticmeetup curiosity 20141113
Erwan Pigneul
 
Brand Positioning, a component of INDIGITAL BRANDING MODEL©
Brand Positioning, a component of INDIGITAL BRANDING MODEL©Brand Positioning, a component of INDIGITAL BRANDING MODEL©
Brand Positioning, a component of INDIGITAL BRANDING MODEL©
Alfredo Escobar
 
Zéphir, ERP dans le Cloud
Zéphir, ERP dans le CloudZéphir, ERP dans le Cloud
Zéphir, ERP dans le Cloud
Zéphir
 
sfPot aop
sfPot aopsfPot aop
Big on Mobile, Big on Facebook. How the European super startups did it.
Big on Mobile, Big on Facebook. How the European super startups did it. Big on Mobile, Big on Facebook. How the European super startups did it.
Big on Mobile, Big on Facebook. How the European super startups did it.
Julien Lesaicherre
 
IBM MQ Overview (IBM Message Queue)
IBM MQ Overview (IBM Message Queue)IBM MQ Overview (IBM Message Queue)
IBM MQ Overview (IBM Message Queue)
Juarez Junior
 
Indian IT industry analysis of 5 slides and company ( Infosys) analysis ( FY ...
Indian IT industry analysis of 5 slides and company ( Infosys) analysis ( FY ...Indian IT industry analysis of 5 slides and company ( Infosys) analysis ( FY ...
Indian IT industry analysis of 5 slides and company ( Infosys) analysis ( FY ...
Saurabh Mittra
 
Efficient Pagination Using MySQL
Efficient Pagination Using MySQLEfficient Pagination Using MySQL
Efficient Pagination Using MySQL
Surat Singh Bhati
 
Best Bourbons
Best BourbonsBest Bourbons
Best Bourbons
Aniruddha Ray (Ani)
 

Viewers also liked (20)

Lightning: large scale machine learning in python
Lightning: large scale machine learning in pythonLightning: large scale machine learning in python
Lightning: large scale machine learning in python
 
Profiling in Python
Profiling in PythonProfiling in Python
Profiling in Python
 
Hyperparameter optimization with approximate gradient
Hyperparameter optimization with approximate gradientHyperparameter optimization with approximate gradient
Hyperparameter optimization with approximate gradient
 
CANDDi Insights
CANDDi InsightsCANDDi Insights
CANDDi Insights
 
Mobile commerce km
Mobile commerce kmMobile commerce km
Mobile commerce km
 
TIBCO Loyalty Lab paris event
TIBCO Loyalty Lab paris eventTIBCO Loyalty Lab paris event
TIBCO Loyalty Lab paris event
 
Implications of 4G Deployments (MEF for MPLS World Congress Ethernet Wholesa...
Implications of 4G Deployments (MEF for MPLS World Congress  Ethernet Wholesa...Implications of 4G Deployments (MEF for MPLS World Congress  Ethernet Wholesa...
Implications of 4G Deployments (MEF for MPLS World Congress Ethernet Wholesa...
 
Seerus analytics or how integrate smart data in your company
Seerus analytics or how integrate smart data in your company Seerus analytics or how integrate smart data in your company
Seerus analytics or how integrate smart data in your company
 
Introduction to C#3
Introduction to C#3Introduction to C#3
Introduction to C#3
 
Growth hacking - Telecom bretagne - 2015-10-21
Growth hacking - Telecom bretagne - 2015-10-21Growth hacking - Telecom bretagne - 2015-10-21
Growth hacking - Telecom bretagne - 2015-10-21
 
CogLab | Imaginove | UI#02 – BCI : Usages et enjeux pour l’innovation et la c...
CogLab | Imaginove | UI#02 – BCI : Usages et enjeux pour l’innovation et la c...CogLab | Imaginove | UI#02 – BCI : Usages et enjeux pour l’innovation et la c...
CogLab | Imaginove | UI#02 – BCI : Usages et enjeux pour l’innovation et la c...
 
Elasticmeetup curiosity 20141113
Elasticmeetup curiosity 20141113Elasticmeetup curiosity 20141113
Elasticmeetup curiosity 20141113
 
Brand Positioning, a component of INDIGITAL BRANDING MODEL©
Brand Positioning, a component of INDIGITAL BRANDING MODEL©Brand Positioning, a component of INDIGITAL BRANDING MODEL©
Brand Positioning, a component of INDIGITAL BRANDING MODEL©
 
Zéphir, ERP dans le Cloud
Zéphir, ERP dans le CloudZéphir, ERP dans le Cloud
Zéphir, ERP dans le Cloud
 
sfPot aop
sfPot aopsfPot aop
sfPot aop
 
Big on Mobile, Big on Facebook. How the European super startups did it.
Big on Mobile, Big on Facebook. How the European super startups did it. Big on Mobile, Big on Facebook. How the European super startups did it.
Big on Mobile, Big on Facebook. How the European super startups did it.
 
IBM MQ Overview (IBM Message Queue)
IBM MQ Overview (IBM Message Queue)IBM MQ Overview (IBM Message Queue)
IBM MQ Overview (IBM Message Queue)
 
Indian IT industry analysis of 5 slides and company ( Infosys) analysis ( FY ...
Indian IT industry analysis of 5 slides and company ( Infosys) analysis ( FY ...Indian IT industry analysis of 5 slides and company ( Infosys) analysis ( FY ...
Indian IT industry analysis of 5 slides and company ( Infosys) analysis ( FY ...
 
Efficient Pagination Using MySQL
Efficient Pagination Using MySQLEfficient Pagination Using MySQL
Efficient Pagination Using MySQL
 
Best Bourbons
Best BourbonsBest Bourbons
Best Bourbons
 

Similar to Performance and scalability for machine learning

Serial-War
Serial-WarSerial-War
Serial-War
Xuechao Wu
 
Advertising Fraud Detection at Scale at T-Mobile
Advertising Fraud Detection at Scale at T-MobileAdvertising Fraud Detection at Scale at T-Mobile
Advertising Fraud Detection at Scale at T-Mobile
Databricks
 
Python高级编程(二)
Python高级编程(二)Python高级编程(二)
Python高级编程(二)
Qiangning Hong
 
DSL Construction with Ruby - ThoughtWorks Masterclass Series 2009
DSL Construction with Ruby - ThoughtWorks Masterclass Series 2009DSL Construction with Ruby - ThoughtWorks Masterclass Series 2009
DSL Construction with Ruby - ThoughtWorks Masterclass Series 2009
Harshal Hayatnagarkar
 
Writing high performance code in NetCore 3.0
Writing high performance code in NetCore 3.0Writing high performance code in NetCore 3.0
Writing high performance code in NetCore 3.0
Javier Cantón Ferrero
 
DotNet 2019 | Javier Cantón - Writing high performance code in NetCore 3.0
DotNet 2019 | Javier Cantón - Writing high performance code in NetCore 3.0DotNet 2019 | Javier Cantón - Writing high performance code in NetCore 3.0
DotNet 2019 | Javier Cantón - Writing high performance code in NetCore 3.0
Plain Concepts
 
Spark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer Agarwal
Spark Summit
 
Node.js - Advanced Basics
Node.js - Advanced BasicsNode.js - Advanced Basics
Node.js - Advanced Basics
Doug Jones
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the Cloud
Revolution Analytics
 
Efficient use of NodeJS
Efficient use of NodeJSEfficient use of NodeJS
Efficient use of NodeJS
Yura Bogdanov
 
Высокопроизводительный инференс глубоких сетей на GPU с помощью TensorRT / Ма...
Высокопроизводительный инференс глубоких сетей на GPU с помощью TensorRT / Ма...Высокопроизводительный инференс глубоких сетей на GPU с помощью TensorRT / Ма...
Высокопроизводительный инференс глубоких сетей на GPU с помощью TensorRT / Ма...
Ontico
 
Node.js security - JS Day Italy 2018
Node.js security - JS Day Italy 2018Node.js security - JS Day Italy 2018
Node.js security - JS Day Italy 2018
Liran Tal
 
Redis Day TLV 2018 - 10 Reasons why Redis should be your Primary Database
Redis Day TLV 2018 - 10 Reasons why Redis should be your Primary DatabaseRedis Day TLV 2018 - 10 Reasons why Redis should be your Primary Database
Redis Day TLV 2018 - 10 Reasons why Redis should be your Primary Database
Redis Labs
 
Natural Language Processing with CNTK and Apache Spark with Ali Zaidi
Natural Language Processing with CNTK and Apache Spark with Ali ZaidiNatural Language Processing with CNTK and Apache Spark with Ali Zaidi
Natural Language Processing with CNTK and Apache Spark with Ali Zaidi
Databricks
 
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNetFrom Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
Eric Haibin Lin
 
Resource Scheduling using Apache Mesos in Cloud Native Environments
Resource Scheduling using Apache Mesos in Cloud Native EnvironmentsResource Scheduling using Apache Mesos in Cloud Native Environments
Resource Scheduling using Apache Mesos in Cloud Native Environments
Sharma Podila
 
Javantura v4 - Java and lambdas and streams - are they better than for loops ...
Javantura v4 - Java and lambdas and streams - are they better than for loops ...Javantura v4 - Java and lambdas and streams - are they better than for loops ...
Javantura v4 - Java and lambdas and streams - are they better than for loops ...
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
Venkat ns2
Venkat ns2Venkat ns2
Venkat ns2
venkatnampally
 
Deep Dive on Deep Learning (June 2018)
Deep Dive on Deep Learning (June 2018)Deep Dive on Deep Learning (June 2018)
Deep Dive on Deep Learning (June 2018)
Julien SIMON
 
Extreme HTTP Performance Tuning: 1.2M API req/s on a 4 vCPU EC2 Instance
Extreme HTTP Performance Tuning: 1.2M API req/s on a 4 vCPU EC2 InstanceExtreme HTTP Performance Tuning: 1.2M API req/s on a 4 vCPU EC2 Instance
Extreme HTTP Performance Tuning: 1.2M API req/s on a 4 vCPU EC2 Instance
ScyllaDB
 

Similar to Performance and scalability for machine learning (20)

Serial-War
Serial-WarSerial-War
Serial-War
 
Advertising Fraud Detection at Scale at T-Mobile
Advertising Fraud Detection at Scale at T-MobileAdvertising Fraud Detection at Scale at T-Mobile
Advertising Fraud Detection at Scale at T-Mobile
 
Python高级编程(二)
Python高级编程(二)Python高级编程(二)
Python高级编程(二)
 
DSL Construction with Ruby - ThoughtWorks Masterclass Series 2009
DSL Construction with Ruby - ThoughtWorks Masterclass Series 2009DSL Construction with Ruby - ThoughtWorks Masterclass Series 2009
DSL Construction with Ruby - ThoughtWorks Masterclass Series 2009
 
Writing high performance code in NetCore 3.0
Writing high performance code in NetCore 3.0Writing high performance code in NetCore 3.0
Writing high performance code in NetCore 3.0
 
DotNet 2019 | Javier Cantón - Writing high performance code in NetCore 3.0
DotNet 2019 | Javier Cantón - Writing high performance code in NetCore 3.0DotNet 2019 | Javier Cantón - Writing high performance code in NetCore 3.0
DotNet 2019 | Javier Cantón - Writing high performance code in NetCore 3.0
 
Spark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer Agarwal
 
Node.js - Advanced Basics
Node.js - Advanced BasicsNode.js - Advanced Basics
Node.js - Advanced Basics
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the Cloud
 
Efficient use of NodeJS
Efficient use of NodeJSEfficient use of NodeJS
Efficient use of NodeJS
 
Высокопроизводительный инференс глубоких сетей на GPU с помощью TensorRT / Ма...
Высокопроизводительный инференс глубоких сетей на GPU с помощью TensorRT / Ма...Высокопроизводительный инференс глубоких сетей на GPU с помощью TensorRT / Ма...
Высокопроизводительный инференс глубоких сетей на GPU с помощью TensorRT / Ма...
 
Node.js security - JS Day Italy 2018
Node.js security - JS Day Italy 2018Node.js security - JS Day Italy 2018
Node.js security - JS Day Italy 2018
 
Redis Day TLV 2018 - 10 Reasons why Redis should be your Primary Database
Redis Day TLV 2018 - 10 Reasons why Redis should be your Primary DatabaseRedis Day TLV 2018 - 10 Reasons why Redis should be your Primary Database
Redis Day TLV 2018 - 10 Reasons why Redis should be your Primary Database
 
Natural Language Processing with CNTK and Apache Spark with Ali Zaidi
Natural Language Processing with CNTK and Apache Spark with Ali ZaidiNatural Language Processing with CNTK and Apache Spark with Ali Zaidi
Natural Language Processing with CNTK and Apache Spark with Ali Zaidi
 
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNetFrom Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
 
Resource Scheduling using Apache Mesos in Cloud Native Environments
Resource Scheduling using Apache Mesos in Cloud Native EnvironmentsResource Scheduling using Apache Mesos in Cloud Native Environments
Resource Scheduling using Apache Mesos in Cloud Native Environments
 
Javantura v4 - Java and lambdas and streams - are they better than for loops ...
Javantura v4 - Java and lambdas and streams - are they better than for loops ...Javantura v4 - Java and lambdas and streams - are they better than for loops ...
Javantura v4 - Java and lambdas and streams - are they better than for loops ...
 
Venkat ns2
Venkat ns2Venkat ns2
Venkat ns2
 
Deep Dive on Deep Learning (June 2018)
Deep Dive on Deep Learning (June 2018)Deep Dive on Deep Learning (June 2018)
Deep Dive on Deep Learning (June 2018)
 
Extreme HTTP Performance Tuning: 1.2M API req/s on a 4 vCPU EC2 Instance
Extreme HTTP Performance Tuning: 1.2M API req/s on a 4 vCPU EC2 InstanceExtreme HTTP Performance Tuning: 1.2M API req/s on a 4 vCPU EC2 Instance
Extreme HTTP Performance Tuning: 1.2M API req/s on a 4 vCPU EC2 Instance
 

Recently uploaded

Namma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdf
Namma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdfNamma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdf
Namma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdf
22ad0301
 
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
eudsoh
 
Telemetry Solution for Gaming (AWS Summit'24)
Telemetry Solution for Gaming (AWS Summit'24)Telemetry Solution for Gaming (AWS Summit'24)
Telemetry Solution for Gaming (AWS Summit'24)
GeorgiiSteshenko
 
一比一原版悉尼大学毕业证如何办理
一比一原版悉尼大学毕业证如何办理一比一原版悉尼大学毕业证如何办理
一比一原版悉尼大学毕业证如何办理
keesa2
 
Sample Devops SRE Product Companies .pdf
Sample Devops SRE  Product Companies .pdfSample Devops SRE  Product Companies .pdf
Sample Devops SRE Product Companies .pdf
Vineet
 
一比一原版(heriotwatt学位证书)英国赫瑞瓦特大学毕业证如何办理
一比一原版(heriotwatt学位证书)英国赫瑞瓦特大学毕业证如何办理一比一原版(heriotwatt学位证书)英国赫瑞瓦特大学毕业证如何办理
一比一原版(heriotwatt学位证书)英国赫瑞瓦特大学毕业证如何办理
zoykygu
 
reading_sample_sap_press_operational_data_provisioning_with_sap_bw4hana (1).pdf
reading_sample_sap_press_operational_data_provisioning_with_sap_bw4hana (1).pdfreading_sample_sap_press_operational_data_provisioning_with_sap_bw4hana (1).pdf
reading_sample_sap_press_operational_data_provisioning_with_sap_bw4hana (1).pdf
perranet1
 
06-18-2024-Princeton Meetup-Introduction to Milvus
06-18-2024-Princeton Meetup-Introduction to Milvus06-18-2024-Princeton Meetup-Introduction to Milvus
06-18-2024-Princeton Meetup-Introduction to Milvus
Timothy Spann
 
Senior Software Profiles Backend Sample - Sheet1.pdf
Senior Software Profiles  Backend Sample - Sheet1.pdfSenior Software Profiles  Backend Sample - Sheet1.pdf
Senior Software Profiles Backend Sample - Sheet1.pdf
Vineet
 
Sid Sigma educational and problem solving power point- Six Sigma.ppt
Sid Sigma educational and problem solving power point- Six Sigma.pptSid Sigma educational and problem solving power point- Six Sigma.ppt
Sid Sigma educational and problem solving power point- Six Sigma.ppt
ArshadAyub49
 
Call Girls Hyderabad (india) ☎️ +91-7426014248 Hyderabad Call Girl
Call Girls Hyderabad  (india) ☎️ +91-7426014248 Hyderabad  Call GirlCall Girls Hyderabad  (india) ☎️ +91-7426014248 Hyderabad  Call Girl
Call Girls Hyderabad (india) ☎️ +91-7426014248 Hyderabad Call Girl
sapna sharmap11
 
Salesforce AI + Data Community Tour Slides - Canarias
Salesforce AI + Data Community Tour Slides - CanariasSalesforce AI + Data Community Tour Slides - Canarias
Salesforce AI + Data Community Tour Slides - Canarias
davidpietrzykowski1
 
Template xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptxTemplate xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptx
TeukuEriSyahputra
 
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
agdhot
 
一比一原版莱斯大学毕业证(rice毕业证)如何办理
一比一原版莱斯大学毕业证(rice毕业证)如何办理一比一原版莱斯大学毕业证(rice毕业证)如何办理
一比一原版莱斯大学毕业证(rice毕业证)如何办理
zsafxbf
 
Call Girls Hyderabad ❤️ 7339748667 ❤️ With No Advance Payment
Call Girls Hyderabad ❤️ 7339748667 ❤️ With No Advance PaymentCall Girls Hyderabad ❤️ 7339748667 ❤️ With No Advance Payment
Call Girls Hyderabad ❤️ 7339748667 ❤️ With No Advance Payment
prijesh mathew
 
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
hqfek
 
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
nyvan3
 
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
9gr6pty
 
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
osoyvvf
 

Recently uploaded (20)

Namma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdf
Namma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdfNamma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdf
Namma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdf
 
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
 
Telemetry Solution for Gaming (AWS Summit'24)
Telemetry Solution for Gaming (AWS Summit'24)Telemetry Solution for Gaming (AWS Summit'24)
Telemetry Solution for Gaming (AWS Summit'24)
 
一比一原版悉尼大学毕业证如何办理
一比一原版悉尼大学毕业证如何办理一比一原版悉尼大学毕业证如何办理
一比一原版悉尼大学毕业证如何办理
 
Sample Devops SRE Product Companies .pdf
Sample Devops SRE  Product Companies .pdfSample Devops SRE  Product Companies .pdf
Sample Devops SRE Product Companies .pdf
 
一比一原版(heriotwatt学位证书)英国赫瑞瓦特大学毕业证如何办理
一比一原版(heriotwatt学位证书)英国赫瑞瓦特大学毕业证如何办理一比一原版(heriotwatt学位证书)英国赫瑞瓦特大学毕业证如何办理
一比一原版(heriotwatt学位证书)英国赫瑞瓦特大学毕业证如何办理
 
reading_sample_sap_press_operational_data_provisioning_with_sap_bw4hana (1).pdf
reading_sample_sap_press_operational_data_provisioning_with_sap_bw4hana (1).pdfreading_sample_sap_press_operational_data_provisioning_with_sap_bw4hana (1).pdf
reading_sample_sap_press_operational_data_provisioning_with_sap_bw4hana (1).pdf
 
06-18-2024-Princeton Meetup-Introduction to Milvus
06-18-2024-Princeton Meetup-Introduction to Milvus06-18-2024-Princeton Meetup-Introduction to Milvus
06-18-2024-Princeton Meetup-Introduction to Milvus
 
Senior Software Profiles Backend Sample - Sheet1.pdf
Senior Software Profiles  Backend Sample - Sheet1.pdfSenior Software Profiles  Backend Sample - Sheet1.pdf
Senior Software Profiles Backend Sample - Sheet1.pdf
 
Sid Sigma educational and problem solving power point- Six Sigma.ppt
Sid Sigma educational and problem solving power point- Six Sigma.pptSid Sigma educational and problem solving power point- Six Sigma.ppt
Sid Sigma educational and problem solving power point- Six Sigma.ppt
 
Call Girls Hyderabad (india) ☎️ +91-7426014248 Hyderabad Call Girl
Call Girls Hyderabad  (india) ☎️ +91-7426014248 Hyderabad  Call GirlCall Girls Hyderabad  (india) ☎️ +91-7426014248 Hyderabad  Call Girl
Call Girls Hyderabad (india) ☎️ +91-7426014248 Hyderabad Call Girl
 
Salesforce AI + Data Community Tour Slides - Canarias
Salesforce AI + Data Community Tour Slides - CanariasSalesforce AI + Data Community Tour Slides - Canarias
Salesforce AI + Data Community Tour Slides - Canarias
 
Template xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptxTemplate xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptx
 
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
 
一比一原版莱斯大学毕业证(rice毕业证)如何办理
一比一原版莱斯大学毕业证(rice毕业证)如何办理一比一原版莱斯大学毕业证(rice毕业证)如何办理
一比一原版莱斯大学毕业证(rice毕业证)如何办理
 
Call Girls Hyderabad ❤️ 7339748667 ❤️ With No Advance Payment
Call Girls Hyderabad ❤️ 7339748667 ❤️ With No Advance PaymentCall Girls Hyderabad ❤️ 7339748667 ❤️ With No Advance Payment
Call Girls Hyderabad ❤️ 7339748667 ❤️ With No Advance Payment
 
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
 
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
 
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
 
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
 

Performance and scalability for machine learning

  • 1. Performance and scalability for machine learning. Arnaud Rachez (arnaud.rachez@gmail.com) ! November 2nd, 2015
  • 2. Outline • Performance (7mn) • Parallelism (7mn) • Scalability (10mn)
  • 3. Numbers everyone should know (2015 update) 3 Source: http://www.eecs.berkeley.edu/~rcs/research/interactive_latency.html ThrougputGB/s 0 200 400 600 800 L1 L2 L3 RAM ThrougputGB/s 0 7,5 15 22,5 30 RAM SSD Network ~800MB ~1.25GB~30GB source: http://forums.aida64.com/topic/2864-i7-5775c-l4-cache-performance/ source: http://www.macrumors.com/2015/05/21/15-inch-retina-macbook-pro-2gbps-throughput/
  • 4. Outline • Performance (5-7mn) • Parallelism (5-7mn) • Scalability (7-10mn)
  • 5. Optimising SGD • Linear regression (like) stochastic gradient descent with d=5 features and n=1,000,000 examples. • Using Python (1), Numba (2), Numpy (3) and Cython (4) (https://gist.github.com/zermelozf/ 3cd06c8b0ce28f4eeacd) • Also compared it to pure C++ code (https://gist.github.com/ zermelozf/ 4df67d14f72f04b4338a) (1) (2) (3) (4)
  • 7. Runtime optimisation 6 Optimisation strategies (d=5 & n=1,000,000) time(ms) 1 10 100 1000 10000 Python Numpy Cython Numba c++ Source: https://gist.github.com/zermelozf/3cd06c8b0ce28f4eeacd
  • 8. Runtime optimisation 6 Optimisation strategies (d=5 & n=1,000,000) time(ms) 1 10 100 1000 10000 Python Numpy Cython Numba c++ Source: https://gist.github.com/zermelozf/3cd06c8b0ce28f4eeacd
  • 9. Runtime optimisation 6 Optimisation strategies (d=5 & n=1,000,000) time(ms) 1 10 100 1000 10000 Python Numpy Cython Numba c++ Source: https://gist.github.com/zermelozf/3cd06c8b0ce28f4eeacd
  • 10. Runtime optimisation 6 Optimisation strategies (d=5 & n=1,000,000) time(ms) 1 10 100 1000 10000 Python Numpy Cython Numba c++ Source: https://gist.github.com/zermelozf/3cd06c8b0ce28f4eeacd
  • 11. Runtime optimisation 6 Optimisation strategies (d=5 & n=1,000,000) time(ms) 1 10 100 1000 10000 Python Numpy Cython Numba c++ Source: https://gist.github.com/zermelozf/3cd06c8b0ce28f4eeacd
  • 12. Runtime optimisation 6 Optimisation strategies (d=5 & n=1,000,000) time(ms) 1 10 100 1000 10000 Python Numpy Cython Numba c++ Source: https://gist.github.com/zermelozf/3cd06c8b0ce28f4eeacd
  • 13. Runtime optimisation 6 Optimisation strategies (d=5 & n=1,000,000) time(ms) 1 10 100 1000 10000 Python Numpy Cython Numba c++ Source: https://gist.github.com/zermelozf/3cd06c8b0ce28f4eeacd M em ory views M em ory views pointers pointers? M em ory views?
  • 15. Runtime optimisation 7 Cache optimisation (d=5 & n=1,000,000) time(ms) 0 40 80 120 160 Numba c++ cython random linear
  • 16. Runtime optimisation 7 Cache optimisation (d=5 & n=1,000,000) time(ms) 0 40 80 120 160 Numba c++ cython random linear
  • 17. Runtime optimisation 7 Cache optimisation (d=5 & n=1,000,000) time(ms) 0 40 80 120 160 Numba c++ cython random linear Cache miss Cache miss Cache miss
  • 18. Runtime optimisation 7 Cache optimisation (d=5 & n=1,000,000) time(ms) 0 40 80 120 160 Numba c++ cython random linear Cache hit Cache hitCache miss Cache miss Cache miss Cache hit
  • 19. (d>>1) Gensim word2vec case study • Elman style RNN trained with SGD: 15,079×200 matrix on a 1M word corpus. • Baseline written by Tomas Mikolov in optimised C. • Rewritten by Radim Rehurec in python. • Optimised by Radim Rehurec using Cython, BLAS… Source: http://rare-technologies.com/word2vec-in-python-part-two-optimizing/
  • 20. (d>>1) Gensim word2vec case study • Elman style RNN trained with SGD: 15,079×200 matrix on a 1M word corpus. • Baseline written by Tomas Mikolov in optimised C. • Rewritten by Radim Rehurec in python. • Optimised by Radim Rehurec using Cython, BLAS… Source: http://rare-technologies.com/word2vec-in-python-part-two-optimizing/ Original C Numpy Cython Cython + BLAS Cython + BLAS + sigmoid table word/sec (x1000) 0 30 60 90 120
  • 21. (d>>1) Gensim word2vec case study • Elman style RNN trained with SGD: 15,079×200 matrix on a 1M word corpus. • Baseline written by Tomas Mikolov in optimised C. • Rewritten by Radim Rehurec in python. • Optimised by Radim Rehurec using Cython, BLAS… Source: http://rare-technologies.com/word2vec-in-python-part-two-optimizing/ Original C Numpy Cython Cython + BLAS Cython + BLAS + sigmoid table word/sec (x1000) 0 30 60 90 120
  • 22. (d>>1) Gensim word2vec case study • Elman style RNN trained with SGD: 15,079×200 matrix on a 1M word corpus. • Baseline written by Tomas Mikolov in optimised C. • Rewritten by Radim Rehurec in python. • Optimised by Radim Rehurec using Cython, BLAS… Source: http://rare-technologies.com/word2vec-in-python-part-two-optimizing/ Original C Numpy Cython Cython + BLAS Cython + BLAS + sigmoid table word/sec (x1000) 0 30 60 90 120
  • 23. (d>>1) Gensim word2vec case study • Elman style RNN trained with SGD: 15,079×200 matrix on a 1M word corpus. • Baseline written by Tomas Mikolov in optimised C. • Rewritten by Radim Rehurec in python. • Optimised by Radim Rehurec using Cython, BLAS… Source: http://rare-technologies.com/word2vec-in-python-part-two-optimizing/ Original C Numpy Cython Cython + BLAS Cython + BLAS + sigmoid table word/sec (x1000) 0 30 60 90 120
  • 24. (d>>1) Gensim word2vec case study • Elman style RNN trained with SGD: 15,079×200 matrix on a 1M word corpus. • Baseline written by Tomas Mikolov in optimised C. • Rewritten by Radim Rehurec in python. • Optimised by Radim Rehurec using Cython, BLAS… Source: http://rare-technologies.com/word2vec-in-python-part-two-optimizing/ Original C Numpy Cython Cython + BLAS Cython + BLAS + sigmoid table word/sec (x1000) 0 30 60 90 120
  • 25. (d>>1) Gensim word2vec case study • Elman style RNN trained with SGD: 15,079×200 matrix on a 1M word corpus. • Baseline written by Tomas Mikolov in optimised C. • Rewritten by Radim Rehurec in python. • Optimised by Radim Rehurec using Cython, BLAS… Source: http://rare-technologies.com/word2vec-in-python-part-two-optimizing/ Original C Numpy Cython Cython + BLAS Cython + BLAS + sigmoid table word/sec (x1000) 0 30 60 90 120
  • 26. (d>>1) Gensim word2vec case study • Elman style RNN trained with SGD: 15,079×200 matrix on a 1M word corpus. • Baseline written by Tomas Mikolov in optimised C. • Rewritten by Radim Rehurec in python. • Optimised by Radim Rehurec using Cython, BLAS… Source: http://rare-technologies.com/word2vec-in-python-part-two-optimizing/ Original C Numpy Cython Cython + BLAS Cython + BLAS + sigmoid table word/sec (x1000) 0 30 60 90 120 pointers pointers pointers
  • 27. What’s this BLAS magic? Source: https://github.com/piskvorky/gensim/blob/develop/gensim/models/word2vec_inner.pyx • vectorized y = alpha*x ! • replaced 3 lines of code! • translated into a 3x speedup over Cython alone! • please read http://rare-technologies.com/word2vec-in-python-part-two-optimizing/ **On my MacBook Pro, SciPy automatically links against Apple’s vecLib, which contains an excellent BLAS. Similarly, Intel’s MKL, AMD’s AMCL, Sun’s SunPerf or the automatically tuned ATLAS are all good choices.
  • 28. Outline • Performance (5-7mn) • Parallelism (5-7mn) • Scalability (7-10mn)
  • 29. Hardware trends: CPU 11 Numberofcores 0 1 2 3 4 ClockspeedMhz 0 1000 2000 3000 4000 1970 1975 1980 1985 1990 1995 2000 2005 2010 2015 Clock speed (Mhz) #Cores Source: http://www.gotw.ca/publications/concurrency-ddj.htm
  • 30. (d>>1) Gensim word2vec continued • Elman style RNN trained with SGD: 15,079×200 matrix on a 1M word corpus. • Baseline written by Tomas Mikolov in optimised C. • Rewritten by Radim Rehurec in python. • Optimised by Radim Rehurec using Cython, BLAS… Source: http://rare-technologies.com/parallelizing-word2vec-in-python/
  • 31. (d>>1) Gensim word2vec continued • Elman style RNN trained with SGD: 15,079×200 matrix on a 1M word corpus. • Baseline written by Tomas Mikolov in optimised C. • Rewritten by Radim Rehurec in python. • Optimised by Radim Rehurec using Cython, BLAS… Source: http://rare-technologies.com/parallelizing-word2vec-in-python/ • … and parallelised with threads!
  • 32. (d>>1) Gensim word2vec continued • Elman style RNN trained with SGD: 15,079×200 matrix on a 1M word corpus. • Baseline written by Tomas Mikolov in optimised C. • Rewritten by Radim Rehurec in python. • Optimised by Radim Rehurec using Cython, BLAS… Source: http://rare-technologies.com/parallelizing-word2vec-in-python/ 1 thread 2 threads 3 threads 4 threads word/sec (x1000) 0 100 200 300 400 Original C Cython + BLAS + sigmoid table • … and parallelised with threads!
  • 33. (d>>1) Gensim word2vec continued • Elman style RNN trained with SGD: 15,079×200 matrix on a 1M word corpus. • Baseline written by Tomas Mikolov in optimised C. • Rewritten by Radim Rehurec in python. • Optimised by Radim Rehurec using Cython, BLAS… Source: http://rare-technologies.com/parallelizing-word2vec-in-python/ 1 thread 2 threads 3 threads 4 threads word/sec (x1000) 0 100 200 300 400 Original C Cython + BLAS + sigmoid table • … and parallelised with threads!
  • 34. (d>>1) Gensim word2vec continued • Elman style RNN trained with SGD: 15,079×200 matrix on a 1M word corpus. • Baseline written by Tomas Mikolov in optimised C. • Rewritten by Radim Rehurec in python. • Optimised by Radim Rehurec using Cython, BLAS… Source: http://rare-technologies.com/parallelizing-word2vec-in-python/ 1 thread 2 threads 3 threads 4 threads word/sec (x1000) 0 100 200 300 400 Original C Cython + BLAS + sigmoid table • … and parallelised with threads!
  • 35. (d>>1) Gensim word2vec continued • Elman style RNN trained with SGD: 15,079×200 matrix on a 1M word corpus. • Baseline written by Tomas Mikolov in optimised C. • Rewritten by Radim Rehurec in python. • Optimised by Radim Rehurec using Cython, BLAS… Source: http://rare-technologies.com/parallelizing-word2vec-in-python/ 1 thread 2 threads 3 threads 4 threads word/sec (x1000) 0 100 200 300 400 Original C Cython + BLAS + sigmoid table • … and parallelised with threads!
  • 36. (d>>1) Gensim word2vec continued • Elman style RNN trained with SGD: 15,079×200 matrix on a 1M word corpus. • Baseline written by Tomas Mikolov in optimised C. • Rewritten by Radim Rehurec in python. • Optimised by Radim Rehurec using Cython, BLAS… Source: http://rare-technologies.com/parallelizing-word2vec-in-python/ 1 thread 2 threads 3 threads 4 threads word/sec (x1000) 0 100 200 300 400 Original C Cython + BLAS + sigmoid table • … and parallelised with threads!
  • 37. (d>>1) Gensim word2vec continued • Elman style RNN trained with SGD: 15,079×200 matrix on a 1M word corpus. • Baseline written by Tomas Mikolov in optimised C. • Rewritten by Radim Rehurec in python. • Optimised by Radim Rehurec using Cython, BLAS… Source: http://rare-technologies.com/parallelizing-word2vec-in-python/ 1 thread 2 threads 3 threads 4 threads word/sec (x1000) 0 100 200 300 400 Original C Cython + BLAS + sigmoid table • … and parallelised with threads! 2.85x speedup
  • 38. (d>>1) Hogwild!on SAG • Fabian’s experimentation with Julia (lang). • Running SAG in parallel, without a lock.
  • 39. (d>>1) Hogwild!on SAG • Fabian’s experimentation with Julia (lang). • Running SAG in parallel, without a lock. • Very nice speed up!!!
  • 40. Data and does not fit in memory… Stream data from disk… … but you cannot read in parallel… Producer/Consumer pattern chunk 1 chunk 2 chunk 3 chunk 4 chunk 5 chunk 6 … … … …thread 2" (consumer) thread 2" (consumer) thread 1" (producer) Et cetera…
  • 41. Data and does not fit in memory… Stream data from disk… … but you cannot read in parallel… Producer/Consumer pattern chunk 1 chunk 2 chunk 3 chunk 4 chunk 5 chunk 6 … job 1 … … …thread 2" (consumer) thread 2" (consumer) thread 1" (producer) Et cetera…
  • 42. Data and does not fit in memory… Stream data from disk… … but you cannot read in parallel… Producer/Consumer pattern chunk 1 chunk 2 chunk 3 chunk 4 chunk 5 chunk 6 … job 2 … job 1 … …thread 2" (consumer) thread 2" (consumer) thread 1" (producer) Et cetera…
  • 43. Data and does not fit in memory… Stream data from disk… … but you cannot read in parallel… Producer/Consumer pattern chunk 1 chunk 2 chunk 3 chunk 4 chunk 5 chunk 6 … job 3 … job 1 job 2 … …thread 2" (consumer) thread 2" (consumer) thread 1" (producer) Et cetera…
  • 44. Data and does not fit in memory… Stream data from disk… … but you cannot read in parallel… Producer/Consumer pattern chunk 1 chunk 2 chunk 3 chunk 4 chunk 5 chunk 6 … job 3 job 4 … job 1 job 2 … …thread 2" (consumer) thread 2" (consumer) thread 1" (producer) Et cetera…
  • 45. Data and does not fit in memory… Stream data from disk… … but you cannot read in parallel… Producer/Consumer pattern chunk 1 chunk 2 chunk 3 chunk 4 chunk 5 chunk 6 … job 3 job 4 job 5 … … …thread 2" (consumer) thread 2" (consumer) thread 1" (producer) done done Et cetera…
  • 46. Data and does not fit in memory… Stream data from disk… … but you cannot read in parallel… Producer/Consumer pattern chunk 1 chunk 2 chunk 3 chunk 4 chunk 5 chunk 6 … job 5 … job 3 job 4 … …thread 2" (consumer) thread 2" (consumer) thread 1" (producer) done done Et cetera…
  • 47. Data and does not fit in memory… Stream data from disk… … but you cannot read in parallel… Producer/Consumer pattern chunk 1 chunk 2 chunk 3 chunk 4 chunk 5 chunk 6 … job 5 … job 3 job 4 … …thread 2" (consumer) thread 2" (consumer) thread 1" (producer) done done Et cetera…
  • 48. Data and does not fit in memory… Stream data from disk… … but you cannot read in parallel… Producer/Consumer pattern chunk 1 chunk 2 chunk 3 chunk 4 chunk 5 chunk 6 … job 5 … job 4 … …thread 2" (consumer) thread 2" (consumer) thread 1" (producer) done done done Et cetera…
  • 49. Data and does not fit in memory… Stream data from disk… … but you cannot read in parallel… Producer/Consumer pattern chunk 1 chunk 2 chunk 3 chunk 4 chunk 5 chunk 6 … … job 4 job 5 … …thread 2" (consumer) thread 2" (consumer) thread 1" (producer) done done done Et cetera…
  • 50. How many consumers? It depends… ! • Gensim (R. Rehurec) • Saw the impact up to 4 consumers earlier
 • Vowpal Wabbit (J. Langford) • Claims no gain with more than 1 consumer! • 2’10’’ on my macbook pro for ~10GB and 50MM lines 
 (Criteo’s advertising dataset). ! • CNNs pre-processing (S. Dieleman) • Big impact with ?? (several) consumers! • Useful for data augmentation/preprocessing
  • 51. 5.3GB (~105MM lines) word count 0 55 110 165 220 Number of consumers 1 2 3 4 5 6 Word count java benchmark source: https://gist.github.com/nicomak/1d6561e6f71d936d3178 • Macbook pro 15’’ 2014 • `sudo purge`
  • 52. Outline • Performance (5-7mn) • Parallelism (5-7mn) • Scalability (7-10mn)
  • 53. Hardware trends: HDD Capacity(GB) 0 150 300 450 600 Timetoread(sec) 0 1 000 2 000 3 000 4 000 1979 1983 1993 1998 1999 2001 2003 2008 2011 Read full disk (sec.) Capacity (GB) Source: https://tylermuth.wordpress.com/2011/11/02 18
  • 54. Distributed computing 19 Scalability - A perspective on Big data
  • 55. Distributed computing 19 Scalability - A perspective on Big data
  • 56. Distributed computing 19 • Strong scaling: if you throw twice as many machines at the task, you solve it in half the time.
 Usually relevant when the task is CPU bound. Scalability - A perspective on Big data
  • 57. Distributed computing 19 • Strong scaling: if you throw twice as many machines at the task, you solve it in half the time.
 Usually relevant when the task is CPU bound. • Weak scaling: if the dataset is twice as big, throw twice as many machines at it to solve the task in constant time.
 Memory bound tasks… usually. Scalability - A perspective on Big data
  • 58. Distributed computing 19 • Strong scaling: if you throw twice as many machines at the task, you solve it in half the time.
 Usually relevant when the task is CPU bound. • Weak scaling: if the dataset is twice as big, throw twice as many machines at it to solve the task in constant time.
 Memory bound tasks… usually. Most “big data” problems are I/O bound. Hard to solve the task in an acceptable time independently of the size of the data (weak scaling). Scalability - A perspective on Big data
  • 60. Bring computation to data 20 Map-Reduce: Statistical query model
  • 61. Bring computation to data 20 Map-Reduce: Statistical query model the sum corresponds to a reduce operation
  • 62. Bring computation to data 20 Map-Reduce: Statistical query model f, the map function, is sent to every machine the sum corresponds to a reduce operation
  • 63. Bring computation to data 20 Map-Reduce: Statistical query model f, the map function, is sent to every machine the sum corresponds to a reduce operation • D. Caragea et al., A Framework for Learning from Distributed Data Using Sufficient Statistics and Its Application to Learning Decision Trees. Int. J. Hybrid Intell. Syst. 2004 • Chu et al., Map-Reduce for Machine Learning on Multicore. NIPS’06.
  • 64. Spark on Criteo’s data ! • Logistic regression trained with minibatch SGD" • 10GB of data (50MM lines). 
 Caveat: Quite small for a benchmark • Super linear strong scalability. 
 Not theoretically possible => small dataset + few instances saturate. Numberofcores 0 10 20 30 40 timeinsec. 0 325 650 975 1300 Number of AWS nodes 4 6 8 10 time (sec) #cores
  • 65. Spark on Criteo’s data ! • Logistic regression trained with minibatch SGD" • 10GB of data (50MM lines). 
 Caveat: Quite small for a benchmark • Super linear strong scalability. 
 Not theoretically possible => small dataset + few instances saturate. Numberofcores 0 10 20 30 40 timeinsec. 0 325 650 975 1300 Number of AWS nodes 4 6 8 10 time (sec) #cores Manual setup of the cluster was a bit painful…
  • 66. Software stack for big data 22
  • 67. Software stack for big data 22 Local Standalone YARNMESOS Cluster" manager
  • 68. Software stack for big data 22 Local Standalone YARNMESOS HDFS Tachyon Cassandra HBase Others… Cluster" manager Storage " layer
  • 69. Software stack for big data 22 Local Standalone YARNMESOS HDFS Tachyon Cassandra HBase Others… Spark" Memory-optimised execution engine Flink" Apache incubated excution engine. Hadoop MR 2" Cluster" manager Storage " layer Execution " layer
  • 70. Software stack for big data 22 Local Standalone YARNMESOS HDFS Tachyon Cassandra HBase Others… Spark" Memory-optimised execution engine Flink" Apache incubated excution engine. Hadoop MR 2" MLlib GraphX Streaming SQL/ Datafra me Cluster" manager Storage " layer Execution " layer Libraries
  • 71. Software stack for big data 22 Local Standalone YARNMESOS HDFS Tachyon Cassandra HBase Others… Spark" Memory-optimised execution engine Flink" Apache incubated excution engine. Hadoop MR 2" MLlib GraphX Streaming SQL/ Datafra me FlinkML Gelly (graph) TableAPI Batch Cluster" manager Storage " layer Execution " layer Libraries
  • 72. Software stack MESOS vs YARN 23
  • 73. Software stack MESOS vs YARN 23 • Standalone mode is fastest…
  • 74. Software stack MESOS vs YARN 23 • Standalone mode is fastest… • … but resources are requested for the entire job.
  • 75. Software stack MESOS vs YARN 23 • Standalone mode is fastest… • … but resources are requested for the entire job. Cluster management frameworks
  • 76. Software stack MESOS vs YARN 23 • Standalone mode is fastest… • … but resources are requested for the entire job. Cluster management frameworks
  • 77. Software stack MESOS vs YARN 23 • Standalone mode is fastest… • … but resources are requested for the entire job. Cluster management frameworks • Concurrent access (multiuser)
  • 78. Software stack MESOS vs YARN 23 • Standalone mode is fastest… • … but resources are requested for the entire job. Cluster management frameworks • Concurrent access (multiuser) • Hyperparameter tuning (multijob)
  • 79. Software stack MESOS vs YARN 23 • Standalone mode is fastest… • … but resources are requested for the entire job. Cluster management frameworks • Concurrent access (multiuser) • Hyperparameter tuning (multijob)
  • 80. Software stack MESOS vs YARN 23 • Standalone mode is fastest… • … but resources are requested for the entire job. Cluster management frameworks • Concurrent access (multiuser) • Hyperparameter tuning (multijob) Mesos YARN • Framework receive offers • Easy install on AWS, GCE • Lots of compatible frameworks: Spark, MPI, Cassandra, HDFS… • Mesosphere’s DCOS is really, really easy to use. • Frameworks make offers • Configuration hell (can be made easier with puppet/ ansible recipes • Several compatible frameworks: Spark, Flink, HDFS…
  • 81. Infrastructure stack • AWS = AWeSome • Basic instance with spot price ! ! ! ! • Graphical Network Designer
  • 82. Infrastructure stack • AWS = AWeSome • Basic instance with spot price ! ! ! ! • Graphical Network Designer
  • 83. Infrastructure stack • AWS = AWeSome • Basic instance with spot price ! ! ! ! • Graphical Network Designer 10 r2.2xlarge instances for (350GB mem. & 40 cores) 0.85$/hour
  • 84. Infrastructure stack • AWS = AWeSome • Basic instance with spot price ! ! ! ! • Graphical Network Designer
  • 90. Infrastructure stack VPC Subnets public/private Security rules Bootstrap config for master/slaves Network entry point
  • 93. What’s coming in the next few years ? BONUS