Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux
1. Building a cutting-edge data processing
environment on a budget
GaĀØel Varoquaux
This talk is not about
rocket science!
2. Building a cutting-edge data processing
environment on a budget
GaĀØel Varoquaux
Disclaimer: this talk is as much about people
and projects as it is about code and algorithms.
3. Growing up as a penniless academic
I did a PhD in
quantum physics
4. Growing up as a penniless academic
I did a PhD in
quantum physics
Vacuum (leaks)
Electronics (shorts)
Lasers (mis-alignment)
Best training ever
for agile project
management
5. Growing up as a penniless academic
I did a PhD in
quantum physics
Vacuum (leaks)
Electronics (shorts)
Lasers (mis-alignment)
Computers were only one
of the many moving parts
Matlab
Instrument control
6. Growing up as a penniless academic
I did a PhD in
quantum physics
Vacuum (leaks)
Electronics (shorts)
Lasers (mis-alignment)
Computers were only one
of the many moving parts
Matlab
Instrument controlShaped my vision
of computing as a
means to an end
7. Growing up as a penniless academic
2011
Tenured researcher
in computer science
8. Growing up as a penniless academic
2011
Tenured researcher
in computer science
Today
Growing team with
data science
rock stars
9. 1 Using machine learning to
understand brain function
Link neural activity to thoughts and cognition
G Varoquaux 6
12. 1 Encoding models of stimuli
Predicting neural response
Ʊ a window into brain representations of stimuli
āfeature engineeringā a description of the world
G Varoquaux 9
14. 1 Data processing feats
Visual image reconstruction from human brain activity
[Miyawaki, et al. (2008)]
ābrain readingā
G Varoquaux 11
15. 1 Data processing feats
Visual image reconstruction from human brain activity
[Miyawaki, et al. (2008)]
āif itās not open and veriļ¬able by others, itās not
science, or engineering...ā Stodden, 2010
G Varoquaux 11
16. 1 Data processing feats
Visual image reconstruction from human brain activity
[Miyawaki, et al. (2008)]
Make it work, make it right, make it boring
17. 1 Data processing feats
Visual image reconstruction from human brain activity
[Miyawaki, et al. (2008)]
Make it work, make it right, make it boring
http://nilearn.github.io/auto examples/
plot miyawaki reconstruction.html
Code, data, ... just worksTM
http://nilearn.github.io
ni
G Varoquaux 11
18. 1 Data processing feats
Visual image reconstruction from human brain activity
[Miyawaki, et al. (2008)]
Make it work, make it right, make it boring
http://nilearn.github.io/auto examples/
plot miyawaki reconstruction.html
Code, data, ... just worksTM
http://nilearn.github.io
ni
Software development challenge
G Varoquaux 11
19. 1 Data accumulation
When data processing is routine... ābig dataā
for rich models of
brain function
Accumulation of scientiļ¬c knowledge
and learning formal representations
G Varoquaux 12
20. 1 Data accumulation
When data processing is routine... ābig dataā
for rich models of
brain function
Accumulation of scientiļ¬c knowledge
and learning formal representations
āA theory is a good theory if it satisļ¬es two requirements:
It must accurately describe a large class of observa-
tions on the basis of a model that contains only a few
arbitrary elements, and it must make deļ¬nite predic-
tions about the results of future observations.ā
Stephen Hawking, A Brief History of Time.
G Varoquaux 12
21. 1 Petty day-to-day technicalities
Buggy code
Slow code
Lead data scientist leaves
New intern to train
I donāt understand the
code I have written a year ago
G Varoquaux 13
22. 1 Petty day-to-day technicalities
Buggy code
Slow code
Lead data scientist leaves
New intern to train
I donāt understand the
code I have written a year ago
A lab is no diļ¬erent from a startup
Diļ¬culties
Recruitment
Limited resources
(people & hardware)
Risks
Bus factor
Technical dept
G Varoquaux 13
23. 1 Petty day-to-day technicalities
Buggy code
Slow code
Lead data scientist leaves
New intern to train
I donāt understand the
code I have written a year ago
A lab is no diļ¬erent from a startup
Diļ¬culties
Recruitment
Limited resources
(people & hardware)
Risks
Bus factor
Technical dept
Our mission is to revolutionize brain data processing
on a tight budget
G Varoquaux 13
25. 2 The data processing workļ¬ow agile
Interaction...
Ć script...
Ć module...
Ć½ interaction again...
Consolidation,
progressively
Low tech and short
turn-around times
G Varoquaux 15
26. 2 From statistics to statistical learning
Paradigm shift as the
dimensionality of data
grows
# features,
not only # samples
From parameter
inference to prediction
Statistical learning is
spreading everywhere
x
y
G Varoquaux 16
27. 3 Letās just make software
to solve all these problems.
c Theodore W. Gray
G Varoquaux 17
28. 3 Design philosophy
1. Donāt solve hard problems
The original problem can be bent.
2. Easy setup, works out of the box
Installing software sucks.
Convention over conļ¬guration.
3. Fail gracefully
Robust to errors. Easy to debug.
4. Quality, quality, quality
Whatās not excellent wonāt be used.
G Varoquaux 18
29. 3 Design philosophy
1. Donāt solve hard problems
The original problem can be bent.
2. Easy setup, works out of the box
Installing software sucks.
Convention over conļ¬guration.
3. Fail gracefully
Robust to errors. Easy to debug.
4. Quality, quality, quality
Whatās not excellent wonāt be used.
Not āone software to rule them allā
Break down projects by expertise
G Varoquaux 18
31. Vision
Machine learning without learning the machinery
Black box that can be opened
Right trade-oļ¬ between ājust worksā and versatility
(think Apple vs Linux)
G Varoquaux 19
32. Vision
Machine learning without learning the machinery
Black box that can be opened
Right trade-oļ¬ between ājust worksā and versatility
(think Apple vs Linux)
Weāre not going to solve all the problems for you
I donāt solve hard problems
Feature-engineering, domain-speciļ¬c cases...
Python is a programming language. Use it.
Cover all the 80% usecases in one package
G Varoquaux 19
33. 3 Performance in high-level programming
High-level programming
is what keeps us
alive and kicking
G Varoquaux 20
34. 3 Performance in high-level programming
The secret sauce
Optimize algorithmes, not for loops
Know perfectly Numpy and Scipy
- Signiļ¬cant data should be arrays/memoryviews
- Avoid memory copies, rely on blas/lapack
line-proļ¬ler/memory-proļ¬ler
scipy-lectures.github.io
Cython not C/C++
G Varoquaux 20
35. 3 Performance in high-level programming
The secret sauce
Optimize algorithmes, not for loops
Know perfectly Numpy and Scipy
- Signiļ¬cant data should be arrays/memoryviews
- Avoid memory copies, rely on blas/lapack
line-proļ¬ler/memory-proļ¬ler
scipy-lectures.github.io
Cython not C/C++
Hierarchical clustering PR #2199
1. Take the 2 closest clusters
2. Merge them
3. Update the distance matrix
...
Faster with constraints: sparse distance matrix
- Keep a heap queue of distances: cheap minimum
- Need sparse growable structure for neighborhoods
skip-list in Cython!
Oplog nq insert, remove, access
bind C++ map[int, float] with Cython
Fast traversal, possibly in Cython, for step 3.
G Varoquaux 20
36. 3 Performance in high-level programming
The secret sauce
Optimize algorithmes, not for loops
Know perfectly Numpy and Scipy
- Signiļ¬cant data should be arrays/memoryviews
- Avoid memory copies, rely on blas/lapack
line-proļ¬ler/memory-proļ¬ler
scipy-lectures.github.io
Cython not C/C++
Hierarchical clustering PR #2199
1. Take the 2 closest clusters
2. Merge them
3. Update the distance matrix
...
Faster with constraints: sparse distance matrix
- Keep a heap queue of distances: cheap minimum
- Need sparse growable structure for neighborhoods
skip-list in Cython!
Oplog nq insert, remove, access
bind C++ map[int, float] with Cython
Fast traversal, possibly in Cython, for step 3.
G Varoquaux 20
38. 3 Architecture of a data-manipulation toolkit
Separate data from operations,
but keep an imperative-like language
Object API exposes a data-processing language
ļ¬t, predict, transform, score, partial ļ¬t
Instantiated without data but with all the parameters
Objects pipeline, merging, etc...
G Varoquaux 21
39. 3 Architecture of a data-manipulation toolkit
Separate data from operations,
but keep an imperative-like language
Object API exposes a data-processing language
ļ¬t, predict, transform, score, partial ļ¬t
Instantiated without data but with all the parameters
Objects pipeline, merging, etc...
conļ¬guration/run pattern traits, pyre
curry in functional programming functools.partial
Ideas from MVC pattern
G Varoquaux 21
41. 4 Big data on small hardware
Biggish
smallish
āBig dataā:
Petabytes...
Distributed storage
Computing cluster
Mere mortals:
Gigabytes...
Python programming
Oļ¬-the-self computers
G Varoquaux 22
42. 4 On-line algorithms
Process the data one sample at a time
Compute the mean of a gazillion
numbers
Hard?
G Varoquaux 23
43. 4 On-line algorithms
Process the data one sample at a time
Compute the mean of a gazillion
numbers
Hard?
No: just do a running mean
G Varoquaux 23
44. 4 On-line algorithms
Converges to expectations
Mini-batch = bunch observations for vectorization
Example: K-Means clustering
X = np.random.normal(size=(10 000, 200))
scipy.cluster.vq.
kmeans(X, 10,
iter=2)
11.33 s
sklearn.cluster.
MiniBatchKMeans(n clusters=10,
n init=2).ļ¬t(X)
0.62 s
G Varoquaux 23
45. 4 On-the-ļ¬y data reduction
Big data is often I/O bound
Layer memory access
CPU caches
RAM
Local disks
Distant storage
Less data also means less work
G Varoquaux 24
46. 4 On-the-ļ¬y data reduction
Dropping data
1 loop: take a random fraction of the data
2 run algorithm on that fraction
3 aggregate results across sub-samplings
Looks like bagging: bootstrap aggregation
Exploits redundancy across observations
Run the loop in parallel
G Varoquaux 24
47. 4 On-the-ļ¬y data reduction
Random projections (will average features)
sklearn.random projection
random linear combinations of the features
Fast clustering of features
sklearn.cluster.WardAgglomeration
on images: super-pixel strategy
Hashing when observations have varying size
(e.g. words)
sklearn.feature extraction.text.
HashingVectorizer
stateless: can be used in parallel
G Varoquaux 24
48. 4 On-the-ļ¬y data reduction
Example: randomized SVD Random projection
sklearn.utils.extmath.randomized svd
X = np.random.normal(size=(50000, 200))
%timeit lapack = linalg.svd(X, full matrices=False)
1 loops, best of 3: 6.09 s per loop
%timeit arpack=splinalg.svds(X, 10)
1 loops, best of 3: 2.49 s per loop
%timeit randomized = randomized svd(X, 10)
1 loops, best of 3: 303 ms per loop
linalg.norm(lapack[0][:, :10] - arpack[0]) / 2000
0.0022360679774997738
linalg.norm(lapack[0][:, :10] - randomized[0]) / 2000
0.0022121161221386925
G Varoquaux 24
49. 4 Biggish iron
Our new box: 15 ke
48 cores
384G RAM
70T storage
(SSD cache on RAID controller)
Gets our work done faster than our 800 CPU cluster
Itās the access patterns!
āNobody ever got ļ¬red for using Hadoop on a clusterā
A. Rowstron et al., HotCDP ā12
G Varoquaux 25
51. 5 Parallel processing big picture
Focus on embarassingly parallel for loops
Life is too short to worry about deadlocks
Workers compete for data access
Memory bus is a bottleneck
The right grain of parallelism
Too ļ¬ne Ʊ overhead
Too coarse Ʊ memory shortage
Scale by the relevant cache pool
G Varoquaux 27
52. 5 Parallel processing joblib
Focus on embarassingly parallel for loops
Life is too short to worry about deadlocks
>>> from joblib import Parallel, delayed
>>> Parallel(n jobs=2)(delayed(sqrt)(i**2)
... for i in range(8))
[0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0]
G Varoquaux 27
53. 5 Parallel processing joblib
IPython, multiprocessing, celery, MPI?
joblib is higher-level
No dependencies, works everywhere
Better traceback reporting
Memmaping arrays to share memory (O. Grisel)
On-the-ļ¬y dispatch of jobs ā memory-friendly
Threads or processes backend
G Varoquaux 27
54. 5 Parallel processing joblib
IPython, multiprocessing, celery, MPI?
joblib is higher-level
No dependencies, works everywhere
Better traceback reporting
Memmaping arrays to share memory (O. Grisel)
On-the-ļ¬y dispatch of jobs ā memory-friendly
Threads or processes backend
G Varoquaux 27
55. 5 Parallel processing Queues
Queues: high-performance, concurrent-friendly
Diļ¬culty: callback on result arrival
Ʊ multiple threads in caller ` risk of deadlocks
Dispatch queue should ļ¬ll up āslowlyā
Ʊ pre dispatch in joblib
Ʊ Back and forth communication
Door open to race conditions
G Varoquaux 28
56. 5 Parallel processing: what happens where
joblib design: Caller, dispatch queue, and collect
queue in same process
Beneļ¬t: robustness
Grand-central dispatch design: dispatch queue has
a process of its own
Beneļ¬t: resource managment in nested for loops
G Varoquaux 29
57. 5 Caching
For reproducibility:
avoid manually chained scripts (make-like usage)
For performance:
avoiding re-computing is the crux of optimization
G Varoquaux 30
58. 5 Caching The joblib approach
For reproducibility:
avoid manually chained scripts (make-like usage)
For performance:
avoiding re-computing is the crux of optimization
Memoize pattern
mem = joblib.Memory(cachedir=ā.ā)
g = mem.cache(f)
b = g(a) # computes a using f
c = g(a) # retrieves results from store
G Varoquaux 30
59. 5 Caching The joblib approach
For reproducibility:
avoid manually chained scripts (make-like usage)
For performance:
avoiding re-computing is the crux of optimization
Memoize pattern
mem = joblib.Memory(cachedir=ā.ā)
g = mem.cache(f)
b = g(a) # computes a using f
c = g(a) # retrieves results from store
Challenges in the context of big data
a & b are big
Design goals
a & b arbitrary Python objects
No dependencies
Drop-in, framework-less code
G Varoquaux 30
60. 5 Caching The joblib approach
For reproducibility:
avoid manually chained scripts (make-like usage)
For performance:
avoiding re-computing is the crux of optimization
Memoize pattern
mem = joblib.Memory(cachedir=ā.ā)
g = mem.cache(f)
b = g(a) # computes a using f
c = g(a) # retrieves results from store
Lego bricks for out-of-core algorithms coming soon
Ä Ä Ä Ä Ä Ä Ä Ä Ä result = g.call and shelve(a)
Ä Ä Ä Ä Ä Ä Ä Ä Ä result
MemorizedResult(cachedir=ā...ā, func=āg...ā, argument hash=ā...ā)
Ä Ä Ä Ä Ä Ä Ä Ä Ä c = result.get()
G Varoquaux 30
61. 5 Eļ¬cient input argument hashing ā joblib.hash
Compute md5ā¹
of input arguments
Trade-oļ¬ between features and cost
Black boxy
Robust and completely generic
G Varoquaux 31
62. 5 Eļ¬cient input argument hashing ā joblib.hash
Compute md5ā¹
of input arguments
Implementation
1. Create an md5 hash object
2. Subclass the standard-library pickler
= state machine that walks the object graph
3. Walk the object graph:
- ndarrays: pass data pointer to md5 algorithm
(āupdateā method)
- the rest: pickle
4. Update the md5 with the pickle
ā¹ md5 is in the Python standard library
G Varoquaux 31
63. 5 Fast, disk-based, concurrent, store ā joblib.dump
Persisting arbritrary objects
Once again sub-class the pickler
Use .npy for large numpy arrays (np.save),
pickle for the rest
Ʊ Multiple ļ¬les
Store concurrency issues
Strategy: atomic operations ` try/except
Renaming a directory is atomic
Directory layout consistent with remove operations
Good performance, usable on shared disks (cluster)
G Varoquaux 32
64. 5 Making I/O fast
Fast compression
CPU may be faster than disk access
in particular in parallel
Standard library: zlib.compress with buļ¬ers
(bypass gzip module to work online + in-memory)
G Varoquaux 33
65. 5 Making I/O fast
Fast compression
CPU may be faster than disk access
in particular in parallel
Standard library: zlib.compress with buļ¬ers
(bypass gzip module to work online + in-memory)
Avoiding copies
zlib.compress: C-contiguous buļ¬ers
Copyless storage of raw buļ¬er
+ meta-information (strides, class...)
G Varoquaux 33
66. 5 Making I/O fast
Fast compression
CPU may be faster than disk access
in particular in parallel
Standard library: zlib.compress with buļ¬ers
(bypass gzip module to work online + in-memory)
Avoiding copies
zlib.compress: C-contiguous buļ¬ers
Copyless storage of raw buļ¬er
+ meta-information (strides, class...)
Single ļ¬le dump coming soon
File opening is slow on cluster
Challenge: streaming the above for memory usage
G Varoquaux 33
67. 5 Making I/O fast
Fast compression
CPU may be faster than disk access
in particular in parallel
Standard library: zlib.compress with buļ¬ers
(bypass gzip module to work online + in-memory)
Avoiding copies
zlib.compress: C-contiguous buļ¬ers
Copyless storage of raw buļ¬er
+ meta-information (strides, class...)
Single ļ¬le dump coming soon
File opening is slow on cluster
Challenge: streaming the above for memory usage
What matters on large systems
Numbers of bytes stored
brings network/SATA bus down
Memory usage
brings compute nodes down
Number of atomic ļ¬le access
brings shared storage down
G Varoquaux 33
68. 5 Benchmarking to np.save and pytables
yaxisscale:1isnp.save
NeuroImaging data (MNI atlas)G Varoquaux 34
69. 6 The bigger picture: building
an ecosystem
Helping your future self
G Varoquaux 35
70. 6 Community-based development in scikit-learn
Huge feature set:
beneļ¬ts of a large team
Project growth:
More than 200 contributors
ā 12 core contributors
1 full-time INRIA programmer
from the start
Estimated cost of development: $ 6 millions
COCOMO model,
http://www.ohloh.net/p/scikit-learn
G Varoquaux 36
71. 6 The economics of open source
Code maintenance too expensive to be alone
scikit-learn ā 300 email/month nipy ā 45 email/month
joblib ā 45 email/month mayavi ā 30 email/month
āHey Gael, I take it youāre too
busy. Thatās okay, I spent a day
trying to install XXX and I think
Iāll succeed myself. Next time
though please donāt ignore my
emails, I really donāt like it. You
can say, āsorry, I have no time to
help you.ā Just donāt ignore.ā
G Varoquaux 37
72. 6 The economics of open source
Code maintenance too expensive to be alone
scikit-learn ā 300 email/month nipy ā 45 email/month
joblib ā 45 email/month mayavi ā 30 email/month
Your ābeneļ¬tsā come from a fraction of the code
Data loading? Maybe?
Standard algorithms? Nah
Share the common code...
...to avoid dying under code
Code becomes less precious with time
And somebody might contribute features
G Varoquaux 37
73. 6 Many eyes makes code fast
Bench WiseRF anybody?
L. Buitinck, O. Grisel, A. Joly, G. Louppe, J. Nothman, P. Prettenhofer
G Varoquaux 38
74. 6 6 steps to a community-driven project
1 Focus on quality
2 Build great docs and examples
3 Use github
4 Limit the technicality of your codebase
5 Releasing and packaging matter
6 Focus on your contributors,
give them credit, decision power
http://www.slideshare.net/GaelVaroquaux/
scikit-learn-dveloppement-communautaire
G Varoquaux 39
75. 6 Core project contributors
Normalized number of commits
since 2009-06
Numberofcommits
Individual committer
Credit: Fernando Perez, Gist 5843625
G Varoquaux 40
76. 6 The tragedy of the commons
Individuals, acting independently and rationally accord-
ing to each oneās self-interest, behave contrary to the
whole groupās long-term best interests by depleting
some common resource.
Wikipedia
Make it work, make it right, make it boring
Core projects (boring) taken for granted
Ʊ Hard to fund, less excitement
They need citation, in papers & on corporate web pages
G Varoquaux 41
77. @GaelVaroquaux
Solving problems that matter
The 80/20 rule
80% of the usecases can be solved
with 20% of the lines of code
scikit-learn, joblib, nilearn, ... I hope
79. @GaelVaroquaux
Cutting-edge ... environment ... on a budget
1 Set the goals right
2 Use the simplest technological solutions possible
Be very technically sophisticated
Donāt use that sophistication
80. @GaelVaroquaux
Cutting-edge ... environment ... on a budget
1 Set the goals right
2 Use the simplest technological solutions possible
3 Donāt forget the human factors
With your users (documentation)
With your contributors
81. @GaelVaroquaux
Cutting-edge ... environment ... on a budget
1 Set the goals right
2 Use the simplest technological solutions possible
3 Donāt forget the human factors
A perfect
design?