Building a cutting-edge data processing
environment on a budget
Ga¨el Varoquaux
This talk is not about
rocket science!
Building a cutting-edge data processing
environment on a budget
Ga¨el Varoquaux
Disclaimer: this talk is as much about peo...
Growing up as a penniless academic
I did a PhD in
quantum physics
Growing up as a penniless academic
I did a PhD in
quantum physics
Vacuum (leaks)
Electronics (shorts)
Lasers (mis-alignmen...
Growing up as a penniless academic
I did a PhD in
quantum physics
Vacuum (leaks)
Electronics (shorts)
Lasers (mis-alignmen...
Growing up as a penniless academic
I did a PhD in
quantum physics
Vacuum (leaks)
Electronics (shorts)
Lasers (mis-alignmen...
Growing up as a penniless academic
2011
Tenured researcher
in computer science
Growing up as a penniless academic
2011
Tenured researcher
in computer science
Today
Growing team with
data science
rock s...
1 Using machine learning to
understand brain function
Link neural activity to thoughts and cognition
G Varoquaux 6
1 Functional MRI
t
Recordings of brain activity
G Varoquaux 7
1 Cognitive NeuroImaging
Learn a bilateral link between brain activity
and cognitive function
G Varoquaux 8
1 Encoding models of stimuli
Predicting neural response
ñ a window into brain representations of stimuli
“feature engineer...
1 Decoding brain activity
“brain reading”
G Varoquaux 10
1 Data processing feats
Visual image reconstruction from human brain activity
[Miyawaki, et al. (2008)]
“brain reading”
G ...
1 Data processing feats
Visual image reconstruction from human brain activity
[Miyawaki, et al. (2008)]
“if it’s not open ...
1 Data processing feats
Visual image reconstruction from human brain activity
[Miyawaki, et al. (2008)]
Make it work, make...
1 Data processing feats
Visual image reconstruction from human brain activity
[Miyawaki, et al. (2008)]
Make it work, make...
1 Data processing feats
Visual image reconstruction from human brain activity
[Miyawaki, et al. (2008)]
Make it work, make...
1 Data accumulation
When data processing is routine... “big data”
for rich models of
brain function
Accumulation of scient...
1 Data accumulation
When data processing is routine... “big data”
for rich models of
brain function
Accumulation of scient...
1 Petty day-to-day technicalities
Buggy code
Slow code
Lead data scientist leaves
New intern to train
I don’t understand t...
1 Petty day-to-day technicalities
Buggy code
Slow code
Lead data scientist leaves
New intern to train
I don’t understand t...
1 Petty day-to-day technicalities
Buggy code
Slow code
Lead data scientist leaves
New intern to train
I don’t understand t...
2 Patterns in data processing
G Varoquaux 14
2 The data processing workflow agile
Interaction...
Ñ script...
Ñ module...
ý interaction again...
Consolidation,
progressi...
2 From statistics to statistical learning
Paradigm shift as the
dimensionality of data
grows
# features,
not only # sample...
3 Let’s just make software
to solve all these problems.
c Theodore W. Gray
G Varoquaux 17
3 Design philosophy
1. Don’t solve hard problems
The original problem can be bent.
2. Easy setup, works out of the box
Ins...
3 Design philosophy
1. Don’t solve hard problems
The original problem can be bent.
2. Easy setup, works out of the box
Ins...
G Varoquaux 19
Vision
Machine learning without learning the machinery
Black box that can be opened
Right trade-off between ”just works” an...
Vision
Machine learning without learning the machinery
Black box that can be opened
Right trade-off between ”just works” an...
3 Performance in high-level programming
High-level programming
is what keeps us
alive and kicking
G Varoquaux 20
3 Performance in high-level programming
The secret sauce
Optimize algorithmes, not for loops
Know perfectly Numpy and Scip...
3 Performance in high-level programming
The secret sauce
Optimize algorithmes, not for loops
Know perfectly Numpy and Scip...
3 Performance in high-level programming
The secret sauce
Optimize algorithmes, not for loops
Know perfectly Numpy and Scip...
3 Architecture of a data-manipulation toolkit
Separate data from operations,
but keep an imperative-like language
0
3
8
7
...
3 Architecture of a data-manipulation toolkit
Separate data from operations,
but keep an imperative-like language
Object A...
3 Architecture of a data-manipulation toolkit
Separate data from operations,
but keep an imperative-like language
Object A...
4 Big data on small hardware
G Varoquaux 22
4 Big data on small hardware
Biggish
smallish
“Big data”:
Petabytes...
Distributed storage
Computing cluster
Mere mortals:...
4 On-line algorithms
Process the data one sample at a time
Compute the mean of a gazillion
numbers
Hard?
G Varoquaux 23
4 On-line algorithms
Process the data one sample at a time
Compute the mean of a gazillion
numbers
Hard?
No: just do a run...
4 On-line algorithms
Converges to expectations
Mini-batch = bunch observations for vectorization
Example: K-Means clusteri...
4 On-the-fly data reduction
Big data is often I/O bound
Layer memory access
CPU caches
RAM
Local disks
Distant storage
Less...
4 On-the-fly data reduction
Dropping data
1 loop: take a random fraction of the data
2 run algorithm on that fraction
3 agg...
4 On-the-fly data reduction
Random projections (will average features)
sklearn.random projection
random linear combinations...
4 On-the-fly data reduction
Example: randomized SVD Random projection
sklearn.utils.extmath.randomized svd
X = np.random.no...
4 Biggish iron
Our new box: 15 ke
48 cores
384G RAM
70T storage
(SSD cache on RAID controller)
Gets our work done faster t...
5 Avoiding the framework
joblib
G Varoquaux 26
5 Parallel processing big picture
Focus on embarassingly parallel for loops
Life is too short to worry about deadlocks
Wor...
5 Parallel processing joblib
Focus on embarassingly parallel for loops
Life is too short to worry about deadlocks
>>> from...
5 Parallel processing joblib
IPython, multiprocessing, celery, MPI?
joblib is higher-level
No dependencies, works everywhe...
5 Parallel processing joblib
IPython, multiprocessing, celery, MPI?
joblib is higher-level
No dependencies, works everywhe...
5 Parallel processing Queues
Queues: high-performance, concurrent-friendly
Difficulty: callback on result arrival
ñ multiple...
5 Parallel processing: what happens where
joblib design: Caller, dispatch queue, and collect
queue in same process
Benefit:...
5 Caching
For reproducibility:
avoid manually chained scripts (make-like usage)
For performance:
avoiding re-computing is ...
5 Caching The joblib approach
For reproducibility:
avoid manually chained scripts (make-like usage)
For performance:
avoid...
5 Caching The joblib approach
For reproducibility:
avoid manually chained scripts (make-like usage)
For performance:
avoid...
5 Caching The joblib approach
For reproducibility:
avoid manually chained scripts (make-like usage)
For performance:
avoid...
5 Efficient input argument hashing – joblib.hash
Compute md5‹
of input arguments
Trade-off between features and cost
Black bo...
5 Efficient input argument hashing – joblib.hash
Compute md5‹
of input arguments
Implementation
1. Create an md5 hash object...
5 Fast, disk-based, concurrent, store – joblib.dump
Persisting arbritrary objects
Once again sub-class the pickler
Use .np...
5 Making I/O fast
Fast compression
CPU may be faster than disk access
in particular in parallel
Standard library: zlib.com...
5 Making I/O fast
Fast compression
CPU may be faster than disk access
in particular in parallel
Standard library: zlib.com...
5 Making I/O fast
Fast compression
CPU may be faster than disk access
in particular in parallel
Standard library: zlib.com...
5 Making I/O fast
Fast compression
CPU may be faster than disk access
in particular in parallel
Standard library: zlib.com...
5 Benchmarking to np.save and pytables
yaxisscale:1isnp.save
NeuroImaging data (MNI atlas)G Varoquaux 34
6 The bigger picture: building
an ecosystem
Helping your future self
G Varoquaux 35
6 Community-based development in scikit-learn
Huge feature set:
benefits of a large team
Project growth:
More than 200 cont...
6 The economics of open source
Code maintenance too expensive to be alone
scikit-learn „ 300 email/month nipy „ 45 email/m...
6 The economics of open source
Code maintenance too expensive to be alone
scikit-learn „ 300 email/month nipy „ 45 email/m...
6 Many eyes makes code fast
Bench WiseRF anybody?
L. Buitinck, O. Grisel, A. Joly, G. Louppe, J. Nothman, P. Prettenhofer
...
6 6 steps to a community-driven project
1 Focus on quality
2 Build great docs and examples
3 Use github
4 Limit the techni...
6 Core project contributors
Normalized number of commits
since 2009-06
Numberofcommits
Individual committer
Credit: Fernan...
6 The tragedy of the commons
Individuals, acting independently and rationally accord-
ing to each one’s self-interest, beh...
@GaelVaroquaux
Solving problems that matter
The 80/20 rule
80% of the usecases can be solved
with 20% of the lines of code...
@GaelVaroquaux
Cutting-edge ... environment ... on a budget
1 Set the goals right
Don’t solve hard problems
What’s your or...
@GaelVaroquaux
Cutting-edge ... environment ... on a budget
1 Set the goals right
2 Use the simplest technological solutio...
@GaelVaroquaux
Cutting-edge ... environment ... on a budget
1 Set the goals right
2 Use the simplest technological solutio...
@GaelVaroquaux
Cutting-edge ... environment ... on a budget
1 Set the goals right
2 Use the simplest technological solutio...
Upcoming SlideShare
Loading in …5
×

Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux

1,018 views

Published on

Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux

Published in: Technology

Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux

  1. 1. Building a cutting-edge data processing environment on a budget Ga¨el Varoquaux This talk is not about rocket science!
  2. 2. Building a cutting-edge data processing environment on a budget Ga¨el Varoquaux Disclaimer: this talk is as much about people and projects as it is about code and algorithms.
  3. 3. Growing up as a penniless academic I did a PhD in quantum physics
  4. 4. Growing up as a penniless academic I did a PhD in quantum physics Vacuum (leaks) Electronics (shorts) Lasers (mis-alignment) Best training ever for agile project management
  5. 5. Growing up as a penniless academic I did a PhD in quantum physics Vacuum (leaks) Electronics (shorts) Lasers (mis-alignment) Computers were only one of the many moving parts Matlab Instrument control
  6. 6. Growing up as a penniless academic I did a PhD in quantum physics Vacuum (leaks) Electronics (shorts) Lasers (mis-alignment) Computers were only one of the many moving parts Matlab Instrument controlShaped my vision of computing as a means to an end
  7. 7. Growing up as a penniless academic 2011 Tenured researcher in computer science
  8. 8. Growing up as a penniless academic 2011 Tenured researcher in computer science Today Growing team with data science rock stars
  9. 9. 1 Using machine learning to understand brain function Link neural activity to thoughts and cognition G Varoquaux 6
  10. 10. 1 Functional MRI t Recordings of brain activity G Varoquaux 7
  11. 11. 1 Cognitive NeuroImaging Learn a bilateral link between brain activity and cognitive function G Varoquaux 8
  12. 12. 1 Encoding models of stimuli Predicting neural response ñ a window into brain representations of stimuli “feature engineering” a description of the world G Varoquaux 9
  13. 13. 1 Decoding brain activity “brain reading” G Varoquaux 10
  14. 14. 1 Data processing feats Visual image reconstruction from human brain activity [Miyawaki, et al. (2008)] “brain reading” G Varoquaux 11
  15. 15. 1 Data processing feats Visual image reconstruction from human brain activity [Miyawaki, et al. (2008)] “if it’s not open and verifiable by others, it’s not science, or engineering...” Stodden, 2010 G Varoquaux 11
  16. 16. 1 Data processing feats Visual image reconstruction from human brain activity [Miyawaki, et al. (2008)] Make it work, make it right, make it boring
  17. 17. 1 Data processing feats Visual image reconstruction from human brain activity [Miyawaki, et al. (2008)] Make it work, make it right, make it boring http://nilearn.github.io/auto examples/ plot miyawaki reconstruction.html Code, data, ... just worksTM http://nilearn.github.io ni G Varoquaux 11
  18. 18. 1 Data processing feats Visual image reconstruction from human brain activity [Miyawaki, et al. (2008)] Make it work, make it right, make it boring http://nilearn.github.io/auto examples/ plot miyawaki reconstruction.html Code, data, ... just worksTM http://nilearn.github.io ni Software development challenge G Varoquaux 11
  19. 19. 1 Data accumulation When data processing is routine... “big data” for rich models of brain function Accumulation of scientific knowledge and learning formal representations G Varoquaux 12
  20. 20. 1 Data accumulation When data processing is routine... “big data” for rich models of brain function Accumulation of scientific knowledge and learning formal representations “A theory is a good theory if it satisfies two requirements: It must accurately describe a large class of observa- tions on the basis of a model that contains only a few arbitrary elements, and it must make definite predic- tions about the results of future observations.” Stephen Hawking, A Brief History of Time. G Varoquaux 12
  21. 21. 1 Petty day-to-day technicalities Buggy code Slow code Lead data scientist leaves New intern to train I don’t understand the code I have written a year ago G Varoquaux 13
  22. 22. 1 Petty day-to-day technicalities Buggy code Slow code Lead data scientist leaves New intern to train I don’t understand the code I have written a year ago A lab is no different from a startup Difficulties Recruitment Limited resources (people & hardware) Risks Bus factor Technical dept G Varoquaux 13
  23. 23. 1 Petty day-to-day technicalities Buggy code Slow code Lead data scientist leaves New intern to train I don’t understand the code I have written a year ago A lab is no different from a startup Difficulties Recruitment Limited resources (people & hardware) Risks Bus factor Technical dept Our mission is to revolutionize brain data processing on a tight budget G Varoquaux 13
  24. 24. 2 Patterns in data processing G Varoquaux 14
  25. 25. 2 The data processing workflow agile Interaction... Ñ script... Ñ module... ý interaction again... Consolidation, progressively Low tech and short turn-around times G Varoquaux 15
  26. 26. 2 From statistics to statistical learning Paradigm shift as the dimensionality of data grows # features, not only # samples From parameter inference to prediction Statistical learning is spreading everywhere x y G Varoquaux 16
  27. 27. 3 Let’s just make software to solve all these problems. c Theodore W. Gray G Varoquaux 17
  28. 28. 3 Design philosophy 1. Don’t solve hard problems The original problem can be bent. 2. Easy setup, works out of the box Installing software sucks. Convention over configuration. 3. Fail gracefully Robust to errors. Easy to debug. 4. Quality, quality, quality What’s not excellent won’t be used. G Varoquaux 18
  29. 29. 3 Design philosophy 1. Don’t solve hard problems The original problem can be bent. 2. Easy setup, works out of the box Installing software sucks. Convention over configuration. 3. Fail gracefully Robust to errors. Easy to debug. 4. Quality, quality, quality What’s not excellent won’t be used. Not “one software to rule them all” Break down projects by expertise G Varoquaux 18
  30. 30. G Varoquaux 19
  31. 31. Vision Machine learning without learning the machinery Black box that can be opened Right trade-off between ”just works” and versatility (think Apple vs Linux) G Varoquaux 19
  32. 32. Vision Machine learning without learning the machinery Black box that can be opened Right trade-off between ”just works” and versatility (think Apple vs Linux) We’re not going to solve all the problems for you I don’t solve hard problems Feature-engineering, domain-specific cases... Python is a programming language. Use it. Cover all the 80% usecases in one package G Varoquaux 19
  33. 33. 3 Performance in high-level programming High-level programming is what keeps us alive and kicking G Varoquaux 20
  34. 34. 3 Performance in high-level programming The secret sauce Optimize algorithmes, not for loops Know perfectly Numpy and Scipy - Significant data should be arrays/memoryviews - Avoid memory copies, rely on blas/lapack line-profiler/memory-profiler scipy-lectures.github.io Cython not C/C++ G Varoquaux 20
  35. 35. 3 Performance in high-level programming The secret sauce Optimize algorithmes, not for loops Know perfectly Numpy and Scipy - Significant data should be arrays/memoryviews - Avoid memory copies, rely on blas/lapack line-profiler/memory-profiler scipy-lectures.github.io Cython not C/C++ Hierarchical clustering PR #2199 1. Take the 2 closest clusters 2. Merge them 3. Update the distance matrix ... Faster with constraints: sparse distance matrix - Keep a heap queue of distances: cheap minimum - Need sparse growable structure for neighborhoods skip-list in Cython! Oplog nq insert, remove, access bind C++ map[int, float] with Cython Fast traversal, possibly in Cython, for step 3. G Varoquaux 20
  36. 36. 3 Performance in high-level programming The secret sauce Optimize algorithmes, not for loops Know perfectly Numpy and Scipy - Significant data should be arrays/memoryviews - Avoid memory copies, rely on blas/lapack line-profiler/memory-profiler scipy-lectures.github.io Cython not C/C++ Hierarchical clustering PR #2199 1. Take the 2 closest clusters 2. Merge them 3. Update the distance matrix ... Faster with constraints: sparse distance matrix - Keep a heap queue of distances: cheap minimum - Need sparse growable structure for neighborhoods skip-list in Cython! Oplog nq insert, remove, access bind C++ map[int, float] with Cython Fast traversal, possibly in Cython, for step 3. G Varoquaux 20
  37. 37. 3 Architecture of a data-manipulation toolkit Separate data from operations, but keep an imperative-like language 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 bokeh, chaco, hadoop, Mayavi, CPUs G Varoquaux 21
  38. 38. 3 Architecture of a data-manipulation toolkit Separate data from operations, but keep an imperative-like language Object API exposes a data-processing language fit, predict, transform, score, partial fit Instantiated without data but with all the parameters Objects pipeline, merging, etc... G Varoquaux 21
  39. 39. 3 Architecture of a data-manipulation toolkit Separate data from operations, but keep an imperative-like language Object API exposes a data-processing language fit, predict, transform, score, partial fit Instantiated without data but with all the parameters Objects pipeline, merging, etc... configuration/run pattern traits, pyre curry in functional programming functools.partial Ideas from MVC pattern G Varoquaux 21
  40. 40. 4 Big data on small hardware G Varoquaux 22
  41. 41. 4 Big data on small hardware Biggish smallish “Big data”: Petabytes... Distributed storage Computing cluster Mere mortals: Gigabytes... Python programming Off-the-self computers G Varoquaux 22
  42. 42. 4 On-line algorithms Process the data one sample at a time Compute the mean of a gazillion numbers Hard? G Varoquaux 23
  43. 43. 4 On-line algorithms Process the data one sample at a time Compute the mean of a gazillion numbers Hard? No: just do a running mean G Varoquaux 23
  44. 44. 4 On-line algorithms Converges to expectations Mini-batch = bunch observations for vectorization Example: K-Means clustering X = np.random.normal(size=(10 000, 200)) scipy.cluster.vq. kmeans(X, 10, iter=2) 11.33 s sklearn.cluster. MiniBatchKMeans(n clusters=10, n init=2).fit(X) 0.62 s G Varoquaux 23
  45. 45. 4 On-the-fly data reduction Big data is often I/O bound Layer memory access CPU caches RAM Local disks Distant storage Less data also means less work G Varoquaux 24
  46. 46. 4 On-the-fly data reduction Dropping data 1 loop: take a random fraction of the data 2 run algorithm on that fraction 3 aggregate results across sub-samplings Looks like bagging: bootstrap aggregation Exploits redundancy across observations Run the loop in parallel G Varoquaux 24
  47. 47. 4 On-the-fly data reduction Random projections (will average features) sklearn.random projection random linear combinations of the features Fast clustering of features sklearn.cluster.WardAgglomeration on images: super-pixel strategy Hashing when observations have varying size (e.g. words) sklearn.feature extraction.text. HashingVectorizer stateless: can be used in parallel G Varoquaux 24
  48. 48. 4 On-the-fly data reduction Example: randomized SVD Random projection sklearn.utils.extmath.randomized svd X = np.random.normal(size=(50000, 200)) %timeit lapack = linalg.svd(X, full matrices=False) 1 loops, best of 3: 6.09 s per loop %timeit arpack=splinalg.svds(X, 10) 1 loops, best of 3: 2.49 s per loop %timeit randomized = randomized svd(X, 10) 1 loops, best of 3: 303 ms per loop linalg.norm(lapack[0][:, :10] - arpack[0]) / 2000 0.0022360679774997738 linalg.norm(lapack[0][:, :10] - randomized[0]) / 2000 0.0022121161221386925 G Varoquaux 24
  49. 49. 4 Biggish iron Our new box: 15 ke 48 cores 384G RAM 70T storage (SSD cache on RAID controller) Gets our work done faster than our 800 CPU cluster It’s the access patterns! “Nobody ever got fired for using Hadoop on a cluster” A. Rowstron et al., HotCDP ’12 G Varoquaux 25
  50. 50. 5 Avoiding the framework joblib G Varoquaux 26
  51. 51. 5 Parallel processing big picture Focus on embarassingly parallel for loops Life is too short to worry about deadlocks Workers compete for data access Memory bus is a bottleneck The right grain of parallelism Too fine ñ overhead Too coarse ñ memory shortage Scale by the relevant cache pool G Varoquaux 27
  52. 52. 5 Parallel processing joblib Focus on embarassingly parallel for loops Life is too short to worry about deadlocks >>> from joblib import Parallel, delayed >>> Parallel(n jobs=2)(delayed(sqrt)(i**2) ... for i in range(8)) [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0] G Varoquaux 27
  53. 53. 5 Parallel processing joblib IPython, multiprocessing, celery, MPI? joblib is higher-level No dependencies, works everywhere Better traceback reporting Memmaping arrays to share memory (O. Grisel) On-the-fly dispatch of jobs – memory-friendly Threads or processes backend G Varoquaux 27
  54. 54. 5 Parallel processing joblib IPython, multiprocessing, celery, MPI? joblib is higher-level No dependencies, works everywhere Better traceback reporting Memmaping arrays to share memory (O. Grisel) On-the-fly dispatch of jobs – memory-friendly Threads or processes backend G Varoquaux 27
  55. 55. 5 Parallel processing Queues Queues: high-performance, concurrent-friendly Difficulty: callback on result arrival ñ multiple threads in caller ` risk of deadlocks Dispatch queue should fill up “slowly” ñ pre dispatch in joblib ñ Back and forth communication Door open to race conditions G Varoquaux 28
  56. 56. 5 Parallel processing: what happens where joblib design: Caller, dispatch queue, and collect queue in same process Benefit: robustness Grand-central dispatch design: dispatch queue has a process of its own Benefit: resource managment in nested for loops G Varoquaux 29
  57. 57. 5 Caching For reproducibility: avoid manually chained scripts (make-like usage) For performance: avoiding re-computing is the crux of optimization G Varoquaux 30
  58. 58. 5 Caching The joblib approach For reproducibility: avoid manually chained scripts (make-like usage) For performance: avoiding re-computing is the crux of optimization Memoize pattern mem = joblib.Memory(cachedir=’.’) g = mem.cache(f) b = g(a) # computes a using f c = g(a) # retrieves results from store G Varoquaux 30
  59. 59. 5 Caching The joblib approach For reproducibility: avoid manually chained scripts (make-like usage) For performance: avoiding re-computing is the crux of optimization Memoize pattern mem = joblib.Memory(cachedir=’.’) g = mem.cache(f) b = g(a) # computes a using f c = g(a) # retrieves results from store Challenges in the context of big data a & b are big Design goals a & b arbitrary Python objects No dependencies Drop-in, framework-less code G Varoquaux 30
  60. 60. 5 Caching The joblib approach For reproducibility: avoid manually chained scripts (make-like usage) For performance: avoiding re-computing is the crux of optimization Memoize pattern mem = joblib.Memory(cachedir=’.’) g = mem.cache(f) b = g(a) # computes a using f c = g(a) # retrieves results from store Lego bricks for out-of-core algorithms coming soon ąąąąąąąąą result = g.call and shelve(a) ąąąąąąąąą result MemorizedResult(cachedir=”...”, func=”g...”, argument hash=”...”) ąąąąąąąąą c = result.get() G Varoquaux 30
  61. 61. 5 Efficient input argument hashing – joblib.hash Compute md5‹ of input arguments Trade-off between features and cost Black boxy Robust and completely generic G Varoquaux 31
  62. 62. 5 Efficient input argument hashing – joblib.hash Compute md5‹ of input arguments Implementation 1. Create an md5 hash object 2. Subclass the standard-library pickler = state machine that walks the object graph 3. Walk the object graph: - ndarrays: pass data pointer to md5 algorithm (“update” method) - the rest: pickle 4. Update the md5 with the pickle ‹ md5 is in the Python standard library G Varoquaux 31
  63. 63. 5 Fast, disk-based, concurrent, store – joblib.dump Persisting arbritrary objects Once again sub-class the pickler Use .npy for large numpy arrays (np.save), pickle for the rest ñ Multiple files Store concurrency issues Strategy: atomic operations ` try/except Renaming a directory is atomic Directory layout consistent with remove operations Good performance, usable on shared disks (cluster) G Varoquaux 32
  64. 64. 5 Making I/O fast Fast compression CPU may be faster than disk access in particular in parallel Standard library: zlib.compress with buffers (bypass gzip module to work online + in-memory) G Varoquaux 33
  65. 65. 5 Making I/O fast Fast compression CPU may be faster than disk access in particular in parallel Standard library: zlib.compress with buffers (bypass gzip module to work online + in-memory) Avoiding copies zlib.compress: C-contiguous buffers Copyless storage of raw buffer + meta-information (strides, class...) G Varoquaux 33
  66. 66. 5 Making I/O fast Fast compression CPU may be faster than disk access in particular in parallel Standard library: zlib.compress with buffers (bypass gzip module to work online + in-memory) Avoiding copies zlib.compress: C-contiguous buffers Copyless storage of raw buffer + meta-information (strides, class...) Single file dump coming soon File opening is slow on cluster Challenge: streaming the above for memory usage G Varoquaux 33
  67. 67. 5 Making I/O fast Fast compression CPU may be faster than disk access in particular in parallel Standard library: zlib.compress with buffers (bypass gzip module to work online + in-memory) Avoiding copies zlib.compress: C-contiguous buffers Copyless storage of raw buffer + meta-information (strides, class...) Single file dump coming soon File opening is slow on cluster Challenge: streaming the above for memory usage What matters on large systems Numbers of bytes stored brings network/SATA bus down Memory usage brings compute nodes down Number of atomic file access brings shared storage down G Varoquaux 33
  68. 68. 5 Benchmarking to np.save and pytables yaxisscale:1isnp.save NeuroImaging data (MNI atlas)G Varoquaux 34
  69. 69. 6 The bigger picture: building an ecosystem Helping your future self G Varoquaux 35
  70. 70. 6 Community-based development in scikit-learn Huge feature set: benefits of a large team Project growth: More than 200 contributors „ 12 core contributors 1 full-time INRIA programmer from the start Estimated cost of development: $ 6 millions COCOMO model, http://www.ohloh.net/p/scikit-learn G Varoquaux 36
  71. 71. 6 The economics of open source Code maintenance too expensive to be alone scikit-learn „ 300 email/month nipy „ 45 email/month joblib „ 45 email/month mayavi „ 30 email/month “Hey Gael, I take it you’re too busy. That’s okay, I spent a day trying to install XXX and I think I’ll succeed myself. Next time though please don’t ignore my emails, I really don’t like it. You can say, ‘sorry, I have no time to help you.’ Just don’t ignore.” G Varoquaux 37
  72. 72. 6 The economics of open source Code maintenance too expensive to be alone scikit-learn „ 300 email/month nipy „ 45 email/month joblib „ 45 email/month mayavi „ 30 email/month Your “benefits” come from a fraction of the code Data loading? Maybe? Standard algorithms? Nah Share the common code... ...to avoid dying under code Code becomes less precious with time And somebody might contribute features G Varoquaux 37
  73. 73. 6 Many eyes makes code fast Bench WiseRF anybody? L. Buitinck, O. Grisel, A. Joly, G. Louppe, J. Nothman, P. Prettenhofer G Varoquaux 38
  74. 74. 6 6 steps to a community-driven project 1 Focus on quality 2 Build great docs and examples 3 Use github 4 Limit the technicality of your codebase 5 Releasing and packaging matter 6 Focus on your contributors, give them credit, decision power http://www.slideshare.net/GaelVaroquaux/ scikit-learn-dveloppement-communautaire G Varoquaux 39
  75. 75. 6 Core project contributors Normalized number of commits since 2009-06 Numberofcommits Individual committer Credit: Fernando Perez, Gist 5843625 G Varoquaux 40
  76. 76. 6 The tragedy of the commons Individuals, acting independently and rationally accord- ing to each one’s self-interest, behave contrary to the whole group’s long-term best interests by depleting some common resource. Wikipedia Make it work, make it right, make it boring Core projects (boring) taken for granted ñ Hard to fund, less excitement They need citation, in papers & on corporate web pages G Varoquaux 41
  77. 77. @GaelVaroquaux Solving problems that matter The 80/20 rule 80% of the usecases can be solved with 20% of the lines of code scikit-learn, joblib, nilearn, ... I hope
  78. 78. @GaelVaroquaux Cutting-edge ... environment ... on a budget 1 Set the goals right Don’t solve hard problems What’s your original problem?
  79. 79. @GaelVaroquaux Cutting-edge ... environment ... on a budget 1 Set the goals right 2 Use the simplest technological solutions possible Be very technically sophisticated Don’t use that sophistication
  80. 80. @GaelVaroquaux Cutting-edge ... environment ... on a budget 1 Set the goals right 2 Use the simplest technological solutions possible 3 Don’t forget the human factors With your users (documentation) With your contributors
  81. 81. @GaelVaroquaux Cutting-edge ... environment ... on a budget 1 Set the goals right 2 Use the simplest technological solutions possible 3 Don’t forget the human factors A perfect design?

×