Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux

Building a cutting-edge data processing
environment on a budget
Ga¨el Varoquaux
This talk is not about
rocket science!

Building a cutting-edge data processing
environment on a budget
Ga¨el Varoquaux
Disclaimer: this talk is as much about people
and projects as it is about code and algorithms.

Growing up as a penniless academic
I did a PhD in
quantum physics

I did a PhD in
quantum physics
Vacuum (leaks)
Electronics (shorts)
Lasers (mis-alignment)
Best training ever
for agile project
management

I did a PhD in
quantum physics
Vacuum (leaks)
Computers were only one
of the many moving parts
Matlab
Instrument control

I did a PhD in
quantum physics
Vacuum (leaks)
Computers were only one
of the many moving parts
Matlab
Instrument controlShaped my vision
of computing as a
means to an end

2011
Tenured researcher
in computer science

2011
Tenured researcher
in computer science
Today
Growing team with
data science
rock stars

1 Using machine learning to
understand brain function
Link neural activity to thoughts and cognition
G Varoquaux 6

1 Functional MRI
t
Recordings of brain activity
G Varoquaux 7

1 Cognitive NeuroImaging
Learn a bilateral link between brain activity
and cognitive function
G Varoquaux 8

1 Encoding models of stimuli
Predicting neural response
ñ a window into brain representations of stimuli
“feature engineering” a description of the world
G Varoquaux 9

1 Decoding brain activity
“brain reading”
G Varoquaux 10

1 Data processing feats
Visual image reconstruction from human brain activity
[Miyawaki, et al. (2008)]
“brain reading”
G Varoquaux 11

“if it’s not open and veriﬁable by others, it’s not
science, or engineering...” Stodden, 2010
G Varoquaux 11

Make it work, make it right, make it boring

http://nilearn.github.io/auto examples/
plot miyawaki reconstruction.html
Code, data, ... just worksTM
http://nilearn.github.io
ni
G Varoquaux 11

http://nilearn.github.io/auto examples/
plot miyawaki reconstruction.html
Code, data, ... just worksTM
http://nilearn.github.io
ni
Software development challenge
G Varoquaux 11

1 Data accumulation
When data processing is routine... “big data”
for rich models of
brain function
Accumulation of scientiﬁc knowledge
and learning formal representations
G Varoquaux 12

1 Data accumulation
When data processing is routine... “big data”
for rich models of
brain function
Accumulation of scientific knowledge
and learning formal representations
“A theory is a good theory if it satisfies two requirements:
It must accurately describe a large class of observa-
tions on the basis of a model that contains only a few
arbitrary elements, and it must make definite predic-
tions about the results of future observations.”
Stephen Hawking, A Brief History of Time.
G Varoquaux 12

1 Petty day-to-day technicalities
Buggy code
Slow code
Lead data scientist leaves
New intern to train
I don’t understand the
code I have written a year ago
G Varoquaux 13

Buggy code
Slow code
New intern to train
A lab is no diﬀerent from a startup
Diﬃculties
Recruitment
Limited resources
(people & hardware)
Risks
Bus factor
Technical dept
G Varoquaux 13

Buggy code
Slow code
New intern to train
A lab is no diﬀerent from a startup
Diﬃculties
Recruitment
Limited resources
(people & hardware)
Risks
Bus factor
Technical dept
Our mission is to revolutionize brain data processing
on a tight budget
G Varoquaux 13

2 Patterns in data processing
G Varoquaux 14

2 The data processing workﬂow agile
Interaction...
Ñ script...
Ñ module...
ý interaction again...
Consolidation,
progressively
Low tech and short
turn-around times
G Varoquaux 15

2 From statistics to statistical learning
Paradigm shift as the
dimensionality of data
grows
# features,
not only # samples
From parameter
inference to prediction
Statistical learning is
spreading everywhere
x
y
G Varoquaux 16

3 Let’s just make software
to solve all these problems.
c Theodore W. Gray
G Varoquaux 17

3 Design philosophy
1. Don’t solve hard problems
The original problem can be bent.
2. Easy setup, works out of the box
Installing software sucks.
Convention over conﬁguration.
3. Fail gracefully
Robust to errors. Easy to debug.
4. Quality, quality, quality
What’s not excellent won’t be used.
G Varoquaux 18

3 Design philosophy
1. Don’t solve hard problems
The original problem can be bent.
2. Easy setup, works out of the box
Installing software sucks.
Convention over conﬁguration.
3. Fail gracefully
Robust to errors. Easy to debug.
4. Quality, quality, quality
What’s not excellent won’t be used.
Not “one software to rule them all”
Break down projects by expertise
G Varoquaux 18

Vision
Machine learning without learning the machinery
Black box that can be opened
Right trade-oﬀ between ”just works” and versatility
(think Apple vs Linux)
G Varoquaux 19

Vision
Machine learning without learning the machinery
Black box that can be opened
Right trade-oﬀ between ”just works” and versatility
(think Apple vs Linux)
We’re not going to solve all the problems for you
I don’t solve hard problems
Feature-engineering, domain-speciﬁc cases...
Python is a programming language. Use it.
Cover all the 80% usecases in one package
G Varoquaux 19

3 Performance in high-level programming
High-level programming
is what keeps us
alive and kicking
G Varoquaux 20

The secret sauce
Optimize algorithmes, not for loops
Know perfectly Numpy and Scipy
- Significant data should be arrays/memoryviews
- Avoid memory copies, rely on blas/lapack
line-profiler/memory-profiler
scipy-lectures.github.io
Cython not C/C++
G Varoquaux 20

The secret sauce
Optimize algorithmes, not for loops
Know perfectly Numpy and Scipy
- Significant data should be arrays/memoryviews
- Avoid memory copies, rely on blas/lapack
line-profiler/memory-profiler
scipy-lectures.github.io
Cython not C/C++
Hierarchical clustering PR #2199
1. Take the 2 closest clusters
2. Merge them
3. Update the distance matrix
...
Faster with constraints: sparse distance matrix
- Keep a heap queue of distances: cheap minimum
- Need sparse growable structure for neighborhoods
skip-list in Cython!
Oplog nq insert, remove, access
bind C++ map[int, float] with Cython
Fast traversal, possibly in Cython, for step 3.
G Varoquaux 20

3 Architecture of a data-manipulation toolkit
Separate data from operations,
but keep an imperative-like language
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
bokeh, chaco, hadoop, Mayavi, CPUs
G Varoquaux 21

Object API exposes a data-processing language
ﬁt, predict, transform, score, partial ﬁt
Instantiated without data but with all the parameters
Objects pipeline, merging, etc...
G Varoquaux 21

Object API exposes a data-processing language
fit, predict, transform, score, partial fit
Instantiated without data but with all the parameters
Objects pipeline, merging, etc...
configuration/run pattern traits, pyre
curry in functional programming functools.partial
Ideas from MVC pattern
G Varoquaux 21

4 Big data on small hardware
G Varoquaux 22

4 Big data on small hardware
Biggish
smallish
“Big data”:
Petabytes...
Distributed storage
Computing cluster
Mere mortals:
Gigabytes...
Python programming
Oﬀ-the-self computers
G Varoquaux 22

4 On-line algorithms
Process the data one sample at a time
Compute the mean of a gazillion
numbers
Hard?
G Varoquaux 23

Process the data one sample at a time
Compute the mean of a gazillion
numbers
Hard?
No: just do a running mean
G Varoquaux 23

Converges to expectations
Mini-batch = bunch observations for vectorization
Example: K-Means clustering
X = np.random.normal(size=(10 000, 200))
scipy.cluster.vq.
kmeans(X, 10,
iter=2)
11.33 s
sklearn.cluster.
MiniBatchKMeans(n clusters=10,
n init=2).ﬁt(X)
0.62 s
G Varoquaux 23

4 On-the-ﬂy data reduction
Big data is often I/O bound
Layer memory access
CPU caches
RAM
Local disks
Distant storage
Less data also means less work
G Varoquaux 24

Dropping data
1 loop: take a random fraction of the data
2 run algorithm on that fraction
3 aggregate results across sub-samplings
Looks like bagging: bootstrap aggregation
Exploits redundancy across observations
Run the loop in parallel
G Varoquaux 24

Random projections (will average features)
sklearn.random projection
random linear combinations of the features
Fast clustering of features
sklearn.cluster.WardAgglomeration
on images: super-pixel strategy
Hashing when observations have varying size
(e.g. words)
sklearn.feature extraction.text.
HashingVectorizer
stateless: can be used in parallel
G Varoquaux 24

Example: randomized SVD Random projection
sklearn.utils.extmath.randomized svd
X = np.random.normal(size=(50000, 200))
%timeit lapack = linalg.svd(X, full matrices=False)
1 loops, best of 3: 6.09 s per loop
%timeit arpack=splinalg.svds(X, 10)
1 loops, best of 3: 2.49 s per loop
%timeit randomized = randomized svd(X, 10)
1 loops, best of 3: 303 ms per loop
linalg.norm(lapack[0][:, :10] - arpack[0]) / 2000
0.0022360679774997738
linalg.norm(lapack[0][:, :10] - randomized[0]) / 2000
0.0022121161221386925
G Varoquaux 24

4 Biggish iron
Our new box: 15 ke
48 cores
384G RAM
70T storage
(SSD cache on RAID controller)
Gets our work done faster than our 800 CPU cluster
It’s the access patterns!
“Nobody ever got ﬁred for using Hadoop on a cluster”
A. Rowstron et al., HotCDP ’12
G Varoquaux 25

5 Avoiding the framework
joblib
G Varoquaux 26

5 Parallel processing big picture
Focus on embarassingly parallel for loops
Life is too short to worry about deadlocks
Workers compete for data access
Memory bus is a bottleneck
The right grain of parallelism
Too ﬁne ñ overhead
Too coarse ñ memory shortage
Scale by the relevant cache pool
G Varoquaux 27

5 Parallel processing joblib
Focus on embarassingly parallel for loops
Life is too short to worry about deadlocks
>>> from joblib import Parallel, delayed
>>> Parallel(n jobs=2)(delayed(sqrt)(i**2)
... for i in range(8))
[0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0]
G Varoquaux 27

5 Parallel processing joblib
IPython, multiprocessing, celery, MPI?
joblib is higher-level
No dependencies, works everywhere
Better traceback reporting
Memmaping arrays to share memory (O. Grisel)
On-the-ﬂy dispatch of jobs – memory-friendly
Threads or processes backend
G Varoquaux 27

5 Parallel processing Queues
Queues: high-performance, concurrent-friendly
Diﬃculty: callback on result arrival
ñ multiple threads in caller ` risk of deadlocks
Dispatch queue should ﬁll up “slowly”
ñ pre dispatch in joblib
ñ Back and forth communication
Door open to race conditions
G Varoquaux 28

5 Parallel processing: what happens where
joblib design: Caller, dispatch queue, and collect
queue in same process
Beneﬁt: robustness
Grand-central dispatch design: dispatch queue has
a process of its own
Beneﬁt: resource managment in nested for loops
G Varoquaux 29

5 Caching
For reproducibility:
avoid manually chained scripts (make-like usage)
For performance:
avoiding re-computing is the crux of optimization
G Varoquaux 30

5 Caching The joblib approach
For performance:
Memoize pattern
mem = joblib.Memory(cachedir=’.’)
g = mem.cache(f)
b = g(a) # computes a using f
c = g(a) # retrieves results from store
G Varoquaux 30

For performance:
Memoize pattern
g = mem.cache(f)
Challenges in the context of big data
a & b are big
Design goals
a & b arbitrary Python objects
No dependencies
Drop-in, framework-less code
G Varoquaux 30

For performance:
Memoize pattern
g = mem.cache(f)
Lego bricks for out-of-core algorithms coming soon
ąąąąąąąąą result = g.call and shelve(a)
ąąąąąąąąą result
MemorizedResult(cachedir=”...”, func=”g...”, argument hash=”...”)
ąąąąąąąąą c = result.get()
G Varoquaux 30

5 Eﬃcient input argument hashing – joblib.hash
Compute md5‹
of input arguments
Trade-oﬀ between features and cost
Black boxy
Robust and completely generic
G Varoquaux 31

5 Eﬃcient input argument hashing – joblib.hash
Compute md5‹
of input arguments
Implementation
1. Create an md5 hash object
2. Subclass the standard-library pickler
= state machine that walks the object graph
3. Walk the object graph:
- ndarrays: pass data pointer to md5 algorithm
(“update” method)
- the rest: pickle
4. Update the md5 with the pickle
‹ md5 is in the Python standard library
G Varoquaux 31

5 Fast, disk-based, concurrent, store – joblib.dump
Persisting arbritrary objects
Once again sub-class the pickler
Use .npy for large numpy arrays (np.save),
pickle for the rest
ñ Multiple ﬁles
Store concurrency issues
Strategy: atomic operations ` try/except
Renaming a directory is atomic
Directory layout consistent with remove operations
Good performance, usable on shared disks (cluster)
G Varoquaux 32

5 Making I/O fast
Fast compression
CPU may be faster than disk access
in particular in parallel
Standard library: zlib.compress with buﬀers
(bypass gzip module to work online + in-memory)
G Varoquaux 33

5 Making I/O fast
Fast compression
Avoiding copies
zlib.compress: C-contiguous buﬀers
Copyless storage of raw buﬀer
+ meta-information (strides, class...)
G Varoquaux 33

5 Making I/O fast
Fast compression
Avoiding copies
Single ﬁle dump coming soon
File opening is slow on cluster
Challenge: streaming the above for memory usage
G Varoquaux 33

5 Making I/O fast
Fast compression
Avoiding copies
Single ﬁle dump coming soon
File opening is slow on cluster
Challenge: streaming the above for memory usage
What matters on large systems
Numbers of bytes stored
brings network/SATA bus down
Memory usage
brings compute nodes down
Number of atomic ﬁle access
brings shared storage down
G Varoquaux 33

5 Benchmarking to np.save and pytables
yaxisscale:1isnp.save
NeuroImaging data (MNI atlas)G Varoquaux 34

6 The bigger picture: building
an ecosystem
Helping your future self
G Varoquaux 35

6 Community-based development in scikit-learn
Huge feature set:
beneﬁts of a large team
Project growth:
More than 200 contributors
„ 12 core contributors
1 full-time INRIA programmer
from the start
Estimated cost of development: $ 6 millions
COCOMO model,
http://www.ohloh.net/p/scikit-learn
G Varoquaux 36

6 The economics of open source
Code maintenance too expensive to be alone
scikit-learn „ 300 email/month nipy „ 45 email/month
joblib „ 45 email/month mayavi „ 30 email/month
“Hey Gael, I take it you’re too
busy. That’s okay, I spent a day
trying to install XXX and I think
I’ll succeed myself. Next time
though please don’t ignore my
emails, I really don’t like it. You
can say, ‘sorry, I have no time to
help you.’ Just don’t ignore.”
G Varoquaux 37

6 The economics of open source
Code maintenance too expensive to be alone
scikit-learn „ 300 email/month nipy „ 45 email/month
joblib „ 45 email/month mayavi „ 30 email/month
Your “beneﬁts” come from a fraction of the code
Data loading? Maybe?
Standard algorithms? Nah
Share the common code...
...to avoid dying under code
Code becomes less precious with time
And somebody might contribute features
G Varoquaux 37

6 Many eyes makes code fast
Bench WiseRF anybody?
L. Buitinck, O. Grisel, A. Joly, G. Louppe, J. Nothman, P. Prettenhofer
G Varoquaux 38

6 6 steps to a community-driven project
1 Focus on quality
2 Build great docs and examples
3 Use github
4 Limit the technicality of your codebase
5 Releasing and packaging matter
6 Focus on your contributors,
give them credit, decision power
http://www.slideshare.net/GaelVaroquaux/
scikit-learn-dveloppement-communautaire
G Varoquaux 39

6 Core project contributors
Normalized number of commits
since 2009-06
Numberofcommits
Individual committer
Credit: Fernando Perez, Gist 5843625
G Varoquaux 40

6 The tragedy of the commons
Individuals, acting independently and rationally accord-
ing to each one’s self-interest, behave contrary to the
whole group’s long-term best interests by depleting
some common resource.
Wikipedia
Core projects (boring) taken for granted
ñ Hard to fund, less excitement
They need citation, in papers & on corporate web pages
G Varoquaux 41

@GaelVaroquaux
Solving problems that matter
The 80/20 rule
80% of the usecases can be solved
with 20% of the lines of code
scikit-learn, joblib, nilearn, ... I hope

@GaelVaroquaux
Cutting-edge ... environment ... on a budget
1 Set the goals right
Don’t solve hard problems
What’s your original problem?

@GaelVaroquaux
2 Use the simplest technological solutions possible
Be very technically sophisticated
Don’t use that sophistication

@GaelVaroquaux
3 Don’t forget the human factors
With your users (documentation)
With your contributors

@GaelVaroquaux
3 Don’t forget the human factors
A perfect
design?

Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux

Similar to Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux (20)

More from PyData

More from PyData (20)

Recently uploaded

Recently uploaded (20)

Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux