SlideShare a Scribd company logo
1 of 81
Download to read offline
Building a cutting-edge data processing
environment on a budget
GaĀØel Varoquaux
This talk is not about
rocket science!
Building a cutting-edge data processing
environment on a budget
GaĀØel Varoquaux
Disclaimer: this talk is as much about people
and projects as it is about code and algorithms.
Growing up as a penniless academic
I did a PhD in
quantum physics
Growing up as a penniless academic
I did a PhD in
quantum physics
Vacuum (leaks)
Electronics (shorts)
Lasers (mis-alignment)
Best training ever
for agile project
management
Growing up as a penniless academic
I did a PhD in
quantum physics
Vacuum (leaks)
Electronics (shorts)
Lasers (mis-alignment)
Computers were only one
of the many moving parts
Matlab
Instrument control
Growing up as a penniless academic
I did a PhD in
quantum physics
Vacuum (leaks)
Electronics (shorts)
Lasers (mis-alignment)
Computers were only one
of the many moving parts
Matlab
Instrument controlShaped my vision
of computing as a
means to an end
Growing up as a penniless academic
2011
Tenured researcher
in computer science
Growing up as a penniless academic
2011
Tenured researcher
in computer science
Today
Growing team with
data science
rock stars
1 Using machine learning to
understand brain function
Link neural activity to thoughts and cognition
G Varoquaux 6
1 Functional MRI
t
Recordings of brain activity
G Varoquaux 7
1 Cognitive NeuroImaging
Learn a bilateral link between brain activity
and cognitive function
G Varoquaux 8
1 Encoding models of stimuli
Predicting neural response
Ʊ a window into brain representations of stimuli
ā€œfeature engineeringā€ a description of the world
G Varoquaux 9
1 Decoding brain activity
ā€œbrain readingā€
G Varoquaux 10
1 Data processing feats
Visual image reconstruction from human brain activity
[Miyawaki, et al. (2008)]
ā€œbrain readingā€
G Varoquaux 11
1 Data processing feats
Visual image reconstruction from human brain activity
[Miyawaki, et al. (2008)]
ā€œif itā€™s not open and veriļ¬able by others, itā€™s not
science, or engineering...ā€ Stodden, 2010
G Varoquaux 11
1 Data processing feats
Visual image reconstruction from human brain activity
[Miyawaki, et al. (2008)]
Make it work, make it right, make it boring
1 Data processing feats
Visual image reconstruction from human brain activity
[Miyawaki, et al. (2008)]
Make it work, make it right, make it boring
http://nilearn.github.io/auto examples/
plot miyawaki reconstruction.html
Code, data, ... just worksTM
http://nilearn.github.io
ni
G Varoquaux 11
1 Data processing feats
Visual image reconstruction from human brain activity
[Miyawaki, et al. (2008)]
Make it work, make it right, make it boring
http://nilearn.github.io/auto examples/
plot miyawaki reconstruction.html
Code, data, ... just worksTM
http://nilearn.github.io
ni
Software development challenge
G Varoquaux 11
1 Data accumulation
When data processing is routine... ā€œbig dataā€
for rich models of
brain function
Accumulation of scientiļ¬c knowledge
and learning formal representations
G Varoquaux 12
1 Data accumulation
When data processing is routine... ā€œbig dataā€
for rich models of
brain function
Accumulation of scientiļ¬c knowledge
and learning formal representations
ā€œA theory is a good theory if it satisļ¬es two requirements:
It must accurately describe a large class of observa-
tions on the basis of a model that contains only a few
arbitrary elements, and it must make deļ¬nite predic-
tions about the results of future observations.ā€
Stephen Hawking, A Brief History of Time.
G Varoquaux 12
1 Petty day-to-day technicalities
Buggy code
Slow code
Lead data scientist leaves
New intern to train
I donā€™t understand the
code I have written a year ago
G Varoquaux 13
1 Petty day-to-day technicalities
Buggy code
Slow code
Lead data scientist leaves
New intern to train
I donā€™t understand the
code I have written a year ago
A lab is no diļ¬€erent from a startup
Diļ¬ƒculties
Recruitment
Limited resources
(people & hardware)
Risks
Bus factor
Technical dept
G Varoquaux 13
1 Petty day-to-day technicalities
Buggy code
Slow code
Lead data scientist leaves
New intern to train
I donā€™t understand the
code I have written a year ago
A lab is no diļ¬€erent from a startup
Diļ¬ƒculties
Recruitment
Limited resources
(people & hardware)
Risks
Bus factor
Technical dept
Our mission is to revolutionize brain data processing
on a tight budget
G Varoquaux 13
2 Patterns in data processing
G Varoquaux 14
2 The data processing workļ¬‚ow agile
Interaction...
Ƒ script...
Ƒ module...
Ć½ interaction again...
Consolidation,
progressively
Low tech and short
turn-around times
G Varoquaux 15
2 From statistics to statistical learning
Paradigm shift as the
dimensionality of data
grows
# features,
not only # samples
From parameter
inference to prediction
Statistical learning is
spreading everywhere
x
y
G Varoquaux 16
3 Letā€™s just make software
to solve all these problems.
c Theodore W. Gray
G Varoquaux 17
3 Design philosophy
1. Donā€™t solve hard problems
The original problem can be bent.
2. Easy setup, works out of the box
Installing software sucks.
Convention over conļ¬guration.
3. Fail gracefully
Robust to errors. Easy to debug.
4. Quality, quality, quality
Whatā€™s not excellent wonā€™t be used.
G Varoquaux 18
3 Design philosophy
1. Donā€™t solve hard problems
The original problem can be bent.
2. Easy setup, works out of the box
Installing software sucks.
Convention over conļ¬guration.
3. Fail gracefully
Robust to errors. Easy to debug.
4. Quality, quality, quality
Whatā€™s not excellent wonā€™t be used.
Not ā€œone software to rule them allā€
Break down projects by expertise
G Varoquaux 18
G Varoquaux 19
Vision
Machine learning without learning the machinery
Black box that can be opened
Right trade-oļ¬€ between ā€just worksā€ and versatility
(think Apple vs Linux)
G Varoquaux 19
Vision
Machine learning without learning the machinery
Black box that can be opened
Right trade-oļ¬€ between ā€just worksā€ and versatility
(think Apple vs Linux)
Weā€™re not going to solve all the problems for you
I donā€™t solve hard problems
Feature-engineering, domain-speciļ¬c cases...
Python is a programming language. Use it.
Cover all the 80% usecases in one package
G Varoquaux 19
3 Performance in high-level programming
High-level programming
is what keeps us
alive and kicking
G Varoquaux 20
3 Performance in high-level programming
The secret sauce
Optimize algorithmes, not for loops
Know perfectly Numpy and Scipy
- Signiļ¬cant data should be arrays/memoryviews
- Avoid memory copies, rely on blas/lapack
line-proļ¬ler/memory-proļ¬ler
scipy-lectures.github.io
Cython not C/C++
G Varoquaux 20
3 Performance in high-level programming
The secret sauce
Optimize algorithmes, not for loops
Know perfectly Numpy and Scipy
- Signiļ¬cant data should be arrays/memoryviews
- Avoid memory copies, rely on blas/lapack
line-proļ¬ler/memory-proļ¬ler
scipy-lectures.github.io
Cython not C/C++
Hierarchical clustering PR #2199
1. Take the 2 closest clusters
2. Merge them
3. Update the distance matrix
...
Faster with constraints: sparse distance matrix
- Keep a heap queue of distances: cheap minimum
- Need sparse growable structure for neighborhoods
skip-list in Cython!
Oplog nq insert, remove, access
bind C++ map[int, float] with Cython
Fast traversal, possibly in Cython, for step 3.
G Varoquaux 20
3 Performance in high-level programming
The secret sauce
Optimize algorithmes, not for loops
Know perfectly Numpy and Scipy
- Signiļ¬cant data should be arrays/memoryviews
- Avoid memory copies, rely on blas/lapack
line-proļ¬ler/memory-proļ¬ler
scipy-lectures.github.io
Cython not C/C++
Hierarchical clustering PR #2199
1. Take the 2 closest clusters
2. Merge them
3. Update the distance matrix
...
Faster with constraints: sparse distance matrix
- Keep a heap queue of distances: cheap minimum
- Need sparse growable structure for neighborhoods
skip-list in Cython!
Oplog nq insert, remove, access
bind C++ map[int, float] with Cython
Fast traversal, possibly in Cython, for step 3.
G Varoquaux 20
3 Architecture of a data-manipulation toolkit
Separate data from operations,
but keep an imperative-like language
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
bokeh, chaco, hadoop, Mayavi, CPUs
G Varoquaux 21
3 Architecture of a data-manipulation toolkit
Separate data from operations,
but keep an imperative-like language
Object API exposes a data-processing language
ļ¬t, predict, transform, score, partial ļ¬t
Instantiated without data but with all the parameters
Objects pipeline, merging, etc...
G Varoquaux 21
3 Architecture of a data-manipulation toolkit
Separate data from operations,
but keep an imperative-like language
Object API exposes a data-processing language
ļ¬t, predict, transform, score, partial ļ¬t
Instantiated without data but with all the parameters
Objects pipeline, merging, etc...
conļ¬guration/run pattern traits, pyre
curry in functional programming functools.partial
Ideas from MVC pattern
G Varoquaux 21
4 Big data on small hardware
G Varoquaux 22
4 Big data on small hardware
Biggish
smallish
ā€œBig dataā€:
Petabytes...
Distributed storage
Computing cluster
Mere mortals:
Gigabytes...
Python programming
Oļ¬€-the-self computers
G Varoquaux 22
4 On-line algorithms
Process the data one sample at a time
Compute the mean of a gazillion
numbers
Hard?
G Varoquaux 23
4 On-line algorithms
Process the data one sample at a time
Compute the mean of a gazillion
numbers
Hard?
No: just do a running mean
G Varoquaux 23
4 On-line algorithms
Converges to expectations
Mini-batch = bunch observations for vectorization
Example: K-Means clustering
X = np.random.normal(size=(10 000, 200))
scipy.cluster.vq.
kmeans(X, 10,
iter=2)
11.33 s
sklearn.cluster.
MiniBatchKMeans(n clusters=10,
n init=2).ļ¬t(X)
0.62 s
G Varoquaux 23
4 On-the-ļ¬‚y data reduction
Big data is often I/O bound
Layer memory access
CPU caches
RAM
Local disks
Distant storage
Less data also means less work
G Varoquaux 24
4 On-the-ļ¬‚y data reduction
Dropping data
1 loop: take a random fraction of the data
2 run algorithm on that fraction
3 aggregate results across sub-samplings
Looks like bagging: bootstrap aggregation
Exploits redundancy across observations
Run the loop in parallel
G Varoquaux 24
4 On-the-ļ¬‚y data reduction
Random projections (will average features)
sklearn.random projection
random linear combinations of the features
Fast clustering of features
sklearn.cluster.WardAgglomeration
on images: super-pixel strategy
Hashing when observations have varying size
(e.g. words)
sklearn.feature extraction.text.
HashingVectorizer
stateless: can be used in parallel
G Varoquaux 24
4 On-the-ļ¬‚y data reduction
Example: randomized SVD Random projection
sklearn.utils.extmath.randomized svd
X = np.random.normal(size=(50000, 200))
%timeit lapack = linalg.svd(X, full matrices=False)
1 loops, best of 3: 6.09 s per loop
%timeit arpack=splinalg.svds(X, 10)
1 loops, best of 3: 2.49 s per loop
%timeit randomized = randomized svd(X, 10)
1 loops, best of 3: 303 ms per loop
linalg.norm(lapack[0][:, :10] - arpack[0]) / 2000
0.0022360679774997738
linalg.norm(lapack[0][:, :10] - randomized[0]) / 2000
0.0022121161221386925
G Varoquaux 24
4 Biggish iron
Our new box: 15 ke
48 cores
384G RAM
70T storage
(SSD cache on RAID controller)
Gets our work done faster than our 800 CPU cluster
Itā€™s the access patterns!
ā€œNobody ever got ļ¬red for using Hadoop on a clusterā€
A. Rowstron et al., HotCDP ā€™12
G Varoquaux 25
5 Avoiding the framework
joblib
G Varoquaux 26
5 Parallel processing big picture
Focus on embarassingly parallel for loops
Life is too short to worry about deadlocks
Workers compete for data access
Memory bus is a bottleneck
The right grain of parallelism
Too ļ¬ne Ʊ overhead
Too coarse Ʊ memory shortage
Scale by the relevant cache pool
G Varoquaux 27
5 Parallel processing joblib
Focus on embarassingly parallel for loops
Life is too short to worry about deadlocks
>>> from joblib import Parallel, delayed
>>> Parallel(n jobs=2)(delayed(sqrt)(i**2)
... for i in range(8))
[0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0]
G Varoquaux 27
5 Parallel processing joblib
IPython, multiprocessing, celery, MPI?
joblib is higher-level
No dependencies, works everywhere
Better traceback reporting
Memmaping arrays to share memory (O. Grisel)
On-the-ļ¬‚y dispatch of jobs ā€“ memory-friendly
Threads or processes backend
G Varoquaux 27
5 Parallel processing joblib
IPython, multiprocessing, celery, MPI?
joblib is higher-level
No dependencies, works everywhere
Better traceback reporting
Memmaping arrays to share memory (O. Grisel)
On-the-ļ¬‚y dispatch of jobs ā€“ memory-friendly
Threads or processes backend
G Varoquaux 27
5 Parallel processing Queues
Queues: high-performance, concurrent-friendly
Diļ¬ƒculty: callback on result arrival
Ʊ multiple threads in caller ` risk of deadlocks
Dispatch queue should ļ¬ll up ā€œslowlyā€
Ʊ pre dispatch in joblib
Ʊ Back and forth communication
Door open to race conditions
G Varoquaux 28
5 Parallel processing: what happens where
joblib design: Caller, dispatch queue, and collect
queue in same process
Beneļ¬t: robustness
Grand-central dispatch design: dispatch queue has
a process of its own
Beneļ¬t: resource managment in nested for loops
G Varoquaux 29
5 Caching
For reproducibility:
avoid manually chained scripts (make-like usage)
For performance:
avoiding re-computing is the crux of optimization
G Varoquaux 30
5 Caching The joblib approach
For reproducibility:
avoid manually chained scripts (make-like usage)
For performance:
avoiding re-computing is the crux of optimization
Memoize pattern
mem = joblib.Memory(cachedir=ā€™.ā€™)
g = mem.cache(f)
b = g(a) # computes a using f
c = g(a) # retrieves results from store
G Varoquaux 30
5 Caching The joblib approach
For reproducibility:
avoid manually chained scripts (make-like usage)
For performance:
avoiding re-computing is the crux of optimization
Memoize pattern
mem = joblib.Memory(cachedir=ā€™.ā€™)
g = mem.cache(f)
b = g(a) # computes a using f
c = g(a) # retrieves results from store
Challenges in the context of big data
a & b are big
Design goals
a & b arbitrary Python objects
No dependencies
Drop-in, framework-less code
G Varoquaux 30
5 Caching The joblib approach
For reproducibility:
avoid manually chained scripts (make-like usage)
For performance:
avoiding re-computing is the crux of optimization
Memoize pattern
mem = joblib.Memory(cachedir=ā€™.ā€™)
g = mem.cache(f)
b = g(a) # computes a using f
c = g(a) # retrieves results from store
Lego bricks for out-of-core algorithms coming soon
ąąąąąąąąą result = g.call and shelve(a)
ąąąąąąąąą result
MemorizedResult(cachedir=ā€...ā€, func=ā€g...ā€, argument hash=ā€...ā€)
ąąąąąąąąą c = result.get()
G Varoquaux 30
5 Eļ¬ƒcient input argument hashing ā€“ joblib.hash
Compute md5ā€¹
of input arguments
Trade-oļ¬€ between features and cost
Black boxy
Robust and completely generic
G Varoquaux 31
5 Eļ¬ƒcient input argument hashing ā€“ joblib.hash
Compute md5ā€¹
of input arguments
Implementation
1. Create an md5 hash object
2. Subclass the standard-library pickler
= state machine that walks the object graph
3. Walk the object graph:
- ndarrays: pass data pointer to md5 algorithm
(ā€œupdateā€ method)
- the rest: pickle
4. Update the md5 with the pickle
ā€¹ md5 is in the Python standard library
G Varoquaux 31
5 Fast, disk-based, concurrent, store ā€“ joblib.dump
Persisting arbritrary objects
Once again sub-class the pickler
Use .npy for large numpy arrays (np.save),
pickle for the rest
Ʊ Multiple ļ¬les
Store concurrency issues
Strategy: atomic operations ` try/except
Renaming a directory is atomic
Directory layout consistent with remove operations
Good performance, usable on shared disks (cluster)
G Varoquaux 32
5 Making I/O fast
Fast compression
CPU may be faster than disk access
in particular in parallel
Standard library: zlib.compress with buļ¬€ers
(bypass gzip module to work online + in-memory)
G Varoquaux 33
5 Making I/O fast
Fast compression
CPU may be faster than disk access
in particular in parallel
Standard library: zlib.compress with buļ¬€ers
(bypass gzip module to work online + in-memory)
Avoiding copies
zlib.compress: C-contiguous buļ¬€ers
Copyless storage of raw buļ¬€er
+ meta-information (strides, class...)
G Varoquaux 33
5 Making I/O fast
Fast compression
CPU may be faster than disk access
in particular in parallel
Standard library: zlib.compress with buļ¬€ers
(bypass gzip module to work online + in-memory)
Avoiding copies
zlib.compress: C-contiguous buļ¬€ers
Copyless storage of raw buļ¬€er
+ meta-information (strides, class...)
Single ļ¬le dump coming soon
File opening is slow on cluster
Challenge: streaming the above for memory usage
G Varoquaux 33
5 Making I/O fast
Fast compression
CPU may be faster than disk access
in particular in parallel
Standard library: zlib.compress with buļ¬€ers
(bypass gzip module to work online + in-memory)
Avoiding copies
zlib.compress: C-contiguous buļ¬€ers
Copyless storage of raw buļ¬€er
+ meta-information (strides, class...)
Single ļ¬le dump coming soon
File opening is slow on cluster
Challenge: streaming the above for memory usage
What matters on large systems
Numbers of bytes stored
brings network/SATA bus down
Memory usage
brings compute nodes down
Number of atomic ļ¬le access
brings shared storage down
G Varoquaux 33
5 Benchmarking to np.save and pytables
yaxisscale:1isnp.save
NeuroImaging data (MNI atlas)G Varoquaux 34
6 The bigger picture: building
an ecosystem
Helping your future self
G Varoquaux 35
6 Community-based development in scikit-learn
Huge feature set:
beneļ¬ts of a large team
Project growth:
More than 200 contributors
ā€ž 12 core contributors
1 full-time INRIA programmer
from the start
Estimated cost of development: $ 6 millions
COCOMO model,
http://www.ohloh.net/p/scikit-learn
G Varoquaux 36
6 The economics of open source
Code maintenance too expensive to be alone
scikit-learn ā€ž 300 email/month nipy ā€ž 45 email/month
joblib ā€ž 45 email/month mayavi ā€ž 30 email/month
ā€œHey Gael, I take it youā€™re too
busy. Thatā€™s okay, I spent a day
trying to install XXX and I think
Iā€™ll succeed myself. Next time
though please donā€™t ignore my
emails, I really donā€™t like it. You
can say, ā€˜sorry, I have no time to
help you.ā€™ Just donā€™t ignore.ā€
G Varoquaux 37
6 The economics of open source
Code maintenance too expensive to be alone
scikit-learn ā€ž 300 email/month nipy ā€ž 45 email/month
joblib ā€ž 45 email/month mayavi ā€ž 30 email/month
Your ā€œbeneļ¬tsā€ come from a fraction of the code
Data loading? Maybe?
Standard algorithms? Nah
Share the common code...
...to avoid dying under code
Code becomes less precious with time
And somebody might contribute features
G Varoquaux 37
6 Many eyes makes code fast
Bench WiseRF anybody?
L. Buitinck, O. Grisel, A. Joly, G. Louppe, J. Nothman, P. Prettenhofer
G Varoquaux 38
6 6 steps to a community-driven project
1 Focus on quality
2 Build great docs and examples
3 Use github
4 Limit the technicality of your codebase
5 Releasing and packaging matter
6 Focus on your contributors,
give them credit, decision power
http://www.slideshare.net/GaelVaroquaux/
scikit-learn-dveloppement-communautaire
G Varoquaux 39
6 Core project contributors
Normalized number of commits
since 2009-06
Numberofcommits
Individual committer
Credit: Fernando Perez, Gist 5843625
G Varoquaux 40
6 The tragedy of the commons
Individuals, acting independently and rationally accord-
ing to each oneā€™s self-interest, behave contrary to the
whole groupā€™s long-term best interests by depleting
some common resource.
Wikipedia
Make it work, make it right, make it boring
Core projects (boring) taken for granted
Ʊ Hard to fund, less excitement
They need citation, in papers & on corporate web pages
G Varoquaux 41
@GaelVaroquaux
Solving problems that matter
The 80/20 rule
80% of the usecases can be solved
with 20% of the lines of code
scikit-learn, joblib, nilearn, ... I hope
@GaelVaroquaux
Cutting-edge ... environment ... on a budget
1 Set the goals right
Donā€™t solve hard problems
Whatā€™s your original problem?
@GaelVaroquaux
Cutting-edge ... environment ... on a budget
1 Set the goals right
2 Use the simplest technological solutions possible
Be very technically sophisticated
Donā€™t use that sophistication
@GaelVaroquaux
Cutting-edge ... environment ... on a budget
1 Set the goals right
2 Use the simplest technological solutions possible
3 Donā€™t forget the human factors
With your users (documentation)
With your contributors
@GaelVaroquaux
Cutting-edge ... environment ... on a budget
1 Set the goals right
2 Use the simplest technological solutions possible
3 Donā€™t forget the human factors
A perfect
design?

More Related Content

What's hot

Deep Learning through Examples
Deep Learning through ExamplesDeep Learning through Examples
Deep Learning through ExamplesSri Ambati
Ā 
Squeezing Deep Learning Into Mobile Phones
Squeezing Deep Learning Into Mobile PhonesSqueezing Deep Learning Into Mobile Phones
Squeezing Deep Learning Into Mobile PhonesAnirudh Koul
Ā 
Deep learning with Tensorflow in R
Deep learning with Tensorflow in RDeep learning with Tensorflow in R
Deep learning with Tensorflow in Rmikaelhuss
Ā 
Deep machine learning by Mario Cho
Deep machine learning by Mario ChoDeep machine learning by Mario Cho
Deep machine learning by Mario ChoMario Cho
Ā 
Using Deep Learning to do Real-Time Scoring in Practical Applications - 2015-...
Using Deep Learning to do Real-Time Scoring in Practical Applications - 2015-...Using Deep Learning to do Real-Time Scoring in Practical Applications - 2015-...
Using Deep Learning to do Real-Time Scoring in Practical Applications - 2015-...Greg Makowski
Ā 
Building Interpretable & Secure AI Systems using PyTorch
Building Interpretable & Secure AI Systems using PyTorchBuilding Interpretable & Secure AI Systems using PyTorch
Building Interpretable & Secure AI Systems using PyTorchgeetachauhan
Ā 
Deep Learning for Data Scientists - Data Science ATL Meetup Presentation, 201...
Deep Learning for Data Scientists - Data Science ATL Meetup Presentation, 201...Deep Learning for Data Scientists - Data Science ATL Meetup Presentation, 201...
Deep Learning for Data Scientists - Data Science ATL Meetup Presentation, 201...Andrew Gardner
Ā 
Jay Yagnik at AI Frontiers : A History Lesson on AI
Jay Yagnik at AI Frontiers : A History Lesson on AIJay Yagnik at AI Frontiers : A History Lesson on AI
Jay Yagnik at AI Frontiers : A History Lesson on AIAI Frontiers
Ā 
Koss Lab ģ„øėÆøė‚˜ ģ˜¤ķ”ˆģ†ŒģŠ¤ ģøź³µģ§€ėŠ„(AI) ķ”„ė ˆģž„ģ›ķŒŒķ—¤ģ¹˜źø°
Koss Lab ģ„øėÆøė‚˜ ģ˜¤ķ”ˆģ†ŒģŠ¤ ģøź³µģ§€ėŠ„(AI) ķ”„ė ˆģž„ģ›ķŒŒķ—¤ģ¹˜źø° Koss Lab ģ„øėÆøė‚˜ ģ˜¤ķ”ˆģ†ŒģŠ¤ ģøź³µģ§€ėŠ„(AI) ķ”„ė ˆģž„ģ›ķŒŒķ—¤ģ¹˜źø°
Koss Lab ģ„øėÆøė‚˜ ģ˜¤ķ”ˆģ†ŒģŠ¤ ģøź³µģ§€ėŠ„(AI) ķ”„ė ˆģž„ģ›ķŒŒķ—¤ģ¹˜źø° Mario Cho
Ā 
Simple big data, in Python
Simple big data, in PythonSimple big data, in Python
Simple big data, in PythonGael Varoquaux
Ā 
MLconf - Distributed Deep Learning for Classification and Regression Problems...
MLconf - Distributed Deep Learning for Classification and Regression Problems...MLconf - Distributed Deep Learning for Classification and Regression Problems...
MLconf - Distributed Deep Learning for Classification and Regression Problems...Sri Ambati
Ā 
Convolutional Neural Networks for Computer vision Applications
Convolutional Neural Networks for Computer vision ApplicationsConvolutional Neural Networks for Computer vision Applications
Convolutional Neural Networks for Computer vision ApplicationsAlex Conway
Ā 
Alex Tellez, Deep Learning Applications
Alex Tellez, Deep Learning ApplicationsAlex Tellez, Deep Learning Applications
Alex Tellez, Deep Learning ApplicationsSri Ambati
Ā 
The rod of Asclepios: Machine learning in Python for cardiac image analysis, ...
The rod of Asclepios: Machine learning in Python for cardiac image analysis, ...The rod of Asclepios: Machine learning in Python for cardiac image analysis, ...
The rod of Asclepios: Machine learning in Python for cardiac image analysis, ...PĆ“le Systematic Paris-Region
Ā 
Deep Learning in the Wild with Arno Candel
Deep Learning in the Wild with Arno CandelDeep Learning in the Wild with Arno Candel
Deep Learning in the Wild with Arno CandelSri Ambati
Ā 
Open source ai_technical_trend
Open source ai_technical_trendOpen source ai_technical_trend
Open source ai_technical_trendMario Cho
Ā 
Deep Learning as a Cat/Dog Detector
Deep Learning as a Cat/Dog DetectorDeep Learning as a Cat/Dog Detector
Deep Learning as a Cat/Dog DetectorRoelof Pieters
Ā 
Deep Learning Jump Start
Deep Learning Jump StartDeep Learning Jump Start
Deep Learning Jump StartMichele Toni
Ā 
Koss 6 a17_deepmachinelearning_mariocho_r10
Koss 6 a17_deepmachinelearning_mariocho_r10Koss 6 a17_deepmachinelearning_mariocho_r10
Koss 6 a17_deepmachinelearning_mariocho_r10Mario Cho
Ā 

What's hot (20)

H20: A platform for big math
H20: A platform for big math H20: A platform for big math
H20: A platform for big math
Ā 
Deep Learning through Examples
Deep Learning through ExamplesDeep Learning through Examples
Deep Learning through Examples
Ā 
Squeezing Deep Learning Into Mobile Phones
Squeezing Deep Learning Into Mobile PhonesSqueezing Deep Learning Into Mobile Phones
Squeezing Deep Learning Into Mobile Phones
Ā 
Deep learning with Tensorflow in R
Deep learning with Tensorflow in RDeep learning with Tensorflow in R
Deep learning with Tensorflow in R
Ā 
Deep machine learning by Mario Cho
Deep machine learning by Mario ChoDeep machine learning by Mario Cho
Deep machine learning by Mario Cho
Ā 
Using Deep Learning to do Real-Time Scoring in Practical Applications - 2015-...
Using Deep Learning to do Real-Time Scoring in Practical Applications - 2015-...Using Deep Learning to do Real-Time Scoring in Practical Applications - 2015-...
Using Deep Learning to do Real-Time Scoring in Practical Applications - 2015-...
Ā 
Building Interpretable & Secure AI Systems using PyTorch
Building Interpretable & Secure AI Systems using PyTorchBuilding Interpretable & Secure AI Systems using PyTorch
Building Interpretable & Secure AI Systems using PyTorch
Ā 
Deep Learning for Data Scientists - Data Science ATL Meetup Presentation, 201...
Deep Learning for Data Scientists - Data Science ATL Meetup Presentation, 201...Deep Learning for Data Scientists - Data Science ATL Meetup Presentation, 201...
Deep Learning for Data Scientists - Data Science ATL Meetup Presentation, 201...
Ā 
Jay Yagnik at AI Frontiers : A History Lesson on AI
Jay Yagnik at AI Frontiers : A History Lesson on AIJay Yagnik at AI Frontiers : A History Lesson on AI
Jay Yagnik at AI Frontiers : A History Lesson on AI
Ā 
Koss Lab ģ„øėÆøė‚˜ ģ˜¤ķ”ˆģ†ŒģŠ¤ ģøź³µģ§€ėŠ„(AI) ķ”„ė ˆģž„ģ›ķŒŒķ—¤ģ¹˜źø°
Koss Lab ģ„øėÆøė‚˜ ģ˜¤ķ”ˆģ†ŒģŠ¤ ģøź³µģ§€ėŠ„(AI) ķ”„ė ˆģž„ģ›ķŒŒķ—¤ģ¹˜źø° Koss Lab ģ„øėÆøė‚˜ ģ˜¤ķ”ˆģ†ŒģŠ¤ ģøź³µģ§€ėŠ„(AI) ķ”„ė ˆģž„ģ›ķŒŒķ—¤ģ¹˜źø°
Koss Lab ģ„øėÆøė‚˜ ģ˜¤ķ”ˆģ†ŒģŠ¤ ģøź³µģ§€ėŠ„(AI) ķ”„ė ˆģž„ģ›ķŒŒķ—¤ģ¹˜źø°
Ā 
Simple big data, in Python
Simple big data, in PythonSimple big data, in Python
Simple big data, in Python
Ā 
MLconf - Distributed Deep Learning for Classification and Regression Problems...
MLconf - Distributed Deep Learning for Classification and Regression Problems...MLconf - Distributed Deep Learning for Classification and Regression Problems...
MLconf - Distributed Deep Learning for Classification and Regression Problems...
Ā 
Convolutional Neural Networks for Computer vision Applications
Convolutional Neural Networks for Computer vision ApplicationsConvolutional Neural Networks for Computer vision Applications
Convolutional Neural Networks for Computer vision Applications
Ā 
Alex Tellez, Deep Learning Applications
Alex Tellez, Deep Learning ApplicationsAlex Tellez, Deep Learning Applications
Alex Tellez, Deep Learning Applications
Ā 
The rod of Asclepios: Machine learning in Python for cardiac image analysis, ...
The rod of Asclepios: Machine learning in Python for cardiac image analysis, ...The rod of Asclepios: Machine learning in Python for cardiac image analysis, ...
The rod of Asclepios: Machine learning in Python for cardiac image analysis, ...
Ā 
Deep Learning in the Wild with Arno Candel
Deep Learning in the Wild with Arno CandelDeep Learning in the Wild with Arno Candel
Deep Learning in the Wild with Arno Candel
Ā 
Open source ai_technical_trend
Open source ai_technical_trendOpen source ai_technical_trend
Open source ai_technical_trend
Ā 
Deep Learning as a Cat/Dog Detector
Deep Learning as a Cat/Dog DetectorDeep Learning as a Cat/Dog Detector
Deep Learning as a Cat/Dog Detector
Ā 
Deep Learning Jump Start
Deep Learning Jump StartDeep Learning Jump Start
Deep Learning Jump Start
Ā 
Koss 6 a17_deepmachinelearning_mariocho_r10
Koss 6 a17_deepmachinelearning_mariocho_r10Koss 6 a17_deepmachinelearning_mariocho_r10
Koss 6 a17_deepmachinelearning_mariocho_r10
Ā 

Similar to Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux

On the code of data science
On the code of data scienceOn the code of data science
On the code of data scienceGael Varoquaux
Ā 
The Art Of Performance Tuning - with presenter notes!
The Art Of Performance Tuning - with presenter notes!The Art Of Performance Tuning - with presenter notes!
The Art Of Performance Tuning - with presenter notes!Jonathan Ross
Ā 
Creating a Machine Learning Model on the Cloud
Creating a Machine Learning Model on the CloudCreating a Machine Learning Model on the Cloud
Creating a Machine Learning Model on the CloudAlexander Al Basosi
Ā 
DARMDN: Deep autoregressive mixture density nets for dynamical system mode...
   DARMDN: Deep autoregressive mixture density nets for dynamical system mode...   DARMDN: Deep autoregressive mixture density nets for dynamical system mode...
DARMDN: Deep autoregressive mixture density nets for dynamical system mode...BalƔzs KƩgl
Ā 
Better neuroimaging data processing: driven by evidence, open communities, an...
Better neuroimaging data processing: driven by evidence, open communities, an...Better neuroimaging data processing: driven by evidence, open communities, an...
Better neuroimaging data processing: driven by evidence, open communities, an...Gael Varoquaux
Ā 
Reproducibility challenges in computational settings: what are they, why shou...
Reproducibility challenges in computational settings: what are they, why shou...Reproducibility challenges in computational settings: what are they, why shou...
Reproducibility challenges in computational settings: what are they, why shou...Research Data Alliance
Ā 
Machine learning on Hadoop data lakes
Machine learning on Hadoop data lakesMachine learning on Hadoop data lakes
Machine learning on Hadoop data lakesDataWorks Summit
Ā 
Machine learning model to production
Machine learning model to productionMachine learning model to production
Machine learning model to productionGeorg Heiler
Ā 
Vision Algorithmics
Vision AlgorithmicsVision Algorithmics
Vision Algorithmicspotaters
Ā 
Balancing Infrastructure with Optimization and Problem Formulation
Balancing Infrastructure with Optimization and Problem FormulationBalancing Infrastructure with Optimization and Problem Formulation
Balancing Infrastructure with Optimization and Problem FormulationAlex D. Gaudio
Ā 
Computational practices for reproducible science
Computational practices for reproducible scienceComputational practices for reproducible science
Computational practices for reproducible scienceGael Varoquaux
Ā 
AIoT: Intelligence on Microcontroller
AIoT: Intelligence on MicrocontrollerAIoT: Intelligence on Microcontroller
AIoT: Intelligence on MicrocontrollerAndri Yadi
Ā 
Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...
Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...
Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...Codemotion
Ā 
Building frameworks: from concept to completion
Building frameworks: from concept to completionBuilding frameworks: from concept to completion
Building frameworks: from concept to completionRuben Goncalves
Ā 
Real time analytics @ netflix
Real time analytics @ netflixReal time analytics @ netflix
Real time analytics @ netflixCody Rioux
Ā 
Lambda Architecture and open source technology stack for real time big data
Lambda Architecture and open source technology stack for real time big dataLambda Architecture and open source technology stack for real time big data
Lambda Architecture and open source technology stack for real time big dataTrieu Nguyen
Ā 
VIZBI 2015 Tutorial: Cytoscape, IPython, Docker, and Reproducible Network Dat...
VIZBI 2015 Tutorial: Cytoscape, IPython, Docker, and Reproducible Network Dat...VIZBI 2015 Tutorial: Cytoscape, IPython, Docker, and Reproducible Network Dat...
VIZBI 2015 Tutorial: Cytoscape, IPython, Docker, and Reproducible Network Dat...Keiichiro Ono
Ā 
Overview Of Parallel Development - Ericnel
Overview Of Parallel Development -  EricnelOverview Of Parallel Development -  Ericnel
Overview Of Parallel Development - Ericnelukdpe
Ā 
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...Jason Dai
Ā 

Similar to Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux (20)

On the code of data science
On the code of data scienceOn the code of data science
On the code of data science
Ā 
The Art Of Performance Tuning - with presenter notes!
The Art Of Performance Tuning - with presenter notes!The Art Of Performance Tuning - with presenter notes!
The Art Of Performance Tuning - with presenter notes!
Ā 
Creating a Machine Learning Model on the Cloud
Creating a Machine Learning Model on the CloudCreating a Machine Learning Model on the Cloud
Creating a Machine Learning Model on the Cloud
Ā 
DARMDN: Deep autoregressive mixture density nets for dynamical system mode...
   DARMDN: Deep autoregressive mixture density nets for dynamical system mode...   DARMDN: Deep autoregressive mixture density nets for dynamical system mode...
DARMDN: Deep autoregressive mixture density nets for dynamical system mode...
Ā 
Better neuroimaging data processing: driven by evidence, open communities, an...
Better neuroimaging data processing: driven by evidence, open communities, an...Better neuroimaging data processing: driven by evidence, open communities, an...
Better neuroimaging data processing: driven by evidence, open communities, an...
Ā 
Reproducibility challenges in computational settings: what are they, why shou...
Reproducibility challenges in computational settings: what are they, why shou...Reproducibility challenges in computational settings: what are they, why shou...
Reproducibility challenges in computational settings: what are they, why shou...
Ā 
Machine learning on Hadoop data lakes
Machine learning on Hadoop data lakesMachine learning on Hadoop data lakes
Machine learning on Hadoop data lakes
Ā 
Machine learning model to production
Machine learning model to productionMachine learning model to production
Machine learning model to production
Ā 
Python and Sage
Python and SagePython and Sage
Python and Sage
Ā 
Vision Algorithmics
Vision AlgorithmicsVision Algorithmics
Vision Algorithmics
Ā 
Balancing Infrastructure with Optimization and Problem Formulation
Balancing Infrastructure with Optimization and Problem FormulationBalancing Infrastructure with Optimization and Problem Formulation
Balancing Infrastructure with Optimization and Problem Formulation
Ā 
Computational practices for reproducible science
Computational practices for reproducible scienceComputational practices for reproducible science
Computational practices for reproducible science
Ā 
AIoT: Intelligence on Microcontroller
AIoT: Intelligence on MicrocontrollerAIoT: Intelligence on Microcontroller
AIoT: Intelligence on Microcontroller
Ā 
Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...
Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...
Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...
Ā 
Building frameworks: from concept to completion
Building frameworks: from concept to completionBuilding frameworks: from concept to completion
Building frameworks: from concept to completion
Ā 
Real time analytics @ netflix
Real time analytics @ netflixReal time analytics @ netflix
Real time analytics @ netflix
Ā 
Lambda Architecture and open source technology stack for real time big data
Lambda Architecture and open source technology stack for real time big dataLambda Architecture and open source technology stack for real time big data
Lambda Architecture and open source technology stack for real time big data
Ā 
VIZBI 2015 Tutorial: Cytoscape, IPython, Docker, and Reproducible Network Dat...
VIZBI 2015 Tutorial: Cytoscape, IPython, Docker, and Reproducible Network Dat...VIZBI 2015 Tutorial: Cytoscape, IPython, Docker, and Reproducible Network Dat...
VIZBI 2015 Tutorial: Cytoscape, IPython, Docker, and Reproducible Network Dat...
Ā 
Overview Of Parallel Development - Ericnel
Overview Of Parallel Development -  EricnelOverview Of Parallel Development -  Ericnel
Overview Of Parallel Development - Ericnel
Ā 
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Ā 

More from PyData

Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...PyData
Ā 
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshUnit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshPyData
Ā 
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiThe TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiPyData
Ā 
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...PyData
Ā 
Deploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerDeploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerPyData
Ā 
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaGraph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaPyData
Ā 
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...PyData
Ā 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroPyData
Ā 
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...PyData
Ā 
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottAvoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottPyData
Ā 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroPyData
Ā 
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...PyData
Ā 
Pydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPyData
Ā 
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...PyData
Ā 
Extending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydExtending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydPyData
Ā 
Measuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverMeasuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverPyData
Ā 
What's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldWhat's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldPyData
Ā 
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...PyData
Ā 
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardSolving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardPyData
Ā 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...PyData
Ā 

More from PyData (20)

Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Ā 
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshUnit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
Ā 
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiThe TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
Ā 
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Ā 
Deploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerDeploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne Bauer
Ā 
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaGraph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Ā 
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Ā 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
Ā 
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Ā 
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottAvoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Ā 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca Bilbro
Ā 
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
Ā 
Pydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica Puerto
Ā 
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
Ā 
Extending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydExtending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will Ayd
Ā 
Measuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverMeasuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen Hoover
Ā 
What's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldWhat's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper Seabold
Ā 
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Ā 
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardSolving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Ā 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
Ā 

Recently uploaded

The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
Ā 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
Ā 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
Ā 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
Ā 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
Ā 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
Ā 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service šŸø 8923113531 šŸŽ° Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service šŸø 8923113531 šŸŽ° Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service šŸø 8923113531 šŸŽ° Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service šŸø 8923113531 šŸŽ° Avail...gurkirankumar98700
Ā 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
Ā 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
Ā 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
Ā 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
Ā 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
Ā 
Scaling API-first ā€“ The story of a global engineering organization
Scaling API-first ā€“ The story of a global engineering organizationScaling API-first ā€“ The story of a global engineering organization
Scaling API-first ā€“ The story of a global engineering organizationRadu Cotescu
Ā 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
Ā 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
Ā 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
Ā 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
Ā 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
Ā 
šŸ¬ The future of MySQL is Postgres šŸ˜
šŸ¬  The future of MySQL is Postgres   šŸ˜šŸ¬  The future of MySQL is Postgres   šŸ˜
šŸ¬ The future of MySQL is Postgres šŸ˜RTylerCroy
Ā 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
Ā 

Recently uploaded (20)

The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
Ā 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
Ā 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
Ā 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
Ā 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
Ā 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
Ā 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service šŸø 8923113531 šŸŽ° Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service šŸø 8923113531 šŸŽ° Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service šŸø 8923113531 šŸŽ° Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service šŸø 8923113531 šŸŽ° Avail...
Ā 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
Ā 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
Ā 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
Ā 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
Ā 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Ā 
Scaling API-first ā€“ The story of a global engineering organization
Scaling API-first ā€“ The story of a global engineering organizationScaling API-first ā€“ The story of a global engineering organization
Scaling API-first ā€“ The story of a global engineering organization
Ā 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Ā 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
Ā 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
Ā 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
Ā 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
Ā 
šŸ¬ The future of MySQL is Postgres šŸ˜
šŸ¬  The future of MySQL is Postgres   šŸ˜šŸ¬  The future of MySQL is Postgres   šŸ˜
šŸ¬ The future of MySQL is Postgres šŸ˜
Ā 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
Ā 

Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux

  • 1. Building a cutting-edge data processing environment on a budget GaĀØel Varoquaux This talk is not about rocket science!
  • 2. Building a cutting-edge data processing environment on a budget GaĀØel Varoquaux Disclaimer: this talk is as much about people and projects as it is about code and algorithms.
  • 3. Growing up as a penniless academic I did a PhD in quantum physics
  • 4. Growing up as a penniless academic I did a PhD in quantum physics Vacuum (leaks) Electronics (shorts) Lasers (mis-alignment) Best training ever for agile project management
  • 5. Growing up as a penniless academic I did a PhD in quantum physics Vacuum (leaks) Electronics (shorts) Lasers (mis-alignment) Computers were only one of the many moving parts Matlab Instrument control
  • 6. Growing up as a penniless academic I did a PhD in quantum physics Vacuum (leaks) Electronics (shorts) Lasers (mis-alignment) Computers were only one of the many moving parts Matlab Instrument controlShaped my vision of computing as a means to an end
  • 7. Growing up as a penniless academic 2011 Tenured researcher in computer science
  • 8. Growing up as a penniless academic 2011 Tenured researcher in computer science Today Growing team with data science rock stars
  • 9. 1 Using machine learning to understand brain function Link neural activity to thoughts and cognition G Varoquaux 6
  • 10. 1 Functional MRI t Recordings of brain activity G Varoquaux 7
  • 11. 1 Cognitive NeuroImaging Learn a bilateral link between brain activity and cognitive function G Varoquaux 8
  • 12. 1 Encoding models of stimuli Predicting neural response Ʊ a window into brain representations of stimuli ā€œfeature engineeringā€ a description of the world G Varoquaux 9
  • 13. 1 Decoding brain activity ā€œbrain readingā€ G Varoquaux 10
  • 14. 1 Data processing feats Visual image reconstruction from human brain activity [Miyawaki, et al. (2008)] ā€œbrain readingā€ G Varoquaux 11
  • 15. 1 Data processing feats Visual image reconstruction from human brain activity [Miyawaki, et al. (2008)] ā€œif itā€™s not open and veriļ¬able by others, itā€™s not science, or engineering...ā€ Stodden, 2010 G Varoquaux 11
  • 16. 1 Data processing feats Visual image reconstruction from human brain activity [Miyawaki, et al. (2008)] Make it work, make it right, make it boring
  • 17. 1 Data processing feats Visual image reconstruction from human brain activity [Miyawaki, et al. (2008)] Make it work, make it right, make it boring http://nilearn.github.io/auto examples/ plot miyawaki reconstruction.html Code, data, ... just worksTM http://nilearn.github.io ni G Varoquaux 11
  • 18. 1 Data processing feats Visual image reconstruction from human brain activity [Miyawaki, et al. (2008)] Make it work, make it right, make it boring http://nilearn.github.io/auto examples/ plot miyawaki reconstruction.html Code, data, ... just worksTM http://nilearn.github.io ni Software development challenge G Varoquaux 11
  • 19. 1 Data accumulation When data processing is routine... ā€œbig dataā€ for rich models of brain function Accumulation of scientiļ¬c knowledge and learning formal representations G Varoquaux 12
  • 20. 1 Data accumulation When data processing is routine... ā€œbig dataā€ for rich models of brain function Accumulation of scientiļ¬c knowledge and learning formal representations ā€œA theory is a good theory if it satisļ¬es two requirements: It must accurately describe a large class of observa- tions on the basis of a model that contains only a few arbitrary elements, and it must make deļ¬nite predic- tions about the results of future observations.ā€ Stephen Hawking, A Brief History of Time. G Varoquaux 12
  • 21. 1 Petty day-to-day technicalities Buggy code Slow code Lead data scientist leaves New intern to train I donā€™t understand the code I have written a year ago G Varoquaux 13
  • 22. 1 Petty day-to-day technicalities Buggy code Slow code Lead data scientist leaves New intern to train I donā€™t understand the code I have written a year ago A lab is no diļ¬€erent from a startup Diļ¬ƒculties Recruitment Limited resources (people & hardware) Risks Bus factor Technical dept G Varoquaux 13
  • 23. 1 Petty day-to-day technicalities Buggy code Slow code Lead data scientist leaves New intern to train I donā€™t understand the code I have written a year ago A lab is no diļ¬€erent from a startup Diļ¬ƒculties Recruitment Limited resources (people & hardware) Risks Bus factor Technical dept Our mission is to revolutionize brain data processing on a tight budget G Varoquaux 13
  • 24. 2 Patterns in data processing G Varoquaux 14
  • 25. 2 The data processing workļ¬‚ow agile Interaction... Ƒ script... Ƒ module... Ć½ interaction again... Consolidation, progressively Low tech and short turn-around times G Varoquaux 15
  • 26. 2 From statistics to statistical learning Paradigm shift as the dimensionality of data grows # features, not only # samples From parameter inference to prediction Statistical learning is spreading everywhere x y G Varoquaux 16
  • 27. 3 Letā€™s just make software to solve all these problems. c Theodore W. Gray G Varoquaux 17
  • 28. 3 Design philosophy 1. Donā€™t solve hard problems The original problem can be bent. 2. Easy setup, works out of the box Installing software sucks. Convention over conļ¬guration. 3. Fail gracefully Robust to errors. Easy to debug. 4. Quality, quality, quality Whatā€™s not excellent wonā€™t be used. G Varoquaux 18
  • 29. 3 Design philosophy 1. Donā€™t solve hard problems The original problem can be bent. 2. Easy setup, works out of the box Installing software sucks. Convention over conļ¬guration. 3. Fail gracefully Robust to errors. Easy to debug. 4. Quality, quality, quality Whatā€™s not excellent wonā€™t be used. Not ā€œone software to rule them allā€ Break down projects by expertise G Varoquaux 18
  • 31. Vision Machine learning without learning the machinery Black box that can be opened Right trade-oļ¬€ between ā€just worksā€ and versatility (think Apple vs Linux) G Varoquaux 19
  • 32. Vision Machine learning without learning the machinery Black box that can be opened Right trade-oļ¬€ between ā€just worksā€ and versatility (think Apple vs Linux) Weā€™re not going to solve all the problems for you I donā€™t solve hard problems Feature-engineering, domain-speciļ¬c cases... Python is a programming language. Use it. Cover all the 80% usecases in one package G Varoquaux 19
  • 33. 3 Performance in high-level programming High-level programming is what keeps us alive and kicking G Varoquaux 20
  • 34. 3 Performance in high-level programming The secret sauce Optimize algorithmes, not for loops Know perfectly Numpy and Scipy - Signiļ¬cant data should be arrays/memoryviews - Avoid memory copies, rely on blas/lapack line-proļ¬ler/memory-proļ¬ler scipy-lectures.github.io Cython not C/C++ G Varoquaux 20
  • 35. 3 Performance in high-level programming The secret sauce Optimize algorithmes, not for loops Know perfectly Numpy and Scipy - Signiļ¬cant data should be arrays/memoryviews - Avoid memory copies, rely on blas/lapack line-proļ¬ler/memory-proļ¬ler scipy-lectures.github.io Cython not C/C++ Hierarchical clustering PR #2199 1. Take the 2 closest clusters 2. Merge them 3. Update the distance matrix ... Faster with constraints: sparse distance matrix - Keep a heap queue of distances: cheap minimum - Need sparse growable structure for neighborhoods skip-list in Cython! Oplog nq insert, remove, access bind C++ map[int, float] with Cython Fast traversal, possibly in Cython, for step 3. G Varoquaux 20
  • 36. 3 Performance in high-level programming The secret sauce Optimize algorithmes, not for loops Know perfectly Numpy and Scipy - Signiļ¬cant data should be arrays/memoryviews - Avoid memory copies, rely on blas/lapack line-proļ¬ler/memory-proļ¬ler scipy-lectures.github.io Cython not C/C++ Hierarchical clustering PR #2199 1. Take the 2 closest clusters 2. Merge them 3. Update the distance matrix ... Faster with constraints: sparse distance matrix - Keep a heap queue of distances: cheap minimum - Need sparse growable structure for neighborhoods skip-list in Cython! Oplog nq insert, remove, access bind C++ map[int, float] with Cython Fast traversal, possibly in Cython, for step 3. G Varoquaux 20
  • 37. 3 Architecture of a data-manipulation toolkit Separate data from operations, but keep an imperative-like language 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 bokeh, chaco, hadoop, Mayavi, CPUs G Varoquaux 21
  • 38. 3 Architecture of a data-manipulation toolkit Separate data from operations, but keep an imperative-like language Object API exposes a data-processing language ļ¬t, predict, transform, score, partial ļ¬t Instantiated without data but with all the parameters Objects pipeline, merging, etc... G Varoquaux 21
  • 39. 3 Architecture of a data-manipulation toolkit Separate data from operations, but keep an imperative-like language Object API exposes a data-processing language ļ¬t, predict, transform, score, partial ļ¬t Instantiated without data but with all the parameters Objects pipeline, merging, etc... conļ¬guration/run pattern traits, pyre curry in functional programming functools.partial Ideas from MVC pattern G Varoquaux 21
  • 40. 4 Big data on small hardware G Varoquaux 22
  • 41. 4 Big data on small hardware Biggish smallish ā€œBig dataā€: Petabytes... Distributed storage Computing cluster Mere mortals: Gigabytes... Python programming Oļ¬€-the-self computers G Varoquaux 22
  • 42. 4 On-line algorithms Process the data one sample at a time Compute the mean of a gazillion numbers Hard? G Varoquaux 23
  • 43. 4 On-line algorithms Process the data one sample at a time Compute the mean of a gazillion numbers Hard? No: just do a running mean G Varoquaux 23
  • 44. 4 On-line algorithms Converges to expectations Mini-batch = bunch observations for vectorization Example: K-Means clustering X = np.random.normal(size=(10 000, 200)) scipy.cluster.vq. kmeans(X, 10, iter=2) 11.33 s sklearn.cluster. MiniBatchKMeans(n clusters=10, n init=2).ļ¬t(X) 0.62 s G Varoquaux 23
  • 45. 4 On-the-ļ¬‚y data reduction Big data is often I/O bound Layer memory access CPU caches RAM Local disks Distant storage Less data also means less work G Varoquaux 24
  • 46. 4 On-the-ļ¬‚y data reduction Dropping data 1 loop: take a random fraction of the data 2 run algorithm on that fraction 3 aggregate results across sub-samplings Looks like bagging: bootstrap aggregation Exploits redundancy across observations Run the loop in parallel G Varoquaux 24
  • 47. 4 On-the-ļ¬‚y data reduction Random projections (will average features) sklearn.random projection random linear combinations of the features Fast clustering of features sklearn.cluster.WardAgglomeration on images: super-pixel strategy Hashing when observations have varying size (e.g. words) sklearn.feature extraction.text. HashingVectorizer stateless: can be used in parallel G Varoquaux 24
  • 48. 4 On-the-ļ¬‚y data reduction Example: randomized SVD Random projection sklearn.utils.extmath.randomized svd X = np.random.normal(size=(50000, 200)) %timeit lapack = linalg.svd(X, full matrices=False) 1 loops, best of 3: 6.09 s per loop %timeit arpack=splinalg.svds(X, 10) 1 loops, best of 3: 2.49 s per loop %timeit randomized = randomized svd(X, 10) 1 loops, best of 3: 303 ms per loop linalg.norm(lapack[0][:, :10] - arpack[0]) / 2000 0.0022360679774997738 linalg.norm(lapack[0][:, :10] - randomized[0]) / 2000 0.0022121161221386925 G Varoquaux 24
  • 49. 4 Biggish iron Our new box: 15 ke 48 cores 384G RAM 70T storage (SSD cache on RAID controller) Gets our work done faster than our 800 CPU cluster Itā€™s the access patterns! ā€œNobody ever got ļ¬red for using Hadoop on a clusterā€ A. Rowstron et al., HotCDP ā€™12 G Varoquaux 25
  • 50. 5 Avoiding the framework joblib G Varoquaux 26
  • 51. 5 Parallel processing big picture Focus on embarassingly parallel for loops Life is too short to worry about deadlocks Workers compete for data access Memory bus is a bottleneck The right grain of parallelism Too ļ¬ne Ʊ overhead Too coarse Ʊ memory shortage Scale by the relevant cache pool G Varoquaux 27
  • 52. 5 Parallel processing joblib Focus on embarassingly parallel for loops Life is too short to worry about deadlocks >>> from joblib import Parallel, delayed >>> Parallel(n jobs=2)(delayed(sqrt)(i**2) ... for i in range(8)) [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0] G Varoquaux 27
  • 53. 5 Parallel processing joblib IPython, multiprocessing, celery, MPI? joblib is higher-level No dependencies, works everywhere Better traceback reporting Memmaping arrays to share memory (O. Grisel) On-the-ļ¬‚y dispatch of jobs ā€“ memory-friendly Threads or processes backend G Varoquaux 27
  • 54. 5 Parallel processing joblib IPython, multiprocessing, celery, MPI? joblib is higher-level No dependencies, works everywhere Better traceback reporting Memmaping arrays to share memory (O. Grisel) On-the-ļ¬‚y dispatch of jobs ā€“ memory-friendly Threads or processes backend G Varoquaux 27
  • 55. 5 Parallel processing Queues Queues: high-performance, concurrent-friendly Diļ¬ƒculty: callback on result arrival Ʊ multiple threads in caller ` risk of deadlocks Dispatch queue should ļ¬ll up ā€œslowlyā€ Ʊ pre dispatch in joblib Ʊ Back and forth communication Door open to race conditions G Varoquaux 28
  • 56. 5 Parallel processing: what happens where joblib design: Caller, dispatch queue, and collect queue in same process Beneļ¬t: robustness Grand-central dispatch design: dispatch queue has a process of its own Beneļ¬t: resource managment in nested for loops G Varoquaux 29
  • 57. 5 Caching For reproducibility: avoid manually chained scripts (make-like usage) For performance: avoiding re-computing is the crux of optimization G Varoquaux 30
  • 58. 5 Caching The joblib approach For reproducibility: avoid manually chained scripts (make-like usage) For performance: avoiding re-computing is the crux of optimization Memoize pattern mem = joblib.Memory(cachedir=ā€™.ā€™) g = mem.cache(f) b = g(a) # computes a using f c = g(a) # retrieves results from store G Varoquaux 30
  • 59. 5 Caching The joblib approach For reproducibility: avoid manually chained scripts (make-like usage) For performance: avoiding re-computing is the crux of optimization Memoize pattern mem = joblib.Memory(cachedir=ā€™.ā€™) g = mem.cache(f) b = g(a) # computes a using f c = g(a) # retrieves results from store Challenges in the context of big data a & b are big Design goals a & b arbitrary Python objects No dependencies Drop-in, framework-less code G Varoquaux 30
  • 60. 5 Caching The joblib approach For reproducibility: avoid manually chained scripts (make-like usage) For performance: avoiding re-computing is the crux of optimization Memoize pattern mem = joblib.Memory(cachedir=ā€™.ā€™) g = mem.cache(f) b = g(a) # computes a using f c = g(a) # retrieves results from store Lego bricks for out-of-core algorithms coming soon ąąąąąąąąą result = g.call and shelve(a) ąąąąąąąąą result MemorizedResult(cachedir=ā€...ā€, func=ā€g...ā€, argument hash=ā€...ā€) ąąąąąąąąą c = result.get() G Varoquaux 30
  • 61. 5 Eļ¬ƒcient input argument hashing ā€“ joblib.hash Compute md5ā€¹ of input arguments Trade-oļ¬€ between features and cost Black boxy Robust and completely generic G Varoquaux 31
  • 62. 5 Eļ¬ƒcient input argument hashing ā€“ joblib.hash Compute md5ā€¹ of input arguments Implementation 1. Create an md5 hash object 2. Subclass the standard-library pickler = state machine that walks the object graph 3. Walk the object graph: - ndarrays: pass data pointer to md5 algorithm (ā€œupdateā€ method) - the rest: pickle 4. Update the md5 with the pickle ā€¹ md5 is in the Python standard library G Varoquaux 31
  • 63. 5 Fast, disk-based, concurrent, store ā€“ joblib.dump Persisting arbritrary objects Once again sub-class the pickler Use .npy for large numpy arrays (np.save), pickle for the rest Ʊ Multiple ļ¬les Store concurrency issues Strategy: atomic operations ` try/except Renaming a directory is atomic Directory layout consistent with remove operations Good performance, usable on shared disks (cluster) G Varoquaux 32
  • 64. 5 Making I/O fast Fast compression CPU may be faster than disk access in particular in parallel Standard library: zlib.compress with buļ¬€ers (bypass gzip module to work online + in-memory) G Varoquaux 33
  • 65. 5 Making I/O fast Fast compression CPU may be faster than disk access in particular in parallel Standard library: zlib.compress with buļ¬€ers (bypass gzip module to work online + in-memory) Avoiding copies zlib.compress: C-contiguous buļ¬€ers Copyless storage of raw buļ¬€er + meta-information (strides, class...) G Varoquaux 33
  • 66. 5 Making I/O fast Fast compression CPU may be faster than disk access in particular in parallel Standard library: zlib.compress with buļ¬€ers (bypass gzip module to work online + in-memory) Avoiding copies zlib.compress: C-contiguous buļ¬€ers Copyless storage of raw buļ¬€er + meta-information (strides, class...) Single ļ¬le dump coming soon File opening is slow on cluster Challenge: streaming the above for memory usage G Varoquaux 33
  • 67. 5 Making I/O fast Fast compression CPU may be faster than disk access in particular in parallel Standard library: zlib.compress with buļ¬€ers (bypass gzip module to work online + in-memory) Avoiding copies zlib.compress: C-contiguous buļ¬€ers Copyless storage of raw buļ¬€er + meta-information (strides, class...) Single ļ¬le dump coming soon File opening is slow on cluster Challenge: streaming the above for memory usage What matters on large systems Numbers of bytes stored brings network/SATA bus down Memory usage brings compute nodes down Number of atomic ļ¬le access brings shared storage down G Varoquaux 33
  • 68. 5 Benchmarking to np.save and pytables yaxisscale:1isnp.save NeuroImaging data (MNI atlas)G Varoquaux 34
  • 69. 6 The bigger picture: building an ecosystem Helping your future self G Varoquaux 35
  • 70. 6 Community-based development in scikit-learn Huge feature set: beneļ¬ts of a large team Project growth: More than 200 contributors ā€ž 12 core contributors 1 full-time INRIA programmer from the start Estimated cost of development: $ 6 millions COCOMO model, http://www.ohloh.net/p/scikit-learn G Varoquaux 36
  • 71. 6 The economics of open source Code maintenance too expensive to be alone scikit-learn ā€ž 300 email/month nipy ā€ž 45 email/month joblib ā€ž 45 email/month mayavi ā€ž 30 email/month ā€œHey Gael, I take it youā€™re too busy. Thatā€™s okay, I spent a day trying to install XXX and I think Iā€™ll succeed myself. Next time though please donā€™t ignore my emails, I really donā€™t like it. You can say, ā€˜sorry, I have no time to help you.ā€™ Just donā€™t ignore.ā€ G Varoquaux 37
  • 72. 6 The economics of open source Code maintenance too expensive to be alone scikit-learn ā€ž 300 email/month nipy ā€ž 45 email/month joblib ā€ž 45 email/month mayavi ā€ž 30 email/month Your ā€œbeneļ¬tsā€ come from a fraction of the code Data loading? Maybe? Standard algorithms? Nah Share the common code... ...to avoid dying under code Code becomes less precious with time And somebody might contribute features G Varoquaux 37
  • 73. 6 Many eyes makes code fast Bench WiseRF anybody? L. Buitinck, O. Grisel, A. Joly, G. Louppe, J. Nothman, P. Prettenhofer G Varoquaux 38
  • 74. 6 6 steps to a community-driven project 1 Focus on quality 2 Build great docs and examples 3 Use github 4 Limit the technicality of your codebase 5 Releasing and packaging matter 6 Focus on your contributors, give them credit, decision power http://www.slideshare.net/GaelVaroquaux/ scikit-learn-dveloppement-communautaire G Varoquaux 39
  • 75. 6 Core project contributors Normalized number of commits since 2009-06 Numberofcommits Individual committer Credit: Fernando Perez, Gist 5843625 G Varoquaux 40
  • 76. 6 The tragedy of the commons Individuals, acting independently and rationally accord- ing to each oneā€™s self-interest, behave contrary to the whole groupā€™s long-term best interests by depleting some common resource. Wikipedia Make it work, make it right, make it boring Core projects (boring) taken for granted Ʊ Hard to fund, less excitement They need citation, in papers & on corporate web pages G Varoquaux 41
  • 77. @GaelVaroquaux Solving problems that matter The 80/20 rule 80% of the usecases can be solved with 20% of the lines of code scikit-learn, joblib, nilearn, ... I hope
  • 78. @GaelVaroquaux Cutting-edge ... environment ... on a budget 1 Set the goals right Donā€™t solve hard problems Whatā€™s your original problem?
  • 79. @GaelVaroquaux Cutting-edge ... environment ... on a budget 1 Set the goals right 2 Use the simplest technological solutions possible Be very technically sophisticated Donā€™t use that sophistication
  • 80. @GaelVaroquaux Cutting-edge ... environment ... on a budget 1 Set the goals right 2 Use the simplest technological solutions possible 3 Donā€™t forget the human factors With your users (documentation) With your contributors
  • 81. @GaelVaroquaux Cutting-edge ... environment ... on a budget 1 Set the goals right 2 Use the simplest technological solutions possible 3 Donā€™t forget the human factors A perfect design?