SlideShare a Scribd company logo
1 of 65
Download to read offline
Simple big data, in Python
Ga¨el Varoquaux
Simple big data, in Python
Ga¨el Varoquaux
This is a lie!
Please allow me to introduce myself
I’m a man of wealth and taste
I’ve been around for a long, long year
Physicist gone bad
Neuroscience, Machine learning
Worked in a software startup
Enthought: scientific computing
consulting in Python
Coder (done my share of mistake)
Mayavi, scikit-learn, joblib...
Scipy community
Chair of scipy and EuroScipy conferences
Researcher (PI) at INRIA
G Varoquaux 2
1 Machine learning in 2 words
2 Scikit-learn
3 Big data on a budget
4 The bigger picture: a community
G Varoquaux 3
1 Machine learning in 2 words
G Varoquaux 4
1 A historical perspective
Artificial Intelligence The 80s
Building decision rules
Eatable?
Mobile?
Tall?
G Varoquaux 5
1 A historical perspective
Artificial Intelligence The 80s
Building decision rules
Machine learning The 90s
Learn these from observations
G Varoquaux 5
1 A historical perspective
Artificial Intelligence The 80s
Building decision rules
Machine learning The 90s
Learn these from observations
Statistical learning 2000s
Model the noise in the observations
G Varoquaux 5
1 A historical perspective
Artificial Intelligence The 80s
Building decision rules
Machine learning The 90s
Learn these from observations
Statistical learning 2000s
Model the noise in the observations
Big data today
Many observations,
simple rules
G Varoquaux 5
1 A historical perspective
Artificial Intelligence The 80s
Building decision rules
Machine learning The 90s
Learn these from observations
Statistical learning 2000s
Model the noise in the observations
Big data today
Many observations,
simple rules
“Big data isn’t actually interesting without machine
learning”
Steve Jurvetson, VC, Silicon Valley
G Varoquaux 5
1 Machine learning
Example: face recognition
Andrew Bill Charles Dave
G Varoquaux 6
1 Machine learning
Example: face recognition
Andrew Bill Charles Dave
?G Varoquaux 6
1 A simple method
1 Store all the known (noisy) images and the names
that go with them.
2 From a new (noisy) images, find the image that is
most similar.
“Nearest neighbor” method
G Varoquaux 7
1 A simple method
1 Store all the known (noisy) images and the names
that go with them.
2 From a new (noisy) images, find the image that is
most similar.
“Nearest neighbor” method
How many errors on already-known images?
... 0: no erreurs
Test data = Train data
G Varoquaux 7
1 1st problem: noise
Signal unrelated to the prediction problem
0.0 0.5 1.0 1.5 2.0 2.5 3.0
Noise level
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Predictionaccuracy
G Varoquaux 8
1 2nd problem: number of variables
Finding a needle in a haystack
1 2 3 4 5 6 7 8 9 10
Useful fraction of the frame
0.65
0.70
0.75
0.80
0.85
0.90
0.95
Predictionaccuracy
G Varoquaux 9
1 Machine learning
Example: face recognition
Andrew Bill Charles Dave
?
Learning from numerical descriptors
Difficulties: i) noise,
ii) number of features
“supervised” task: known labels
“unsupervised” task: unknown labels
G Varoquaux 10
1 Machine learning: regression
A single descriptor:
one dimension
x
y
G Varoquaux 11
1 Machine learning: regression
A single descriptor:
one dimension
x
y
x
y
Which model to prefer?
G Varoquaux 11
1 Machine learning: regression
A single descriptor:
one dimension
x
y
x
y
Problem of “over-fitting”
Minimizing error is not always the best strategy
(learning noise)
Test data = train data
G Varoquaux 11
1 Machine learning: regression
A single descriptor:
one dimension
x
y
x
y
Prefer simple models
= concept of “regularization”
Balance the number of parameters to learn
with the amount of data
G Varoquaux 11
1 Machine learning: regression
A single descriptor:
one dimension
x
y
Two descriptors:
2 dimensions
X_1
X_2
y
More parameters
G Varoquaux 11
1 Machine learning: regression
A single descriptor:
one dimension
x
y
Two descriptors:
2 dimensions
X_1
X_2
y
More parameters
⇒ need more data
“curse of dimensionality”
G Varoquaux 11
1 Supervised learning: classification
Predicting categories, e.g. numbers
X2
X1
G Varoquaux 12
1 Unsupervised learning
Stock market structure
G Varoquaux 13
1 Unsupervised learning
Stock market structure
Unlabeled data
more common than labeled data
G Varoquaux 13
1 Recommender systems
G Varoquaux 14
1 Recommender systems
Andrew
Bill
Charles
Dave
Edie
Little overlap between users
G Varoquaux 14
1 Machine learning
Challenges
Statistics
Computational
G Varoquaux 15
1 Petty day-to-day technicalities
Buggy code
Slow code
Lead data scientist leaves
New intern to train
I don’t understand the
code I have written a year ago
G Varoquaux 16
1 Petty day-to-day technicalities
Buggy code
Slow code
Lead data scientist leaves
New intern to train
I don’t understand the
code I have written a year ago
An in-house data science squad
Difficulties
Recruitment
Limited resources
(people & hardware)
Risks
Bus factor
Technical dept
G Varoquaux 16
1 Petty day-to-day technicalities
Buggy code
Slow code
Lead data scientist leaves
New intern to train
I don’t understand the
code I have written a year ago
An in-house data science squad
Difficulties
Recruitment
Limited resources
(people & hardware)
Risks
Bus factor
Technical dept
We need big data (and machine learning)
on a tight budget
G Varoquaux 16
2 Scikit-learn
Machine learning without learning the
machinery
c Theodore W. GrayG Varoquaux 17
2 My stack
Python, what else?
Interactive language
Easy to read / write
General purpose
G Varoquaux 18
2 My stack
Python, what else?
The scientific Python stack
numpy arrays
pandas
...
It’s about plugin things
together
G Varoquaux 18
2 scikit-learn vision
Machine learning for all
No specific application domain
No requirements in machine learning
High-quality software library
Interfaces designed for users
Community-driven development
BSD licensed, very diverse contributors
http://scikit-learn.org
G Varoquaux 19
2 A Python library
A library, not a program
More expressive and flexible
Easy to include in an ecosystem
As easy as py
from s k l e a r n import svm
c l a s s i f i e r = svm.SVC()
c l a s s i f i e r . f i t ( X t r a i n , Y t r a i n )
Y t e s t = c l a s s i f i e r . p r e d i c t ( X t e s t )
G Varoquaux 20
2 Very rich feature set
Supervised learning
Decision trees (Random-Forest, Boosted Tree)
Linear models
SVM
Unsupervised Learning
Clustering
Dictionary learning
Outlier detection
Model selection
Built in cross-validation
Parameter optimization
G Varoquaux 21
2 Computational performance
scikit-learn mlpy pybrain pymvpa mdp shogun
SVM 5.2 9.47 17.5 11.52 40.48 5.63
LARS 1.17 105.3 - 37.35 - -
Elastic Net 0.52 73.7 - 1.44 - -
kNN 0.57 1.41 - 0.56 0.58 1.36
PCA 0.18 - - 8.93 0.47 0.33
k-Means 1.34 0.79 ∞ - 35.75 0.68
Algorithmic optimizations
Minimizing data copies
G Varoquaux 22
2 Computational performance
scikit-learn mlpy pybrain pymvpa mdp shogun
SVM 5.2 9.47 17.5 11.52 40.48 5.63
LARS 1.17 105.3 - 37.35 - -
Elastic Net 0.52 73.7 - 1.44 - -
kNN 0.57 1.41 - 0.56 0.58 1.36
PCA 0.18 - - 8.93 0.47 0.33
k-Means 1.34 0.79 ∞ - 35.75 0.68
Algorithmic optimizations
Minimizing data copies
Random Forest fit time
0
2000
4000
6000
8000
10000
12000
14000Fittime(s)
203.01 211.53
4464.65
3342.83
1518.14
1711.94
1027.91
13427.06
10941.72
Scikit-Learn-RF
Scikit-Learn-ETs
OpenCV-RF
OpenCV-ETs
OK3-RF
OK3-ETs
Weka-RF
R-RF
Orange-RF
Scikit-Learn
Python, Cython
OpenCV
C++
OK3
C Weka
Java
randomForest
R, Fortran
Orange
Python
Figure: Gilles Louppe
G Varoquaux 22
3 Big data on a budget
G Varoquaux 23
3 Big data on a budgetBiggish
“Big data”:
Petabytes...
Distributed storage
Computing cluster
Mere mortals:
Gigabytes...
Python programming
Off-the-self computers
G Varoquaux 23
3 Big data on a budgetBiggish
“Big data”:
Petabytes...
Distributed storage
Computing cluster
Mere mortals:
Gigabytes...
Python programming
Off-the-self computers
Simple data processing patterns
G Varoquaux 23
“Big data”, but big how?
2 scenarios:
Many observations –samples
e.g. twitter
Many descriptors per observation –features
e.g. brain scans
G Varoquaux 24
3 On-line algorithms
Process the data one sample at a time
Compute the mean of a gazillion
numbers
Hard?
G Varoquaux 25
3 On-line algorithms
Process the data one sample at a time
Compute the mean of a gazillion
numbers
Hard?
No: just do a running mean
G Varoquaux 25
3 On-line algorithms
Converges to expectations
Mini-batch = bunch observations for vectorization
Example: K-Means clustering
X = np.random.normal(size=(10 000, 200))
scipy.cluster.vq.
kmeans(X, 10,
iter=2)
11.33 s
sklearn.cluster.
MiniBatchKMeans(n clusters=10,
n init=2).fit(X)
0.62 s
G Varoquaux 25
3 On-the-fly data reduction
Big data is often I/O bound
Layer memory access
CPU caches
RAM
Local disks
Distant storage
Less data also means less work
G Varoquaux 26
3 On-the-fly data reduction
Dropping data
1 loop: take a random fraction of the data
2 run algorithm on that fraction
3 aggregate results across sub-samplings
Looks like bagging: bootstrap aggregation
Exploits redundancy across observations
Run the loop in parallel
G Varoquaux 26
3 On-the-fly data reduction
Random projections (will average features)
sklearn.random projection
random linear combinations of the features
Fast clustering of features
sklearn.cluster.WardAgglomeration
on images: super-pixel strategy
Hashing when observations have varying size
(e.g. words)
sklearn.feature extraction.text.
HashingVectorizer
stateless: can be used in parallel
G Varoquaux 26
3 On-the-fly data reduction
Example: randomized SVD Random projection
sklearn.utils.extmath.randomized svd
X = np.random.normal(size=(50000, 200))
%timeit lapack = linalg.svd(X, full matrices=False)
1 loops, best of 3: 6.09 s per loop
%timeit arpack=splinalg.svds(X, 10)
1 loops, best of 3: 2.49 s per loop
%timeit randomized = randomized svd(X, 10)
1 loops, best of 3: 303 ms per loop
linalg.norm(lapack[0][:, :10] - arpack[0]) / 2000
0.0022360679774997738
linalg.norm(lapack[0][:, :10] - randomized[0]) / 2000
0.0022121161221386925
G Varoquaux 26
3 Data-parallel computing
I tackle only embarrassingly parallel problem
Life is too short to worry about deadlocks
Stratification to follow
the statistical dependencies and
the data storage structure
Batch size scaled by the relevent
cache pool
- Too fine ⇒ overhead
- Too coarse ⇒ memory shortage
joblib.Parallel
G Varoquaux 27
3 Caching
Minimizing data-access latency
Never computing twice the same thing
joblib.cache
G Varoquaux 28
3 Fast access to data
Stored representation consistent with access patterns
Compression to limit bandwidth usage
- CPU are faster than data access
Data access speed is often more a
limitation than raw processing power
joblib.dump/joblib.load
G Varoquaux 29
3 Biggish iron
Our new box: 15 ke
48 cores
384G RAM
70T storage
(SSD cache on RAID controller)
Gets our work done faster than our 800 CPU cluster
It’s the access patterns!
“Nobody ever got fired for using Hadoop on a cluster”
A. Rowstron et al., HotCDP ’12
G Varoquaux 30
4 The bigger picture: a
community
Helping your future self
G Varoquaux 31
4 Community-based development in scikit-learn
Huge feature set:
benefits of a large team
Project growth:
More than 200 contributors
∼ 12 core contributors
1 full-time INRIA programmer
from the start
Estimated cost of development: $ 6 millions
COCOMO model,
http://www.ohloh.net/p/scikit-learn
G Varoquaux 32
4 The economics of open source
Code maintenance too expensive to be alone
scikit-learn ∼ 300 email/month nipy ∼ 45 email/month
joblib ∼ 45 email/month mayavi ∼ 30 email/month
“Hey Gael, I take it you’re too
busy. That’s okay, I spent a day
trying to install XXX and I think
I’ll succeed myself. Next time
though please don’t ignore my
emails, I really don’t like it. You
can say, ‘sorry, I have no time to
help you.’ Just don’t ignore.”
G Varoquaux 33
4 The economics of open source
Code maintenance too expensive to be alone
scikit-learn ∼ 300 email/month nipy ∼ 45 email/month
joblib ∼ 45 email/month mayavi ∼ 30 email/month
Your “benefits” come from a fraction of the code
Data loading? Maybe?
Standard algorithms? Nah
Share the common code...
...to avoid dying under code
Code becomes less precious with time
And somebody might contribute features
G Varoquaux 33
4 Many eyes makes code fast
L. Buitinck, O. Grisel, A. Joly, G. Louppe, J. Nothman, P. Prettenhofer
G Varoquaux 34
4 Communities increase the knowledge pool
Even if you don’t do software,
you should worry about communities
More experts in the package
⇒ Easier recruitement
The future is open
⇒ Enhancements are possible
Meetups, conferences,
are where new ideas are born
G Varoquaux 35
4 6 steps to a community-driven project
1 Focus on quality
2 Build great docs and examples
3 Use github
4 Limit the technicality of your codebase
5 Releasing and packaging matter
6 Focus on your contributors,
give them credit, decision power
http://www.slideshare.net/GaelVaroquaux/
scikit-learn-dveloppement-communautaire
G Varoquaux 36
4 Core project contributors
Credit: Fernando Perez, Gist 5843625
G Varoquaux 37
4 The tragedy of the commons
Individuals, acting independently and rationally accord-
ing to each one’s self-interest, behave contrary to the
whole group’s long-term best interests by depleting
some common resource.
Wikipedia
Make it work, make it right, make it boring
Core projects (boring) taken for granted
⇒ Hard to fund, less excitement
They need citation, in papers & on corporate web pages
G Varoquaux 38
Simple big data, in Python
Beyond the lie
Machine learning gives value to (big) data
Python + scikit-learn =
- from interactive data processing (IPython notebook)
- to crazy big problems (Python + spark)
Big data will require you to understand the
data-flow patterns (access, parallelism, statistics)
Big data community addresses the human factor
@GaelVaroquaux

More Related Content

What's hot

Estimating Functional Connectomes: Sparsity’s Strength and Limitations
Estimating Functional Connectomes: Sparsity’s Strength and LimitationsEstimating Functional Connectomes: Sparsity’s Strength and Limitations
Estimating Functional Connectomes: Sparsity’s Strength and LimitationsGael Varoquaux
 
A tutorial on Machine Learning, with illustrations for MR imaging
A tutorial on Machine Learning, with illustrations for MR imagingA tutorial on Machine Learning, with illustrations for MR imaging
A tutorial on Machine Learning, with illustrations for MR imagingGael Varoquaux
 
Inter-site autism biomarkers from resting state fMRI
Inter-site autism biomarkers from resting state fMRIInter-site autism biomarkers from resting state fMRI
Inter-site autism biomarkers from resting state fMRIGael Varoquaux
 
Similarity encoding for learning on dirty categorical variables
Similarity encoding for learning on dirty categorical variablesSimilarity encoding for learning on dirty categorical variables
Similarity encoding for learning on dirty categorical variablesGael Varoquaux
 
On the code of data science
On the code of data scienceOn the code of data science
On the code of data scienceGael Varoquaux
 
Processing biggish data on commodity hardware: simple Python patterns
Processing biggish data on commodity hardware: simple Python patternsProcessing biggish data on commodity hardware: simple Python patterns
Processing biggish data on commodity hardware: simple Python patternsGael Varoquaux
 
Data-driven Hypothesis Management
Data-driven Hypothesis ManagementData-driven Hypothesis Management
Data-driven Hypothesis Managementbgoncalves2
 
Connectomics: Parcellations and Network Analysis Methods
Connectomics: Parcellations and Network Analysis MethodsConnectomics: Parcellations and Network Analysis Methods
Connectomics: Parcellations and Network Analysis MethodsGael Varoquaux
 
Open Source Scientific Software
Open Source Scientific SoftwareOpen Source Scientific Software
Open Source Scientific SoftwareGael Varoquaux
 
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020Universitat Politècnica de Catalunya
 
Large Scale Recommendation: a view from the Trenches
Large Scale Recommendation: a view from the TrenchesLarge Scale Recommendation: a view from the Trenches
Large Scale Recommendation: a view from the TrenchesAnne-Marie Tousch
 
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...Universitat Politècnica de Catalunya
 
Convolutional neural network in practice
Convolutional neural network in practiceConvolutional neural network in practice
Convolutional neural network in practice남주 김
 
Towards new solutions for scientific computing: the case of Julia
Towards new solutions for scientific computing: the case of JuliaTowards new solutions for scientific computing: the case of Julia
Towards new solutions for scientific computing: the case of JuliaMaurizio Tomasi
 
Building a cutting-edge data processing environment on a budget
Building a cutting-edge data processing environment on a budgetBuilding a cutting-edge data processing environment on a budget
Building a cutting-edge data processing environment on a budgetGael Varoquaux
 
Deep Learning and Design Thinking
Deep Learning and Design ThinkingDeep Learning and Design Thinking
Deep Learning and Design ThinkingYen-lung Tsai
 
Mining Frequent Closed Graphs on Evolving Data Streams
Mining Frequent Closed Graphs on Evolving Data StreamsMining Frequent Closed Graphs on Evolving Data Streams
Mining Frequent Closed Graphs on Evolving Data StreamsAlbert Bifet
 

What's hot (20)

Estimating Functional Connectomes: Sparsity’s Strength and Limitations
Estimating Functional Connectomes: Sparsity’s Strength and LimitationsEstimating Functional Connectomes: Sparsity’s Strength and Limitations
Estimating Functional Connectomes: Sparsity’s Strength and Limitations
 
A tutorial on Machine Learning, with illustrations for MR imaging
A tutorial on Machine Learning, with illustrations for MR imagingA tutorial on Machine Learning, with illustrations for MR imaging
A tutorial on Machine Learning, with illustrations for MR imaging
 
Inter-site autism biomarkers from resting state fMRI
Inter-site autism biomarkers from resting state fMRIInter-site autism biomarkers from resting state fMRI
Inter-site autism biomarkers from resting state fMRI
 
Similarity encoding for learning on dirty categorical variables
Similarity encoding for learning on dirty categorical variablesSimilarity encoding for learning on dirty categorical variables
Similarity encoding for learning on dirty categorical variables
 
On the code of data science
On the code of data scienceOn the code of data science
On the code of data science
 
Processing biggish data on commodity hardware: simple Python patterns
Processing biggish data on commodity hardware: simple Python patternsProcessing biggish data on commodity hardware: simple Python patterns
Processing biggish data on commodity hardware: simple Python patterns
 
Data-driven Hypothesis Management
Data-driven Hypothesis ManagementData-driven Hypothesis Management
Data-driven Hypothesis Management
 
Connectomics: Parcellations and Network Analysis Methods
Connectomics: Parcellations and Network Analysis MethodsConnectomics: Parcellations and Network Analysis Methods
Connectomics: Parcellations and Network Analysis Methods
 
Open Source Scientific Software
Open Source Scientific SoftwareOpen Source Scientific Software
Open Source Scientific Software
 
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
 
Large Scale Recommendation: a view from the Trenches
Large Scale Recommendation: a view from the TrenchesLarge Scale Recommendation: a view from the Trenches
Large Scale Recommendation: a view from the Trenches
 
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
 
Convolutional neural network in practice
Convolutional neural network in practiceConvolutional neural network in practice
Convolutional neural network in practice
 
Final Report-1-(1)
Final Report-1-(1)Final Report-1-(1)
Final Report-1-(1)
 
Towards new solutions for scientific computing: the case of Julia
Towards new solutions for scientific computing: the case of JuliaTowards new solutions for scientific computing: the case of Julia
Towards new solutions for scientific computing: the case of Julia
 
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
 
Building a cutting-edge data processing environment on a budget
Building a cutting-edge data processing environment on a budgetBuilding a cutting-edge data processing environment on a budget
Building a cutting-edge data processing environment on a budget
 
DTLC-GAN
DTLC-GANDTLC-GAN
DTLC-GAN
 
Deep Learning and Design Thinking
Deep Learning and Design ThinkingDeep Learning and Design Thinking
Deep Learning and Design Thinking
 
Mining Frequent Closed Graphs on Evolving Data Streams
Mining Frequent Closed Graphs on Evolving Data StreamsMining Frequent Closed Graphs on Evolving Data Streams
Mining Frequent Closed Graphs on Evolving Data Streams
 

Viewers also liked

Scientist meets web dev: how Python became the language of data
Scientist meets web dev: how Python became the language of dataScientist meets web dev: how Python became the language of data
Scientist meets web dev: how Python became the language of dataGael Varoquaux
 
Succeeding in academia despite doing good_software
Succeeding in academia despite doing good_softwareSucceeding in academia despite doing good_software
Succeeding in academia despite doing good_softwareGael Varoquaux
 
Scikit-learn: the state of the union 2016
Scikit-learn: the state of the union 2016Scikit-learn: the state of the union 2016
Scikit-learn: the state of the union 2016Gael Varoquaux
 
Hyperparameter optimization with approximate gradient
Hyperparameter optimization with approximate gradientHyperparameter optimization with approximate gradient
Hyperparameter optimization with approximate gradientFabian Pedregosa
 
Strata San Jose 2016 - Reduce False Positives in Security
Strata San Jose 2016 - Reduce False Positives in Security Strata San Jose 2016 - Reduce False Positives in Security
Strata San Jose 2016 - Reduce False Positives in Security Ram Shankar Siva Kumar
 
Aiguille dans botte de foin: scikit-learn et joblib
Aiguille dans botte de foin: scikit-learn et joblibAiguille dans botte de foin: scikit-learn et joblib
Aiguille dans botte de foin: scikit-learn et joblibGael Varoquaux
 
Sondage aléatoire simple ou a probabilité égal
Sondage aléatoire simple ou a probabilité égal Sondage aléatoire simple ou a probabilité égal
Sondage aléatoire simple ou a probabilité égal hammamiahlem1
 
Je configure mes serveurs avec fabric et fabtools
Je configure mes serveurs avec fabric et fabtoolsJe configure mes serveurs avec fabric et fabtools
Je configure mes serveurs avec fabric et fabtoolsRonan Amicel
 
Scikit learn: apprentissage statistique en Python
Scikit learn: apprentissage statistique en PythonScikit learn: apprentissage statistique en Python
Scikit learn: apprentissage statistique en PythonGael Varoquaux
 
SeSQL : un moteur de recherche en Python et PostgreSQL
SeSQL : un moteur de recherche en Python et PostgreSQLSeSQL : un moteur de recherche en Python et PostgreSQL
SeSQL : un moteur de recherche en Python et PostgreSQLParis, France
 
Python Sympy 모듈 이해하기
Python Sympy 모듈 이해하기Python Sympy 모듈 이해하기
Python Sympy 모듈 이해하기Yong Joon Moon
 
A Beginner's Guide to Machine Learning with Scikit-Learn
A Beginner's Guide to Machine Learning with Scikit-LearnA Beginner's Guide to Machine Learning with Scikit-Learn
A Beginner's Guide to Machine Learning with Scikit-LearnSarah Guido
 
Presentation r markdown
Presentation r markdown Presentation r markdown
Presentation r markdown Cdiscount
 
Algorithmique_et_programmation_part2
Algorithmique_et_programmation_part2Algorithmique_et_programmation_part2
Algorithmique_et_programmation_part2Emeric Tapachès
 
Python et les bases de données non sql
Python et les bases de données non sqlPython et les bases de données non sql
Python et les bases de données non sqlbchesneau
 

Viewers also liked (19)

Scientist meets web dev: how Python became the language of data
Scientist meets web dev: how Python became the language of dataScientist meets web dev: how Python became the language of data
Scientist meets web dev: how Python became the language of data
 
Succeeding in academia despite doing good_software
Succeeding in academia despite doing good_softwareSucceeding in academia despite doing good_software
Succeeding in academia despite doing good_software
 
Scikit-learn: the state of the union 2016
Scikit-learn: the state of the union 2016Scikit-learn: the state of the union 2016
Scikit-learn: the state of the union 2016
 
Hyperparameter optimization with approximate gradient
Hyperparameter optimization with approximate gradientHyperparameter optimization with approximate gradient
Hyperparameter optimization with approximate gradient
 
Empresa
EmpresaEmpresa
Empresa
 
Strata San Jose 2016 - Reduce False Positives in Security
Strata San Jose 2016 - Reduce False Positives in Security Strata San Jose 2016 - Reduce False Positives in Security
Strata San Jose 2016 - Reduce False Positives in Security
 
Aiguille dans botte de foin: scikit-learn et joblib
Aiguille dans botte de foin: scikit-learn et joblibAiguille dans botte de foin: scikit-learn et joblib
Aiguille dans botte de foin: scikit-learn et joblib
 
Sondage aléatoire simple ou a probabilité égal
Sondage aléatoire simple ou a probabilité égal Sondage aléatoire simple ou a probabilité égal
Sondage aléatoire simple ou a probabilité égal
 
Je configure mes serveurs avec fabric et fabtools
Je configure mes serveurs avec fabric et fabtoolsJe configure mes serveurs avec fabric et fabtools
Je configure mes serveurs avec fabric et fabtools
 
Python et NoSQL
Python et NoSQLPython et NoSQL
Python et NoSQL
 
Scikit learn: apprentissage statistique en Python
Scikit learn: apprentissage statistique en PythonScikit learn: apprentissage statistique en Python
Scikit learn: apprentissage statistique en Python
 
Python packaging
Python packagingPython packaging
Python packaging
 
SeSQL : un moteur de recherche en Python et PostgreSQL
SeSQL : un moteur de recherche en Python et PostgreSQLSeSQL : un moteur de recherche en Python et PostgreSQL
SeSQL : un moteur de recherche en Python et PostgreSQL
 
Python Sympy 모듈 이해하기
Python Sympy 모듈 이해하기Python Sympy 모듈 이해하기
Python Sympy 모듈 이해하기
 
A Beginner's Guide to Machine Learning with Scikit-Learn
A Beginner's Guide to Machine Learning with Scikit-LearnA Beginner's Guide to Machine Learning with Scikit-Learn
A Beginner's Guide to Machine Learning with Scikit-Learn
 
Presentation r markdown
Presentation r markdown Presentation r markdown
Presentation r markdown
 
Algorithmique_et_programmation_part2
Algorithmique_et_programmation_part2Algorithmique_et_programmation_part2
Algorithmique_et_programmation_part2
 
Python et les bases de données non sql
Python et les bases de données non sqlPython et les bases de données non sql
Python et les bases de données non sql
 
R versur Python
R versur PythonR versur Python
R versur Python
 

Similar to Simple big data, in Python

Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux
Building a Cutting-Edge Data Process Environment on a Budget by Gael VaroquauxBuilding a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux
Building a Cutting-Edge Data Process Environment on a Budget by Gael VaroquauxPyData
 
Better neuroimaging data processing: driven by evidence, open communities, an...
Better neuroimaging data processing: driven by evidence, open communities, an...Better neuroimaging data processing: driven by evidence, open communities, an...
Better neuroimaging data processing: driven by evidence, open communities, an...Gael Varoquaux
 
The Heatmap
 - Why is Security Visualization so Hard?
The Heatmap
 - Why is Security Visualization so Hard?The Heatmap
 - Why is Security Visualization so Hard?
The Heatmap
 - Why is Security Visualization so Hard?Raffael Marty
 
Deep Learning And Business Models (VNITC 2015-09-13)
Deep Learning And Business Models (VNITC 2015-09-13)Deep Learning And Business Models (VNITC 2015-09-13)
Deep Learning And Business Models (VNITC 2015-09-13)Ha Phuong
 
Tech day ngobrol santai tensorflow
Tech day ngobrol santai tensorflowTech day ngobrol santai tensorflow
Tech day ngobrol santai tensorflowRamdhan Rizki
 
Learning Predictive Modeling with TSA and Kaggle
Learning Predictive Modeling with TSA and KaggleLearning Predictive Modeling with TSA and Kaggle
Learning Predictive Modeling with TSA and KaggleYvonne K. Matos
 
Anomaly detection, part 1
Anomaly detection, part 1Anomaly detection, part 1
Anomaly detection, part 1David Khosid
 
Automated Identification of On-hold Self-admitted Technical Debt
Automated Identification of On-hold Self-admitted Technical DebtAutomated Identification of On-hold Self-admitted Technical Debt
Automated Identification of On-hold Self-admitted Technical DebtRungrojMaipradit1
 
Python for Data Science with Anaconda
Python for Data Science with AnacondaPython for Data Science with Anaconda
Python for Data Science with AnacondaTravis Oliphant
 
Democratizing Machine Learning: Perspective from a scikit-learn Creator
Democratizing Machine Learning: Perspective from a scikit-learn CreatorDemocratizing Machine Learning: Perspective from a scikit-learn Creator
Democratizing Machine Learning: Perspective from a scikit-learn CreatorDatabricks
 
⭐⭐⭐⭐⭐ Device Free Indoor Localization in the 28 GHz band based on machine lea...
⭐⭐⭐⭐⭐ Device Free Indoor Localization in the 28 GHz band based on machine lea...⭐⭐⭐⭐⭐ Device Free Indoor Localization in the 28 GHz band based on machine lea...
⭐⭐⭐⭐⭐ Device Free Indoor Localization in the 28 GHz band based on machine lea...Victor Asanza
 
machine learning in the age of big data: new approaches and business applicat...
machine learning in the age of big data: new approaches and business applicat...machine learning in the age of big data: new approaches and business applicat...
machine learning in the age of big data: new approaches and business applicat...Armando Vieira
 
Simple APIs and innovative documentation
Simple APIs and innovative documentationSimple APIs and innovative documentation
Simple APIs and innovative documentationPyDataParis
 
Pyparis2017 / Scikit-learn - an incomplete yearly review, by Gael Varoquaux
Pyparis2017 / Scikit-learn - an incomplete yearly review, by Gael VaroquauxPyparis2017 / Scikit-learn - an incomplete yearly review, by Gael Varoquaux
Pyparis2017 / Scikit-learn - an incomplete yearly review, by Gael VaroquauxPôle Systematic Paris-Region
 
Computational decision making
Computational decision makingComputational decision making
Computational decision makingBoris Adryan
 
Meetup Python Madrid 2018: ¿Segmentación semántica? ¿Pero de qué me estás hab...
Meetup Python Madrid 2018: ¿Segmentación semántica? ¿Pero de qué me estás hab...Meetup Python Madrid 2018: ¿Segmentación semántica? ¿Pero de qué me estás hab...
Meetup Python Madrid 2018: ¿Segmentación semántica? ¿Pero de qué me estás hab...Ricardo Guerrero Gómez-Olmedo
 
DARMDN: Deep autoregressive mixture density nets for dynamical system mode...
   DARMDN: Deep autoregressive mixture density nets for dynamical system mode...   DARMDN: Deep autoregressive mixture density nets for dynamical system mode...
DARMDN: Deep autoregressive mixture density nets for dynamical system mode...Balázs Kégl
 
Jay Yagnik at AI Frontiers : A History Lesson on AI
Jay Yagnik at AI Frontiers : A History Lesson on AIJay Yagnik at AI Frontiers : A History Lesson on AI
Jay Yagnik at AI Frontiers : A History Lesson on AIAI Frontiers
 
SERENE 2014 School: Measurement-Driven Resilience Design of Cloud-Based Cyber...
SERENE 2014 School: Measurement-Driven Resilience Design of Cloud-Based Cyber...SERENE 2014 School: Measurement-Driven Resilience Design of Cloud-Based Cyber...
SERENE 2014 School: Measurement-Driven Resilience Design of Cloud-Based Cyber...SERENEWorkshop
 

Similar to Simple big data, in Python (20)

Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux
Building a Cutting-Edge Data Process Environment on a Budget by Gael VaroquauxBuilding a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux
Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux
 
Better neuroimaging data processing: driven by evidence, open communities, an...
Better neuroimaging data processing: driven by evidence, open communities, an...Better neuroimaging data processing: driven by evidence, open communities, an...
Better neuroimaging data processing: driven by evidence, open communities, an...
 
The Heatmap
 - Why is Security Visualization so Hard?
The Heatmap
 - Why is Security Visualization so Hard?The Heatmap
 - Why is Security Visualization so Hard?
The Heatmap
 - Why is Security Visualization so Hard?
 
Deep Learning And Business Models (VNITC 2015-09-13)
Deep Learning And Business Models (VNITC 2015-09-13)Deep Learning And Business Models (VNITC 2015-09-13)
Deep Learning And Business Models (VNITC 2015-09-13)
 
Tech day ngobrol santai tensorflow
Tech day ngobrol santai tensorflowTech day ngobrol santai tensorflow
Tech day ngobrol santai tensorflow
 
Learning Predictive Modeling with TSA and Kaggle
Learning Predictive Modeling with TSA and KaggleLearning Predictive Modeling with TSA and Kaggle
Learning Predictive Modeling with TSA and Kaggle
 
Anomaly detection, part 1
Anomaly detection, part 1Anomaly detection, part 1
Anomaly detection, part 1
 
Automated Identification of On-hold Self-admitted Technical Debt
Automated Identification of On-hold Self-admitted Technical DebtAutomated Identification of On-hold Self-admitted Technical Debt
Automated Identification of On-hold Self-admitted Technical Debt
 
Python for Data Science with Anaconda
Python for Data Science with AnacondaPython for Data Science with Anaconda
Python for Data Science with Anaconda
 
Democratizing Machine Learning: Perspective from a scikit-learn Creator
Democratizing Machine Learning: Perspective from a scikit-learn CreatorDemocratizing Machine Learning: Perspective from a scikit-learn Creator
Democratizing Machine Learning: Perspective from a scikit-learn Creator
 
⭐⭐⭐⭐⭐ Device Free Indoor Localization in the 28 GHz band based on machine lea...
⭐⭐⭐⭐⭐ Device Free Indoor Localization in the 28 GHz band based on machine lea...⭐⭐⭐⭐⭐ Device Free Indoor Localization in the 28 GHz band based on machine lea...
⭐⭐⭐⭐⭐ Device Free Indoor Localization in the 28 GHz band based on machine lea...
 
machine learning in the age of big data: new approaches and business applicat...
machine learning in the age of big data: new approaches and business applicat...machine learning in the age of big data: new approaches and business applicat...
machine learning in the age of big data: new approaches and business applicat...
 
Simple APIs and innovative documentation
Simple APIs and innovative documentationSimple APIs and innovative documentation
Simple APIs and innovative documentation
 
Pyparis2017 / Scikit-learn - an incomplete yearly review, by Gael Varoquaux
Pyparis2017 / Scikit-learn - an incomplete yearly review, by Gael VaroquauxPyparis2017 / Scikit-learn - an incomplete yearly review, by Gael Varoquaux
Pyparis2017 / Scikit-learn - an incomplete yearly review, by Gael Varoquaux
 
New directions for mahout
New directions for mahoutNew directions for mahout
New directions for mahout
 
Computational decision making
Computational decision makingComputational decision making
Computational decision making
 
Meetup Python Madrid 2018: ¿Segmentación semántica? ¿Pero de qué me estás hab...
Meetup Python Madrid 2018: ¿Segmentación semántica? ¿Pero de qué me estás hab...Meetup Python Madrid 2018: ¿Segmentación semántica? ¿Pero de qué me estás hab...
Meetup Python Madrid 2018: ¿Segmentación semántica? ¿Pero de qué me estás hab...
 
DARMDN: Deep autoregressive mixture density nets for dynamical system mode...
   DARMDN: Deep autoregressive mixture density nets for dynamical system mode...   DARMDN: Deep autoregressive mixture density nets for dynamical system mode...
DARMDN: Deep autoregressive mixture density nets for dynamical system mode...
 
Jay Yagnik at AI Frontiers : A History Lesson on AI
Jay Yagnik at AI Frontiers : A History Lesson on AIJay Yagnik at AI Frontiers : A History Lesson on AI
Jay Yagnik at AI Frontiers : A History Lesson on AI
 
SERENE 2014 School: Measurement-Driven Resilience Design of Cloud-Based Cyber...
SERENE 2014 School: Measurement-Driven Resilience Design of Cloud-Based Cyber...SERENE 2014 School: Measurement-Driven Resilience Design of Cloud-Based Cyber...
SERENE 2014 School: Measurement-Driven Resilience Design of Cloud-Based Cyber...
 

More from Gael Varoquaux

Evaluating machine learning models and their diagnostic value
Evaluating machine learning models and their diagnostic valueEvaluating machine learning models and their diagnostic value
Evaluating machine learning models and their diagnostic valueGael Varoquaux
 
Measuring mental health with machine learning and brain imaging
Measuring mental health with machine learning and brain imagingMeasuring mental health with machine learning and brain imaging
Measuring mental health with machine learning and brain imagingGael Varoquaux
 
Machine learning with missing values
Machine learning with missing valuesMachine learning with missing values
Machine learning with missing valuesGael Varoquaux
 
Representation learning in limited-data settings
Representation learning in limited-data settingsRepresentation learning in limited-data settings
Representation learning in limited-data settingsGael Varoquaux
 
Functional-connectome biomarkers to meet clinical needs?
Functional-connectome biomarkers to meet clinical needs?Functional-connectome biomarkers to meet clinical needs?
Functional-connectome biomarkers to meet clinical needs?Gael Varoquaux
 
Atlases of cognition with large-scale human brain mapping
Atlases of cognition with large-scale human brain mappingAtlases of cognition with large-scale human brain mapping
Atlases of cognition with large-scale human brain mappingGael Varoquaux
 
Towards psychoinformatics with machine learning and brain imaging
Towards psychoinformatics with machine learning and brain imagingTowards psychoinformatics with machine learning and brain imaging
Towards psychoinformatics with machine learning and brain imagingGael Varoquaux
 
Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities Gael Varoquaux
 
Computational practices for reproducible science
Computational practices for reproducible scienceComputational practices for reproducible science
Computational practices for reproducible scienceGael Varoquaux
 
Coding for science and innovation
Coding for science and innovationCoding for science and innovation
Coding for science and innovationGael Varoquaux
 
Scikit-learn: apprentissage statistique en Python. Créer des machines intelli...
Scikit-learn: apprentissage statistique en Python. Créer des machines intelli...Scikit-learn: apprentissage statistique en Python. Créer des machines intelli...
Scikit-learn: apprentissage statistique en Python. Créer des machines intelli...Gael Varoquaux
 

More from Gael Varoquaux (11)

Evaluating machine learning models and their diagnostic value
Evaluating machine learning models and their diagnostic valueEvaluating machine learning models and their diagnostic value
Evaluating machine learning models and their diagnostic value
 
Measuring mental health with machine learning and brain imaging
Measuring mental health with machine learning and brain imagingMeasuring mental health with machine learning and brain imaging
Measuring mental health with machine learning and brain imaging
 
Machine learning with missing values
Machine learning with missing valuesMachine learning with missing values
Machine learning with missing values
 
Representation learning in limited-data settings
Representation learning in limited-data settingsRepresentation learning in limited-data settings
Representation learning in limited-data settings
 
Functional-connectome biomarkers to meet clinical needs?
Functional-connectome biomarkers to meet clinical needs?Functional-connectome biomarkers to meet clinical needs?
Functional-connectome biomarkers to meet clinical needs?
 
Atlases of cognition with large-scale human brain mapping
Atlases of cognition with large-scale human brain mappingAtlases of cognition with large-scale human brain mapping
Atlases of cognition with large-scale human brain mapping
 
Towards psychoinformatics with machine learning and brain imaging
Towards psychoinformatics with machine learning and brain imagingTowards psychoinformatics with machine learning and brain imaging
Towards psychoinformatics with machine learning and brain imaging
 
Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities
 
Computational practices for reproducible science
Computational practices for reproducible scienceComputational practices for reproducible science
Computational practices for reproducible science
 
Coding for science and innovation
Coding for science and innovationCoding for science and innovation
Coding for science and innovation
 
Scikit-learn: apprentissage statistique en Python. Créer des machines intelli...
Scikit-learn: apprentissage statistique en Python. Créer des machines intelli...Scikit-learn: apprentissage statistique en Python. Créer des machines intelli...
Scikit-learn: apprentissage statistique en Python. Créer des machines intelli...
 

Recently uploaded

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Hyundai Motor Group
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetEnjoy Anytime
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 

Recently uploaded (20)

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 

Simple big data, in Python

  • 1. Simple big data, in Python Ga¨el Varoquaux
  • 2. Simple big data, in Python Ga¨el Varoquaux This is a lie!
  • 3. Please allow me to introduce myself I’m a man of wealth and taste I’ve been around for a long, long year Physicist gone bad Neuroscience, Machine learning Worked in a software startup Enthought: scientific computing consulting in Python Coder (done my share of mistake) Mayavi, scikit-learn, joblib... Scipy community Chair of scipy and EuroScipy conferences Researcher (PI) at INRIA G Varoquaux 2
  • 4. 1 Machine learning in 2 words 2 Scikit-learn 3 Big data on a budget 4 The bigger picture: a community G Varoquaux 3
  • 5. 1 Machine learning in 2 words G Varoquaux 4
  • 6. 1 A historical perspective Artificial Intelligence The 80s Building decision rules Eatable? Mobile? Tall? G Varoquaux 5
  • 7. 1 A historical perspective Artificial Intelligence The 80s Building decision rules Machine learning The 90s Learn these from observations G Varoquaux 5
  • 8. 1 A historical perspective Artificial Intelligence The 80s Building decision rules Machine learning The 90s Learn these from observations Statistical learning 2000s Model the noise in the observations G Varoquaux 5
  • 9. 1 A historical perspective Artificial Intelligence The 80s Building decision rules Machine learning The 90s Learn these from observations Statistical learning 2000s Model the noise in the observations Big data today Many observations, simple rules G Varoquaux 5
  • 10. 1 A historical perspective Artificial Intelligence The 80s Building decision rules Machine learning The 90s Learn these from observations Statistical learning 2000s Model the noise in the observations Big data today Many observations, simple rules “Big data isn’t actually interesting without machine learning” Steve Jurvetson, VC, Silicon Valley G Varoquaux 5
  • 11. 1 Machine learning Example: face recognition Andrew Bill Charles Dave G Varoquaux 6
  • 12. 1 Machine learning Example: face recognition Andrew Bill Charles Dave ?G Varoquaux 6
  • 13. 1 A simple method 1 Store all the known (noisy) images and the names that go with them. 2 From a new (noisy) images, find the image that is most similar. “Nearest neighbor” method G Varoquaux 7
  • 14. 1 A simple method 1 Store all the known (noisy) images and the names that go with them. 2 From a new (noisy) images, find the image that is most similar. “Nearest neighbor” method How many errors on already-known images? ... 0: no erreurs Test data = Train data G Varoquaux 7
  • 15. 1 1st problem: noise Signal unrelated to the prediction problem 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Noise level 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Predictionaccuracy G Varoquaux 8
  • 16. 1 2nd problem: number of variables Finding a needle in a haystack 1 2 3 4 5 6 7 8 9 10 Useful fraction of the frame 0.65 0.70 0.75 0.80 0.85 0.90 0.95 Predictionaccuracy G Varoquaux 9
  • 17. 1 Machine learning Example: face recognition Andrew Bill Charles Dave ? Learning from numerical descriptors Difficulties: i) noise, ii) number of features “supervised” task: known labels “unsupervised” task: unknown labels G Varoquaux 10
  • 18. 1 Machine learning: regression A single descriptor: one dimension x y G Varoquaux 11
  • 19. 1 Machine learning: regression A single descriptor: one dimension x y x y Which model to prefer? G Varoquaux 11
  • 20. 1 Machine learning: regression A single descriptor: one dimension x y x y Problem of “over-fitting” Minimizing error is not always the best strategy (learning noise) Test data = train data G Varoquaux 11
  • 21. 1 Machine learning: regression A single descriptor: one dimension x y x y Prefer simple models = concept of “regularization” Balance the number of parameters to learn with the amount of data G Varoquaux 11
  • 22. 1 Machine learning: regression A single descriptor: one dimension x y Two descriptors: 2 dimensions X_1 X_2 y More parameters G Varoquaux 11
  • 23. 1 Machine learning: regression A single descriptor: one dimension x y Two descriptors: 2 dimensions X_1 X_2 y More parameters ⇒ need more data “curse of dimensionality” G Varoquaux 11
  • 24. 1 Supervised learning: classification Predicting categories, e.g. numbers X2 X1 G Varoquaux 12
  • 25. 1 Unsupervised learning Stock market structure G Varoquaux 13
  • 26. 1 Unsupervised learning Stock market structure Unlabeled data more common than labeled data G Varoquaux 13
  • 27. 1 Recommender systems G Varoquaux 14
  • 28. 1 Recommender systems Andrew Bill Charles Dave Edie Little overlap between users G Varoquaux 14
  • 30. 1 Petty day-to-day technicalities Buggy code Slow code Lead data scientist leaves New intern to train I don’t understand the code I have written a year ago G Varoquaux 16
  • 31. 1 Petty day-to-day technicalities Buggy code Slow code Lead data scientist leaves New intern to train I don’t understand the code I have written a year ago An in-house data science squad Difficulties Recruitment Limited resources (people & hardware) Risks Bus factor Technical dept G Varoquaux 16
  • 32. 1 Petty day-to-day technicalities Buggy code Slow code Lead data scientist leaves New intern to train I don’t understand the code I have written a year ago An in-house data science squad Difficulties Recruitment Limited resources (people & hardware) Risks Bus factor Technical dept We need big data (and machine learning) on a tight budget G Varoquaux 16
  • 33. 2 Scikit-learn Machine learning without learning the machinery c Theodore W. GrayG Varoquaux 17
  • 34. 2 My stack Python, what else? Interactive language Easy to read / write General purpose G Varoquaux 18
  • 35. 2 My stack Python, what else? The scientific Python stack numpy arrays pandas ... It’s about plugin things together G Varoquaux 18
  • 36. 2 scikit-learn vision Machine learning for all No specific application domain No requirements in machine learning High-quality software library Interfaces designed for users Community-driven development BSD licensed, very diverse contributors http://scikit-learn.org G Varoquaux 19
  • 37. 2 A Python library A library, not a program More expressive and flexible Easy to include in an ecosystem As easy as py from s k l e a r n import svm c l a s s i f i e r = svm.SVC() c l a s s i f i e r . f i t ( X t r a i n , Y t r a i n ) Y t e s t = c l a s s i f i e r . p r e d i c t ( X t e s t ) G Varoquaux 20
  • 38. 2 Very rich feature set Supervised learning Decision trees (Random-Forest, Boosted Tree) Linear models SVM Unsupervised Learning Clustering Dictionary learning Outlier detection Model selection Built in cross-validation Parameter optimization G Varoquaux 21
  • 39. 2 Computational performance scikit-learn mlpy pybrain pymvpa mdp shogun SVM 5.2 9.47 17.5 11.52 40.48 5.63 LARS 1.17 105.3 - 37.35 - - Elastic Net 0.52 73.7 - 1.44 - - kNN 0.57 1.41 - 0.56 0.58 1.36 PCA 0.18 - - 8.93 0.47 0.33 k-Means 1.34 0.79 ∞ - 35.75 0.68 Algorithmic optimizations Minimizing data copies G Varoquaux 22
  • 40. 2 Computational performance scikit-learn mlpy pybrain pymvpa mdp shogun SVM 5.2 9.47 17.5 11.52 40.48 5.63 LARS 1.17 105.3 - 37.35 - - Elastic Net 0.52 73.7 - 1.44 - - kNN 0.57 1.41 - 0.56 0.58 1.36 PCA 0.18 - - 8.93 0.47 0.33 k-Means 1.34 0.79 ∞ - 35.75 0.68 Algorithmic optimizations Minimizing data copies Random Forest fit time 0 2000 4000 6000 8000 10000 12000 14000Fittime(s) 203.01 211.53 4464.65 3342.83 1518.14 1711.94 1027.91 13427.06 10941.72 Scikit-Learn-RF Scikit-Learn-ETs OpenCV-RF OpenCV-ETs OK3-RF OK3-ETs Weka-RF R-RF Orange-RF Scikit-Learn Python, Cython OpenCV C++ OK3 C Weka Java randomForest R, Fortran Orange Python Figure: Gilles Louppe G Varoquaux 22
  • 41. 3 Big data on a budget G Varoquaux 23
  • 42. 3 Big data on a budgetBiggish “Big data”: Petabytes... Distributed storage Computing cluster Mere mortals: Gigabytes... Python programming Off-the-self computers G Varoquaux 23
  • 43. 3 Big data on a budgetBiggish “Big data”: Petabytes... Distributed storage Computing cluster Mere mortals: Gigabytes... Python programming Off-the-self computers Simple data processing patterns G Varoquaux 23
  • 44. “Big data”, but big how? 2 scenarios: Many observations –samples e.g. twitter Many descriptors per observation –features e.g. brain scans G Varoquaux 24
  • 45. 3 On-line algorithms Process the data one sample at a time Compute the mean of a gazillion numbers Hard? G Varoquaux 25
  • 46. 3 On-line algorithms Process the data one sample at a time Compute the mean of a gazillion numbers Hard? No: just do a running mean G Varoquaux 25
  • 47. 3 On-line algorithms Converges to expectations Mini-batch = bunch observations for vectorization Example: K-Means clustering X = np.random.normal(size=(10 000, 200)) scipy.cluster.vq. kmeans(X, 10, iter=2) 11.33 s sklearn.cluster. MiniBatchKMeans(n clusters=10, n init=2).fit(X) 0.62 s G Varoquaux 25
  • 48. 3 On-the-fly data reduction Big data is often I/O bound Layer memory access CPU caches RAM Local disks Distant storage Less data also means less work G Varoquaux 26
  • 49. 3 On-the-fly data reduction Dropping data 1 loop: take a random fraction of the data 2 run algorithm on that fraction 3 aggregate results across sub-samplings Looks like bagging: bootstrap aggregation Exploits redundancy across observations Run the loop in parallel G Varoquaux 26
  • 50. 3 On-the-fly data reduction Random projections (will average features) sklearn.random projection random linear combinations of the features Fast clustering of features sklearn.cluster.WardAgglomeration on images: super-pixel strategy Hashing when observations have varying size (e.g. words) sklearn.feature extraction.text. HashingVectorizer stateless: can be used in parallel G Varoquaux 26
  • 51. 3 On-the-fly data reduction Example: randomized SVD Random projection sklearn.utils.extmath.randomized svd X = np.random.normal(size=(50000, 200)) %timeit lapack = linalg.svd(X, full matrices=False) 1 loops, best of 3: 6.09 s per loop %timeit arpack=splinalg.svds(X, 10) 1 loops, best of 3: 2.49 s per loop %timeit randomized = randomized svd(X, 10) 1 loops, best of 3: 303 ms per loop linalg.norm(lapack[0][:, :10] - arpack[0]) / 2000 0.0022360679774997738 linalg.norm(lapack[0][:, :10] - randomized[0]) / 2000 0.0022121161221386925 G Varoquaux 26
  • 52. 3 Data-parallel computing I tackle only embarrassingly parallel problem Life is too short to worry about deadlocks Stratification to follow the statistical dependencies and the data storage structure Batch size scaled by the relevent cache pool - Too fine ⇒ overhead - Too coarse ⇒ memory shortage joblib.Parallel G Varoquaux 27
  • 53. 3 Caching Minimizing data-access latency Never computing twice the same thing joblib.cache G Varoquaux 28
  • 54. 3 Fast access to data Stored representation consistent with access patterns Compression to limit bandwidth usage - CPU are faster than data access Data access speed is often more a limitation than raw processing power joblib.dump/joblib.load G Varoquaux 29
  • 55. 3 Biggish iron Our new box: 15 ke 48 cores 384G RAM 70T storage (SSD cache on RAID controller) Gets our work done faster than our 800 CPU cluster It’s the access patterns! “Nobody ever got fired for using Hadoop on a cluster” A. Rowstron et al., HotCDP ’12 G Varoquaux 30
  • 56. 4 The bigger picture: a community Helping your future self G Varoquaux 31
  • 57. 4 Community-based development in scikit-learn Huge feature set: benefits of a large team Project growth: More than 200 contributors ∼ 12 core contributors 1 full-time INRIA programmer from the start Estimated cost of development: $ 6 millions COCOMO model, http://www.ohloh.net/p/scikit-learn G Varoquaux 32
  • 58. 4 The economics of open source Code maintenance too expensive to be alone scikit-learn ∼ 300 email/month nipy ∼ 45 email/month joblib ∼ 45 email/month mayavi ∼ 30 email/month “Hey Gael, I take it you’re too busy. That’s okay, I spent a day trying to install XXX and I think I’ll succeed myself. Next time though please don’t ignore my emails, I really don’t like it. You can say, ‘sorry, I have no time to help you.’ Just don’t ignore.” G Varoquaux 33
  • 59. 4 The economics of open source Code maintenance too expensive to be alone scikit-learn ∼ 300 email/month nipy ∼ 45 email/month joblib ∼ 45 email/month mayavi ∼ 30 email/month Your “benefits” come from a fraction of the code Data loading? Maybe? Standard algorithms? Nah Share the common code... ...to avoid dying under code Code becomes less precious with time And somebody might contribute features G Varoquaux 33
  • 60. 4 Many eyes makes code fast L. Buitinck, O. Grisel, A. Joly, G. Louppe, J. Nothman, P. Prettenhofer G Varoquaux 34
  • 61. 4 Communities increase the knowledge pool Even if you don’t do software, you should worry about communities More experts in the package ⇒ Easier recruitement The future is open ⇒ Enhancements are possible Meetups, conferences, are where new ideas are born G Varoquaux 35
  • 62. 4 6 steps to a community-driven project 1 Focus on quality 2 Build great docs and examples 3 Use github 4 Limit the technicality of your codebase 5 Releasing and packaging matter 6 Focus on your contributors, give them credit, decision power http://www.slideshare.net/GaelVaroquaux/ scikit-learn-dveloppement-communautaire G Varoquaux 36
  • 63. 4 Core project contributors Credit: Fernando Perez, Gist 5843625 G Varoquaux 37
  • 64. 4 The tragedy of the commons Individuals, acting independently and rationally accord- ing to each one’s self-interest, behave contrary to the whole group’s long-term best interests by depleting some common resource. Wikipedia Make it work, make it right, make it boring Core projects (boring) taken for granted ⇒ Hard to fund, less excitement They need citation, in papers & on corporate web pages G Varoquaux 38
  • 65. Simple big data, in Python Beyond the lie Machine learning gives value to (big) data Python + scikit-learn = - from interactive data processing (IPython notebook) - to crazy big problems (Python + spark) Big data will require you to understand the data-flow patterns (access, parallelism, statistics) Big data community addresses the human factor @GaelVaroquaux