SlideShare a Scribd company logo
1 of 23
Download to read offline
Domino Data Lab November 10, 2015
Faster data science — without a cluster
Parallel programming in Python
Manojit Nandi
mnandi92@gmail.com
@mnandi92
Who am I?
Domino Data Lab November 10, 2015
• Data Scientist at STEALTHBits Technologies





• Data Science Evangelist at Domino Data Lab





• BS in Decision Science
Domino Data Lab November 10, 2015
Agenda and Goals
• Motivation
• Conceptual intro to parallelism, general principles and pitfalls
• Machine learning applications
• Demos
Goal: Leave you with principles, and practical concrete tools, that will
help you run your code much faster
Motivation
Domino Data Lab November 10, 2015
• Lots of “medium data” problems
• Can fit in memory on one machine
• Lots of naturally parallel problems

• Easy to access large machines
• Clusters are hard
• Not everything fits map-reduce
CPUs with multiple cores have become the standard in the recent
development of modern computer architectures and we can not only find
them in supercomputer facilities but also in our desktop machines at
home, and our laptops; even Apple's iPhone 5S got a 1.3 GHz Dual-core
processor in 2013.
- Sebastian Rascka
Parallel programing 101
Domino Data Lab November 10, 2015
• Think about independent tasks (hint: “for” loops are a good place to start!)
• Should be CPU-bound tasks
• Warning and pitfalls
• Not a substitute for good code
• Overhead
• Shared resource contention
• Thrashing
Source: Blaise Barney, Lawrence Livermore National Laboratory
Can parallelize at different “levels”
Domino Data Lab November 10, 2015
Will focus on algorithms, with some brief comments on
Experiments
Run against underlying libraries that parallelize
low-level operations, e.g., openBLAS, ATLAS
Write your code (or use a package) to
parallelize functions or steps within your
analysis
Run different analyses at once
Math ops
Algorithms
Experiments
Parallelize tasks to match your resources
Domino Data Lab November 10, 2015
Computing something (CPU)

Reading from disk/database

Writing to disk/database

Network IO (e.g., web scraping)
Saturating a resource will create a bottleneck
Don't oversaturate your resources
Domino Data Lab November 10, 2015
itemIDs = [1, 2, … , n]
parallel-for-each(i = itemIDs){
item = fetchData(i)
result = computeSomething(item)
saveResult(result)
}
Parallelize tasks to match your resources
Domino Data Lab November 10, 2015
items = fetchData([1, 2, … , n])
results = parallel-for-each(i = items){
computeSomething(item)
}
saveResult(results)
Avoid modifying global state
Domino Data Lab November 10, 2015
itemIDs = [0, 0, 0, 0]
parallel-for-each(i = 1:4) {
itemIDs[i] = i
}
A = [0,0,0,0]Array initialized in process 1
[0,0,0,0] [0,0,0,0][0,0,0,0][0,0,0,0]Array copied to each sub-process
[0,0,0,0] [0,0,0,3][0,0,2,0][0,1,0,0]The copy is modified
[0,0,0,0]
When all parallel tasks finish, array in original process remained unchanged
Demo
Domino Data Lab November 10, 2015
Many ML tasks are parallelized
Domino Data Lab November 10, 2015
• Cross-Validation
• Grid Search Selection
• Random Forest
• Kernel Density Estimation
• K-Means Clustering
• Probabilistic Graphical Models
• Online Learning
• Neural Networks (Backpropagation)
Harder to parallelize
Intuitive to parallelize
Cross validation
Domino Data Lab November 10, 2015
Grid search
Domino Data Lab November 10, 2015
1 10 100 1000
Linear
RBF
C
Kernel
Random forest
Domino Data Lab November 10, 2015
Parallel programing in Python
Domino Data Lab November 10, 2015
• Joblib

pythonhosted.org/joblib/parallel.html
• scikit-learn (n_jobs) scikit-learn.org
• GridSearchCV
• RandomForest
• KMeans
• cross_val_score
• IPython Notebook clusters

www.astro.washington.edu/users/vanderplas/Astr599/notebooks/
21_IPythonParallel
Demo
Domino Data Lab November 10, 2015
Parallel Programming using the GPU
Domino Data Lab November 10, 2015
• GPUs are essential to deep learning
because they can yield 10x speed-up
when training the neural networks.
• Use PyCUDA library to write Python
code that executes using the GPU.
Demo
Domino Data Lab November 10, 2015
Can compose layers of parallelism
Domino Data Lab November 10, 2015
c1 c2 cn… c1 c2 cn…c1 c2 cn…
Machines

(experiments)
Cores
RF NN GridSearched 

SVC
Demo
Domino Data Lab November 10, 2015
FYI: Parallel programing in R
Domino Data Lab November 10, 2015
• General purpose
• parallel
• foreach

cran.r-project.org/web/packages/foreach
• More specialized
• randomForest

cran.r-project.org/web/packages/randomForest
• caret

topepo.github.io/caret
• plyr

cran.r-project.org/web/packages/plyr
Domino Data Lab November 10, 2015
dominodatalab.com
blog.dominodatalab.com
@dominodatalab
Check us out!

More Related Content

What's hot

FireWorks workflow software
FireWorks workflow softwareFireWorks workflow software
FireWorks workflow softwareAnubhav Jain
 
D3 in Jupyter : PyData NYC 2015
D3 in Jupyter : PyData NYC 2015D3 in Jupyter : PyData NYC 2015
D3 in Jupyter : PyData NYC 2015Brian Coffey
 
More Data Science with Less Engineering: Machine Learning Infrastructure at N...
More Data Science with Less Engineering: Machine Learning Infrastructure at N...More Data Science with Less Engineering: Machine Learning Infrastructure at N...
More Data Science with Less Engineering: Machine Learning Infrastructure at N...Ville Tuulos
 
Deep learning with TensorFlow
Deep learning with TensorFlowDeep learning with TensorFlow
Deep learning with TensorFlowNdjido Ardo BAR
 
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...Databricks
 
Tensorflow presentation
Tensorflow presentationTensorflow presentation
Tensorflow presentationAhmed rebai
 
TensorFlow in Context
TensorFlow in ContextTensorFlow in Context
TensorFlow in ContextAltoros
 
Tensorflow 101 @ Machine Learning Innovation Summit SF June 6, 2017
Tensorflow 101 @ Machine Learning Innovation Summit SF June 6, 2017Tensorflow 101 @ Machine Learning Innovation Summit SF June 6, 2017
Tensorflow 101 @ Machine Learning Innovation Summit SF June 6, 2017Ashish Bansal
 
Scale up and Scale Out Anaconda and PyData
Scale up and Scale Out Anaconda and PyDataScale up and Scale Out Anaconda and PyData
Scale up and Scale Out Anaconda and PyDataTravis Oliphant
 
Jonathan Coveney: Why Pig?
Jonathan Coveney: Why Pig?Jonathan Coveney: Why Pig?
Jonathan Coveney: Why Pig?mortardata
 
High Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2OHigh Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2OSri Ambati
 
Introduction To TensorFlow
Introduction To TensorFlowIntroduction To TensorFlow
Introduction To TensorFlowSpotle.ai
 
SciPy 2019: How to Accelerate an Existing Codebase with Numba
SciPy 2019: How to Accelerate an Existing Codebase with NumbaSciPy 2019: How to Accelerate an Existing Codebase with Numba
SciPy 2019: How to Accelerate an Existing Codebase with Numbastan_seibert
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm Chandler Huang
 
Rajat Monga, Engineering Director, TensorFlow, Google at MLconf 2016
Rajat Monga, Engineering Director, TensorFlow, Google at MLconf 2016Rajat Monga, Engineering Director, TensorFlow, Google at MLconf 2016
Rajat Monga, Engineering Director, TensorFlow, Google at MLconf 2016MLconf
 
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016MLconf
 
Tensorflow internal
Tensorflow internalTensorflow internal
Tensorflow internalHyunghun Cho
 

What's hot (20)

FireWorks workflow software
FireWorks workflow softwareFireWorks workflow software
FireWorks workflow software
 
MAVRL Workshop 2014 - pymatgen-db & custodian
MAVRL Workshop 2014 - pymatgen-db & custodianMAVRL Workshop 2014 - pymatgen-db & custodian
MAVRL Workshop 2014 - pymatgen-db & custodian
 
D3 in Jupyter : PyData NYC 2015
D3 in Jupyter : PyData NYC 2015D3 in Jupyter : PyData NYC 2015
D3 in Jupyter : PyData NYC 2015
 
More Data Science with Less Engineering: Machine Learning Infrastructure at N...
More Data Science with Less Engineering: Machine Learning Infrastructure at N...More Data Science with Less Engineering: Machine Learning Infrastructure at N...
More Data Science with Less Engineering: Machine Learning Infrastructure at N...
 
Deep learning with TensorFlow
Deep learning with TensorFlowDeep learning with TensorFlow
Deep learning with TensorFlow
 
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
 
Tensorflow presentation
Tensorflow presentationTensorflow presentation
Tensorflow presentation
 
TensorFlow in Context
TensorFlow in ContextTensorFlow in Context
TensorFlow in Context
 
Tensorflow 101 @ Machine Learning Innovation Summit SF June 6, 2017
Tensorflow 101 @ Machine Learning Innovation Summit SF June 6, 2017Tensorflow 101 @ Machine Learning Innovation Summit SF June 6, 2017
Tensorflow 101 @ Machine Learning Innovation Summit SF June 6, 2017
 
Numba lightning
Numba lightningNumba lightning
Numba lightning
 
TensorFlow 101
TensorFlow 101TensorFlow 101
TensorFlow 101
 
Scale up and Scale Out Anaconda and PyData
Scale up and Scale Out Anaconda and PyDataScale up and Scale Out Anaconda and PyData
Scale up and Scale Out Anaconda and PyData
 
Jonathan Coveney: Why Pig?
Jonathan Coveney: Why Pig?Jonathan Coveney: Why Pig?
Jonathan Coveney: Why Pig?
 
High Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2OHigh Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2O
 
Introduction To TensorFlow
Introduction To TensorFlowIntroduction To TensorFlow
Introduction To TensorFlow
 
SciPy 2019: How to Accelerate an Existing Codebase with Numba
SciPy 2019: How to Accelerate an Existing Codebase with NumbaSciPy 2019: How to Accelerate an Existing Codebase with Numba
SciPy 2019: How to Accelerate an Existing Codebase with Numba
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
 
Rajat Monga, Engineering Director, TensorFlow, Google at MLconf 2016
Rajat Monga, Engineering Director, TensorFlow, Google at MLconf 2016Rajat Monga, Engineering Director, TensorFlow, Google at MLconf 2016
Rajat Monga, Engineering Director, TensorFlow, Google at MLconf 2016
 
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
 
Tensorflow internal
Tensorflow internalTensorflow internal
Tensorflow internal
 

Viewers also liked

Class & Object - Intro
Class & Object - IntroClass & Object - Intro
Class & Object - IntroPRN USM
 
Object and class relationships
Object and class relationshipsObject and class relationships
Object and class relationshipsPooja mittal
 
Anomaly Detection for Security
Anomaly Detection for SecurityAnomaly Detection for Security
Anomaly Detection for SecurityCody Rioux
 
The Dark of Building an Production Incident Syste
The Dark of Building an Production Incident SysteThe Dark of Building an Production Incident Syste
The Dark of Building an Production Incident SysteAlois Reitbauer
 
Traffic anomaly detection and attack
Traffic anomaly detection and attackTraffic anomaly detection and attack
Traffic anomaly detection and attackQrator Labs
 
Anomaly Detection for Real-World Systems
Anomaly Detection for Real-World SystemsAnomaly Detection for Real-World Systems
Anomaly Detection for Real-World SystemsManojit Nandi
 
Where is Data Going? - RMDC Keynote
Where is Data Going? - RMDC KeynoteWhere is Data Going? - RMDC Keynote
Where is Data Going? - RMDC KeynoteTed Dunning
 
Monitoring without alerts
Monitoring without alertsMonitoring without alerts
Monitoring without alertsAlois Reitbauer
 
The Dark Art of Production Alerting
The Dark Art of Production AlertingThe Dark Art of Production Alerting
The Dark Art of Production AlertingAlois Reitbauer
 
Can a monitoring tool pass the turing test
Can a monitoring tool pass the turing testCan a monitoring tool pass the turing test
Can a monitoring tool pass the turing testAlois Reitbauer
 
Monitoring large scale Docker production environments
Monitoring large scale Docker production environmentsMonitoring large scale Docker production environments
Monitoring large scale Docker production environmentsAlois Reitbauer
 
The definition of normal - An introduction and guide to anomaly detection.
The definition of normal - An introduction and guide to anomaly detection. The definition of normal - An introduction and guide to anomaly detection.
The definition of normal - An introduction and guide to anomaly detection. Alois Reitbauer
 
SSL Certificate Expiration and Howler Monkey's Inception
SSL Certificate Expiration and Howler Monkey's InceptionSSL Certificate Expiration and Howler Monkey's Inception
SSL Certificate Expiration and Howler Monkey's Inceptionroyrapoport
 
Cloud Tech III: Actionable Metrics
Cloud Tech III: Actionable MetricsCloud Tech III: Actionable Metrics
Cloud Tech III: Actionable Metricsroyrapoport
 
Python Through the Back Door: Netflix Presentation at CodeMash 2014
Python Through the Back Door: Netflix Presentation at CodeMash 2014Python Through the Back Door: Netflix Presentation at CodeMash 2014
Python Through the Back Door: Netflix Presentation at CodeMash 2014royrapoport
 
Monitoring Docker Application in Production
Monitoring Docker Application in ProductionMonitoring Docker Application in Production
Monitoring Docker Application in ProductionAlois Reitbauer
 
Ruxit - How we launched a global monitoring platform on AWS in 80 days.
Ruxit - How we launched a global monitoring platform on AWS in 80 days. Ruxit - How we launched a global monitoring platform on AWS in 80 days.
Ruxit - How we launched a global monitoring platform on AWS in 80 days. Alois Reitbauer
 
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...tboubez
 

Viewers also liked (20)

Class & Object - Intro
Class & Object - IntroClass & Object - Intro
Class & Object - Intro
 
Object and class relationships
Object and class relationshipsObject and class relationships
Object and class relationships
 
No-Bullshit Data Science
No-Bullshit Data ScienceNo-Bullshit Data Science
No-Bullshit Data Science
 
Anomaly Detection for Security
Anomaly Detection for SecurityAnomaly Detection for Security
Anomaly Detection for Security
 
The Dark of Building an Production Incident Syste
The Dark of Building an Production Incident SysteThe Dark of Building an Production Incident Syste
The Dark of Building an Production Incident Syste
 
Traffic anomaly detection and attack
Traffic anomaly detection and attackTraffic anomaly detection and attack
Traffic anomaly detection and attack
 
Anomaly Detection for Real-World Systems
Anomaly Detection for Real-World SystemsAnomaly Detection for Real-World Systems
Anomaly Detection for Real-World Systems
 
Where is Data Going? - RMDC Keynote
Where is Data Going? - RMDC KeynoteWhere is Data Going? - RMDC Keynote
Where is Data Going? - RMDC Keynote
 
Monitoring without alerts
Monitoring without alertsMonitoring without alerts
Monitoring without alerts
 
The Dark Art of Production Alerting
The Dark Art of Production AlertingThe Dark Art of Production Alerting
The Dark Art of Production Alerting
 
Can a monitoring tool pass the turing test
Can a monitoring tool pass the turing testCan a monitoring tool pass the turing test
Can a monitoring tool pass the turing test
 
Monitoring large scale Docker production environments
Monitoring large scale Docker production environmentsMonitoring large scale Docker production environments
Monitoring large scale Docker production environments
 
PyGotham 2016
PyGotham 2016PyGotham 2016
PyGotham 2016
 
The definition of normal - An introduction and guide to anomaly detection.
The definition of normal - An introduction and guide to anomaly detection. The definition of normal - An introduction and guide to anomaly detection.
The definition of normal - An introduction and guide to anomaly detection.
 
SSL Certificate Expiration and Howler Monkey's Inception
SSL Certificate Expiration and Howler Monkey's InceptionSSL Certificate Expiration and Howler Monkey's Inception
SSL Certificate Expiration and Howler Monkey's Inception
 
Cloud Tech III: Actionable Metrics
Cloud Tech III: Actionable MetricsCloud Tech III: Actionable Metrics
Cloud Tech III: Actionable Metrics
 
Python Through the Back Door: Netflix Presentation at CodeMash 2014
Python Through the Back Door: Netflix Presentation at CodeMash 2014Python Through the Back Door: Netflix Presentation at CodeMash 2014
Python Through the Back Door: Netflix Presentation at CodeMash 2014
 
Monitoring Docker Application in Production
Monitoring Docker Application in ProductionMonitoring Docker Application in Production
Monitoring Docker Application in Production
 
Ruxit - How we launched a global monitoring platform on AWS in 80 days.
Ruxit - How we launched a global monitoring platform on AWS in 80 days. Ruxit - How we launched a global monitoring platform on AWS in 80 days.
Ruxit - How we launched a global monitoring platform on AWS in 80 days.
 
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
 

Similar to Parallel Programming in Python: Speeding up your analysis

Blastn plus jupyter on Docker
Blastn plus jupyter on DockerBlastn plus jupyter on Docker
Blastn plus jupyter on DockerLynn Langit
 
Databricks for Dummies
Databricks for DummiesDatabricks for Dummies
Databricks for DummiesRodney Joyce
 
PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015Cloudera, Inc.
 
Data processing with celery and rabbit mq
Data processing with celery and rabbit mqData processing with celery and rabbit mq
Data processing with celery and rabbit mqJeff Peck
 
NLP based Data Engineering and ETL Tool - Ask On Data.pdf
NLP based Data Engineering and ETL Tool - Ask On Data.pdfNLP based Data Engineering and ETL Tool - Ask On Data.pdf
NLP based Data Engineering and ETL Tool - Ask On Data.pdfHelicalInsight1
 
Phingified ci and deployment strategies ipc 2012
Phingified ci and deployment strategies ipc 2012Phingified ci and deployment strategies ipc 2012
Phingified ci and deployment strategies ipc 2012TEQneers GmbH & Co. KG
 
Data science in ruby, is it possible? is it fast? should we use it?
Data science in ruby, is it possible? is it fast? should we use it?Data science in ruby, is it possible? is it fast? should we use it?
Data science in ruby, is it possible? is it fast? should we use it?Rodrigo Urubatan
 
.NET per la Data Science e oltre
.NET per la Data Science e oltre.NET per la Data Science e oltre
.NET per la Data Science e oltreMarco Parenzan
 
Proud to be polyglot!
Proud to be polyglot!Proud to be polyglot!
Proud to be polyglot!NLJUG
 
Dev ops lessons learned - Michael Collins
Dev ops lessons learned  - Michael CollinsDev ops lessons learned  - Michael Collins
Dev ops lessons learned - Michael CollinsDevopsdays
 
Deep learning in production with the best
Deep learning in production   with the bestDeep learning in production   with the best
Deep learning in production with the bestAdam Gibson
 
Open Source Big Graph Analytics on Neo4j with Apache Spark
Open Source Big Graph Analytics on Neo4j with Apache SparkOpen Source Big Graph Analytics on Neo4j with Apache Spark
Open Source Big Graph Analytics on Neo4j with Apache SparkKenny Bastani
 
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...Big Data Spain
 

Similar to Parallel Programming in Python: Speeding up your analysis (20)

Blastn plus jupyter on Docker
Blastn plus jupyter on DockerBlastn plus jupyter on Docker
Blastn plus jupyter on Docker
 
Databricks for Dummies
Databricks for DummiesDatabricks for Dummies
Databricks for Dummies
 
Stackato
StackatoStackato
Stackato
 
PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015
 
Stackato v5
Stackato v5Stackato v5
Stackato v5
 
PyData Boston 2013
PyData Boston 2013PyData Boston 2013
PyData Boston 2013
 
Data processing with celery and rabbit mq
Data processing with celery and rabbit mqData processing with celery and rabbit mq
Data processing with celery and rabbit mq
 
Python ml
Python mlPython ml
Python ml
 
NLP based Data Engineering and ETL Tool - Ask On Data.pdf
NLP based Data Engineering and ETL Tool - Ask On Data.pdfNLP based Data Engineering and ETL Tool - Ask On Data.pdf
NLP based Data Engineering and ETL Tool - Ask On Data.pdf
 
Phingified ci and deployment strategies ipc 2012
Phingified ci and deployment strategies ipc 2012Phingified ci and deployment strategies ipc 2012
Phingified ci and deployment strategies ipc 2012
 
Data science in ruby, is it possible? is it fast? should we use it?
Data science in ruby, is it possible? is it fast? should we use it?Data science in ruby, is it possible? is it fast? should we use it?
Data science in ruby, is it possible? is it fast? should we use it?
 
.NET per la Data Science e oltre
.NET per la Data Science e oltre.NET per la Data Science e oltre
.NET per la Data Science e oltre
 
Proud to be polyglot!
Proud to be polyglot!Proud to be polyglot!
Proud to be polyglot!
 
Dev ops lessons learned - Michael Collins
Dev ops lessons learned  - Michael CollinsDev ops lessons learned  - Michael Collins
Dev ops lessons learned - Michael Collins
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Deep learning in production with the best
Deep learning in production   with the bestDeep learning in production   with the best
Deep learning in production with the best
 
Open Source Big Graph Analytics on Neo4j with Apache Spark
Open Source Big Graph Analytics on Neo4j with Apache SparkOpen Source Big Graph Analytics on Neo4j with Apache Spark
Open Source Big Graph Analytics on Neo4j with Apache Spark
 
Stackato v6
Stackato v6Stackato v6
Stackato v6
 
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
 
Background processing with hangfire
Background processing with hangfireBackground processing with hangfire
Background processing with hangfire
 

Recently uploaded

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdfChristopherTHyatt
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 

Recently uploaded (20)

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 

Parallel Programming in Python: Speeding up your analysis

  • 1. Domino Data Lab November 10, 2015 Faster data science — without a cluster Parallel programming in Python Manojit Nandi mnandi92@gmail.com @mnandi92
  • 2. Who am I? Domino Data Lab November 10, 2015 • Data Scientist at STEALTHBits Technologies
 
 
 • Data Science Evangelist at Domino Data Lab
 
 
 • BS in Decision Science
  • 3. Domino Data Lab November 10, 2015 Agenda and Goals • Motivation • Conceptual intro to parallelism, general principles and pitfalls • Machine learning applications • Demos Goal: Leave you with principles, and practical concrete tools, that will help you run your code much faster
  • 4. Motivation Domino Data Lab November 10, 2015 • Lots of “medium data” problems • Can fit in memory on one machine • Lots of naturally parallel problems
 • Easy to access large machines • Clusters are hard • Not everything fits map-reduce CPUs with multiple cores have become the standard in the recent development of modern computer architectures and we can not only find them in supercomputer facilities but also in our desktop machines at home, and our laptops; even Apple's iPhone 5S got a 1.3 GHz Dual-core processor in 2013. - Sebastian Rascka
  • 5. Parallel programing 101 Domino Data Lab November 10, 2015 • Think about independent tasks (hint: “for” loops are a good place to start!) • Should be CPU-bound tasks • Warning and pitfalls • Not a substitute for good code • Overhead • Shared resource contention • Thrashing Source: Blaise Barney, Lawrence Livermore National Laboratory
  • 6. Can parallelize at different “levels” Domino Data Lab November 10, 2015 Will focus on algorithms, with some brief comments on Experiments Run against underlying libraries that parallelize low-level operations, e.g., openBLAS, ATLAS Write your code (or use a package) to parallelize functions or steps within your analysis Run different analyses at once Math ops Algorithms Experiments
  • 7. Parallelize tasks to match your resources Domino Data Lab November 10, 2015 Computing something (CPU)
 Reading from disk/database
 Writing to disk/database
 Network IO (e.g., web scraping) Saturating a resource will create a bottleneck
  • 8. Don't oversaturate your resources Domino Data Lab November 10, 2015 itemIDs = [1, 2, … , n] parallel-for-each(i = itemIDs){ item = fetchData(i) result = computeSomething(item) saveResult(result) }
  • 9. Parallelize tasks to match your resources Domino Data Lab November 10, 2015 items = fetchData([1, 2, … , n]) results = parallel-for-each(i = items){ computeSomething(item) } saveResult(results)
  • 10. Avoid modifying global state Domino Data Lab November 10, 2015 itemIDs = [0, 0, 0, 0] parallel-for-each(i = 1:4) { itemIDs[i] = i } A = [0,0,0,0]Array initialized in process 1 [0,0,0,0] [0,0,0,0][0,0,0,0][0,0,0,0]Array copied to each sub-process [0,0,0,0] [0,0,0,3][0,0,2,0][0,1,0,0]The copy is modified [0,0,0,0] When all parallel tasks finish, array in original process remained unchanged
  • 11. Demo Domino Data Lab November 10, 2015
  • 12. Many ML tasks are parallelized Domino Data Lab November 10, 2015 • Cross-Validation • Grid Search Selection • Random Forest • Kernel Density Estimation • K-Means Clustering • Probabilistic Graphical Models • Online Learning • Neural Networks (Backpropagation) Harder to parallelize Intuitive to parallelize
  • 13. Cross validation Domino Data Lab November 10, 2015
  • 14. Grid search Domino Data Lab November 10, 2015 1 10 100 1000 Linear RBF C Kernel
  • 15. Random forest Domino Data Lab November 10, 2015
  • 16. Parallel programing in Python Domino Data Lab November 10, 2015 • Joblib
 pythonhosted.org/joblib/parallel.html • scikit-learn (n_jobs) scikit-learn.org • GridSearchCV • RandomForest • KMeans • cross_val_score • IPython Notebook clusters
 www.astro.washington.edu/users/vanderplas/Astr599/notebooks/ 21_IPythonParallel
  • 17. Demo Domino Data Lab November 10, 2015
  • 18. Parallel Programming using the GPU Domino Data Lab November 10, 2015 • GPUs are essential to deep learning because they can yield 10x speed-up when training the neural networks. • Use PyCUDA library to write Python code that executes using the GPU.
  • 19. Demo Domino Data Lab November 10, 2015
  • 20. Can compose layers of parallelism Domino Data Lab November 10, 2015 c1 c2 cn… c1 c2 cn…c1 c2 cn… Machines
 (experiments) Cores RF NN GridSearched 
 SVC
  • 21. Demo Domino Data Lab November 10, 2015
  • 22. FYI: Parallel programing in R Domino Data Lab November 10, 2015 • General purpose • parallel • foreach
 cran.r-project.org/web/packages/foreach • More specialized • randomForest
 cran.r-project.org/web/packages/randomForest • caret
 topepo.github.io/caret • plyr
 cran.r-project.org/web/packages/plyr
  • 23. Domino Data Lab November 10, 2015 dominodatalab.com blog.dominodatalab.com @dominodatalab Check us out!