SlideShare a Scribd company logo
Joblib
Toward efficient computing
From laptop to cloud
Alexandre Abadie
    
Overview of Joblib
Recent major improvements
What's next
Overview of Joblib
Embarrassingly Parallel computing helper
Efficient disk caching to avoid recomputation
Fast I/O persistence
No dependencies, optimized for numpy arrays
Joblib is the parallel backend used by Scikit-Learn
https://pythonhosted.org/joblib/
Parallel helper
Available backends: threading and multiprocessing (default)
>>>fromjoblibimportParallel,delayed
>>>frommathimportsqrt
>>>Parallel(n_jobs=3,verbose=50)(delayed(sqrt)(i**2)foriinrange(6))
[Parallel(n_jobs=3)]:Done 1tasks |elapsed: 0.0s
[...]
[Parallel(n_jobs=3)]:Done 6outof 6|elapsed: 0.0sfinished
[0.0,1.0,2.0,3.0,4.0,5.0]
Caching on disk
Use a memoize pattern with the Memory object
>>>fromjoblibimportMemory
>>>mem=Memory(cachedir='/tmp/joblib')
>>>importnumpyasnp
>>>a=np.vander(np.arange(3)).astype(np.float)
>>>square=mem.cache(np.square)
>>>b=square(a)
________________________________________________________________________________
[Memory]Callingsquare...
square(array([[0., 0., 1.],
[1., 1., 1.],
[4., 2., 1.]]))
___________________________________________________________square-0...s,0.0min
>>>c=square(a)#norecomputation
Use md5 hash of input parameters
Results are persisted on disk
Persistence
Convert/create an arbitrary object into/from a string of bytes
Persistence in Joblib is based on pickle and Pickler/Unpickler subclasses
>>>importnumpyasnp
>>>importjoblib
>>>obj=[('a',[1,2,3]),('b',np.arange(10))]
>>>joblib.dump(obj,'/tmp/test.pkl')
['/tmp/test.pkl','/tmp/test.pkl_01.npy']
>>>joblib.load('/tmp/test.pkl')
[('a',[1,2,3]),('b',array([0,1,2,3,4,5,6,7,8,9]))]
Use compression for fast I/O
>>>joblib.dump(obj,'/tmp/test.pkl',compress=True,cache_size=0)
['/tmp/test.pkl','/tmp/test.pkl_01.npy.z']
>>>joblib.load('/tmp/test.pkl')
Access numpy arrays with np.memmapfor out-of-core computing or for
sharing between multiple workers
Recent major improvements
Persistence
Custom parallel backends
arriving in version 0.10.0
Persistence refactoring
Until 0.9.4:
An object with multiple arrays is persisted in multiple files
Only zlib compression available
Memory copies with compression
Persistence refactoring
Until 0.9.4:
An object with multiple arrays is persisted in multiple files
Only zlib compression available
Memory copies with compression
In 0.10.0 (not released yet):
An object with multiple arrays goes in a single file
Support of all compression methods provided by the python
standard library
No memory copies with compression
https://github.com/joblib/joblib/pull/260
Persistence refactoring strategy
Write numpy array buffer interleaved in the pickle stream
⇒ all arrays in a single file
Caveat: Not compatible with pickle format
Dump/reconstruct the array by chunks of bytes using numpy functions
⇒ avoid memory copies
Before: memory copies
Now: no memory copies
Credits: Memory profiler
Persistence in a single file
>>>importnumpyasnp
>>>importjoblib
>>>obj=[np.ones((5000,5000)),np.random.random((5000,5000))]
#only1fileisgenerated:
>>>joblib.dump(obj,'/tmp/test.pkl',compress=True)
['/tmp/test.pkl']
>>>joblib.load('/tmp/test.pkl')
[array([[1., 1.,..., 1., 1.]],
array([[0.47006195, 0.5436392,..., 0.1218267, 0.48592789]])]
useful with scikit-learn estimators
simpler management of backup files
robust when using memory map on distributed file systems
Compression formats
New supported compression formats
⇒ gzip, bz2, lzma and xz
Automatic compression based on file extension
Automatic detection of compression format when loading
Valid compression file formats
Slower than zlib
Compression formats
New supported compression formats
⇒ gzip, bz2, lzma and xz
Automatic compression based on file extension
Automatic detection of compression format when loading
Valid compression file formats
Slower than zlib
Example with gzipcompression:
>>>joblib.dump(obj,'/tmp/test.pkl.gz',compress=('gzip',3))
>>>joblib.load('/tmp/test.pkl.gz')
[array([[1., 1.,..., 1., 1.]],
array([[0.47006195, 0.5436392,..., 0.1218267, 0.48592789]])]
>>>#orwithfileextensiondetection
>>>joblib.dump(obj,'/tmp/test.pkl.gz')
Performance comparison: Memory footprint and Speed
Joblib persists faster and with low extra memory consumption
Performance is data dependent
http://gael-varoquaux.info/programming/new_low-
overhead_persistence_in_joblib_for_big_data.html
Parallel backends
Custom parallel backends
Until 0.9.4:
Only threading and multiprocessing backends
Not extensible
Active backend cannot be changed easily at a high level
Custom parallel backends
Until 0.9.4:
Only threading and multiprocessing backends
Not extensible
Active backend cannot be changed easily at a high level
In 0.10.0:
Common API using ParallelBackendBaseinterface
Use withto set the active backend in a context manager
New backends for:
distributed implemented by Matthieu Rocklin
ipyparallel implemented by Min RK
YARN implemented by Niels Zielemaker
https://github.com/joblib/joblib/pull/306 contributed by Niels Zielemaker
Principle
1. Subclass ParallelBackendBase:
classExampleParallelBackend(ParallelBackendBase):
"""Exampleofminimumparallelbackend."""
defconfigure(self,n_jobs=1,parallel=None,**backend_args):
self.n_jobs=self.effective_n_jobs(n_jobs)
self.parallel=parallel
returnn_jobs
defapply_async(self,func,callback=None):
"""Scheduleafunctoberun"""
result=func()#dependsonthebackend
ifcallback:
callback(result)
returnresult
2. Register your backend:
>>>register_parallel_backend("example_backend",ExampleParallelBackend)
IPython parallel backend
Integration for Joblib available in version 5.1
1. Launch a 5 engines cluster:
$ipcontroller&
$ipclusterengines-n5
2. Run the following script:
importtime
importipyparallelasipp
fromipyparallel.joblibimportregisterasregister_joblib
fromjoblibimportparallel_backend,Parallel,delayed
#Registeripyparallelbackend
register_joblib()
#Startthejob
withparallel_backend("ipyparallel"):
Parallel(n_jobs=20,verbose=50)(delayed(time.sleep)(1)foriinrange(10))
Demo
https://github.com/aabadie/ipyparallel-
cloud/blob/master/examples/sklearn_parameter_search_local_ipyparallel.ipynb
What's next
Persistence in file objects
1. With regular file object:
>>>withopen('/tmp/test.pkl','wb')asfo:
... joblib.dump(obj,fo)
>>>withopen('/tmp/test.pkl','rb')asfo:
... joblib.load(fo)
Also works with gzip.GzipFile, bz2.BZ2File and lzma.LZMAFile
https://github.com/joblib/joblib/pull/351
Persistence in file objects
1. With regular file object:
>>>withopen('/tmp/test.pkl','wb')asfo:
... joblib.dump(obj,fo)
>>>withopen('/tmp/test.pkl','rb')asfo:
... joblib.load(fo)
Also works with gzip.GzipFile, bz2.BZ2File and lzma.LZMAFile
https://github.com/joblib/joblib/pull/351
2. Or with any file-like object, e.g exposing read/write functions:
>>>withRemoteStoreObject(hostname,port,obj_id)asrso:
... joblib.dump(obj,rso)
>>>withRemoteStoreObject(hostname,port,obj_id)asrso:
... joblib.load(rso)
⇒Example: blob databases, remote storage
Joblib in the cloud
Share persistence files
    
Use computing ressources
    
    
Conclusion
Persistence of numpy arrays in a single file...
... and soon in file objects
New compression formats: gzip, bz2, xz, lzma
... extend to faster implementations (blosc) ?
Persists without memory copies
... manipulate bigger data
New parallel backends: distributed, ipyparallel, YARN...
... new ones to come ?
      Thanks !
               

More Related Content

What's hot

Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
Altinity Ltd
 
Strategies for Backing Up MongoDB
Strategies for Backing Up MongoDBStrategies for Backing Up MongoDB
Strategies for Backing Up MongoDB
MongoDB
 
Developing High Performance Application with Aerospike & Go
Developing High Performance Application with Aerospike & GoDeveloping High Performance Application with Aerospike & Go
Developing High Performance Application with Aerospike & Go
Chris Stivers
 
Exadata下的数据并行加载、并行卸载及性能监控
Exadata下的数据并行加载、并行卸载及性能监控Exadata下的数据并行加载、并行卸载及性能监控
Exadata下的数据并行加载、并行卸载及性能监控Kaiyao Huang
 
Guava Overview Part 2 Bucharest JUG #2
Guava Overview Part 2 Bucharest JUG #2 Guava Overview Part 2 Bucharest JUG #2
Guava Overview Part 2 Bucharest JUG #2
Andrei Savu
 
Seastar @ SF/BA C++UG
Seastar @ SF/BA C++UGSeastar @ SF/BA C++UG
Seastar @ SF/BA C++UG
Avi Kivity
 
PostgreSQL Configuration for Humans / Alvaro Hernandez (OnGres)
PostgreSQL Configuration for Humans / Alvaro Hernandez (OnGres)PostgreSQL Configuration for Humans / Alvaro Hernandez (OnGres)
PostgreSQL Configuration for Humans / Alvaro Hernandez (OnGres)
Ontico
 
AppOS: PostgreSQL Extension for Scalable File I/O @ PGConf.Asia 2019
AppOS: PostgreSQL Extension for Scalable File I/O @ PGConf.Asia 2019AppOS: PostgreSQL Extension for Scalable File I/O @ PGConf.Asia 2019
AppOS: PostgreSQL Extension for Scalable File I/O @ PGConf.Asia 2019
Sangwook Kim
 
Galaxy CloudMan performance on AWS
Galaxy CloudMan performance on AWSGalaxy CloudMan performance on AWS
Galaxy CloudMan performance on AWS
Enis Afgan
 
System ctl file that is created at time of installation
System ctl file that is created at time of installationSystem ctl file that is created at time of installation
System ctl file that is created at time of installation
AshishRanjan975320
 
Tarantool: как сэкономить миллион долларов на базе данных на высоконагруженно...
Tarantool: как сэкономить миллион долларов на базе данных на высоконагруженно...Tarantool: как сэкономить миллион долларов на базе данных на высоконагруженно...
Tarantool: как сэкономить миллион долларов на базе данных на высоконагруженно...
Ontico
 
Resource Management with Systemd and cgroups
Resource Management with Systemd and cgroupsResource Management with Systemd and cgroups
Resource Management with Systemd and cgroups
Tsung-en Hsiao
 
Seastar / ScyllaDB, or how we implemented a 10-times faster Cassandra
Seastar / ScyllaDB,  or how we implemented a 10-times faster CassandraSeastar / ScyllaDB,  or how we implemented a 10-times faster Cassandra
Seastar / ScyllaDB, or how we implemented a 10-times faster Cassandra
Tzach Livyatan
 
Data Science in DevOps/SysOps - Boaz Shuster - DevOpsDays Tel Aviv 2018
Data Science in DevOps/SysOps - Boaz Shuster - DevOpsDays Tel Aviv 2018Data Science in DevOps/SysOps - Boaz Shuster - DevOpsDays Tel Aviv 2018
Data Science in DevOps/SysOps - Boaz Shuster - DevOpsDays Tel Aviv 2018
DevOpsDays Tel Aviv
 
Automatically Fusing Functions on CuPy
Automatically Fusing Functions on CuPyAutomatically Fusing Functions on CuPy
Automatically Fusing Functions on CuPy
Preferred Networks
 
SiteGround Tech TeamBuilding
SiteGround Tech TeamBuildingSiteGround Tech TeamBuilding
SiteGround Tech TeamBuilding
Marian Marinov
 
MongoDB as Message Queue
MongoDB as Message QueueMongoDB as Message Queue
MongoDB as Message Queue
MongoDB
 
Developers’ mDay 2019. - Rastko Vasiljević, SuperAdmins – Infrastructure as c...
Developers’ mDay 2019. - Rastko Vasiljević, SuperAdmins – Infrastructure as c...Developers’ mDay 2019. - Rastko Vasiljević, SuperAdmins – Infrastructure as c...
Developers’ mDay 2019. - Rastko Vasiljević, SuperAdmins – Infrastructure as c...
mCloud
 

What's hot (19)

Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
 
Strategies for Backing Up MongoDB
Strategies for Backing Up MongoDBStrategies for Backing Up MongoDB
Strategies for Backing Up MongoDB
 
Developing High Performance Application with Aerospike & Go
Developing High Performance Application with Aerospike & GoDeveloping High Performance Application with Aerospike & Go
Developing High Performance Application with Aerospike & Go
 
Exadata下的数据并行加载、并行卸载及性能监控
Exadata下的数据并行加载、并行卸载及性能监控Exadata下的数据并行加载、并行卸载及性能监控
Exadata下的数据并行加载、并行卸载及性能监控
 
Guava Overview Part 2 Bucharest JUG #2
Guava Overview Part 2 Bucharest JUG #2 Guava Overview Part 2 Bucharest JUG #2
Guava Overview Part 2 Bucharest JUG #2
 
Seastar @ SF/BA C++UG
Seastar @ SF/BA C++UGSeastar @ SF/BA C++UG
Seastar @ SF/BA C++UG
 
Guava
GuavaGuava
Guava
 
PostgreSQL Configuration for Humans / Alvaro Hernandez (OnGres)
PostgreSQL Configuration for Humans / Alvaro Hernandez (OnGres)PostgreSQL Configuration for Humans / Alvaro Hernandez (OnGres)
PostgreSQL Configuration for Humans / Alvaro Hernandez (OnGres)
 
AppOS: PostgreSQL Extension for Scalable File I/O @ PGConf.Asia 2019
AppOS: PostgreSQL Extension for Scalable File I/O @ PGConf.Asia 2019AppOS: PostgreSQL Extension for Scalable File I/O @ PGConf.Asia 2019
AppOS: PostgreSQL Extension for Scalable File I/O @ PGConf.Asia 2019
 
Galaxy CloudMan performance on AWS
Galaxy CloudMan performance on AWSGalaxy CloudMan performance on AWS
Galaxy CloudMan performance on AWS
 
System ctl file that is created at time of installation
System ctl file that is created at time of installationSystem ctl file that is created at time of installation
System ctl file that is created at time of installation
 
Tarantool: как сэкономить миллион долларов на базе данных на высоконагруженно...
Tarantool: как сэкономить миллион долларов на базе данных на высоконагруженно...Tarantool: как сэкономить миллион долларов на базе данных на высоконагруженно...
Tarantool: как сэкономить миллион долларов на базе данных на высоконагруженно...
 
Resource Management with Systemd and cgroups
Resource Management with Systemd and cgroupsResource Management with Systemd and cgroups
Resource Management with Systemd and cgroups
 
Seastar / ScyllaDB, or how we implemented a 10-times faster Cassandra
Seastar / ScyllaDB,  or how we implemented a 10-times faster CassandraSeastar / ScyllaDB,  or how we implemented a 10-times faster Cassandra
Seastar / ScyllaDB, or how we implemented a 10-times faster Cassandra
 
Data Science in DevOps/SysOps - Boaz Shuster - DevOpsDays Tel Aviv 2018
Data Science in DevOps/SysOps - Boaz Shuster - DevOpsDays Tel Aviv 2018Data Science in DevOps/SysOps - Boaz Shuster - DevOpsDays Tel Aviv 2018
Data Science in DevOps/SysOps - Boaz Shuster - DevOpsDays Tel Aviv 2018
 
Automatically Fusing Functions on CuPy
Automatically Fusing Functions on CuPyAutomatically Fusing Functions on CuPy
Automatically Fusing Functions on CuPy
 
SiteGround Tech TeamBuilding
SiteGround Tech TeamBuildingSiteGround Tech TeamBuilding
SiteGround Tech TeamBuilding
 
MongoDB as Message Queue
MongoDB as Message QueueMongoDB as Message Queue
MongoDB as Message Queue
 
Developers’ mDay 2019. - Rastko Vasiljević, SuperAdmins – Infrastructure as c...
Developers’ mDay 2019. - Rastko Vasiljević, SuperAdmins – Infrastructure as c...Developers’ mDay 2019. - Rastko Vasiljević, SuperAdmins – Infrastructure as c...
Developers’ mDay 2019. - Rastko Vasiljević, SuperAdmins – Infrastructure as c...
 

Similar to Joblib Toward efficient computing : from laptop to cloud

Joblib for cloud computing
Joblib for cloud computingJoblib for cloud computing
Joblib for cloud computing
Alexandre Abadie
 
PyParis2017 / Cloud computing made easy in Joblib, by Alexandre Abadie
PyParis2017 / Cloud computing made easy in Joblib, by Alexandre AbadiePyParis2017 / Cloud computing made easy in Joblib, by Alexandre Abadie
PyParis2017 / Cloud computing made easy in Joblib, by Alexandre Abadie
Pôle Systematic Paris-Region
 
Why learn Internals?
Why learn Internals?Why learn Internals?
Why learn Internals?
Shaul Rosenzwieg
 
Android Boot Time Optimization
Android Boot Time OptimizationAndroid Boot Time Optimization
Android Boot Time OptimizationKan-Ru Chen
 
Adobe AEM Maintenance - Customer Care Office Hours
Adobe AEM Maintenance - Customer Care Office HoursAdobe AEM Maintenance - Customer Care Office Hours
Adobe AEM Maintenance - Customer Care Office Hours
Andrew Khoury
 
OSMC 2012 | Neues in Nagios 4.0 by Andreas Ericsson
OSMC 2012 | Neues in Nagios 4.0 by Andreas EricssonOSMC 2012 | Neues in Nagios 4.0 by Andreas Ericsson
OSMC 2012 | Neues in Nagios 4.0 by Andreas Ericsson
NETWAYS
 
Gnocchi v3 brownbag
Gnocchi v3 brownbagGnocchi v3 brownbag
Gnocchi v3 brownbag
Gordon Chung
 
LCA13: Hadoop DFS Performance
LCA13: Hadoop DFS PerformanceLCA13: Hadoop DFS Performance
LCA13: Hadoop DFS Performance
Linaro
 
MongoDB 101 & Beyond: Get Started in MongoDB 3.0, Preview 3.2 & Demo of Ops M...
MongoDB 101 & Beyond: Get Started in MongoDB 3.0, Preview 3.2 & Demo of Ops M...MongoDB 101 & Beyond: Get Started in MongoDB 3.0, Preview 3.2 & Demo of Ops M...
MongoDB 101 & Beyond: Get Started in MongoDB 3.0, Preview 3.2 & Demo of Ops M...
MongoDB
 
Using apache spark for processing trillions of records each day at Datadog
Using apache spark for processing trillions of records each day at DatadogUsing apache spark for processing trillions of records each day at Datadog
Using apache spark for processing trillions of records each day at Datadog
Vadim Semenov
 
ibbackup vs mysqldump对比测试 - 20080718
ibbackup vs mysqldump对比测试 - 20080718ibbackup vs mysqldump对比测试 - 20080718
ibbackup vs mysqldump对比测试 - 20080718
Jinrong Ye
 
PGConf.ASIA 2019 Bali - AppOS: PostgreSQL Extension for Scalable File I/O - K...
PGConf.ASIA 2019 Bali - AppOS: PostgreSQL Extension for Scalable File I/O - K...PGConf.ASIA 2019 Bali - AppOS: PostgreSQL Extension for Scalable File I/O - K...
PGConf.ASIA 2019 Bali - AppOS: PostgreSQL Extension for Scalable File I/O - K...
Equnix Business Solutions
 
Introduction to Ansible - (dev ops for people who hate devops)
Introduction to Ansible - (dev ops for people who hate devops)Introduction to Ansible - (dev ops for people who hate devops)
Introduction to Ansible - (dev ops for people who hate devops)
Jude A. Goonawardena
 
Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...
Flink Forward SF 2017: Malo Deniélou -  No shard left behind: Dynamic work re...Flink Forward SF 2017: Malo Deniélou -  No shard left behind: Dynamic work re...
Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...
Flink Forward
 
Percona XtraBackup - New Features and Improvements
Percona XtraBackup - New Features and ImprovementsPercona XtraBackup - New Features and Improvements
Percona XtraBackup - New Features and Improvements
Marcelo Altmann
 
Become a GC Hero
Become a GC HeroBecome a GC Hero
Become a GC Hero
Tier1app
 
Ansible is the simplest way to automate. MoldCamp, 2015
Ansible is the simplest way to automate. MoldCamp, 2015Ansible is the simplest way to automate. MoldCamp, 2015
Ansible is the simplest way to automate. MoldCamp, 2015
Alex S
 
Analyze corefile and backtraces with GDB for Mysql/MariaDB on Linux - Nilanda...
Analyze corefile and backtraces with GDB for Mysql/MariaDB on Linux - Nilanda...Analyze corefile and backtraces with GDB for Mysql/MariaDB on Linux - Nilanda...
Analyze corefile and backtraces with GDB for Mysql/MariaDB on Linux - Nilanda...
Mydbops
 
OpenNebulaConf 2016 - Measuring and tuning VM performance by Boyan Krosnov, S...
OpenNebulaConf 2016 - Measuring and tuning VM performance by Boyan Krosnov, S...OpenNebulaConf 2016 - Measuring and tuning VM performance by Boyan Krosnov, S...
OpenNebulaConf 2016 - Measuring and tuning VM performance by Boyan Krosnov, S...
OpenNebula Project
 
Redis深入浅出
Redis深入浅出Redis深入浅出
Redis深入浅出
iammutex
 

Similar to Joblib Toward efficient computing : from laptop to cloud (20)

Joblib for cloud computing
Joblib for cloud computingJoblib for cloud computing
Joblib for cloud computing
 
PyParis2017 / Cloud computing made easy in Joblib, by Alexandre Abadie
PyParis2017 / Cloud computing made easy in Joblib, by Alexandre AbadiePyParis2017 / Cloud computing made easy in Joblib, by Alexandre Abadie
PyParis2017 / Cloud computing made easy in Joblib, by Alexandre Abadie
 
Why learn Internals?
Why learn Internals?Why learn Internals?
Why learn Internals?
 
Android Boot Time Optimization
Android Boot Time OptimizationAndroid Boot Time Optimization
Android Boot Time Optimization
 
Adobe AEM Maintenance - Customer Care Office Hours
Adobe AEM Maintenance - Customer Care Office HoursAdobe AEM Maintenance - Customer Care Office Hours
Adobe AEM Maintenance - Customer Care Office Hours
 
OSMC 2012 | Neues in Nagios 4.0 by Andreas Ericsson
OSMC 2012 | Neues in Nagios 4.0 by Andreas EricssonOSMC 2012 | Neues in Nagios 4.0 by Andreas Ericsson
OSMC 2012 | Neues in Nagios 4.0 by Andreas Ericsson
 
Gnocchi v3 brownbag
Gnocchi v3 brownbagGnocchi v3 brownbag
Gnocchi v3 brownbag
 
LCA13: Hadoop DFS Performance
LCA13: Hadoop DFS PerformanceLCA13: Hadoop DFS Performance
LCA13: Hadoop DFS Performance
 
MongoDB 101 & Beyond: Get Started in MongoDB 3.0, Preview 3.2 & Demo of Ops M...
MongoDB 101 & Beyond: Get Started in MongoDB 3.0, Preview 3.2 & Demo of Ops M...MongoDB 101 & Beyond: Get Started in MongoDB 3.0, Preview 3.2 & Demo of Ops M...
MongoDB 101 & Beyond: Get Started in MongoDB 3.0, Preview 3.2 & Demo of Ops M...
 
Using apache spark for processing trillions of records each day at Datadog
Using apache spark for processing trillions of records each day at DatadogUsing apache spark for processing trillions of records each day at Datadog
Using apache spark for processing trillions of records each day at Datadog
 
ibbackup vs mysqldump对比测试 - 20080718
ibbackup vs mysqldump对比测试 - 20080718ibbackup vs mysqldump对比测试 - 20080718
ibbackup vs mysqldump对比测试 - 20080718
 
PGConf.ASIA 2019 Bali - AppOS: PostgreSQL Extension for Scalable File I/O - K...
PGConf.ASIA 2019 Bali - AppOS: PostgreSQL Extension for Scalable File I/O - K...PGConf.ASIA 2019 Bali - AppOS: PostgreSQL Extension for Scalable File I/O - K...
PGConf.ASIA 2019 Bali - AppOS: PostgreSQL Extension for Scalable File I/O - K...
 
Introduction to Ansible - (dev ops for people who hate devops)
Introduction to Ansible - (dev ops for people who hate devops)Introduction to Ansible - (dev ops for people who hate devops)
Introduction to Ansible - (dev ops for people who hate devops)
 
Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...
Flink Forward SF 2017: Malo Deniélou -  No shard left behind: Dynamic work re...Flink Forward SF 2017: Malo Deniélou -  No shard left behind: Dynamic work re...
Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...
 
Percona XtraBackup - New Features and Improvements
Percona XtraBackup - New Features and ImprovementsPercona XtraBackup - New Features and Improvements
Percona XtraBackup - New Features and Improvements
 
Become a GC Hero
Become a GC HeroBecome a GC Hero
Become a GC Hero
 
Ansible is the simplest way to automate. MoldCamp, 2015
Ansible is the simplest way to automate. MoldCamp, 2015Ansible is the simplest way to automate. MoldCamp, 2015
Ansible is the simplest way to automate. MoldCamp, 2015
 
Analyze corefile and backtraces with GDB for Mysql/MariaDB on Linux - Nilanda...
Analyze corefile and backtraces with GDB for Mysql/MariaDB on Linux - Nilanda...Analyze corefile and backtraces with GDB for Mysql/MariaDB on Linux - Nilanda...
Analyze corefile and backtraces with GDB for Mysql/MariaDB on Linux - Nilanda...
 
OpenNebulaConf 2016 - Measuring and tuning VM performance by Boyan Krosnov, S...
OpenNebulaConf 2016 - Measuring and tuning VM performance by Boyan Krosnov, S...OpenNebulaConf 2016 - Measuring and tuning VM performance by Boyan Krosnov, S...
OpenNebulaConf 2016 - Measuring and tuning VM performance by Boyan Krosnov, S...
 
Redis深入浅出
Redis深入浅出Redis深入浅出
Redis深入浅出
 

Recently uploaded

De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 

Recently uploaded (20)

De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 

Joblib Toward efficient computing : from laptop to cloud

  • 1. Joblib Toward efficient computing From laptop to cloud Alexandre Abadie     
  • 2. Overview of Joblib Recent major improvements What's next
  • 3. Overview of Joblib Embarrassingly Parallel computing helper Efficient disk caching to avoid recomputation Fast I/O persistence No dependencies, optimized for numpy arrays Joblib is the parallel backend used by Scikit-Learn https://pythonhosted.org/joblib/
  • 4. Parallel helper Available backends: threading and multiprocessing (default) >>>fromjoblibimportParallel,delayed >>>frommathimportsqrt >>>Parallel(n_jobs=3,verbose=50)(delayed(sqrt)(i**2)foriinrange(6)) [Parallel(n_jobs=3)]:Done 1tasks |elapsed: 0.0s [...] [Parallel(n_jobs=3)]:Done 6outof 6|elapsed: 0.0sfinished [0.0,1.0,2.0,3.0,4.0,5.0]
  • 5. Caching on disk Use a memoize pattern with the Memory object >>>fromjoblibimportMemory >>>mem=Memory(cachedir='/tmp/joblib') >>>importnumpyasnp >>>a=np.vander(np.arange(3)).astype(np.float) >>>square=mem.cache(np.square) >>>b=square(a) ________________________________________________________________________________ [Memory]Callingsquare... square(array([[0., 0., 1.], [1., 1., 1.], [4., 2., 1.]])) ___________________________________________________________square-0...s,0.0min >>>c=square(a)#norecomputation Use md5 hash of input parameters Results are persisted on disk
  • 6. Persistence Convert/create an arbitrary object into/from a string of bytes Persistence in Joblib is based on pickle and Pickler/Unpickler subclasses >>>importnumpyasnp >>>importjoblib >>>obj=[('a',[1,2,3]),('b',np.arange(10))] >>>joblib.dump(obj,'/tmp/test.pkl') ['/tmp/test.pkl','/tmp/test.pkl_01.npy'] >>>joblib.load('/tmp/test.pkl') [('a',[1,2,3]),('b',array([0,1,2,3,4,5,6,7,8,9]))] Use compression for fast I/O >>>joblib.dump(obj,'/tmp/test.pkl',compress=True,cache_size=0) ['/tmp/test.pkl','/tmp/test.pkl_01.npy.z'] >>>joblib.load('/tmp/test.pkl') Access numpy arrays with np.memmapfor out-of-core computing or for sharing between multiple workers
  • 7. Recent major improvements Persistence Custom parallel backends arriving in version 0.10.0
  • 8. Persistence refactoring Until 0.9.4: An object with multiple arrays is persisted in multiple files Only zlib compression available Memory copies with compression
  • 9. Persistence refactoring Until 0.9.4: An object with multiple arrays is persisted in multiple files Only zlib compression available Memory copies with compression In 0.10.0 (not released yet): An object with multiple arrays goes in a single file Support of all compression methods provided by the python standard library No memory copies with compression https://github.com/joblib/joblib/pull/260
  • 10. Persistence refactoring strategy Write numpy array buffer interleaved in the pickle stream ⇒ all arrays in a single file Caveat: Not compatible with pickle format Dump/reconstruct the array by chunks of bytes using numpy functions ⇒ avoid memory copies
  • 12. Now: no memory copies Credits: Memory profiler
  • 13. Persistence in a single file >>>importnumpyasnp >>>importjoblib >>>obj=[np.ones((5000,5000)),np.random.random((5000,5000))] #only1fileisgenerated: >>>joblib.dump(obj,'/tmp/test.pkl',compress=True) ['/tmp/test.pkl'] >>>joblib.load('/tmp/test.pkl') [array([[1., 1.,..., 1., 1.]], array([[0.47006195, 0.5436392,..., 0.1218267, 0.48592789]])] useful with scikit-learn estimators simpler management of backup files robust when using memory map on distributed file systems
  • 14. Compression formats New supported compression formats ⇒ gzip, bz2, lzma and xz Automatic compression based on file extension Automatic detection of compression format when loading Valid compression file formats Slower than zlib
  • 15. Compression formats New supported compression formats ⇒ gzip, bz2, lzma and xz Automatic compression based on file extension Automatic detection of compression format when loading Valid compression file formats Slower than zlib Example with gzipcompression: >>>joblib.dump(obj,'/tmp/test.pkl.gz',compress=('gzip',3)) >>>joblib.load('/tmp/test.pkl.gz') [array([[1., 1.,..., 1., 1.]], array([[0.47006195, 0.5436392,..., 0.1218267, 0.48592789]])] >>>#orwithfileextensiondetection >>>joblib.dump(obj,'/tmp/test.pkl.gz')
  • 16. Performance comparison: Memory footprint and Speed Joblib persists faster and with low extra memory consumption Performance is data dependent http://gael-varoquaux.info/programming/new_low- overhead_persistence_in_joblib_for_big_data.html
  • 18. Custom parallel backends Until 0.9.4: Only threading and multiprocessing backends Not extensible Active backend cannot be changed easily at a high level
  • 19. Custom parallel backends Until 0.9.4: Only threading and multiprocessing backends Not extensible Active backend cannot be changed easily at a high level In 0.10.0: Common API using ParallelBackendBaseinterface Use withto set the active backend in a context manager New backends for: distributed implemented by Matthieu Rocklin ipyparallel implemented by Min RK YARN implemented by Niels Zielemaker https://github.com/joblib/joblib/pull/306 contributed by Niels Zielemaker
  • 21. IPython parallel backend Integration for Joblib available in version 5.1 1. Launch a 5 engines cluster: $ipcontroller& $ipclusterengines-n5 2. Run the following script: importtime importipyparallelasipp fromipyparallel.joblibimportregisterasregister_joblib fromjoblibimportparallel_backend,Parallel,delayed #Registeripyparallelbackend register_joblib() #Startthejob withparallel_backend("ipyparallel"): Parallel(n_jobs=20,verbose=50)(delayed(time.sleep)(1)foriinrange(10))
  • 24. Persistence in file objects 1. With regular file object: >>>withopen('/tmp/test.pkl','wb')asfo: ... joblib.dump(obj,fo) >>>withopen('/tmp/test.pkl','rb')asfo: ... joblib.load(fo) Also works with gzip.GzipFile, bz2.BZ2File and lzma.LZMAFile https://github.com/joblib/joblib/pull/351
  • 25. Persistence in file objects 1. With regular file object: >>>withopen('/tmp/test.pkl','wb')asfo: ... joblib.dump(obj,fo) >>>withopen('/tmp/test.pkl','rb')asfo: ... joblib.load(fo) Also works with gzip.GzipFile, bz2.BZ2File and lzma.LZMAFile https://github.com/joblib/joblib/pull/351 2. Or with any file-like object, e.g exposing read/write functions: >>>withRemoteStoreObject(hostname,port,obj_id)asrso: ... joblib.dump(obj,rso) >>>withRemoteStoreObject(hostname,port,obj_id)asrso: ... joblib.load(rso) ⇒Example: blob databases, remote storage
  • 26. Joblib in the cloud Share persistence files      Use computing ressources          
  • 27. Conclusion Persistence of numpy arrays in a single file... ... and soon in file objects New compression formats: gzip, bz2, xz, lzma ... extend to faster implementations (blosc) ? Persists without memory copies ... manipulate bigger data New parallel backends: distributed, ipyparallel, YARN... ... new ones to come ?       Thanks !