SlideShare a Scribd company logo
Directly computing against public and
research cloud object stores
OCEANS AND ATMOSPEHERE
Paul Branson | UWA/CSIRO Joint Post-doc
8 May 2019
Pangeo on HPC
• About me
• Disclaimer/Acknowledgements
• Examples of :
• Quick setup of Pangeo for HPC (dask-jobqueue)
• Intake-thredds to AODN THREDDS server, s3fs to AODN AWS S3 bucket
• Xarray with Geoviews+Holoviews for visualisation
• Converting netCDF to Zarr
• Benchmarks of various ways to compute against AODN data directly
• Doing some science (sort of!)
2 | Directly computing against public and research cloud object stores| Paul Brannson
• Coastal Physical Oceanographer
• Lots of numerical modelling – results are typically dense nD-arrays in netCDF
format
• PHD at UWA Studying shallow island wakes at laboratory scale
• Developer 3DPIV code for scalable analysis on Pawsey
About me
3 | Directly computing against public and research cloud object stores| Paul Brannson
• From 9 cameras and 140TB of images
• Each instantaneous velocity field (of up to 98,000 vectors) requires
approximately 4.9 million 3D FFTs (of ~150,000 points)
• 50 experiments of ~10,000 frames to a 32kb figure
Presentation title | Presenter name
Some serious dimensionality reduction
4 |
Branson PM, Ghisalberti M,
Ivey GN (2018) Three-
dimensionality of shallow island
wakes, Journal of
Environmental Fluid Mechanics
Branson PM, Ghisalberti M,
Ivey GN, Hopfiger EJ (2019,
accepted) Cylinder wakes in
shallow oscillatory flow: the
coastal island wake problem,
Journal of Fluid Mechanics
• XArray – Labelled nDimensional arrays
• Dask – scaling out analysis of netCDF datafiles using
dask-jobqueue on Pawsey HPC
Problems like tidal phase-aligning results of experiments, subsetting
and aggregating, calculating differential quantities of vector fields
etc etc
and of course the rest of the Python data stack (Numpy, Scipy, Matplotlib)
And the final data analysis stage was made considerably easier
with Pangeo
5 | Directly computing against public and research cloud object stores| Paul Brannson
Note: None of this is my work!
All self taught from the openness of the Pangeo community:
• Ryan Abernathey @rabernat
• Matthew Rocklin @mrocklin
• Joe Hamman @jhamman
• Stephan Hoyer @shoyer
• Martin Durant @martindurant
• Anderson Banihirwe @andersy005
• Scott Henderson @scottyhq
Acknowledgements
Pangeo Community pangeo.io github.com/pangeo-data
6 | Directly computing against public and research cloud object stores| Paul Brannson
More details guide available here:
http://pangeo.io/setup_guides/hpc.html
TL;DR version:
git clone https://github.com/pbranson/c3dis-2019-pangeo.git
cd c3dis-2019-pangeo
conda env create –f environment.yaml
conda activate pyAODN
cp jobqueue.yaml ~/.config/dask/
sbatch start_pangeo.sh
ssh -N -l pbranson -L 8888:z043:8888 zeus.pawsey.org.au
Setup of dask-jobqueue on HPC
7 | Directly computing against public and research cloud object stores| Paul Brannson
scheduler:
work-stealing: True
allowed-failures: 5
distributed:
worker:
memory:
target: 0.6 # Avoid spilling to disk
spill: 0.7 # Avoid spilling to disk
pause: 0.80 # fraction at which we pause worker threads
terminate: 0.95 # fraction at which we terminate the worker
jobqueue:
slurm:
cores: 10
memory: 120GB
processes: 5
queue: workq
project: pawsey0106
walltime: 0-2:00:00
~/.config/dask/jobqueue.yaml
8 |
jobqueue:
slurm:
cores: 4
memory: 12GB
processes: 2
queue: workq
project: pawsey0106
walltime: 0-2:00:00
OR
Directly computing against public and research cloud object stores| Paul Brannson
• Since a few weeks ago we can (thanks to @andersy005, @martindurant)
Intake-thredds
9 | Directly computing against public and research cloud object stores| Paul Brannson
the to_dask() is a bit of a hangover that needs refactoring
Intake-thredds – to Xarray
10 | Directly computing against public and research cloud object stores| Paul Brannson
Access the underlying AODN AWS S3 bucket
11 | Directly computing against public and research cloud object stores| Paul Brannson
Only working since a few weeks ago also… h5netcdf==0.7.1, s3fs==0.2.1
And open a netCDF file directly from S3 (also works for google)
12 | Directly computing against public and research cloud object stores| Paul Brannson
Presentation title | Presenter name
Viewing remote sensing data in the browser with
Holoviews+Geoviews
13 |
https://speakerdeck.com/rabernat/cloud-native-climate-data-with-zarr-and-xarray
Convert a collection of netCDFs into Zarr – What? Why?
14 | Directly computing against public and research cloud object stores| Paul Brannson
https://speakerdeck.com/rabernat/cloud-native-climate-data-with-zarr-and-xarray
Convert a collection of netCDFs into Zarr – What? Why?
15 | Directly computing against public and research cloud object stores| Paul Brannson
Only working since a few weeks ago also… h5netcdf==0.7.1, s3fs==0.2.1
Example converting to Zarr
16 | Directly computing against public and research cloud object stores| Paul Brannson
• Benchmark by calculating monthly mean of 12 months of daily Australia
wide files (reduction of 365 x 10001 x 7001 element, 100GB array)
monthly_mean = ds_thredds.groupby('time.month').mean('time').compute()
1. netCDFs via THREDDS
2. netCDFs via S3 directly
3. netCDFs via Lustre filesystem
4. Zarr via Lustre
5. Zarr via S3
• Benchmarks conducted from Pawsey with 20 workers across 4 nodes
So lets test some of these things out!
17 | Directly computing against public and research cloud object stores| Paul Brannson
Results
18 | Directly computing against public and research cloud object stores| Paul Brannson
• Stay tuned!
• On AARNET – presentation yesterday by Gavin Kennedy
• CloudStor Service – Re-engineering CloudStor for Infinite Scalability
• S3 service using minio server
• Scalable service available June 2019
• On Pawsey (from a little birdie)
• S3 compliant object store procurement commencing second half 2019
• But Pawsey/CSIRO have some phat pipes so you can work directly
on AWS S3….
Research cloud object stores
19 | Directly computing against public and research cloud object stores| Paul Brannson
• Data volumes are going up exponentially
• Eventually insufficient storage to make mirror for personal use (I know of
some datasets replicated 5/6 times on Pawsey)
• But bandwidth also keeps going up
• So computing against object stores from HPC seems viable… ONLY if
your datasets are in cloud-optimised format
• So in the context of FAIR data, seems that whilst the defacto standard of
netCDF(HDF) as a data format fails the Interoperability criteria in
practice
• https://github.com/pbranson/c3dis-2019-pangeo.git
Conclusions
20 | Directly computing against public and research cloud object stores| Paul Brannson
Oceans and Atmosphere
Paul Branson
UWA/CSIRO Post-Doc
e paul.branson@csiro.au
Thank you
OCEANS AND ATMOSPHERE

More Related Content

What's hot

Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
Dataiku
 
Distributed Deep Learning on Hadoop Clusters
Distributed Deep Learning on Hadoop ClustersDistributed Deep Learning on Hadoop Clusters
Distributed Deep Learning on Hadoop Clusters
DataWorks Summit/Hadoop Summit
 
Cern uses cloud for next challenge
Cern uses cloud for next challengeCern uses cloud for next challenge
Cern uses cloud for next challenge
John Davis
 
Bringing HPC Algorithms to Big Data Platforms: Spark Summit East talk by Niko...
Bringing HPC Algorithms to Big Data Platforms: Spark Summit East talk by Niko...Bringing HPC Algorithms to Big Data Platforms: Spark Summit East talk by Niko...
Bringing HPC Algorithms to Big Data Platforms: Spark Summit East talk by Niko...
Spark Summit
 
Real time big data analytics with Storm by Ron Bodkin of Think Big Analytics
Real time big data analytics with Storm by Ron Bodkin of Think Big AnalyticsReal time big data analytics with Storm by Ron Bodkin of Think Big Analytics
Real time big data analytics with Storm by Ron Bodkin of Think Big Analytics
Data Con LA
 
20141103 cern open_stack_paris_v3
20141103 cern open_stack_paris_v320141103 cern open_stack_paris_v3
20141103 cern open_stack_paris_v3
Tim Bell
 
Scalable Scientific Computing with Dask
Scalable Scientific Computing with DaskScalable Scientific Computing with Dask
Scalable Scientific Computing with Dask
Uwe Korn
 
STORM as an ETL Engine to HADOOP
STORM as an ETL Engine to HADOOPSTORM as an ETL Engine to HADOOP
STORM as an ETL Engine to HADOOP
DataWorks Summit
 
Yahoo compares Storm and Spark
Yahoo compares Storm and SparkYahoo compares Storm and Spark
Yahoo compares Storm and Spark
Chicago Hadoop Users Group
 
Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with Hadoop
Hadoop User Group
 
Overview of analytical on Hadoop
Overview of analytical on Hadoop Overview of analytical on Hadoop
Overview of analytical on Hadoop
Ethan Wang
 
The next generation of the Montage image mosaic engine
The next generation of the Montage image mosaic engineThe next generation of the Montage image mosaic engine
The next generation of the Montage image mosaic engine
G. Bruce Berriman
 
High Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2OHigh Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2O
Sri Ambati
 
Real-Time Big Data at In-Memory Speed, Using Storm
Real-Time Big Data at In-Memory Speed, Using StormReal-Time Big Data at In-Memory Speed, Using Storm
Real-Time Big Data at In-Memory Speed, Using Storm
Nati Shalom
 
data.table and H2O at LondonR with Matt Dowle
data.table and H2O at LondonR with Matt Dowledata.table and H2O at LondonR with Matt Dowle
data.table and H2O at LondonR with Matt Dowle
Sri Ambati
 
Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)
Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)
Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)
Spark Summit
 
Spark Summit EU talk by Josef Habdank
Spark Summit EU talk by Josef HabdankSpark Summit EU talk by Josef Habdank
Spark Summit EU talk by Josef Habdank
Spark Summit
 
Using BigBench to compare Hive and Spark (short version)
Using BigBench to compare Hive and Spark (short version)Using BigBench to compare Hive and Spark (short version)
Using BigBench to compare Hive and Spark (short version)
Nicolas Poggi
 
High Throughput Processing of Space Debris Data
High Throughput Processing of Space Debris DataHigh Throughput Processing of Space Debris Data
High Throughput Processing of Space Debris Data
Andreas Schreiber
 
Q4 2016 GeoTrellis Presentation
Q4 2016 GeoTrellis PresentationQ4 2016 GeoTrellis Presentation
Q4 2016 GeoTrellis Presentation
Rob Emanuele
 

What's hot (20)

Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
 
Distributed Deep Learning on Hadoop Clusters
Distributed Deep Learning on Hadoop ClustersDistributed Deep Learning on Hadoop Clusters
Distributed Deep Learning on Hadoop Clusters
 
Cern uses cloud for next challenge
Cern uses cloud for next challengeCern uses cloud for next challenge
Cern uses cloud for next challenge
 
Bringing HPC Algorithms to Big Data Platforms: Spark Summit East talk by Niko...
Bringing HPC Algorithms to Big Data Platforms: Spark Summit East talk by Niko...Bringing HPC Algorithms to Big Data Platforms: Spark Summit East talk by Niko...
Bringing HPC Algorithms to Big Data Platforms: Spark Summit East talk by Niko...
 
Real time big data analytics with Storm by Ron Bodkin of Think Big Analytics
Real time big data analytics with Storm by Ron Bodkin of Think Big AnalyticsReal time big data analytics with Storm by Ron Bodkin of Think Big Analytics
Real time big data analytics with Storm by Ron Bodkin of Think Big Analytics
 
20141103 cern open_stack_paris_v3
20141103 cern open_stack_paris_v320141103 cern open_stack_paris_v3
20141103 cern open_stack_paris_v3
 
Scalable Scientific Computing with Dask
Scalable Scientific Computing with DaskScalable Scientific Computing with Dask
Scalable Scientific Computing with Dask
 
STORM as an ETL Engine to HADOOP
STORM as an ETL Engine to HADOOPSTORM as an ETL Engine to HADOOP
STORM as an ETL Engine to HADOOP
 
Yahoo compares Storm and Spark
Yahoo compares Storm and SparkYahoo compares Storm and Spark
Yahoo compares Storm and Spark
 
Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with Hadoop
 
Overview of analytical on Hadoop
Overview of analytical on Hadoop Overview of analytical on Hadoop
Overview of analytical on Hadoop
 
The next generation of the Montage image mosaic engine
The next generation of the Montage image mosaic engineThe next generation of the Montage image mosaic engine
The next generation of the Montage image mosaic engine
 
High Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2OHigh Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2O
 
Real-Time Big Data at In-Memory Speed, Using Storm
Real-Time Big Data at In-Memory Speed, Using StormReal-Time Big Data at In-Memory Speed, Using Storm
Real-Time Big Data at In-Memory Speed, Using Storm
 
data.table and H2O at LondonR with Matt Dowle
data.table and H2O at LondonR with Matt Dowledata.table and H2O at LondonR with Matt Dowle
data.table and H2O at LondonR with Matt Dowle
 
Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)
Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)
Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)
 
Spark Summit EU talk by Josef Habdank
Spark Summit EU talk by Josef HabdankSpark Summit EU talk by Josef Habdank
Spark Summit EU talk by Josef Habdank
 
Using BigBench to compare Hive and Spark (short version)
Using BigBench to compare Hive and Spark (short version)Using BigBench to compare Hive and Spark (short version)
Using BigBench to compare Hive and Spark (short version)
 
High Throughput Processing of Space Debris Data
High Throughput Processing of Space Debris DataHigh Throughput Processing of Space Debris Data
High Throughput Processing of Space Debris Data
 
Q4 2016 GeoTrellis Presentation
Q4 2016 GeoTrellis PresentationQ4 2016 GeoTrellis Presentation
Q4 2016 GeoTrellis Presentation
 

Similar to Python analytics from HPC to public object stores

Lightning Fast Dataframes with Polars
Lightning Fast Dataframes with PolarsLightning Fast Dataframes with Polars
Lightning Fast Dataframes with Polars
Alberto Danese
 
Benefits of Hadoop as Platform as a Service
Benefits of Hadoop as Platform as a ServiceBenefits of Hadoop as Platform as a Service
Benefits of Hadoop as Platform as a Service
DataWorks Summit/Hadoop Summit
 
Geospatial Sensor Networks and Partitioning Data
Geospatial Sensor Networks and Partitioning DataGeospatial Sensor Networks and Partitioning Data
Geospatial Sensor Networks and Partitioning Data
AlexMiowski
 
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
t_ivanov
 
An efficient data mining solution by integrating Spark and Cassandra
An efficient data mining solution by integrating Spark and CassandraAn efficient data mining solution by integrating Spark and Cassandra
An efficient data mining solution by integrating Spark and Cassandra
Stratio
 
XDF 2019 Xilinx Accelerated Database and Data Analytics Ecosystem
XDF 2019 Xilinx Accelerated Database and Data Analytics EcosystemXDF 2019 Xilinx Accelerated Database and Data Analytics Ecosystem
XDF 2019 Xilinx Accelerated Database and Data Analytics Ecosystem
Dan Eaton
 
Managing data analytics in a hybrid cloud
Managing data analytics in a hybrid cloudManaging data analytics in a hybrid cloud
Managing data analytics in a hybrid cloud
Karan Singh
 
Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP
vinoth kumar
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
Ramesh Pabba - seeking new projects
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
Ramesh Pabba - seeking new projects
 
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARNYARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
Hortonworks
 
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
MongoDB
 
AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Less...
AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Less...AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Less...
AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Less...
Amazon Web Services
 
Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridEvert Lammerts
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
MongoDB
 
Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...
Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...
Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...
Cedric CARBONE
 
A general introduction to Spring Data / Neo4J
A general introduction to Spring Data / Neo4JA general introduction to Spring Data / Neo4J
A general introduction to Spring Data / Neo4J
Florent Biville
 
Hopsworks - Self-Service Spark/Flink/Kafka/Hadoop
Hopsworks - Self-Service Spark/Flink/Kafka/HadoopHopsworks - Self-Service Spark/Flink/Kafka/Hadoop
Hopsworks - Self-Service Spark/Flink/Kafka/Hadoop
Jim Dowling
 

Similar to Python analytics from HPC to public object stores (20)

Lightning Fast Dataframes with Polars
Lightning Fast Dataframes with PolarsLightning Fast Dataframes with Polars
Lightning Fast Dataframes with Polars
 
Benefits of Hadoop as Platform as a Service
Benefits of Hadoop as Platform as a ServiceBenefits of Hadoop as Platform as a Service
Benefits of Hadoop as Platform as a Service
 
Geospatial Sensor Networks and Partitioning Data
Geospatial Sensor Networks and Partitioning DataGeospatial Sensor Networks and Partitioning Data
Geospatial Sensor Networks and Partitioning Data
 
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
 
An efficient data mining solution by integrating Spark and Cassandra
An efficient data mining solution by integrating Spark and CassandraAn efficient data mining solution by integrating Spark and Cassandra
An efficient data mining solution by integrating Spark and Cassandra
 
XDF 2019 Xilinx Accelerated Database and Data Analytics Ecosystem
XDF 2019 Xilinx Accelerated Database and Data Analytics EcosystemXDF 2019 Xilinx Accelerated Database and Data Analytics Ecosystem
XDF 2019 Xilinx Accelerated Database and Data Analytics Ecosystem
 
Managing data analytics in a hybrid cloud
Managing data analytics in a hybrid cloudManaging data analytics in a hybrid cloud
Managing data analytics in a hybrid cloud
 
Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARNYARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
 
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
 
AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Less...
AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Less...AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Less...
AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Less...
 
Stratio big data spain
Stratio   big data spainStratio   big data spain
Stratio big data spain
 
Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG Grid
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
 
Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...
Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...
Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...
 
A general introduction to Spring Data / Neo4J
A general introduction to Spring Data / Neo4JA general introduction to Spring Data / Neo4J
A general introduction to Spring Data / Neo4J
 
Hopsworks - Self-Service Spark/Flink/Kafka/Hadoop
Hopsworks - Self-Service Spark/Flink/Kafka/HadoopHopsworks - Self-Service Spark/Flink/Kafka/Hadoop
Hopsworks - Self-Service Spark/Flink/Kafka/Hadoop
 

Recently uploaded

Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
vikram sood
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
mzpolocfi
 
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTESAdjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Subhajit Sahu
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
dwreak4tg
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptxData_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
AnirbanRoy608946
 

Recently uploaded (20)

Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
 
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTESAdjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptxData_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
 

Python analytics from HPC to public object stores

  • 1. Directly computing against public and research cloud object stores OCEANS AND ATMOSPEHERE Paul Branson | UWA/CSIRO Joint Post-doc 8 May 2019
  • 2. Pangeo on HPC • About me • Disclaimer/Acknowledgements • Examples of : • Quick setup of Pangeo for HPC (dask-jobqueue) • Intake-thredds to AODN THREDDS server, s3fs to AODN AWS S3 bucket • Xarray with Geoviews+Holoviews for visualisation • Converting netCDF to Zarr • Benchmarks of various ways to compute against AODN data directly • Doing some science (sort of!) 2 | Directly computing against public and research cloud object stores| Paul Brannson
  • 3. • Coastal Physical Oceanographer • Lots of numerical modelling – results are typically dense nD-arrays in netCDF format • PHD at UWA Studying shallow island wakes at laboratory scale • Developer 3DPIV code for scalable analysis on Pawsey About me 3 | Directly computing against public and research cloud object stores| Paul Brannson
  • 4. • From 9 cameras and 140TB of images • Each instantaneous velocity field (of up to 98,000 vectors) requires approximately 4.9 million 3D FFTs (of ~150,000 points) • 50 experiments of ~10,000 frames to a 32kb figure Presentation title | Presenter name Some serious dimensionality reduction 4 | Branson PM, Ghisalberti M, Ivey GN (2018) Three- dimensionality of shallow island wakes, Journal of Environmental Fluid Mechanics Branson PM, Ghisalberti M, Ivey GN, Hopfiger EJ (2019, accepted) Cylinder wakes in shallow oscillatory flow: the coastal island wake problem, Journal of Fluid Mechanics
  • 5. • XArray – Labelled nDimensional arrays • Dask – scaling out analysis of netCDF datafiles using dask-jobqueue on Pawsey HPC Problems like tidal phase-aligning results of experiments, subsetting and aggregating, calculating differential quantities of vector fields etc etc and of course the rest of the Python data stack (Numpy, Scipy, Matplotlib) And the final data analysis stage was made considerably easier with Pangeo 5 | Directly computing against public and research cloud object stores| Paul Brannson
  • 6. Note: None of this is my work! All self taught from the openness of the Pangeo community: • Ryan Abernathey @rabernat • Matthew Rocklin @mrocklin • Joe Hamman @jhamman • Stephan Hoyer @shoyer • Martin Durant @martindurant • Anderson Banihirwe @andersy005 • Scott Henderson @scottyhq Acknowledgements Pangeo Community pangeo.io github.com/pangeo-data 6 | Directly computing against public and research cloud object stores| Paul Brannson
  • 7. More details guide available here: http://pangeo.io/setup_guides/hpc.html TL;DR version: git clone https://github.com/pbranson/c3dis-2019-pangeo.git cd c3dis-2019-pangeo conda env create –f environment.yaml conda activate pyAODN cp jobqueue.yaml ~/.config/dask/ sbatch start_pangeo.sh ssh -N -l pbranson -L 8888:z043:8888 zeus.pawsey.org.au Setup of dask-jobqueue on HPC 7 | Directly computing against public and research cloud object stores| Paul Brannson
  • 8. scheduler: work-stealing: True allowed-failures: 5 distributed: worker: memory: target: 0.6 # Avoid spilling to disk spill: 0.7 # Avoid spilling to disk pause: 0.80 # fraction at which we pause worker threads terminate: 0.95 # fraction at which we terminate the worker jobqueue: slurm: cores: 10 memory: 120GB processes: 5 queue: workq project: pawsey0106 walltime: 0-2:00:00 ~/.config/dask/jobqueue.yaml 8 | jobqueue: slurm: cores: 4 memory: 12GB processes: 2 queue: workq project: pawsey0106 walltime: 0-2:00:00 OR Directly computing against public and research cloud object stores| Paul Brannson
  • 9. • Since a few weeks ago we can (thanks to @andersy005, @martindurant) Intake-thredds 9 | Directly computing against public and research cloud object stores| Paul Brannson
  • 10. the to_dask() is a bit of a hangover that needs refactoring Intake-thredds – to Xarray 10 | Directly computing against public and research cloud object stores| Paul Brannson
  • 11. Access the underlying AODN AWS S3 bucket 11 | Directly computing against public and research cloud object stores| Paul Brannson
  • 12. Only working since a few weeks ago also… h5netcdf==0.7.1, s3fs==0.2.1 And open a netCDF file directly from S3 (also works for google) 12 | Directly computing against public and research cloud object stores| Paul Brannson
  • 13. Presentation title | Presenter name Viewing remote sensing data in the browser with Holoviews+Geoviews 13 |
  • 14. https://speakerdeck.com/rabernat/cloud-native-climate-data-with-zarr-and-xarray Convert a collection of netCDFs into Zarr – What? Why? 14 | Directly computing against public and research cloud object stores| Paul Brannson
  • 15. https://speakerdeck.com/rabernat/cloud-native-climate-data-with-zarr-and-xarray Convert a collection of netCDFs into Zarr – What? Why? 15 | Directly computing against public and research cloud object stores| Paul Brannson
  • 16. Only working since a few weeks ago also… h5netcdf==0.7.1, s3fs==0.2.1 Example converting to Zarr 16 | Directly computing against public and research cloud object stores| Paul Brannson
  • 17. • Benchmark by calculating monthly mean of 12 months of daily Australia wide files (reduction of 365 x 10001 x 7001 element, 100GB array) monthly_mean = ds_thredds.groupby('time.month').mean('time').compute() 1. netCDFs via THREDDS 2. netCDFs via S3 directly 3. netCDFs via Lustre filesystem 4. Zarr via Lustre 5. Zarr via S3 • Benchmarks conducted from Pawsey with 20 workers across 4 nodes So lets test some of these things out! 17 | Directly computing against public and research cloud object stores| Paul Brannson
  • 18. Results 18 | Directly computing against public and research cloud object stores| Paul Brannson
  • 19. • Stay tuned! • On AARNET – presentation yesterday by Gavin Kennedy • CloudStor Service – Re-engineering CloudStor for Infinite Scalability • S3 service using minio server • Scalable service available June 2019 • On Pawsey (from a little birdie) • S3 compliant object store procurement commencing second half 2019 • But Pawsey/CSIRO have some phat pipes so you can work directly on AWS S3…. Research cloud object stores 19 | Directly computing against public and research cloud object stores| Paul Brannson
  • 20. • Data volumes are going up exponentially • Eventually insufficient storage to make mirror for personal use (I know of some datasets replicated 5/6 times on Pawsey) • But bandwidth also keeps going up • So computing against object stores from HPC seems viable… ONLY if your datasets are in cloud-optimised format • So in the context of FAIR data, seems that whilst the defacto standard of netCDF(HDF) as a data format fails the Interoperability criteria in practice • https://github.com/pbranson/c3dis-2019-pangeo.git Conclusions 20 | Directly computing against public and research cloud object stores| Paul Brannson
  • 21. Oceans and Atmosphere Paul Branson UWA/CSIRO Post-Doc e paul.branson@csiro.au Thank you OCEANS AND ATMOSPHERE