Python analytics from HPC to public object stores

Directly computing against public and
research cloud object stores
OCEANS AND ATMOSPEHERE
Paul Branson | UWA/CSIRO Joint Post-doc
8 May 2019

Pangeo on HPC
• About me
• Disclaimer/Acknowledgements
• Examples of :
• Quick setup of Pangeo for HPC (dask-jobqueue)
• Intake-thredds to AODN THREDDS server, s3fs to AODN AWS S3 bucket
• Xarray with Geoviews+Holoviews for visualisation
• Converting netCDF to Zarr
• Benchmarks of various ways to compute against AODN data directly
• Doing some science (sort of!)
2 | Directly computing against public and research cloud object stores| Paul Brannson

• Coastal Physical Oceanographer
• Lots of numerical modelling – results are typically dense nD-arrays in netCDF
format
• PHD at UWA Studying shallow island wakes at laboratory scale
• Developer 3DPIV code for scalable analysis on Pawsey
About me

• From 9 cameras and 140TB of images
• Each instantaneous velocity field (of up to 98,000 vectors) requires
approximately 4.9 million 3D FFTs (of ~150,000 points)
• 50 experiments of ~10,000 frames to a 32kb figure
Presentation title | Presenter name
Some serious dimensionality reduction
4 |
Branson PM, Ghisalberti M,
Ivey GN (2018) Three-
dimensionality of shallow island
wakes, Journal of
Environmental Fluid Mechanics
Branson PM, Ghisalberti M,
Ivey GN, Hopfiger EJ (2019,
accepted) Cylinder wakes in
shallow oscillatory flow: the
coastal island wake problem,
Journal of Fluid Mechanics

• XArray – Labelled nDimensional arrays
• Dask – scaling out analysis of netCDF datafiles using
dask-jobqueue on Pawsey HPC
Problems like tidal phase-aligning results of experiments, subsetting
and aggregating, calculating differential quantities of vector fields
etc etc
and of course the rest of the Python data stack (Numpy, Scipy, Matplotlib)
And the final data analysis stage was made considerably easier
with Pangeo

Note: None of this is my work!
All self taught from the openness of the Pangeo community:
• Ryan Abernathey @rabernat
• Matthew Rocklin @mrocklin
• Joe Hamman @jhamman
• Stephan Hoyer @shoyer
• Martin Durant @martindurant
• Anderson Banihirwe @andersy005
• Scott Henderson @scottyhq
Acknowledgements
Pangeo Community pangeo.io github.com/pangeo-data

More details guide available here:
http://pangeo.io/setup_guides/hpc.html
TL;DR version:
git clone https://github.com/pbranson/c3dis-2019-pangeo.git
cd c3dis-2019-pangeo
conda env create –f environment.yaml
conda activate pyAODN
cp jobqueue.yaml ~/.config/dask/
sbatch start_pangeo.sh
ssh -N -l pbranson -L 8888:z043:8888 zeus.pawsey.org.au
Setup of dask-jobqueue on HPC

scheduler:
work-stealing: True
allowed-failures: 5
distributed:
worker:
memory:
target: 0.6 # Avoid spilling to disk
spill: 0.7 # Avoid spilling to disk
pause: 0.80 # fraction at which we pause worker threads
terminate: 0.95 # fraction at which we terminate the worker
jobqueue:
slurm:
cores: 10
memory: 120GB
processes: 5
queue: workq
project: pawsey0106
walltime: 0-2:00:00
~/.config/dask/jobqueue.yaml
8 |
jobqueue:
slurm:
cores: 4
memory: 12GB
processes: 2
queue: workq
project: pawsey0106
walltime: 0-2:00:00
OR
Directly computing against public and research cloud object stores| Paul Brannson

• Since a few weeks ago we can (thanks to @andersy005, @martindurant)
Intake-thredds

the to_dask() is a bit of a hangover that needs refactoring
Intake-thredds – to Xarray

Access the underlying AODN AWS S3 bucket

Only working since a few weeks ago also… h5netcdf==0.7.1, s3fs==0.2.1
And open a netCDF file directly from S3 (also works for google)

Presentation title | Presenter name
Viewing remote sensing data in the browser with
Holoviews+Geoviews
13 |

https://speakerdeck.com/rabernat/cloud-native-climate-data-with-zarr-and-xarray
Convert a collection of netCDFs into Zarr – What? Why?

Only working since a few weeks ago also… h5netcdf==0.7.1, s3fs==0.2.1
Example converting to Zarr

• Benchmark by calculating monthly mean of 12 months of daily Australia
wide files (reduction of 365 x 10001 x 7001 element, 100GB array)
monthly_mean = ds_thredds.groupby('time.month').mean('time').compute()
1. netCDFs via THREDDS
2. netCDFs via S3 directly
3. netCDFs via Lustre filesystem
4. Zarr via Lustre
5. Zarr via S3
• Benchmarks conducted from Pawsey with 20 workers across 4 nodes
So lets test some of these things out!

Results

• Stay tuned!
• On AARNET – presentation yesterday by Gavin Kennedy
• CloudStor Service – Re-engineering CloudStor for Infinite Scalability
• S3 service using minio server
• Scalable service available June 2019
• On Pawsey (from a little birdie)
• S3 compliant object store procurement commencing second half 2019
• But Pawsey/CSIRO have some phat pipes so you can work directly
on AWS S3….
Research cloud object stores

• Data volumes are going up exponentially
• Eventually insufficient storage to make mirror for personal use (I know of
some datasets replicated 5/6 times on Pawsey)
• But bandwidth also keeps going up
• So computing against object stores from HPC seems viable… ONLY if
your datasets are in cloud-optimised format
• So in the context of FAIR data, seems that whilst the defacto standard of
netCDF(HDF) as a data format fails the Interoperability criteria in
practice
• https://github.com/pbranson/c3dis-2019-pangeo.git
Conclusions

Oceans and Atmosphere
Paul Branson
UWA/CSIRO Post-Doc
e paul.branson@csiro.au
Thank you
OCEANS AND ATMOSPHERE

Python analytics from HPC to public object stores

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Python analytics from HPC to public object stores

Similar to Python analytics from HPC to public object stores (20)

Recently uploaded

Recently uploaded (20)

Python analytics from HPC to public object stores