Python for High Throughput
Science
Mark Basham
Scientific Software Group
Diamond Light Source Ltd UK.
Overview
• What is Diamond Light Source
• Big Data?
• Python for scientists
• Python for developers
Diamond Light source
What do I do?
• Provide data analysis for use during and
after beamtime for users
–Users may or may not have any prior
exp...
Where it all started
Client server
technology
Communication with
EPICS and hardware
Scan mechanism
www.opengda.org
Jython
...
Universal Data Problem
Detector History at DLS
• Early 2007:
– Diamond first user.
– No detector faster than ~10 MB/sec.
• Early 2009:
– first Lu...
< 100GB/day
< 1TB/day
> 1TB/day
Per Beamline Data Rates
Data Storage
● ~1PB of Lustre
● ~1PB of GPFS
● ~0.5PB of on-line archive
● ~1PB near-line archive
– >200M files
High perfo...
< 100GB/day
< 1TB/day
> 1TB/day
Small Data Rate Beamlines (Variety)
“I have all the data I have ever
collected on a floppy disk and
process it by hand…”
Principal beam-line scientist when as...
“I have all the data I have ever
collected on a floppy disk and
process it by hand…”
~1 TB so far this year
Processing Data (Variety)
• Experimental work requires exploring
– Matlab
– IDL
– IgorPro
– Excel
– Origin
– Mathmatica
Processing Playing with Data (Variety)
• Experimental work requires exploring
– Matlab
– IDL
– IgorPro
– Excel
– Origin
– ...
Clusters (Velocity)
●132 Intel based nodes, 1280
Intel cores in service.
●80 NVIDIA GPGPU’s, 23328
GPU cores in service.
●...
Python is the Obvious answer
• Users have used it during their beam times.
• Free and easily distributable.
• ...
• BUT – ...
Extending the Acquisition tools
Client server
technology
Communication with
EPICS and hardware
Scan mechanism
www.opengda....
Main Dawn Elements for Python
Python/Jython
Data
Exploring
Workflow
PyDev Scripting
IPython Console
Python Actor scisoftpy...
Scisoftpy plotting
Interactive console
Run on CMD
Real-time variable
view
IPython interface
Integrated
Debugging
Scripting tools
Breakpoints and
Step by step debugging
Interact with the interpreter while paused
Python @ Diamond
• Anaconda
–Numpy
–Scipy
–H5py
–Mpi4py
–Webservices
• Astra (Tomography)
• FFTW
(Ptychography)
• CCTBX
(C...
Processing Playing with Data (Variety)
• Experimental work requires exploring
– Python
• Scientific Software team
– Module...
Aside – Python for Optimization
• We produce a very fast beam of electrons
(99.999999% the speed of light)
• We oscillate ...
Insertion Devices (ID’s ~600 magnets)
Individual Magnet (~800)
Unique MagnetMagnet Holder
x
yz
X Y Z
Perfect 1.0 0.0 0.0
Real 1.08 0.02 -0.04
Simple Optimisation Problem
• From 800 magnets, pick 600 of them
in the right order so that they appear
to be a perfect ar...
Objective Functions
• Slower in Python than Fortran
–Original code ~ 1,000 times slower
–Numpy array optimised ~ 10 times
...
OptID
• Artificial Immune systems
– Global optimiser
– Need more evaluations
• Parallelization
– Threading with np to use ...
< 100GB/day
< 1TB/day
> 1TB/day
High Data Rate Beamlines
Archiving (Veracity)
• Simple task of registering files and metadata with a
remote service.
– Xml parsing
– Contact web se...
< 100GB/day
< 1TB/day
> 1TB/day
MX Data Processing
(Volume and Velocity)
MX Data Reduction (Volume)
Fast DP - fast
Index
Integrate
PointlessScale, refine in P1
Scale, postrefine, merge in point g...
Experimental Phasing (Velocity)
Fast EP
Prepare for Shelx - ShelxC
Phase - ShelxE
Solvent fraction
Original
Inverted
Find ...
DIALS
• Full application being built in Python
– 4 full time developers
• CCTBX
– Extending and working with this open sou...
< 100GB/day
< 1TB/day
> 1TB/day
Tomography Data Reconstruction
(Volume and Velocity)
Tomography Current Implemetation
• Existing codes for reconstruction in c with CUDA
– Only runs on Tiffs
– Minimal data co...
Tomography Next Gen
• Mpi4py
– Cluster organisation,
– Parallelism
– Queues using send buffers
• Transfer of data using Ze...
Multiprocessor + MPI “profiling”
MPI “profiling”
Multiprocessor/MPI “profiling”
• Javascript
var dataTable = new google.visualization.DataTable()
• Python
import logging
l...
Where are we going?
• Scientists are having to become developers
– We try to steer them in the right direction
– Python is...
Acknowledgements
– Alun Ashton
– Graeme Winter
– Greg Matthews
– Tina Friedrich
– Frederik Ferner
– Jonah Graham
(Kichwa)
...
Thanks for Listening...
@basham_mark
www.dawnsci.org
www.diamond.ac.uk
Python for High Throughput Science by Mark Basham
Python for High Throughput Science by Mark Basham
Upcoming SlideShare
Loading in …5
×

Python for High Throughput Science by Mark Basham

1,361 views

Published on

Python for High Throughput Science by Mark Basham

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,361
On SlideShare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
21
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Python for High Throughput Science by Mark Basham

  1. 1. Python for High Throughput Science Mark Basham Scientific Software Group Diamond Light Source Ltd UK.
  2. 2. Overview • What is Diamond Light Source • Big Data? • Python for scientists • Python for developers
  3. 3. Diamond Light source
  4. 4. What do I do? • Provide data analysis for use during and after beamtime for users –Users may or may not have any prior experience. –~30 beamlines with over 100 techniques used. • With 12 other Full time developers
  5. 5. Where it all started Client server technology Communication with EPICS and hardware Scan mechanism www.opengda.org Jython and Python Visualisation Communication with external analysis Analysis tools All core technologies open source Acquisition • 1.0 release 2002 • 3.0 release 2004 – Jython introduced as scripting language Beamline setup and data collection speed increased.
  6. 6. Universal Data Problem
  7. 7. Detector History at DLS • Early 2007: – Diamond first user. – No detector faster than ~10 MB/sec. • Early 2009: – first Lustre system (DDN S2A9900) – first Pilatus 6M system @ 60 MB/s. • Early 2011: – second Lustre system (DDN SFA10K) – first 25Hz Pilatus 6M system @150 MB/s. • Early 2013: – first GPFS system (DDN SFA12K) – First 100 Hz Pilatus 6M system @ 600 MB/sec – ~10 beamlines with 10 GbE detectors (mainly Pilatus and PCO Edge). • Early 2015: – delivery of Percival detector (6000 MB/sec). 1 10 100 1000 10000 2007 2012 Peak Detector Performance (MB/s)
  8. 8. < 100GB/day < 1TB/day > 1TB/day Per Beamline Data Rates
  9. 9. Data Storage ● ~1PB of Lustre ● ~1PB of GPFS ● ~0.5PB of on-line archive ● ~1PB near-line archive – >200M files High performance parallel file systems HATE lots of small files.
  10. 10. < 100GB/day < 1TB/day > 1TB/day Small Data Rate Beamlines (Variety)
  11. 11. “I have all the data I have ever collected on a floppy disk and process it by hand…” Principal beam-line scientist when asked about data volumes in 2005
  12. 12. “I have all the data I have ever collected on a floppy disk and process it by hand…” ~1 TB so far this year
  13. 13. Processing Data (Variety) • Experimental work requires exploring – Matlab – IDL – IgorPro – Excel – Origin – Mathmatica
  14. 14. Processing Playing with Data (Variety) • Experimental work requires exploring – Matlab – IDL – IgorPro – Excel – Origin – Mathmatica • Issue is scalability at all and at a reasonable price
  15. 15. Clusters (Velocity) ●132 Intel based nodes, 1280 Intel cores in service. ●80 NVIDIA GPGPU’s, 23328 GPU cores in service. ●Split across 6 clusters, with a range of capabilities. ●Mostly used by MX and tomography beamlines. ●All accessed via Sun Grid Engine interface.
  16. 16. Python is the Obvious answer • Users have used it during their beam times. • Free and easily distributable. • ... • BUT – how to give it to them in a way they understand.
  17. 17. Extending the Acquisition tools Client server technology Communication with EPICS and hardware Scan mechanism www.opengda.org Jython and Python Visualisation Communication with external analysis Analysis tools Data read, write, convert Metadata structure Workflows All core technologies open source www.dawnsci.org DAWN is a collection of generic and bespoke ‘views’ collated into ‘perspectives’. The perspectives and views can be used in part or whole in either the GDA or DAWN. Acquisition Analysis
  18. 18. Main Dawn Elements for Python Python/Jython Data Exploring Workflow PyDev Scripting IPython Console Python Actor scisoftpy module HDF5 Visualisation www.dawnsci.org
  19. 19. Scisoftpy plotting
  20. 20. Interactive console Run on CMD Real-time variable view IPython interface Integrated Debugging
  21. 21. Scripting tools Breakpoints and Step by step debugging Interact with the interpreter while paused
  22. 22. Python @ Diamond • Anaconda –Numpy –Scipy –H5py –Mpi4py –Webservices • Astra (Tomography) • FFTW (Ptychography) • CCTBX (Crystallography)
  23. 23. Processing Playing with Data (Variety) • Experimental work requires exploring – Python • Scientific Software team – Modules for easy access and common tasks – Repositories and Training
  24. 24. Aside – Python for Optimization • We produce a very fast beam of electrons (99.999999% the speed of light) • We oscillate this beam between magnet arrays called Insertion Devices (ID’s) to make lots of light
  25. 25. Insertion Devices (ID’s ~600 magnets)
  26. 26. Individual Magnet (~800) Unique MagnetMagnet Holder x yz X Y Z Perfect 1.0 0.0 0.0 Real 1.08 0.02 -0.04
  27. 27. Simple Optimisation Problem • From 800 magnets, pick 600 of them in the right order so that they appear to be a perfect array. • But we already have code in Fortran –Bit hard to use –Not that extensible to new systems
  28. 28. Objective Functions • Slower in Python than Fortran –Original code ~ 1,000 times slower –Numpy array optimised ~ 10 times slower • Python improvements, –Caching ~ matched the speed –Clever updating ~ 100 times faster.
  29. 29. OptID • Artificial Immune systems – Global optimiser – Need more evaluations • Parallelization – Threading with np to use processors – Mpi4py for data transfer and making use of the cluster • Running on 25 machines, 200 cpu’s • First sort with the new code has been built.
  30. 30. < 100GB/day < 1TB/day > 1TB/day High Data Rate Beamlines
  31. 31. Archiving (Veracity) • Simple task of registering files and metadata with a remote service. – Xml parsing – Contact web services – File system interaction • Nearly 1PB of data and 200 Million files archives through this system. • Extended onto the cluster to deal with the additional load.
  32. 32. < 100GB/day < 1TB/day > 1TB/day MX Data Processing (Volume and Velocity)
  33. 33. MX Data Reduction (Volume) Fast DP - fast Index Integrate PointlessScale, refine in P1 Scale, postrefine, merge in point group Choose best point group Integrate Integrate Integrate Integrate Output MTZ File xia2 – thorough downstream processing...
  34. 34. Experimental Phasing (Velocity) Fast EP Prepare for Shelx - ShelxC Phase - ShelxE Solvent fraction Original Inverted Find substructure - ShelxD #sites Spacegroups 0.25 0.75 Experimentally phased map Fast DP MTZ file Results location: (visitpath)/processed/(folder)/(prefix)
  35. 35. DIALS • Full application being built in Python – 4 full time developers • CCTBX – Extending and working with this open source project • Boost – Optimization when required using Boost
  36. 36. < 100GB/day < 1TB/day > 1TB/day Tomography Data Reconstruction (Volume and Velocity)
  37. 37. Tomography Current Implemetation • Existing codes for reconstruction in c with CUDA – Only runs on Tiffs – Minimal data correction for experimental artefacts – Only uses 1GPU • Python – Splits data and manages cluster usage (2 GPU’s per Node) – Extracts corrected data from HDF – Builds input files from metadata
  38. 38. Tomography Next Gen • Mpi4py – Cluster organisation, – Parallelism – Queues using send buffers • Transfer of data using ZeroMQ – Using blosc for compression • Processing in python where possible – But calls to external code will be used initially
  39. 39. Multiprocessor + MPI “profiling”
  40. 40. MPI “profiling”
  41. 41. Multiprocessor/MPI “profiling” • Javascript var dataTable = new google.visualization.DataTable() • Python import logging logging.basicConfig(level=0,format='L %(asctime)s.%(msecs)03d M' + machine_number_string + ' ' + rank_names[machine_rank] + ' %(levelname)-6s %(message)s', datefmt='%H:%M:%S') • Jinja2 templating to tie the 2 together
  42. 42. Where are we going? • Scientists are having to become developers – We try to steer them in the right direction – Python is a very good, if not the best tool to do this • Developers are having to work faster and be more reactive to new detectors, clusters, software, methods,.... – Python allows this, and is being adopted almost as standard by new computational projects at Diamond
  43. 43. Acknowledgements – Alun Ashton – Graeme Winter – Greg Matthews – Tina Friedrich – Frederik Ferner – Jonah Graham (Kichwa) – Matthew Gerring – Peter Chang – Baha El Kassaby – Jacob Filik – Karl Levik – Irakli Sikharulidze – Olof Svensson – Andy Gotz – Gábor Náray – Ed Rial – Robert Oates
  44. 44. Thanks for Listening... @basham_mark www.dawnsci.org www.diamond.ac.uk

×