SlideShare a Scribd company logo
Python for High Throughput
Science
Mark Basham
Scientific Software Group
Diamond Light Source Ltd UK.
Overview
• What is Diamond Light Source
• Big Data?
• Python for scientists
• Python for developers
Diamond Light source
What do I do?
• Provide data analysis for use during and
after beamtime for users
–Users may or may not have any prior
experience.
–~30 beamlines with over 100 techniques
used.
• With 12 other Full time developers
Where it all started
Client server
technology
Communication with
EPICS and hardware
Scan mechanism
www.opengda.org
Jython
and Python
Visualisation
Communication
with external
analysis
Analysis
tools
All core technologies open source
Acquisition
• 1.0 release 2002
• 3.0 release 2004
– Jython introduced
as scripting
language
Beamline setup and
data collection speed
increased.
Universal Data Problem
Detector History at DLS
• Early 2007:
– Diamond first user.
– No detector faster than ~10 MB/sec.
• Early 2009:
– first Lustre system (DDN S2A9900)
– first Pilatus 6M system @ 60 MB/s.
• Early 2011:
– second Lustre system (DDN SFA10K)
– first 25Hz Pilatus 6M system @150 MB/s.
• Early 2013:
– first GPFS system (DDN SFA12K)
– First 100 Hz Pilatus 6M system @ 600 MB/sec
– ~10 beamlines with 10 GbE detectors (mainly Pilatus and PCO Edge).
• Early 2015:
– delivery of Percival detector (6000 MB/sec).
1
10
100
1000
10000
2007 2012
Peak Detector
Performance (MB/s)
< 100GB/day
< 1TB/day
> 1TB/day
Per Beamline Data Rates
Data Storage
● ~1PB of Lustre
● ~1PB of GPFS
● ~0.5PB of on-line archive
● ~1PB near-line archive
– >200M files
High performance parallel
file systems HATE lots of
small files.
< 100GB/day
< 1TB/day
> 1TB/day
Small Data Rate Beamlines (Variety)
“I have all the data I have ever
collected on a floppy disk and
process it by hand…”
Principal beam-line scientist when asked
about data volumes in 2005
“I have all the data I have ever
collected on a floppy disk and
process it by hand…”
~1 TB so far this year
Processing Data (Variety)
• Experimental work requires exploring
– Matlab
– IDL
– IgorPro
– Excel
– Origin
– Mathmatica
Processing Playing with Data (Variety)
• Experimental work requires exploring
– Matlab
– IDL
– IgorPro
– Excel
– Origin
– Mathmatica
• Issue is scalability at all and at a reasonable price
Clusters (Velocity)
●132 Intel based nodes, 1280
Intel cores in service.
●80 NVIDIA GPGPU’s, 23328
GPU cores in service.
●Split across 6 clusters, with a
range of capabilities.
●Mostly used by MX and
tomography beamlines.
●All accessed via Sun Grid
Engine interface.
Python is the Obvious answer
• Users have used it during their beam times.
• Free and easily distributable.
• ...
• BUT – how to give it to them in a way they
understand.
Extending the Acquisition tools
Client server
technology
Communication with
EPICS and hardware
Scan mechanism
www.opengda.org
Jython
and Python
Visualisation
Communication
with external
analysis
Analysis
tools
Data read, write,
convert
Metadata
structure
Workflows
All core technologies open source
www.dawnsci.org
DAWN is a
collection of
generic and
bespoke ‘views’
collated into
‘perspectives’.
The perspectives
and views can
be used in part
or whole in either
the GDA or
DAWN.
Acquisition Analysis
Main Dawn Elements for Python
Python/Jython
Data
Exploring
Workflow
PyDev Scripting
IPython Console
Python Actor scisoftpy module
HDF5
Visualisation
www.dawnsci.org
Scisoftpy plotting
Interactive console
Run on CMD
Real-time variable
view
IPython interface
Integrated
Debugging
Scripting tools
Breakpoints and
Step by step debugging
Interact with the interpreter while paused
Python @ Diamond
• Anaconda
–Numpy
–Scipy
–H5py
–Mpi4py
–Webservices
• Astra (Tomography)
• FFTW
(Ptychography)
• CCTBX
(Crystallography)
Processing Playing with Data (Variety)
• Experimental work requires exploring
– Python
• Scientific Software team
– Modules for easy access and common tasks
– Repositories and Training
Aside – Python for Optimization
• We produce a very fast beam of electrons
(99.999999% the speed of light)
• We oscillate this beam between magnet
arrays called Insertion Devices (ID’s) to
make lots of light
Insertion Devices (ID’s ~600 magnets)
Individual Magnet (~800)
Unique MagnetMagnet Holder
x
yz
X Y Z
Perfect 1.0 0.0 0.0
Real 1.08 0.02 -0.04
Simple Optimisation Problem
• From 800 magnets, pick 600 of them
in the right order so that they appear
to be a perfect array.
• But we already have code in Fortran
–Bit hard to use
–Not that extensible to new systems
Objective Functions
• Slower in Python than Fortran
–Original code ~ 1,000 times slower
–Numpy array optimised ~ 10 times
slower
• Python improvements,
–Caching ~ matched the speed
–Clever updating ~ 100 times faster.
OptID
• Artificial Immune systems
– Global optimiser
– Need more evaluations
• Parallelization
– Threading with np to use processors
– Mpi4py for data transfer and making use of the cluster
• Running on 25 machines, 200 cpu’s
• First sort with the new code has been built.
< 100GB/day
< 1TB/day
> 1TB/day
High Data Rate Beamlines
Archiving (Veracity)
• Simple task of registering files and metadata with a
remote service.
– Xml parsing
– Contact web services
– File system interaction
• Nearly 1PB of data and 200 Million files archives through
this system.
• Extended onto the cluster to deal with the additional
load.
< 100GB/day
< 1TB/day
> 1TB/day
MX Data Processing
(Volume and Velocity)
MX Data Reduction (Volume)
Fast DP - fast
Index
Integrate
PointlessScale, refine in P1
Scale, postrefine, merge in point group
Choose best point group
Integrate Integrate Integrate Integrate
Output MTZ File
xia2 – thorough
downstream processing...
Experimental Phasing (Velocity)
Fast EP
Prepare for Shelx - ShelxC
Phase - ShelxE
Solvent fraction
Original
Inverted
Find substructure - ShelxD
#sites
Spacegroups
0.25 0.75
Experimentally phased map
Fast DP MTZ file
Results location: (visitpath)/processed/(folder)/(prefix)
DIALS
• Full application being built in Python
– 4 full time developers
• CCTBX
– Extending and working with this open source project
• Boost
– Optimization when required using Boost
< 100GB/day
< 1TB/day
> 1TB/day
Tomography Data Reconstruction
(Volume and Velocity)
Tomography Current Implemetation
• Existing codes for reconstruction in c with CUDA
– Only runs on Tiffs
– Minimal data correction for experimental artefacts
– Only uses 1GPU
• Python
– Splits data and manages cluster usage (2 GPU’s per
Node)
– Extracts corrected data from HDF
– Builds input files from metadata
Tomography Next Gen
• Mpi4py
– Cluster organisation,
– Parallelism
– Queues using send buffers
• Transfer of data using ZeroMQ
– Using blosc for compression
• Processing in python where possible
– But calls to external code will be used initially
Multiprocessor + MPI “profiling”
MPI “profiling”
Multiprocessor/MPI “profiling”
• Javascript
var dataTable = new google.visualization.DataTable()
• Python
import logging
logging.basicConfig(level=0,format='L
%(asctime)s.%(msecs)03d M' + machine_number_string +
' ' + rank_names[machine_rank] + ' %(levelname)-6s
%(message)s', datefmt='%H:%M:%S')
• Jinja2 templating to tie the 2 together
Where are we going?
• Scientists are having to become developers
– We try to steer them in the right direction
– Python is a very good, if not the best tool to do this
• Developers are having to work faster and be more
reactive to new detectors, clusters, software, methods,....
– Python allows this, and is being adopted almost as
standard by new computational projects at Diamond
Acknowledgements
– Alun Ashton
– Graeme Winter
– Greg Matthews
– Tina Friedrich
– Frederik Ferner
– Jonah Graham
(Kichwa)
– Matthew Gerring
– Peter Chang
– Baha El Kassaby
– Jacob Filik
– Karl Levik
– Irakli Sikharulidze
– Olof Svensson
– Andy Gotz
– Gábor Náray
– Ed Rial
– Robert Oates
Thanks for Listening...
@basham_mark
www.dawnsci.org
www.diamond.ac.uk

More Related Content

What's hot

Deep Learning, Microsoft Cognitive Toolkit (CNTK) and Azure Machine Learning ...
Deep Learning, Microsoft Cognitive Toolkit (CNTK) and Azure Machine Learning ...Deep Learning, Microsoft Cognitive Toolkit (CNTK) and Azure Machine Learning ...
Deep Learning, Microsoft Cognitive Toolkit (CNTK) and Azure Machine Learning ...
Naoki (Neo) SATO
 
Apache Storm Internals
Apache Storm InternalsApache Storm Internals
Apache Storm Internals
Humoyun Ahmedov
 
Real-time Big Data Processing with Storm
Real-time Big Data Processing with StormReal-time Big Data Processing with Storm
Real-time Big Data Processing with Storm
viirya
 
Storm-on-YARN: Convergence of Low-Latency and Big-Data
Storm-on-YARN: Convergence of Low-Latency and Big-DataStorm-on-YARN: Convergence of Low-Latency and Big-Data
Storm-on-YARN: Convergence of Low-Latency and Big-Data
DataWorks Summit
 
Spark Meetup TensorFrames
Spark Meetup TensorFramesSpark Meetup TensorFrames
Spark Meetup TensorFrames
Jen Aman
 
Get Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
Get Your Head in the Cloud - Lessons in GPU Computing with SchlumbergerGet Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
Get Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
inside-BigData.com
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
Chandler Huang
 
Python for Earth
Python for EarthPython for Earth
Python for Earth
zakiakhmad
 
Achitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and ExascaleAchitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and Exascale
inside-BigData.com
 
Real time big data analytics with Storm by Ron Bodkin of Think Big Analytics
Real time big data analytics with Storm by Ron Bodkin of Think Big AnalyticsReal time big data analytics with Storm by Ron Bodkin of Think Big Analytics
Real time big data analytics with Storm by Ron Bodkin of Think Big Analytics
Data Con LA
 
유연하고 확장성 있는 빅데이터 처리
유연하고 확장성 있는 빅데이터 처리유연하고 확장성 있는 빅데이터 처리
유연하고 확장성 있는 빅데이터 처리
NAVER D2
 
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013
Sonal Raj
 
Intro to Machine Learning for GPUs
Intro to Machine Learning for GPUsIntro to Machine Learning for GPUs
Intro to Machine Learning for GPUs
Sri Ambati
 
Hadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm ArchitectureHadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm Architecture
P. Taylor Goetz
 
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim HunterDeep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
Databricks
 
Exascale Capabl
Exascale CapablExascale Capabl
Exascale Capabl
Sagar Dolas
 
Scalable high-dimensional indexing with Hadoop
Scalable high-dimensional indexing with HadoopScalable high-dimensional indexing with Hadoop
Scalable high-dimensional indexing with Hadoop
Denis Shestakov
 
Data Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and RData Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and R
Radek Maciaszek
 
Storm and Cassandra
Storm and Cassandra Storm and Cassandra
Storm and Cassandra
T Jake Luciani
 
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...
Databricks
 

What's hot (20)

Deep Learning, Microsoft Cognitive Toolkit (CNTK) and Azure Machine Learning ...
Deep Learning, Microsoft Cognitive Toolkit (CNTK) and Azure Machine Learning ...Deep Learning, Microsoft Cognitive Toolkit (CNTK) and Azure Machine Learning ...
Deep Learning, Microsoft Cognitive Toolkit (CNTK) and Azure Machine Learning ...
 
Apache Storm Internals
Apache Storm InternalsApache Storm Internals
Apache Storm Internals
 
Real-time Big Data Processing with Storm
Real-time Big Data Processing with StormReal-time Big Data Processing with Storm
Real-time Big Data Processing with Storm
 
Storm-on-YARN: Convergence of Low-Latency and Big-Data
Storm-on-YARN: Convergence of Low-Latency and Big-DataStorm-on-YARN: Convergence of Low-Latency and Big-Data
Storm-on-YARN: Convergence of Low-Latency and Big-Data
 
Spark Meetup TensorFrames
Spark Meetup TensorFramesSpark Meetup TensorFrames
Spark Meetup TensorFrames
 
Get Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
Get Your Head in the Cloud - Lessons in GPU Computing with SchlumbergerGet Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
Get Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
 
Python for Earth
Python for EarthPython for Earth
Python for Earth
 
Achitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and ExascaleAchitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and Exascale
 
Real time big data analytics with Storm by Ron Bodkin of Think Big Analytics
Real time big data analytics with Storm by Ron Bodkin of Think Big AnalyticsReal time big data analytics with Storm by Ron Bodkin of Think Big Analytics
Real time big data analytics with Storm by Ron Bodkin of Think Big Analytics
 
유연하고 확장성 있는 빅데이터 처리
유연하고 확장성 있는 빅데이터 처리유연하고 확장성 있는 빅데이터 처리
유연하고 확장성 있는 빅데이터 처리
 
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013
 
Intro to Machine Learning for GPUs
Intro to Machine Learning for GPUsIntro to Machine Learning for GPUs
Intro to Machine Learning for GPUs
 
Hadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm ArchitectureHadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm Architecture
 
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim HunterDeep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
 
Exascale Capabl
Exascale CapablExascale Capabl
Exascale Capabl
 
Scalable high-dimensional indexing with Hadoop
Scalable high-dimensional indexing with HadoopScalable high-dimensional indexing with Hadoop
Scalable high-dimensional indexing with Hadoop
 
Data Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and RData Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and R
 
Storm and Cassandra
Storm and Cassandra Storm and Cassandra
Storm and Cassandra
 
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...
 

Similar to Python for High Throughput Science by Mark Basham

Current Trends in HPC
Current Trends in HPCCurrent Trends in HPC
Current Trends in HPC
Putchong Uthayopas
 
Tsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in ChinaTsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in China
DataStax Academy
 
From the Archives: Future of Supercomputing at Altparty 2009
From the Archives: Future of Supercomputing at Altparty 2009From the Archives: Future of Supercomputing at Altparty 2009
From the Archives: Future of Supercomputing at Altparty 2009
Olli-Pekka Lehto
 
ABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big Data
ABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big DataABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big Data
ABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big Data
Hitoshi Sato
 
Lrz kurs: big data analysis
Lrz kurs: big data analysisLrz kurs: big data analysis
Lrz kurs: big data analysis
Ferdinand Jamitzky
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
Chester Chen
 
IBM and ASTRON 64bit μServer for DOME
IBM and ASTRON 64bit μServer for DOMEIBM and ASTRON 64bit μServer for DOME
IBM and ASTRON 64bit μServer for DOME
IBM Research
 
Early Application experiences on Summit
Early Application experiences on Summit Early Application experiences on Summit
Early Application experiences on Summit
Ganesan Narayanasamy
 
A Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural NetworksA Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural Networks
inside-BigData.com
 
Py tables
Py tablesPy tables
Py tables
Ali Hallaji
 
Large Data Analyze With PyTables
Large Data Analyze With PyTablesLarge Data Analyze With PyTables
Large Data Analyze With PyTables
Innfinision Cloud and BigData Solutions
 
PyTables
PyTablesPyTables
PyTables
Ali Hallaji
 
PyTables
PyTablesPyTables
PyTables
Ali Hallaji
 
PCCC23:筑波大学計算科学研究センター テーマ1「スーパーコンピュータCygnus / Pegasus」
PCCC23:筑波大学計算科学研究センター テーマ1「スーパーコンピュータCygnus / Pegasus」PCCC23:筑波大学計算科学研究センター テーマ1「スーパーコンピュータCygnus / Pegasus」
PCCC23:筑波大学計算科学研究センター テーマ1「スーパーコンピュータCygnus / Pegasus」
PC Cluster Consortium
 
Computational infrastructure for NGS data analysis
Computational infrastructure for NGS data analysisComputational infrastructure for NGS data analysis
Computational infrastructure for NGS data analysis
cursoNGS
 
2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup
Ganesan Narayanasamy
 
Chips&toys
Chips&toysChips&toys
Chips&toys
Serendipity Seraph
 
GIST AI-X Computing Cluster
GIST AI-X Computing ClusterGIST AI-X Computing Cluster
GIST AI-X Computing Cluster
Jax Jargalsaikhan
 
High performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User GroupHigh performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User Group
HungWei Chiu
 
Apache con 2020 use cases and optimizations of iotdb
Apache con 2020 use cases and optimizations of iotdbApache con 2020 use cases and optimizations of iotdb
Apache con 2020 use cases and optimizations of iotdb
ZhangZhengming
 

Similar to Python for High Throughput Science by Mark Basham (20)

Current Trends in HPC
Current Trends in HPCCurrent Trends in HPC
Current Trends in HPC
 
Tsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in ChinaTsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in China
 
From the Archives: Future of Supercomputing at Altparty 2009
From the Archives: Future of Supercomputing at Altparty 2009From the Archives: Future of Supercomputing at Altparty 2009
From the Archives: Future of Supercomputing at Altparty 2009
 
ABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big Data
ABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big DataABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big Data
ABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big Data
 
Lrz kurs: big data analysis
Lrz kurs: big data analysisLrz kurs: big data analysis
Lrz kurs: big data analysis
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
 
IBM and ASTRON 64bit μServer for DOME
IBM and ASTRON 64bit μServer for DOMEIBM and ASTRON 64bit μServer for DOME
IBM and ASTRON 64bit μServer for DOME
 
Early Application experiences on Summit
Early Application experiences on Summit Early Application experiences on Summit
Early Application experiences on Summit
 
A Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural NetworksA Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural Networks
 
Py tables
Py tablesPy tables
Py tables
 
Large Data Analyze With PyTables
Large Data Analyze With PyTablesLarge Data Analyze With PyTables
Large Data Analyze With PyTables
 
PyTables
PyTablesPyTables
PyTables
 
PyTables
PyTablesPyTables
PyTables
 
PCCC23:筑波大学計算科学研究センター テーマ1「スーパーコンピュータCygnus / Pegasus」
PCCC23:筑波大学計算科学研究センター テーマ1「スーパーコンピュータCygnus / Pegasus」PCCC23:筑波大学計算科学研究センター テーマ1「スーパーコンピュータCygnus / Pegasus」
PCCC23:筑波大学計算科学研究センター テーマ1「スーパーコンピュータCygnus / Pegasus」
 
Computational infrastructure for NGS data analysis
Computational infrastructure for NGS data analysisComputational infrastructure for NGS data analysis
Computational infrastructure for NGS data analysis
 
2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup
 
Chips&toys
Chips&toysChips&toys
Chips&toys
 
GIST AI-X Computing Cluster
GIST AI-X Computing ClusterGIST AI-X Computing Cluster
GIST AI-X Computing Cluster
 
High performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User GroupHigh performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User Group
 
Apache con 2020 use cases and optimizations of iotdb
Apache con 2020 use cases and optimizations of iotdbApache con 2020 use cases and optimizations of iotdb
Apache con 2020 use cases and optimizations of iotdb
 

More from PyData

Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
PyData
 
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshUnit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
PyData
 
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiThe TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
PyData
 
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
PyData
 
Deploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerDeploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne Bauer
PyData
 
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaGraph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
PyData
 
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
PyData
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
PyData
 
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
PyData
 
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottAvoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
PyData
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca Bilbro
PyData
 
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
PyData
 
Pydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica Puerto
PyData
 
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
PyData
 
Extending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydExtending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will Ayd
PyData
 
Measuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverMeasuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen Hoover
PyData
 
What's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldWhat's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper Seabold
PyData
 
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
PyData
 
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardSolving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
PyData
 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
PyData
 

More from PyData (20)

Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
 
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshUnit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
 
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiThe TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
 
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
 
Deploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerDeploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne Bauer
 
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaGraph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
 
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
 
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
 
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottAvoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca Bilbro
 
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
 
Pydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica Puerto
 
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
 
Extending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydExtending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will Ayd
 
Measuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverMeasuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen Hoover
 
What's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldWhat's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper Seabold
 
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
 
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardSolving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
 

Recently uploaded

National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Vladimir Iglovikov, Ph.D.
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
Claudio Di Ciccio
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
Rohit Gautam
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 

Recently uploaded (20)

National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 

Python for High Throughput Science by Mark Basham

  • 1. Python for High Throughput Science Mark Basham Scientific Software Group Diamond Light Source Ltd UK.
  • 2. Overview • What is Diamond Light Source • Big Data? • Python for scientists • Python for developers
  • 3.
  • 5. What do I do? • Provide data analysis for use during and after beamtime for users –Users may or may not have any prior experience. –~30 beamlines with over 100 techniques used. • With 12 other Full time developers
  • 6. Where it all started Client server technology Communication with EPICS and hardware Scan mechanism www.opengda.org Jython and Python Visualisation Communication with external analysis Analysis tools All core technologies open source Acquisition • 1.0 release 2002 • 3.0 release 2004 – Jython introduced as scripting language Beamline setup and data collection speed increased.
  • 8. Detector History at DLS • Early 2007: – Diamond first user. – No detector faster than ~10 MB/sec. • Early 2009: – first Lustre system (DDN S2A9900) – first Pilatus 6M system @ 60 MB/s. • Early 2011: – second Lustre system (DDN SFA10K) – first 25Hz Pilatus 6M system @150 MB/s. • Early 2013: – first GPFS system (DDN SFA12K) – First 100 Hz Pilatus 6M system @ 600 MB/sec – ~10 beamlines with 10 GbE detectors (mainly Pilatus and PCO Edge). • Early 2015: – delivery of Percival detector (6000 MB/sec). 1 10 100 1000 10000 2007 2012 Peak Detector Performance (MB/s)
  • 9. < 100GB/day < 1TB/day > 1TB/day Per Beamline Data Rates
  • 10. Data Storage ● ~1PB of Lustre ● ~1PB of GPFS ● ~0.5PB of on-line archive ● ~1PB near-line archive – >200M files High performance parallel file systems HATE lots of small files.
  • 11.
  • 12. < 100GB/day < 1TB/day > 1TB/day Small Data Rate Beamlines (Variety)
  • 13. “I have all the data I have ever collected on a floppy disk and process it by hand…” Principal beam-line scientist when asked about data volumes in 2005
  • 14. “I have all the data I have ever collected on a floppy disk and process it by hand…” ~1 TB so far this year
  • 15. Processing Data (Variety) • Experimental work requires exploring – Matlab – IDL – IgorPro – Excel – Origin – Mathmatica
  • 16. Processing Playing with Data (Variety) • Experimental work requires exploring – Matlab – IDL – IgorPro – Excel – Origin – Mathmatica • Issue is scalability at all and at a reasonable price
  • 17. Clusters (Velocity) ●132 Intel based nodes, 1280 Intel cores in service. ●80 NVIDIA GPGPU’s, 23328 GPU cores in service. ●Split across 6 clusters, with a range of capabilities. ●Mostly used by MX and tomography beamlines. ●All accessed via Sun Grid Engine interface.
  • 18. Python is the Obvious answer • Users have used it during their beam times. • Free and easily distributable. • ... • BUT – how to give it to them in a way they understand.
  • 19. Extending the Acquisition tools Client server technology Communication with EPICS and hardware Scan mechanism www.opengda.org Jython and Python Visualisation Communication with external analysis Analysis tools Data read, write, convert Metadata structure Workflows All core technologies open source www.dawnsci.org DAWN is a collection of generic and bespoke ‘views’ collated into ‘perspectives’. The perspectives and views can be used in part or whole in either the GDA or DAWN. Acquisition Analysis
  • 20. Main Dawn Elements for Python Python/Jython Data Exploring Workflow PyDev Scripting IPython Console Python Actor scisoftpy module HDF5 Visualisation www.dawnsci.org
  • 22. Interactive console Run on CMD Real-time variable view IPython interface Integrated Debugging
  • 23. Scripting tools Breakpoints and Step by step debugging Interact with the interpreter while paused
  • 24. Python @ Diamond • Anaconda –Numpy –Scipy –H5py –Mpi4py –Webservices • Astra (Tomography) • FFTW (Ptychography) • CCTBX (Crystallography)
  • 25. Processing Playing with Data (Variety) • Experimental work requires exploring – Python • Scientific Software team – Modules for easy access and common tasks – Repositories and Training
  • 26. Aside – Python for Optimization • We produce a very fast beam of electrons (99.999999% the speed of light) • We oscillate this beam between magnet arrays called Insertion Devices (ID’s) to make lots of light
  • 27. Insertion Devices (ID’s ~600 magnets)
  • 28. Individual Magnet (~800) Unique MagnetMagnet Holder x yz X Y Z Perfect 1.0 0.0 0.0 Real 1.08 0.02 -0.04
  • 29. Simple Optimisation Problem • From 800 magnets, pick 600 of them in the right order so that they appear to be a perfect array. • But we already have code in Fortran –Bit hard to use –Not that extensible to new systems
  • 30. Objective Functions • Slower in Python than Fortran –Original code ~ 1,000 times slower –Numpy array optimised ~ 10 times slower • Python improvements, –Caching ~ matched the speed –Clever updating ~ 100 times faster.
  • 31. OptID • Artificial Immune systems – Global optimiser – Need more evaluations • Parallelization – Threading with np to use processors – Mpi4py for data transfer and making use of the cluster • Running on 25 machines, 200 cpu’s • First sort with the new code has been built.
  • 32. < 100GB/day < 1TB/day > 1TB/day High Data Rate Beamlines
  • 33. Archiving (Veracity) • Simple task of registering files and metadata with a remote service. – Xml parsing – Contact web services – File system interaction • Nearly 1PB of data and 200 Million files archives through this system. • Extended onto the cluster to deal with the additional load.
  • 34. < 100GB/day < 1TB/day > 1TB/day MX Data Processing (Volume and Velocity)
  • 35. MX Data Reduction (Volume) Fast DP - fast Index Integrate PointlessScale, refine in P1 Scale, postrefine, merge in point group Choose best point group Integrate Integrate Integrate Integrate Output MTZ File xia2 – thorough downstream processing...
  • 36. Experimental Phasing (Velocity) Fast EP Prepare for Shelx - ShelxC Phase - ShelxE Solvent fraction Original Inverted Find substructure - ShelxD #sites Spacegroups 0.25 0.75 Experimentally phased map Fast DP MTZ file Results location: (visitpath)/processed/(folder)/(prefix)
  • 37. DIALS • Full application being built in Python – 4 full time developers • CCTBX – Extending and working with this open source project • Boost – Optimization when required using Boost
  • 38. < 100GB/day < 1TB/day > 1TB/day Tomography Data Reconstruction (Volume and Velocity)
  • 39. Tomography Current Implemetation • Existing codes for reconstruction in c with CUDA – Only runs on Tiffs – Minimal data correction for experimental artefacts – Only uses 1GPU • Python – Splits data and manages cluster usage (2 GPU’s per Node) – Extracts corrected data from HDF – Builds input files from metadata
  • 40. Tomography Next Gen • Mpi4py – Cluster organisation, – Parallelism – Queues using send buffers • Transfer of data using ZeroMQ – Using blosc for compression • Processing in python where possible – But calls to external code will be used initially
  • 41. Multiprocessor + MPI “profiling”
  • 43. Multiprocessor/MPI “profiling” • Javascript var dataTable = new google.visualization.DataTable() • Python import logging logging.basicConfig(level=0,format='L %(asctime)s.%(msecs)03d M' + machine_number_string + ' ' + rank_names[machine_rank] + ' %(levelname)-6s %(message)s', datefmt='%H:%M:%S') • Jinja2 templating to tie the 2 together
  • 44. Where are we going? • Scientists are having to become developers – We try to steer them in the right direction – Python is a very good, if not the best tool to do this • Developers are having to work faster and be more reactive to new detectors, clusters, software, methods,.... – Python allows this, and is being adopted almost as standard by new computational projects at Diamond
  • 45. Acknowledgements – Alun Ashton – Graeme Winter – Greg Matthews – Tina Friedrich – Frederik Ferner – Jonah Graham (Kichwa) – Matthew Gerring – Peter Chang – Baha El Kassaby – Jacob Filik – Karl Levik – Irakli Sikharulidze – Olof Svensson – Andy Gotz – Gábor Náray – Ed Rial – Robert Oates