Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

Share

Enabling Real Time Analysis & Decision Making - A Paradigm Shift for Experimental Science

Download to read offline

By Kerstin Kleese van Dam
PyData New York City 2017

New instrument technologies are enabling a new generation of in-situ and in-operando experiments, with extremely fine spatial and temporal resolution, that allows researchers to observe as physics, chemistry and biology are happening. These new methodologies go hand in hand with an exponential growth in data volumes and rates - petabyte scale data collections and terabyte/sec. At the same time scientists are pushing for a paradigm shift. As they can now observe processes in intricate details, they want to analyze, interpret and control those processes. Given the multitude of voluminous, heterogenous data streams involved in every single experiment, novel real time, data driven analysis and decision support approaches are needed to realize their vision. This talk will discuss state of the art streaming analysis for experimental facilities, its challenges and early successes. It will present where commercial technologies can be leveraged and how many of the novel approaches differ from commonly available solutions.

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all
  • Be the first to like this

Enabling Real Time Analysis & Decision Making - A Paradigm Shift for Experimental Science

  1. 1. Enabling Real Time Analysis and Decision Making Kerstin Kleese van Dam
  2. 2. Advancing science and technology to fulfill the DOE Mission
  3. 3. BNL is a Data Driven Science Laboratory DOE has 27 User Facilities - 6 are operated by BNL: BNL provides Data-rich Experimental Facilities: ▪ RHIC - Relativistic Heavy Ion Collider - supporting over 1000 scientists world wide ▪ NSLS II - Newest and Brightest Synchrotron in the world, supporting a multitude of scientific research in academia, industry and national security ▪ CFN - Center for Functional Nanomaterials, combines theory and experiment to probe materials ▪ ATF - Accelerator Test Facility ▪ LHC ATLAS - Largest Tier 1 Center outside CERN ▪ ARM - Atmospheric Radiation Measurement Program - Partner in multi-side facility, operating its external data Center BNL supports large scale Experimental Facilities: ▪ Belle II - Computing for Neutrino experiment ▪ QCD - Computing facilities for BNL, RIKEN & US QCD communities Science Today is Data Driven Discovery 3 ATLAS QCD CFN NSLS II RHIC
  4. 4. !2017 Milestone: Over 100PB, of catalogued data archived, long term, frequent reuse ●2nd largest scientific archive in the US, 4th largest in the world (ECMWF, UKMO, NOAA) ●2016 - 400PB analyzed, 2017 -expected 500PB ●2016 37 PB exported ●BNL PanDA software enabled processing of ~1.6 EXBytes of data worldwide in 2016 4 BNL Data Statistics
  5. 5. Where we need Real Time Analysis and Decision Making
  6. 6. 6/22/2015 6 When do we need streaming decision support? • Data arrives at high velocity and volume • Critical decisions have to be taken while the data is still arriving • Events of interest are rare • Tacit knowledge is required for accurate interpretations
  7. 7. June 22, 2015 7 What does streaming analysis include? Experimental Control: • Analyze all the facts as the situation is evolving • Consider background information • Evaluate alternatives • Request more information if needed • Set priorities • Take decisions
  8. 8. Example 1: CFN TEM - A Widely Used Research Tool • In Transmission Electron Microscopy (TEM) a beam of electrons is transmitted through an ultra-thin specimen, interacting with the specimen as it passes through. • ~500 aberration corrected (S)TEM worldwide with ~20-30 in National Laboratories • In-situ observations generate from 10GB-10’s of TB (e.g. at BNL) of data per experiment (and getting larger) at rates ranging from 100 images/sec for basic instruments to 400 images/sec = 3GB/s for the state of the art systems at BNL • Real time analysis needed to steer the experiment •
  9. 9. Nanoparticle detection in TEM images 9 • 400 frames/s • 3GB/seconds • High noise Convolutional Neural Network (CNN) based deep learning framework – U-Net
  10. 10. TEM Detection results 10 Original frames Detection results Size distribution
  11. 11. Protein structure detection in cryoEM images 
 11 • Size: 4096×4096 • ~120GB for one protein • High noise Cryo-EM images Protein detection result
  12. 12. Protein structure detection in cryoEM images 12 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Rec. S1 S2 S3 S4 S5 0.78 1.00 1.00 1.00 1.00 0.67 1.00 1.00 1.00 1.00 0.47 1.00 1.00 1.00 1.00 0.16 1.00 1.00 1.00 1.00 0.03 1.00 1.00 1.00 1.00 0.01 1.00 1.00 1.00 1.00 0.01 0.99 0.98 1.00 1.00 0.00 0.93 0.92 0.90 0.93 0.00 0.58 0.57 0.41 0.58 Prec. S1 S2 S3 S4 S5 0.89 0.86 0.86 0.83 0.82 0.77 0.86 0.86 0.83 0.82 0.55 0.86 0.86 0.83 0.82 0.18 0.86 0.85 0.83 0.82 0.03 0.86 0.85 0.82 0.82 0.01 0.86 0.85 0.82 0.82 0.00 0.85 0.84 0.82 0.81 0.00 0.80 0.79 0.75 0.76 0.00 0.50 0.49 0.34 0.48 F1 S1 S2 S3 S4 S5 0.83 0.92 0.92 0.90 0.90 0.72 0.92 0.92 0.90 0.90 0.51 0.92 0.92 0.91 0.90 0.17 0.92 0.92 0.90 0.90 0.03 0.92 0.92 0.90 0.90 0.01 0.92 0.92 0.90 0.90 0.00 0.91 0.90 0.89 0.89 0.00 0.86 0.85 0.82 0.84 0.00 0.54 0.53 0.37 0.53 S1: Directly using the pre-trained model S2: Trained with random initialization, 54,000 samples S3: Trained with pre-trained model, 54,000 samples S4: Trained with random initialization, 1,000 samples S5: Trained with pre-trained model, 1,000 samples
  13. 13. Example 2: NSLS II Light Source (1) • Brightest Light Source in the World • 10,000 times brighter than original NSLS • 2,000 users per year (4,000 in FY17) • In full build out 60 unique beamlines, each with a variety of instrument configurations • Enabling solutions to pressing energy challenges, e.g.: • Advanced electrical storage • High-temperature superconductors 
 for the electric grid • Fuel cells based on nanocatalysts • Plant/environment interactions
  14. 14. Example 2: NSLS II Light Source (2) • NSLS II - Coherent Hard X-Ray (CHX) Beamline 4.5 GB/s sustained data rates • NSLS II - Hard X-Ray Nanoprobes (HXN) 50 GB/s sustained, 1 – 5 TB/s in burst • In-situ and In-Operando experiments • Observing and guiding physical, chemical and biological transformations as they happen • Observing these processes in working objects e.g. batteries, microchips • Real time analysis needed to steer the experiments
  15. 15. CHX - Healing X-ray Scattering Images Scientific Achievement A “physics-aware” algorithm was developed to heal defects in x-ray scattering datasets Significance and Impact Healing x-ray data allows rapid automated analysis, and is a useful pre-processing step for machine- learning Research Details Modern x-ray synchrotrons generate data at a massive rate; however, defects in the data (gaps, artifacts) complicate automated analysis Traditional ‘inpainting’ fails on scientific data By exploiting the known physics of x-ray scattering, a tuned healing algorithm was developed; for instance, symmetry is recognized and exploited Healed images can be more easily analyzed using existing data pipelines, including modern machine-learning methods The image healing algorithm also outputs physically-useful information (symmetry, type of ordering, etc.) Jiliang Liu, et al., IUCrJ 4, 455 (2017). Healing X-ray Images: X-ray diffraction/scattering images typically have defects, including missing data and artifacts, which complicate subsequent analysis. However, traditional image healing algorithms do not interpolate scientific data in a physically-meaningful way. A “physics-aware” inpainting method was developed, allowing x-ray scattering images to be healed in a physically-rigorous way.
  16. 16. CHX - Deep Learning Classification of X-ray Scattering Data Scientific Achievement Modern deep learning methods were 
 applied to the automated recognition and 
 classification of x-ray scattering data Significance and Impact Experimental scientific data can be automatically and accurately classified, allowing autonomous experimental decision-making Research Details Modern x-ray synchrotrons generate data at a massive 
 rate; automated data classification is critically required 
 to deal with data-rates Modern deep learning methods were used to recognize and classify x-ray scattering data Neural network training was augmented by inputting a large corpus of synthetic images; because the 
 experimental physics are understood, these images are highly realistic and allow the network to implicitly learn 
 the underlying physics Boyu Wang, et al. New York Scientific Data Summit, 1 (2016)
 Boyu Wang, et al. Applications of Computer Vision (WACV), 1, 697 (2017).
 Jiliang Liu, et al., IUCrJ 4, 455 (2017). Deep learning for scientific data: (Top) A convolutional neural network is used to convert input x-ray scattering images into a concise descriptor that can be used for classification. (Bottom) X-ray scattering images are complex and diverse. Owing to the limited amount of tagged experimental data, we generate synthetic highly- realistic synthetic data to train the CNN.
  17. 17. CHX - Multi-level Visual Data Exploration and Analysis for Heterogeneous Scientific Images
 Scientific Achievement A multi-level exploration and visualization tool for scientific image datasets. Significance and Impact Solves the challenge to explore and visualize an extremely large image set, with each image having a number of heterogeneous attributes. Research Details Present a level-of-detail data exploration mechanism to interactively zoom in and out across multiple layers of representations of the image set: attribute layer, raw image layer and pixel layer. Support attribute comparison and correlation analysis. Enable attribute statistics display and cross-filter operations to interactively select the demand data. Level-of-detail data exploration mechanism: we design a data exploration fashion similar to Google Map that allows users to iteratively zoom across multiple layers of the information associated with the image set: 0 – the scatterplot of the whole dataset with chosen X, Y axes and one attribute being color coded; 1 – the zoom in effect of the selected images with a pseudocolor plot showing one attribute; 2 – the further zoom in effect of 1; 3 – the matrix of the raw images where the attributes in 2 are generated from; 4 – the single raw image; 5 – the pixel values of selected region in 4. 1 2 3 4 5 0
  18. 18. XANES - Solving the Structure of Nanoparticles by Machine Learning Scientific Achievement A new method was developed to decipher the structure of nanoparticle catalysts on-the-fly from their X-ray absorption spectra. Significance and Impact Tracking the structure of catalysts in real working conditions is a challenge due to the paucity of experimental techniques that can measure the number of atoms and the distance between metal atoms with high precision. Accurate structural analysis at the nanoscale, now possible with the new method in the harsh conditions of high temperature and pressure, will enable new possibilities for tuning up reactivity of catalysts at the high flux and high energy resolution beamlines of NSLS-II and other advanced synchrotron facilities. Research Details Using methods of machine learning the previously “hidden” relationships between the features in the X- ray absorption spectra and geometry of nanocatalysts were found.J. Timoshenko, D. Lu, Y. Lin, A. I. Frenkel, J. Phys. Chem. Lett., DOI: 10.1021/acs.jpcc.7b06270 A sketch of the new method that enables fast, “on-the-fly” determination of three-dimensional structure of nanocatalysts. The colored curves are synchrotron X-ray absorption spectra collected in real time, during catalytic reaction (in operando). The neural network (white circles and lines) converts the spectra into geometric information (such as nanoparticle sizes and shapes) and the structural models are obtained for each spectrum.
  19. 19. Computational Modeling Infrastructure - Multi-Source Data Analysis Scientific Goal • Operational integration of multi-modal experimental and numerical modeling results Achievement • Software DiffPy-CMI - determines stable structures for complex materials using results from different experimental modalities in combination with ab initio and atomistic scale modeling (MatDeLab, NWChem) Impact • DiffPy-CMI supported 20 materials studies since its first release - 2014 release downloaded ~1300 times (several hundred users) - 2016 release downloaded ~130 times (~50 users) • In use at NSLS-II XPD beamline • Similar approaches under development, e.g., at the NSLS-II ISS beamline to interpret in operando device performance behavior; also of interest to NSLS-II FMX, AMX, and NYX beamlines www.diffpy.org Results from complex modeling study published in Nature Communications
  20. 20. XPCS - Python Multiprocessing for X-ray Photon Correlation Spectroscopy 
 Scientific Achievements Using the Python multiprocessing module, we have implemented a parallel version of the analysis code for X-ray Photon Correlation Spectroscopy (XPCS) at BNL’s NSLS II. The implementation achieves up to 75% efficiency compared to ideal speedup for the test cases we studied. Significance and Impact X-ray photon correlation spectroscopy is an important technique to study the nano- and meso-scale dynamics of materials. The high resolutions and high incoming image rates expected at the NSLS II beamlines make it essential to have a software toolkit that can process the image data as fast as possible. The parallel implementation of the Python XPCS analysis code is the first step towards the efficient use of the available computing resources at the beamlines and other institutional computing resources. Research Details XPCS analysis code is implemented within the scikit-beam library being developed at BNL. Python multiprocessing module is used, with different processes computing the time correlations of different region of interests (ROIs). First-In-First-Out (FIFO) Queue is used to consolidate results from different processes. S. K. Abeykoon, M. Lin and K. K. van Dam, Proceedings of New York Scientific Data Summit 2017
  21. 21. Who we are
  22. 22. Computational Science Initiative • Established in 2015 • An umbrella to bring together computing and data expertise across the lab • Aims to foster cross-disciplinary collaborations in areas of computational sciences, applied math, computer science and data analytics. • Aims to drive developments in programming models, advanced algorithms and novel computer architectures to advance scientific computing and data analysis. Director: Kerstin Kleese van Dam Deputy Director: Francis Alexander
  23. 23. Computational Science Initiative Computing for National Security Director: Barbara Chapman Director: Eric Lancon Director: Shantenu Jha Director: Nick D’Imperio Director: Adolfy Hoisie
  24. 24. Key research initiatives: Making Sense of Data at Exabyte-Scale and Beyond Real-Time Analysis of Ultra-High Throughput Data Streams Integrated, extreme-scale machine learning and visual analytics methods for real time analysis and interpretation of streaming data. New in situ and in operando experiments at at large scale facilities (e.g., NSLS-II, CFN and RHIC) Data intensive science workflows possible in the Exascale Computing Project Analysis on the Wire Autonomous Optimal Experimental Design Goal-driven capability that optimally leverages theory, modeling/simulation and experiments for the autonomous design and execution of experiments Complex Modeling Infrastructure Interactive Exploration of Multi-Petabyte Scientific Data Sets Common in nuclear physics, high energy physics, computational biology and climate science Integrated research into the required novel hardware, system software, programming models, analysis and visual analytics paradigms
  25. 25. Thank You!

By Kerstin Kleese van Dam PyData New York City 2017 New instrument technologies are enabling a new generation of in-situ and in-operando experiments, with extremely fine spatial and temporal resolution, that allows researchers to observe as physics, chemistry and biology are happening. These new methodologies go hand in hand with an exponential growth in data volumes and rates - petabyte scale data collections and terabyte/sec. At the same time scientists are pushing for a paradigm shift. As they can now observe processes in intricate details, they want to analyze, interpret and control those processes. Given the multitude of voluminous, heterogenous data streams involved in every single experiment, novel real time, data driven analysis and decision support approaches are needed to realize their vision. This talk will discuss state of the art streaming analysis for experimental facilities, its challenges and early successes. It will present where commercial technologies can be leveraged and how many of the novel approaches differ from commonly available solutions.

Views

Total views

308

On Slideshare

0

From embeds

0

Number of embeds

0

Actions

Downloads

5

Shares

0

Comments

0

Likes

0

×