High-performance computing resources are currently widely used in science and engineering areas. Typical post-hoc approaches use persistent storage to save produced data from simulation, thus reading from storage to memory is required for data analysis tasks. For large-scale scientific simulations, such I/O operation will produce significant overhead. In-situ/in-transit approaches bypass I/O by accessing and processing in-memory simulation results directly, which suggests simulations and analysis applications should be more closely coupled. This paper constructs a flexible and extensible framework to connect scientific simulations with multi-steps machine learning processes and in-situ visualization tools, thus providing plugged-in analysis and visualization functionality over complex workflows at real time. A distributed simulation-time clustering method is proposed to detect anomalies from real turbulence flows.
Unleash Your Potential - Namagunga Girls Coding Club
PEARC17:A real-time machine learning and visualization framework for scientific workflows
1. A real-time machine learning and
visualization framework for
scientific workflows
Feng Li, Fengguang Song
Indiana University-Purdue university, Indianapolis
Email: lifen@iupui.edu, fgsong@iupui.edu
2. Outlines
• Background
• An non-parametric anomaly detection method
• A scientific workflow & software framework
• Experiments
• Conclusion and future work
3. Background
• Recent advancements in parallel computing and HPC
• Intensive computation helps solve extremely large problem
• This also brings huge amount of data.
• Much faster interconnect.(infiniband->RDMA)
• How do we deal with such large data:
• Post-hoc?(let simulation results flushed to disk)
• In-situ?(closely coupled)
• Something in between?
4. Tuple Space & Dataspaces
• Tuple Space
• Data is access by content and type, rather than by their raw memory address.
• Ability to describe data without referring any computer architecture!
• Ideal model to couple different simulation and analysis applications.
• Dataspaces
• Developed by Rutgers University(Docan, 2012)
• Derived from the concept of tuple space
• coupling different applications together use RDMA
• Example Interface:
• dspaces_put(var_name, version, elem_size, boundaries, pdata)
5. Vortex detection inTurbulence flow
• Vortex?
• Swirling motion of fluid around a
center region.
• Precise definition?
• Why it matters?
• How to identify it?
6. Vortex detection inTurbulence flow
• A region-based non-parametric anomaly detection method.(Póczos, 2012)
• Simulation data is divided into “regions”
• Each region(A collection of data points) can be regarded as a random sampling from a
distribution with density p.
• Features: (vx , vy, dc)
• v,Velocity in both dimensions
• dc, Distance to center.
BA C D
7. Vortex detection inTurbulence flow,
Cont.
• Density in this data point can be estimated using kNN methods.
• Divergence(distance between regions)
• L2 divergence: 𝐿(𝑝| 𝑞 = ∫((𝑝 𝑥 − 𝑞(𝑥))+ 𝑑𝑥)-/+
• Once we know how to measure the ’difference’ of two regions, we can use
distance-based clustering method.(eg. kmedoids)
𝑝0 𝑋2 = 𝑘/( 𝑛 − 1 ∗ 𝑐 ∗ 𝑣0 𝑖 )
8. Demonstration of theWorkflow
• Divide → Divergence →
Generate new medoids→ Re-
assign Cluster ID
• Different application coupled
using Dataspaces.
• Components in the
Workflow:
• Simulators
• Data processing
• Data Analysis(Anomaly
detection)
• Visualization tool(Paraview
Catalyst)
9. Distributed Sampling
• Original k-medoids method does scale
well with large dataset
• O(𝑛+
) , n is number of regions
• Data processing will be the bottle neck
• CLARA(Clustering LARge Application,
Kaufman 2009)
• Such small granularity operations are
expensive in Dataspaces(“boundaries”)
• Reorganize the data:
• Single Dataspaces operation will
operate on a larger chunk of data.
10. Experiments
• Testbed: Karst indiana University.
• Other tools used:
• John HopkinsTurbulence Databases.
• Paraview Catalyst(Co-processing Library, Fabian 2011)
• Originally designed for the ’in-situ’ approach
• generic visualization framework, support ”live visualization”
• Openfoam(Open Source CFDToolbox)
• DataSpaces adaptor to default writer.
11. Experiment with JHTDB data
• Data description
• Forced isotropic dataset from John HopkinsTurbulence Databases(JHTDB)
• 1024*1024*1024 grid, 5028 frames
• Well formatted inVTK/HDF5
• All regions are clustered into:
• Steady flow
• Unsteady flow
• Random flow
• Client side
• Paraview GUI
• Two views are linked together.
12. ExecutionTime Breakdown
• Analyzes the communication overhead and efficiency of the framework
• How time is spend in each component in this workflow
• Average data transfer time and computation time for all the four
applications(simulator, data processing, data analysis, catalyst visualization)
13. Strong Scalability
• How well the workflow can scale with more computing resources
• Fixed Input size of 4096*4096(larger) in each timestep, which contains 1GB
velocity/pressure data
• More processes for all applications?
• Latency_produce/Latency_consume
14. Conclusion
• a flexible and extensible software framework
• A parallel non-parametric clustering method
• Using DataSpaces greatly reduces I/O time and scales well.
16. References
1. Ciprian Docan, Manish Parashar, and Scott Klasky. 2012. Dataspaces: an interaction
and coordination framework for coupled simulation workflows.Cluster Computing
15, 2 (2012), 163–181.
2. Barnabás Póczos, Liang Xiong, and Jeff Schneider. 2012. Nonparametric divergence
estimation with applications to machine learning on distributions. arXiv preprint
arXiv:1202.3758 (2012)
3. Nathan Fabian, Kenneth Moreland, DavidThompson, Andrew C Bauer, Pat Marion,
Berk Gevecik, Michel Rasquin, and Kenneth E Jansen. 2011.The paraview
coprocessing library: A scalable, general purpose in situ visualization library. In Large
Data Analysis andVisualization (LDAV), 2011 IEEE Symposium on. IEEE, 89–96.
4. Raymond Leonard Kaufman and Peter J Rousseeuw. 2009. Finding groups in data:
an introduction to cluster analysis.Vol. 344. JohnWiley & Sons.