SlideShare a Scribd company logo
1 of 17
Download to read offline
A real-time machine learning and
visualization framework for
scientific workflows
Feng Li, Fengguang Song
Indiana University-Purdue university, Indianapolis
Email: lifen@iupui.edu, fgsong@iupui.edu
Outlines
• Background
• An non-parametric anomaly detection method
• A scientific workflow & software framework
• Experiments
• Conclusion and future work
Background
• Recent advancements in parallel computing and HPC
• Intensive computation helps solve extremely large problem
• This also brings huge amount of data.
• Much faster interconnect.(infiniband->RDMA)
• How do we deal with such large data:
• Post-hoc?(let simulation results flushed to disk)
• In-situ?(closely coupled)
• Something in between?
Tuple Space & Dataspaces
• Tuple Space
• Data is access by content and type, rather than by their raw memory address.
• Ability to describe data without referring any computer architecture!
• Ideal model to couple different simulation and analysis applications.
• Dataspaces
• Developed by Rutgers University(Docan, 2012)
• Derived from the concept of tuple space
• coupling different applications together use RDMA
• Example Interface:
• dspaces_put(var_name, version, elem_size, boundaries, pdata)
Vortex detection inTurbulence flow
• Vortex?
• Swirling motion of fluid around a
center region.
• Precise definition?
• Why it matters?
• How to identify it?
Vortex detection inTurbulence flow
• A region-based non-parametric anomaly detection method.(Póczos, 2012)
• Simulation data is divided into “regions”
• Each region(A collection of data points) can be regarded as a random sampling from a
distribution with density p.
• Features: (vx , vy, dc)
• v,Velocity in both dimensions
• dc, Distance to center.
BA C D
Vortex detection inTurbulence flow,
Cont.
• Density in this data point can be estimated using kNN methods.
• Divergence(distance between regions)
• L2 divergence: 𝐿(𝑝| 𝑞 = ∫((𝑝 𝑥 − 𝑞(𝑥))+ 𝑑𝑥)-/+
• Once we know how to measure the ’difference’ of two regions, we can use
distance-based clustering method.(eg. kmedoids)
𝑝0 𝑋2 = 𝑘/( 𝑛 − 1 ∗ 𝑐 ∗ 𝑣0 𝑖 )
Demonstration of theWorkflow
• Divide → Divergence →
Generate new medoids→ Re-
assign Cluster ID
• Different application coupled
using Dataspaces.
• Components in the
Workflow:
• Simulators
• Data processing
• Data Analysis(Anomaly
detection)
• Visualization tool(Paraview
Catalyst)
Distributed Sampling
• Original k-medoids method does scale
well with large dataset
• O(𝑛+
) , n is number of regions
• Data processing will be the bottle neck
• CLARA(Clustering LARge Application,
Kaufman 2009)
• Such small granularity operations are
expensive in Dataspaces(“boundaries”)
• Reorganize the data:
• Single Dataspaces operation will
operate on a larger chunk of data.
Experiments
• Testbed: Karst indiana University.
• Other tools used:
• John HopkinsTurbulence Databases.
• Paraview Catalyst(Co-processing Library, Fabian 2011)
• Originally designed for the ’in-situ’ approach
• generic visualization framework, support ”live visualization”
• Openfoam(Open Source CFDToolbox)
• DataSpaces adaptor to default writer.
Experiment with JHTDB data
• Data description
• Forced isotropic dataset from John HopkinsTurbulence Databases(JHTDB)
• 1024*1024*1024 grid, 5028 frames
• Well formatted inVTK/HDF5
• All regions are clustered into:
• Steady flow
• Unsteady flow
• Random flow
• Client side
• Paraview GUI
• Two views are linked together.
ExecutionTime Breakdown
• Analyzes the communication overhead and efficiency of the framework
• How time is spend in each component in this workflow
• Average data transfer time and computation time for all the four
applications(simulator, data processing, data analysis, catalyst visualization)
Strong Scalability
• How well the workflow can scale with more computing resources
• Fixed Input size of 4096*4096(larger) in each timestep, which contains 1GB
velocity/pressure data
• More processes for all applications?
• Latency_produce/Latency_consume
Conclusion
• a flexible and extensible software framework
• A parallel non-parametric clustering method
• Using DataSpaces greatly reduces I/O time and scales well.
Thanks
• Questions?
References
1. Ciprian Docan, Manish Parashar, and Scott Klasky. 2012. Dataspaces: an interaction
and coordination framework for coupled simulation workflows.Cluster Computing
15, 2 (2012), 163–181.
2. Barnabás Póczos, Liang Xiong, and Jeff Schneider. 2012. Nonparametric divergence
estimation with applications to machine learning on distributions. arXiv preprint
arXiv:1202.3758 (2012)
3. Nathan Fabian, Kenneth Moreland, DavidThompson, Andrew C Bauer, Pat Marion,
Berk Gevecik, Michel Rasquin, and Kenneth E Jansen. 2011.The paraview
coprocessing library: A scalable, general purpose in situ visualization library. In Large
Data Analysis andVisualization (LDAV), 2011 IEEE Symposium on. IEEE, 89–96.
4. Raymond Leonard Kaufman and Peter J Rousseeuw. 2009. Finding groups in data:
an introduction to cluster analysis.Vol. 344. JohnWiley & Sons.
SpecialThanks go to

More Related Content

What's hot

Scalable Distributed Real-Time Clustering for Big Data Streams
Scalable Distributed Real-Time Clustering for Big Data StreamsScalable Distributed Real-Time Clustering for Big Data Streams
Scalable Distributed Real-Time Clustering for Big Data StreamsAntonio Severien
 
5.1 mining data streams
5.1 mining data streams5.1 mining data streams
5.1 mining data streamsKrish_ver2
 
Big Data on Implementation of Many to Many Clustering
Big Data on Implementation of Many to Many ClusteringBig Data on Implementation of Many to Many Clustering
Big Data on Implementation of Many to Many Clusteringpaperpublications3
 
"Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler...
"Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler..."Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler...
"Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler...Dataconomy Media
 
Data Streaming in Big Data Analysis
Data Streaming in Big Data AnalysisData Streaming in Big Data Analysis
Data Streaming in Big Data AnalysisVincenzo Gulisano
 
Evaluating Classification Algorithms Applied To Data Streams Esteban Donato
Evaluating Classification Algorithms Applied To Data Streams   Esteban DonatoEvaluating Classification Algorithms Applied To Data Streams   Esteban Donato
Evaluating Classification Algorithms Applied To Data Streams Esteban DonatoEsteban Donato
 
Data Stream Management
Data Stream ManagementData Stream Management
Data Stream Managementk_tauhid
 
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...Ian Foster
 
A Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in ParallelA Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in ParallelJenny Liu
 
Streaming Algorithms
Streaming AlgorithmsStreaming Algorithms
Streaming AlgorithmsJoe Kelley
 
Usage Patterns to Provision for Scientific Experiments in Clouds
Usage Patterns to Provision for Scientific Experiments in CloudsUsage Patterns to Provision for Scientific Experiments in Clouds
Usage Patterns to Provision for Scientific Experiments in CloudsEran Chinthaka Withana
 
Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...
Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...
Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...IRJET Journal
 
Streaming computing: architectures, and tchnologies
Streaming computing: architectures, and tchnologiesStreaming computing: architectures, and tchnologies
Streaming computing: architectures, and tchnologiesNatalino Busa
 
Interactive Latency in Big Data Visualization
Interactive Latency in Big Data VisualizationInteractive Latency in Big Data Visualization
Interactive Latency in Big Data Visualizationbigdataviz_bay
 

What's hot (20)

Final proj 2 (1)
Final proj 2 (1)Final proj 2 (1)
Final proj 2 (1)
 
2017 nov reflow sbtb
2017 nov reflow sbtb2017 nov reflow sbtb
2017 nov reflow sbtb
 
Scalable Distributed Real-Time Clustering for Big Data Streams
Scalable Distributed Real-Time Clustering for Big Data StreamsScalable Distributed Real-Time Clustering for Big Data Streams
Scalable Distributed Real-Time Clustering for Big Data Streams
 
5.1 mining data streams
5.1 mining data streams5.1 mining data streams
5.1 mining data streams
 
Big Data on Implementation of Many to Many Clustering
Big Data on Implementation of Many to Many ClusteringBig Data on Implementation of Many to Many Clustering
Big Data on Implementation of Many to Many Clustering
 
"Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler...
"Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler..."Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler...
"Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler...
 
Data Streaming in Big Data Analysis
Data Streaming in Big Data AnalysisData Streaming in Big Data Analysis
Data Streaming in Big Data Analysis
 
Evaluating Classification Algorithms Applied To Data Streams Esteban Donato
Evaluating Classification Algorithms Applied To Data Streams   Esteban DonatoEvaluating Classification Algorithms Applied To Data Streams   Esteban Donato
Evaluating Classification Algorithms Applied To Data Streams Esteban Donato
 
Data Stream Management
Data Stream ManagementData Stream Management
Data Stream Management
 
DIET_BLAST
DIET_BLASTDIET_BLAST
DIET_BLAST
 
A0360109
A0360109A0360109
A0360109
 
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
 
C0312023
C0312023C0312023
C0312023
 
A Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in ParallelA Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in Parallel
 
Streaming Algorithms
Streaming AlgorithmsStreaming Algorithms
Streaming Algorithms
 
Usage Patterns to Provision for Scientific Experiments in Clouds
Usage Patterns to Provision for Scientific Experiments in CloudsUsage Patterns to Provision for Scientific Experiments in Clouds
Usage Patterns to Provision for Scientific Experiments in Clouds
 
Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...
Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...
Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...
 
Streaming computing: architectures, and tchnologies
Streaming computing: architectures, and tchnologiesStreaming computing: architectures, and tchnologies
Streaming computing: architectures, and tchnologies
 
B0330811
B0330811B0330811
B0330811
 
Interactive Latency in Big Data Visualization
Interactive Latency in Big Data VisualizationInteractive Latency in Big Data Visualization
Interactive Latency in Big Data Visualization
 

Similar to PEARC17:A real-time machine learning and visualization framework for scientific workflows

High Performance Computing and Big Data
High Performance Computing and Big Data High Performance Computing and Big Data
High Performance Computing and Big Data Geoffrey Fox
 
Achieving horizontal scalability in density-based clustering for urls
Achieving horizontal scalability in density-based clustering for urlsAchieving horizontal scalability in density-based clustering for urls
Achieving horizontal scalability in density-based clustering for urlsAndrea Morichetta
 
HP - Jerome Rolia - Hadoop World 2010
HP - Jerome Rolia - Hadoop World 2010HP - Jerome Rolia - Hadoop World 2010
HP - Jerome Rolia - Hadoop World 2010Cloudera, Inc.
 
Hadoop Tutorial.ppt
Hadoop Tutorial.pptHadoop Tutorial.ppt
Hadoop Tutorial.pptSathish24111
 
Design of Experiments on Federator Polystore Architecture
Design of Experiments on Federator Polystore ArchitectureDesign of Experiments on Federator Polystore Architecture
Design of Experiments on Federator Polystore ArchitectureLuiz Henrique Zambom Santana
 
PEARC 17: Spark On the ARC
PEARC 17: Spark On the ARCPEARC 17: Spark On the ARC
PEARC 17: Spark On the ARCHimanshu Bedi
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesGeoffrey Fox
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesGeoffrey Fox
 
Scientific
Scientific Scientific
Scientific marpierc
 
Parsl: Pervasive Parallel Programming in Python
Parsl: Pervasive Parallel Programming in PythonParsl: Pervasive Parallel Programming in Python
Parsl: Pervasive Parallel Programming in PythonDaniel S. Katz
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Herman Wu
 
The Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-SystemThe Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-Systeminside-BigData.com
 
Application-Aware Big Data Deduplication in Cloud Environment
Application-Aware Big Data Deduplication in Cloud EnvironmentApplication-Aware Big Data Deduplication in Cloud Environment
Application-Aware Big Data Deduplication in Cloud EnvironmentSafayet Hossain
 
Ling liu part 01:big graph processing
Ling liu part 01:big graph processingLing liu part 01:big graph processing
Ling liu part 01:big graph processingjins0618
 

Similar to PEARC17:A real-time machine learning and visualization framework for scientific workflows (20)

High Performance Computing and Big Data
High Performance Computing and Big Data High Performance Computing and Big Data
High Performance Computing and Big Data
 
try
trytry
try
 
Achieving horizontal scalability in density-based clustering for urls
Achieving horizontal scalability in density-based clustering for urlsAchieving horizontal scalability in density-based clustering for urls
Achieving horizontal scalability in density-based clustering for urls
 
HP - Jerome Rolia - Hadoop World 2010
HP - Jerome Rolia - Hadoop World 2010HP - Jerome Rolia - Hadoop World 2010
HP - Jerome Rolia - Hadoop World 2010
 
Hadoop tutorial
Hadoop tutorialHadoop tutorial
Hadoop tutorial
 
Hadoop Tutorial.ppt
Hadoop Tutorial.pptHadoop Tutorial.ppt
Hadoop Tutorial.ppt
 
Design of Experiments on Federator Polystore Architecture
Design of Experiments on Federator Polystore ArchitectureDesign of Experiments on Federator Polystore Architecture
Design of Experiments on Federator Polystore Architecture
 
PEARC 17: Spark On the ARC
PEARC 17: Spark On the ARCPEARC 17: Spark On the ARC
PEARC 17: Spark On the ARC
 
NGBT_poster_v0.4
NGBT_poster_v0.4NGBT_poster_v0.4
NGBT_poster_v0.4
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software Architectures
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software Architectures
 
ICIECA 2014 Paper 05
ICIECA 2014 Paper 05ICIECA 2014 Paper 05
ICIECA 2014 Paper 05
 
Hadoop
HadoopHadoop
Hadoop
 
Scientific
Scientific Scientific
Scientific
 
Deep Learning at Scale
Deep Learning at ScaleDeep Learning at Scale
Deep Learning at Scale
 
Parsl: Pervasive Parallel Programming in Python
Parsl: Pervasive Parallel Programming in PythonParsl: Pervasive Parallel Programming in Python
Parsl: Pervasive Parallel Programming in Python
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
 
The Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-SystemThe Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-System
 
Application-Aware Big Data Deduplication in Cloud Environment
Application-Aware Big Data Deduplication in Cloud EnvironmentApplication-Aware Big Data Deduplication in Cloud Environment
Application-Aware Big Data Deduplication in Cloud Environment
 
Ling liu part 01:big graph processing
Ling liu part 01:big graph processingLing liu part 01:big graph processing
Ling liu part 01:big graph processing
 

Recently uploaded

Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 

Recently uploaded (20)

Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 

PEARC17:A real-time machine learning and visualization framework for scientific workflows

  • 1. A real-time machine learning and visualization framework for scientific workflows Feng Li, Fengguang Song Indiana University-Purdue university, Indianapolis Email: lifen@iupui.edu, fgsong@iupui.edu
  • 2. Outlines • Background • An non-parametric anomaly detection method • A scientific workflow & software framework • Experiments • Conclusion and future work
  • 3. Background • Recent advancements in parallel computing and HPC • Intensive computation helps solve extremely large problem • This also brings huge amount of data. • Much faster interconnect.(infiniband->RDMA) • How do we deal with such large data: • Post-hoc?(let simulation results flushed to disk) • In-situ?(closely coupled) • Something in between?
  • 4. Tuple Space & Dataspaces • Tuple Space • Data is access by content and type, rather than by their raw memory address. • Ability to describe data without referring any computer architecture! • Ideal model to couple different simulation and analysis applications. • Dataspaces • Developed by Rutgers University(Docan, 2012) • Derived from the concept of tuple space • coupling different applications together use RDMA • Example Interface: • dspaces_put(var_name, version, elem_size, boundaries, pdata)
  • 5. Vortex detection inTurbulence flow • Vortex? • Swirling motion of fluid around a center region. • Precise definition? • Why it matters? • How to identify it?
  • 6. Vortex detection inTurbulence flow • A region-based non-parametric anomaly detection method.(Póczos, 2012) • Simulation data is divided into “regions” • Each region(A collection of data points) can be regarded as a random sampling from a distribution with density p. • Features: (vx , vy, dc) • v,Velocity in both dimensions • dc, Distance to center. BA C D
  • 7. Vortex detection inTurbulence flow, Cont. • Density in this data point can be estimated using kNN methods. • Divergence(distance between regions) • L2 divergence: 𝐿(𝑝| 𝑞 = ∫((𝑝 𝑥 − 𝑞(𝑥))+ 𝑑𝑥)-/+ • Once we know how to measure the ’difference’ of two regions, we can use distance-based clustering method.(eg. kmedoids) 𝑝0 𝑋2 = 𝑘/( 𝑛 − 1 ∗ 𝑐 ∗ 𝑣0 𝑖 )
  • 8. Demonstration of theWorkflow • Divide → Divergence → Generate new medoids→ Re- assign Cluster ID • Different application coupled using Dataspaces. • Components in the Workflow: • Simulators • Data processing • Data Analysis(Anomaly detection) • Visualization tool(Paraview Catalyst)
  • 9. Distributed Sampling • Original k-medoids method does scale well with large dataset • O(𝑛+ ) , n is number of regions • Data processing will be the bottle neck • CLARA(Clustering LARge Application, Kaufman 2009) • Such small granularity operations are expensive in Dataspaces(“boundaries”) • Reorganize the data: • Single Dataspaces operation will operate on a larger chunk of data.
  • 10. Experiments • Testbed: Karst indiana University. • Other tools used: • John HopkinsTurbulence Databases. • Paraview Catalyst(Co-processing Library, Fabian 2011) • Originally designed for the ’in-situ’ approach • generic visualization framework, support ”live visualization” • Openfoam(Open Source CFDToolbox) • DataSpaces adaptor to default writer.
  • 11. Experiment with JHTDB data • Data description • Forced isotropic dataset from John HopkinsTurbulence Databases(JHTDB) • 1024*1024*1024 grid, 5028 frames • Well formatted inVTK/HDF5 • All regions are clustered into: • Steady flow • Unsteady flow • Random flow • Client side • Paraview GUI • Two views are linked together.
  • 12. ExecutionTime Breakdown • Analyzes the communication overhead and efficiency of the framework • How time is spend in each component in this workflow • Average data transfer time and computation time for all the four applications(simulator, data processing, data analysis, catalyst visualization)
  • 13. Strong Scalability • How well the workflow can scale with more computing resources • Fixed Input size of 4096*4096(larger) in each timestep, which contains 1GB velocity/pressure data • More processes for all applications? • Latency_produce/Latency_consume
  • 14. Conclusion • a flexible and extensible software framework • A parallel non-parametric clustering method • Using DataSpaces greatly reduces I/O time and scales well.
  • 16. References 1. Ciprian Docan, Manish Parashar, and Scott Klasky. 2012. Dataspaces: an interaction and coordination framework for coupled simulation workflows.Cluster Computing 15, 2 (2012), 163–181. 2. Barnabás Póczos, Liang Xiong, and Jeff Schneider. 2012. Nonparametric divergence estimation with applications to machine learning on distributions. arXiv preprint arXiv:1202.3758 (2012) 3. Nathan Fabian, Kenneth Moreland, DavidThompson, Andrew C Bauer, Pat Marion, Berk Gevecik, Michel Rasquin, and Kenneth E Jansen. 2011.The paraview coprocessing library: A scalable, general purpose in situ visualization library. In Large Data Analysis andVisualization (LDAV), 2011 IEEE Symposium on. IEEE, 89–96. 4. Raymond Leonard Kaufman and Peter J Rousseeuw. 2009. Finding groups in data: an introduction to cluster analysis.Vol. 344. JohnWiley & Sons.