SlideShare a Scribd company logo
A real-time machine learning and
visualization framework for
scientific workflows
Feng Li, Fengguang Song
Indiana University-Purdue university, Indianapolis
Email: lifen@iupui.edu, fgsong@iupui.edu
Outlines
• Background
• An non-parametric anomaly detection method
• A scientific workflow & software framework
• Experiments
• Conclusion and future work
Background
• Recent advancements in parallel computing and HPC
• Intensive computation helps solve extremely large problem
• This also brings huge amount of data.
• Much faster interconnect.(infiniband->RDMA)
• How do we deal with such large data:
• Post-hoc?(let simulation results flushed to disk)
• In-situ?(closely coupled)
• Something in between?
Tuple Space & Dataspaces
• Tuple Space
• Data is access by content and type, rather than by their raw memory address.
• Ability to describe data without referring any computer architecture!
• Ideal model to couple different simulation and analysis applications.
• Dataspaces
• Developed by Rutgers University(Docan, 2012)
• Derived from the concept of tuple space
• coupling different applications together use RDMA
• Example Interface:
• dspaces_put(var_name, version, elem_size, boundaries, pdata)
Vortex detection inTurbulence flow
• Vortex?
• Swirling motion of fluid around a
center region.
• Precise definition?
• Why it matters?
• How to identify it?
Vortex detection inTurbulence flow
• A region-based non-parametric anomaly detection method.(Póczos, 2012)
• Simulation data is divided into “regions”
• Each region(A collection of data points) can be regarded as a random sampling from a
distribution with density p.
• Features: (vx , vy, dc)
• v,Velocity in both dimensions
• dc, Distance to center.
BA C D
Vortex detection inTurbulence flow,
Cont.
• Density in this data point can be estimated using kNN methods.
• Divergence(distance between regions)
• L2 divergence: 𝐿(𝑝| 𝑞 = ∫((𝑝 𝑥 − 𝑞(𝑥))+ 𝑑𝑥)-/+
• Once we know how to measure the ’difference’ of two regions, we can use
distance-based clustering method.(eg. kmedoids)
𝑝0 𝑋2 = 𝑘/( 𝑛 − 1 ∗ 𝑐 ∗ 𝑣0 𝑖 )
Demonstration of theWorkflow
• Divide → Divergence →
Generate new medoids→ Re-
assign Cluster ID
• Different application coupled
using Dataspaces.
• Components in the
Workflow:
• Simulators
• Data processing
• Data Analysis(Anomaly
detection)
• Visualization tool(Paraview
Catalyst)
Distributed Sampling
• Original k-medoids method does scale
well with large dataset
• O(𝑛+
) , n is number of regions
• Data processing will be the bottle neck
• CLARA(Clustering LARge Application,
Kaufman 2009)
• Such small granularity operations are
expensive in Dataspaces(“boundaries”)
• Reorganize the data:
• Single Dataspaces operation will
operate on a larger chunk of data.
Experiments
• Testbed: Karst indiana University.
• Other tools used:
• John HopkinsTurbulence Databases.
• Paraview Catalyst(Co-processing Library, Fabian 2011)
• Originally designed for the ’in-situ’ approach
• generic visualization framework, support ”live visualization”
• Openfoam(Open Source CFDToolbox)
• DataSpaces adaptor to default writer.
Experiment with JHTDB data
• Data description
• Forced isotropic dataset from John HopkinsTurbulence Databases(JHTDB)
• 1024*1024*1024 grid, 5028 frames
• Well formatted inVTK/HDF5
• All regions are clustered into:
• Steady flow
• Unsteady flow
• Random flow
• Client side
• Paraview GUI
• Two views are linked together.
ExecutionTime Breakdown
• Analyzes the communication overhead and efficiency of the framework
• How time is spend in each component in this workflow
• Average data transfer time and computation time for all the four
applications(simulator, data processing, data analysis, catalyst visualization)
Strong Scalability
• How well the workflow can scale with more computing resources
• Fixed Input size of 4096*4096(larger) in each timestep, which contains 1GB
velocity/pressure data
• More processes for all applications?
• Latency_produce/Latency_consume
Conclusion
• a flexible and extensible software framework
• A parallel non-parametric clustering method
• Using DataSpaces greatly reduces I/O time and scales well.
Thanks
• Questions?
References
1. Ciprian Docan, Manish Parashar, and Scott Klasky. 2012. Dataspaces: an interaction
and coordination framework for coupled simulation workflows.Cluster Computing
15, 2 (2012), 163–181.
2. Barnabás Póczos, Liang Xiong, and Jeff Schneider. 2012. Nonparametric divergence
estimation with applications to machine learning on distributions. arXiv preprint
arXiv:1202.3758 (2012)
3. Nathan Fabian, Kenneth Moreland, DavidThompson, Andrew C Bauer, Pat Marion,
Berk Gevecik, Michel Rasquin, and Kenneth E Jansen. 2011.The paraview
coprocessing library: A scalable, general purpose in situ visualization library. In Large
Data Analysis andVisualization (LDAV), 2011 IEEE Symposium on. IEEE, 89–96.
4. Raymond Leonard Kaufman and Peter J Rousseeuw. 2009. Finding groups in data:
an introduction to cluster analysis.Vol. 344. JohnWiley & Sons.
SpecialThanks go to

More Related Content

What's hot

Final proj 2 (1)
Final proj 2 (1)Final proj 2 (1)
Final proj 2 (1)
Praveen Kumar
 
2017 nov reflow sbtb
2017 nov reflow sbtb2017 nov reflow sbtb
2017 nov reflow sbtb
mariuseriksen4
 
Scalable Distributed Real-Time Clustering for Big Data Streams
Scalable Distributed Real-Time Clustering for Big Data StreamsScalable Distributed Real-Time Clustering for Big Data Streams
Scalable Distributed Real-Time Clustering for Big Data Streams
Antonio Severien
 
5.1 mining data streams
5.1 mining data streams5.1 mining data streams
5.1 mining data streams
Krish_ver2
 
Big Data on Implementation of Many to Many Clustering
Big Data on Implementation of Many to Many ClusteringBig Data on Implementation of Many to Many Clustering
Big Data on Implementation of Many to Many Clustering
paperpublications3
 
"Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler...
"Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler..."Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler...
"Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler...
Dataconomy Media
 
Data Streaming in Big Data Analysis
Data Streaming in Big Data AnalysisData Streaming in Big Data Analysis
Data Streaming in Big Data Analysis
Vincenzo Gulisano
 
Evaluating Classification Algorithms Applied To Data Streams Esteban Donato
Evaluating Classification Algorithms Applied To Data Streams   Esteban DonatoEvaluating Classification Algorithms Applied To Data Streams   Esteban Donato
Evaluating Classification Algorithms Applied To Data Streams Esteban DonatoEsteban Donato
 
Data Stream Management
Data Stream ManagementData Stream Management
Data Stream Management
k_tauhid
 
DIET_BLAST
DIET_BLASTDIET_BLAST
DIET_BLAST
Frederic Desprez
 
A0360109
A0360109A0360109
A0360109
iosrjournals
 
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Ian Foster
 
C0312023
C0312023C0312023
C0312023
iosrjournals
 
A Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in ParallelA Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in Parallel
Jenny Liu
 
Streaming Algorithms
Streaming AlgorithmsStreaming Algorithms
Streaming Algorithms
Joe Kelley
 
Usage Patterns to Provision for Scientific Experiments in Clouds
Usage Patterns to Provision for Scientific Experiments in CloudsUsage Patterns to Provision for Scientific Experiments in Clouds
Usage Patterns to Provision for Scientific Experiments in Clouds
Eran Chinthaka Withana
 
Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...
Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...
Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...
IRJET Journal
 
Streaming computing: architectures, and tchnologies
Streaming computing: architectures, and tchnologiesStreaming computing: architectures, and tchnologies
Streaming computing: architectures, and tchnologies
Natalino Busa
 
B0330811
B0330811B0330811
B0330811
iosrjournals
 
Interactive Latency in Big Data Visualization
Interactive Latency in Big Data VisualizationInteractive Latency in Big Data Visualization
Interactive Latency in Big Data Visualization
bigdataviz_bay
 

What's hot (20)

Final proj 2 (1)
Final proj 2 (1)Final proj 2 (1)
Final proj 2 (1)
 
2017 nov reflow sbtb
2017 nov reflow sbtb2017 nov reflow sbtb
2017 nov reflow sbtb
 
Scalable Distributed Real-Time Clustering for Big Data Streams
Scalable Distributed Real-Time Clustering for Big Data StreamsScalable Distributed Real-Time Clustering for Big Data Streams
Scalable Distributed Real-Time Clustering for Big Data Streams
 
5.1 mining data streams
5.1 mining data streams5.1 mining data streams
5.1 mining data streams
 
Big Data on Implementation of Many to Many Clustering
Big Data on Implementation of Many to Many ClusteringBig Data on Implementation of Many to Many Clustering
Big Data on Implementation of Many to Many Clustering
 
"Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler...
"Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler..."Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler...
"Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler...
 
Data Streaming in Big Data Analysis
Data Streaming in Big Data AnalysisData Streaming in Big Data Analysis
Data Streaming in Big Data Analysis
 
Evaluating Classification Algorithms Applied To Data Streams Esteban Donato
Evaluating Classification Algorithms Applied To Data Streams   Esteban DonatoEvaluating Classification Algorithms Applied To Data Streams   Esteban Donato
Evaluating Classification Algorithms Applied To Data Streams Esteban Donato
 
Data Stream Management
Data Stream ManagementData Stream Management
Data Stream Management
 
DIET_BLAST
DIET_BLASTDIET_BLAST
DIET_BLAST
 
A0360109
A0360109A0360109
A0360109
 
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
 
C0312023
C0312023C0312023
C0312023
 
A Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in ParallelA Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in Parallel
 
Streaming Algorithms
Streaming AlgorithmsStreaming Algorithms
Streaming Algorithms
 
Usage Patterns to Provision for Scientific Experiments in Clouds
Usage Patterns to Provision for Scientific Experiments in CloudsUsage Patterns to Provision for Scientific Experiments in Clouds
Usage Patterns to Provision for Scientific Experiments in Clouds
 
Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...
Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...
Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...
 
Streaming computing: architectures, and tchnologies
Streaming computing: architectures, and tchnologiesStreaming computing: architectures, and tchnologies
Streaming computing: architectures, and tchnologies
 
B0330811
B0330811B0330811
B0330811
 
Interactive Latency in Big Data Visualization
Interactive Latency in Big Data VisualizationInteractive Latency in Big Data Visualization
Interactive Latency in Big Data Visualization
 

Similar to PEARC17:A real-time machine learning and visualization framework for scientific workflows

High Performance Computing and Big Data
High Performance Computing and Big Data High Performance Computing and Big Data
High Performance Computing and Big Data
Geoffrey Fox
 
try
trytry
Achieving horizontal scalability in density-based clustering for urls
Achieving horizontal scalability in density-based clustering for urlsAchieving horizontal scalability in density-based clustering for urls
Achieving horizontal scalability in density-based clustering for urls
Andrea Morichetta
 
HP - Jerome Rolia - Hadoop World 2010
HP - Jerome Rolia - Hadoop World 2010HP - Jerome Rolia - Hadoop World 2010
HP - Jerome Rolia - Hadoop World 2010
Cloudera, Inc.
 
Hadoop tutorial
Hadoop tutorialHadoop tutorial
Hadoop tutorial
Aamir Ameen
 
Hadoop Tutorial.ppt
Hadoop Tutorial.pptHadoop Tutorial.ppt
Hadoop Tutorial.ppt
Sathish24111
 
Design of Experiments on Federator Polystore Architecture
Design of Experiments on Federator Polystore ArchitectureDesign of Experiments on Federator Polystore Architecture
Design of Experiments on Federator Polystore Architecture
Luiz Henrique Zambom Santana
 
PEARC 17: Spark On the ARC
PEARC 17: Spark On the ARCPEARC 17: Spark On the ARC
PEARC 17: Spark On the ARC
Himanshu Bedi
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software Architectures
Geoffrey Fox
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software Architectures
Geoffrey Fox
 
Scientific
Scientific Scientific
Scientific
marpierc
 
Deep Learning at Scale
Deep Learning at ScaleDeep Learning at Scale
Deep Learning at Scale
Mateusz Dymczyk
 
Parsl: Pervasive Parallel Programming in Python
Parsl: Pervasive Parallel Programming in PythonParsl: Pervasive Parallel Programming in Python
Parsl: Pervasive Parallel Programming in Python
Daniel S. Katz
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Herman Wu
 
The Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-SystemThe Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-System
inside-BigData.com
 
Application-Aware Big Data Deduplication in Cloud Environment
Application-Aware Big Data Deduplication in Cloud EnvironmentApplication-Aware Big Data Deduplication in Cloud Environment
Application-Aware Big Data Deduplication in Cloud Environment
Safayet Hossain
 
Ling liu part 01:big graph processing
Ling liu part 01:big graph processingLing liu part 01:big graph processing
Ling liu part 01:big graph processing
jins0618
 

Similar to PEARC17:A real-time machine learning and visualization framework for scientific workflows (20)

High Performance Computing and Big Data
High Performance Computing and Big Data High Performance Computing and Big Data
High Performance Computing and Big Data
 
try
trytry
try
 
Achieving horizontal scalability in density-based clustering for urls
Achieving horizontal scalability in density-based clustering for urlsAchieving horizontal scalability in density-based clustering for urls
Achieving horizontal scalability in density-based clustering for urls
 
HP - Jerome Rolia - Hadoop World 2010
HP - Jerome Rolia - Hadoop World 2010HP - Jerome Rolia - Hadoop World 2010
HP - Jerome Rolia - Hadoop World 2010
 
Hadoop tutorial
Hadoop tutorialHadoop tutorial
Hadoop tutorial
 
Hadoop Tutorial.ppt
Hadoop Tutorial.pptHadoop Tutorial.ppt
Hadoop Tutorial.ppt
 
Design of Experiments on Federator Polystore Architecture
Design of Experiments on Federator Polystore ArchitectureDesign of Experiments on Federator Polystore Architecture
Design of Experiments on Federator Polystore Architecture
 
PEARC 17: Spark On the ARC
PEARC 17: Spark On the ARCPEARC 17: Spark On the ARC
PEARC 17: Spark On the ARC
 
NGBT_poster_v0.4
NGBT_poster_v0.4NGBT_poster_v0.4
NGBT_poster_v0.4
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software Architectures
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software Architectures
 
ICIECA 2014 Paper 05
ICIECA 2014 Paper 05ICIECA 2014 Paper 05
ICIECA 2014 Paper 05
 
Hadoop
HadoopHadoop
Hadoop
 
Scientific
Scientific Scientific
Scientific
 
Deep Learning at Scale
Deep Learning at ScaleDeep Learning at Scale
Deep Learning at Scale
 
Parsl: Pervasive Parallel Programming in Python
Parsl: Pervasive Parallel Programming in PythonParsl: Pervasive Parallel Programming in Python
Parsl: Pervasive Parallel Programming in Python
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
 
The Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-SystemThe Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-System
 
Application-Aware Big Data Deduplication in Cloud Environment
Application-Aware Big Data Deduplication in Cloud EnvironmentApplication-Aware Big Data Deduplication in Cloud Environment
Application-Aware Big Data Deduplication in Cloud Environment
 
Ling liu part 01:big graph processing
Ling liu part 01:big graph processingLing liu part 01:big graph processing
Ling liu part 01:big graph processing
 

Recently uploaded

Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.
ViralQR
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
Vlad Stirbu
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 

Recently uploaded (20)

Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 

PEARC17:A real-time machine learning and visualization framework for scientific workflows

  • 1. A real-time machine learning and visualization framework for scientific workflows Feng Li, Fengguang Song Indiana University-Purdue university, Indianapolis Email: lifen@iupui.edu, fgsong@iupui.edu
  • 2. Outlines • Background • An non-parametric anomaly detection method • A scientific workflow & software framework • Experiments • Conclusion and future work
  • 3. Background • Recent advancements in parallel computing and HPC • Intensive computation helps solve extremely large problem • This also brings huge amount of data. • Much faster interconnect.(infiniband->RDMA) • How do we deal with such large data: • Post-hoc?(let simulation results flushed to disk) • In-situ?(closely coupled) • Something in between?
  • 4. Tuple Space & Dataspaces • Tuple Space • Data is access by content and type, rather than by their raw memory address. • Ability to describe data without referring any computer architecture! • Ideal model to couple different simulation and analysis applications. • Dataspaces • Developed by Rutgers University(Docan, 2012) • Derived from the concept of tuple space • coupling different applications together use RDMA • Example Interface: • dspaces_put(var_name, version, elem_size, boundaries, pdata)
  • 5. Vortex detection inTurbulence flow • Vortex? • Swirling motion of fluid around a center region. • Precise definition? • Why it matters? • How to identify it?
  • 6. Vortex detection inTurbulence flow • A region-based non-parametric anomaly detection method.(Póczos, 2012) • Simulation data is divided into “regions” • Each region(A collection of data points) can be regarded as a random sampling from a distribution with density p. • Features: (vx , vy, dc) • v,Velocity in both dimensions • dc, Distance to center. BA C D
  • 7. Vortex detection inTurbulence flow, Cont. • Density in this data point can be estimated using kNN methods. • Divergence(distance between regions) • L2 divergence: 𝐿(𝑝| 𝑞 = ∫((𝑝 𝑥 − 𝑞(𝑥))+ 𝑑𝑥)-/+ • Once we know how to measure the ’difference’ of two regions, we can use distance-based clustering method.(eg. kmedoids) 𝑝0 𝑋2 = 𝑘/( 𝑛 − 1 ∗ 𝑐 ∗ 𝑣0 𝑖 )
  • 8. Demonstration of theWorkflow • Divide → Divergence → Generate new medoids→ Re- assign Cluster ID • Different application coupled using Dataspaces. • Components in the Workflow: • Simulators • Data processing • Data Analysis(Anomaly detection) • Visualization tool(Paraview Catalyst)
  • 9. Distributed Sampling • Original k-medoids method does scale well with large dataset • O(𝑛+ ) , n is number of regions • Data processing will be the bottle neck • CLARA(Clustering LARge Application, Kaufman 2009) • Such small granularity operations are expensive in Dataspaces(“boundaries”) • Reorganize the data: • Single Dataspaces operation will operate on a larger chunk of data.
  • 10. Experiments • Testbed: Karst indiana University. • Other tools used: • John HopkinsTurbulence Databases. • Paraview Catalyst(Co-processing Library, Fabian 2011) • Originally designed for the ’in-situ’ approach • generic visualization framework, support ”live visualization” • Openfoam(Open Source CFDToolbox) • DataSpaces adaptor to default writer.
  • 11. Experiment with JHTDB data • Data description • Forced isotropic dataset from John HopkinsTurbulence Databases(JHTDB) • 1024*1024*1024 grid, 5028 frames • Well formatted inVTK/HDF5 • All regions are clustered into: • Steady flow • Unsteady flow • Random flow • Client side • Paraview GUI • Two views are linked together.
  • 12. ExecutionTime Breakdown • Analyzes the communication overhead and efficiency of the framework • How time is spend in each component in this workflow • Average data transfer time and computation time for all the four applications(simulator, data processing, data analysis, catalyst visualization)
  • 13. Strong Scalability • How well the workflow can scale with more computing resources • Fixed Input size of 4096*4096(larger) in each timestep, which contains 1GB velocity/pressure data • More processes for all applications? • Latency_produce/Latency_consume
  • 14. Conclusion • a flexible and extensible software framework • A parallel non-parametric clustering method • Using DataSpaces greatly reduces I/O time and scales well.
  • 16. References 1. Ciprian Docan, Manish Parashar, and Scott Klasky. 2012. Dataspaces: an interaction and coordination framework for coupled simulation workflows.Cluster Computing 15, 2 (2012), 163–181. 2. Barnabás Póczos, Liang Xiong, and Jeff Schneider. 2012. Nonparametric divergence estimation with applications to machine learning on distributions. arXiv preprint arXiv:1202.3758 (2012) 3. Nathan Fabian, Kenneth Moreland, DavidThompson, Andrew C Bauer, Pat Marion, Berk Gevecik, Michel Rasquin, and Kenneth E Jansen. 2011.The paraview coprocessing library: A scalable, general purpose in situ visualization library. In Large Data Analysis andVisualization (LDAV), 2011 IEEE Symposium on. IEEE, 89–96. 4. Raymond Leonard Kaufman and Peter J Rousseeuw. 2009. Finding groups in data: an introduction to cluster analysis.Vol. 344. JohnWiley & Sons.