SlideShare a Scribd company logo
HPC I/O for Computational Scientists:
Understanding I/O
Presented to
ATPESC 2017 Participants
Rob Latham and Phil Carns
Mathematics and Computer Science Division
Argonne National Laboratory
Q Center, St. Charles, IL (USA)
8/4/2017
ATPESC 2017, July 30 – August 11, 20172
Motivation for
Characterizing parallel I/O
• Most scientific domains are
increasingly data intensive:
climate, physics, biology and
much more
• Upcoming platforms include
complex hierarchical
storage systems
How can we
maximize productivity
in this environment?
Times are changing in HPC storage!
Example visualizations from
the Human Connectome
Project, CERN/LHC, and the
Parallel Ocean Program
The NERSC burst buffer roadmap and architecture, including solid
state burst buffers that can be used in a variety of ways
ATPESC 2017, July 30 – August 11, 20173
Key challenges
• Instrumentation:
– What do we measure?
– How much overhead is acceptable and when?
• Analysis:
– How do we correlate data and extract actionable information?
– Can we identify the root cause of performance problems?
• Impact:
– Develop best practices and tune applications
– Improve system software
– Design and procure better systems
3
CHARACTERIZING APPLICATION I/O
WITH DARSHAN
ATPESC 2017, July 30 – August 11, 20175
What is Darshan?
Darshan is a scalable HPC I/O characterization tool. It captures an
accurate but concise picture of application I/O behavior with
minimum overhead.
• No code changes, easy to use
– Negligible performance impact: just “leave it on”
– Enabled by default at ALCF, NERSC, NCSA, and KAUST
– Installed and available for case by case use at many other sites
• Produces a summary of I/O activity for each job, including:
– Counters for file access operations
– Time stamps and cumulative timers for key operations
– Histograms of access, stride, datatype, and extent sizes
5
Project began in 2008, first public software
release and deployment in 2009
ATPESC 2017, July 30 – August 11, 20176
Darshan design principles
• The Darshan run time library is inserted at link time (for static
executables) or at run time (for dynamic executables)
• Transparent wrappers for I/O functions collect per-file statistics
• Statistics are stored in bounded memory at each rank
• At shutdown time:
– Collective reduction to merge shared file records
– Parallel compression
– Collective write to a single log file
• No communication or storage operations until shutdown
• Command-line tools are used to post-process log files
6
ATPESC 2017, July 30 – August 11, 20177
JOB analysis example
Example: Darshan-job-summary.pl
produces a 3-page PDF file
summarizing various aspects of I/O
performance
Estimated performance
Percentage of runtime in I/O
Access size histogram
Access type histograms
File usage
ATPESC 2017, July 30 – August 11, 20178
SYSTEM analysis example
• With a sufficient archive of
performance statistics, we can
develop heuristics to detect
anomalous behavior
8
This example highlights large jobs that spent a
disproportionate amount of time managing file
metadata rather than performing raw data transfer
Worst offender spent 99% of I/O time in
open/close/stat/seek
This identification process is not yet automated;
alerts/triggers are needed in future work for greater
impact
Example of heuristics applied to a population of
production jobs on the Hopper system in 2013:
Carns et al., “Production I/O Characterization on the Cray XE6,” In
Proceedings of the Cray User Group meeting 2013 (CUG 2013).
ATPESC 2017, July 30 – August 11, 20179
Performance:
function wrapping overhead
What is the cost of interposing Darshan I/O instrumentation wrappers?
• To test, we compare observed I/O time of an IOR configuration
linked against different Darshan versions on Edison
• File-per-process workload, 6,000 processes, over 12 million
instrumented calls
Type of Darshan builds now
deployed on Theta and Cori
Why the box plots? Recall
observation from this morning that
variability is a constant theme in
HPC I/O today.
(note that the Y axis labels start at 40)
Snyder et al. Modular HPC I/O Characterization with Darshan. In Proceedings
of 5th Workshop on Extreme-scale Programming Tools (ESPT 2016), 2016.
ATPESC 2017, July 30 – August 11, 201710
Performance: shutdown overhead
• Involves aggregating, compressing, and collectively writing I/O
data records
• To test, synthetic workloads are injected into Darshan and resulting
shutdown time is measured on Edison
Near constant shutdown time of
~100 ms in all cases
Shutdown time scales linearly with job size:
5-6s extra shutdown time with 12,000 files
single shared file file-per-process
USING DARSHAN IN PRACTICE
ATPESC 2017, July 30 – August 11, 201712
Typical deployment and usage
• Darshan usage on Mira, Cetus, Vesta, Theta,
Cori, or Edison, abridged:
– Run your job
– If the job calls MPI_Finalize(), log will be stored in
DARSHAN_LOG_DIR/month/day/
– Theta: /lus/theta-fs0/logs/darshan/theta
– Use tools (next slides) to interpret log
• On Titan: “module load darshan” first
• Links to documentation with details will be
given at the end of this presentation
12
ATPESC 2017, July 30 – August 11, 201713
Generating job summaries
• Run job and find its log file:
• Copy log files to save, generate PDF summaries:
13
Job id
Corresponding
log file in today’s
directory
Copy out logs
List logs
Load “latex” module,
(if needed)
Generate PDF
ATPESC 2017, July 30 – August 11, 201714
First page of summary
14
Common questions:
• Did I spend much time performing IO?
• What were the access sizes?
• How many files where opened, and
how big were they?
ATPESC 2017, July 30 – August 11, 201715
Second page of summary (excerpt)
15
Common questions:
• Where in the timeline of the execution did ea
rank do I/O?
There are additional graphs in the PDF file with increasingly detailed information.
You can also dump all data from the log in text format using “darshan-parser”.
TIPS AND TRICKS: ENABLING ADDITIONAL DATA
CAPTURE
ATPESC 2017, July 30 – August 11, 201717
What if you are doing shared-file IO?
17
Your timeline might look like this
No per-process information available
because the data was aggregated by
Darshan to save space/overhead
Is that important? It depends on what
you need to learn about your
application.
– It may be interesting for applications
that access the same file in distinct
phases over time
ATPESC 2017, July 30 – August 11, 201718
What if you are doing shared-file IO?
18
Set environment variable to disable shared file
reductions
Increases overhead and log file size, but provides
per-rank info even on shared files
ATPESC 2017, July 30 – August 11, 201719
Detailed trace data
19
Set environment variable to enable “DXT” tracing
This causes additional overhead and larger files, but
captures precise access data
Parse trace with “darshan-dxt-parser”
Feature contributed by
Cong Xu and Intel’s High
Performance Data Division
Cong Xu et. al, "DXT:
Darshan eXtended Tracing",
Cray User Group Conference
2017
DARSHAN FUTURE WORK
ATPESC 2017, July 30 – August 11, 201721
What’s new?
Modularized instrumentation
• Frequently asked question:
Can I add instrumentation for X?
• Darshan has been re-architected as a
modular framework to help facilitate this,
starting in v3.0
21
Snyder et al. Modular HPC I/O Characterization with
Darshan. In Proceedings of 5th Workshop on Extreme-
scale Programming Tools (ESPT 2016), 2016.
Self-describing log format
ATPESC 2017, July 30 – August 11, 201722
Darshan Module example
• We are using the modular
framework to integrate more data
sources and simplify the
connections between various
components in the stack
• This is a good way for
collaborators to get involved in
Darshan development
22
ATPESC 2017, July 30 – August 11, 201723
The need for HOLISTIC characterization
• We’ve used Darshan to improving application productivity with case
studies, application tuning, and user education
• ... But challenges remain:
– What other factors influence performance?
– What if the problem is beyond a user’s control?
– The user population evolves over time; how do we stay engaged?
23
ATPESC 2017, July 30 – August 11, 201724
“I observed performance XYZ. Now what?”
• A climate vs. weather analogy: It is snowing in Atlanta, Georgia.
Is that normal?
• You need context to know:
– Does it ever snow there?
– What time of year is it?
– What was the temperature yesterday?
– Do your neighbors see snow too?
– Should you look at it first hand?
• It is similarly difficult to understand a single application performance
measurement without broader context. How do we differentiate
typical I/O climate from extreme I/O weather events?
24
+ = ?
ATPESC 2017, July 30 – August 11, 201725
Characterizing the I/O system
• We need a big picture view
• No lack of instrumentation
methods for system
components…
– but with divergent data formats,
resolutions, and scope
25
ATPESC 2017, July 30 – August 11, 201726
Characterizing the I/O system
• We need a big picture view
• No lack of instrumentation
methods for system
components…
– but with wildly divergent data
formats, resolutions, and scope
• This is the motivation for the
TOKIO (TOtal Knowledge of
I/O) project:
– Integrate, correlate, and analyze
I/O behavior from the system as a
whole for holistic understanding
26
Holistic I/O characterization
https://www.nersc.gov/research-and-development/tokio/
ATPESC 2017, July 30 – August 11, 201727
TOKIO Strategy
• Integrate existing best-in-class instrumentation tools with help from
vendors
• Index and query data sources in their native format
– Infrastructure to align and link data sets
– Adapters/parsers to produce coherent views on demand
• Develop integration and analysis methods
• Produce tools that share a common interface and data format
– Correlation, data mining, dashboards, etc.
27
The TOKIO project is a collaboration between LBL and ANL
PI: Nick Wright (LBL), Collaborators: Suren Byna, Glenn Lockwood,
William Yoo, Prabhat, Jialin Liu (LBL) Phil Carns, Shane Snyder, Kevin
Harms, Zach Nault, Matthieu Dorier, Rob Ross (ANL)
ATPESC 2017, July 30 – August 11, 201728
UMAMI example
TOKIO Unified Measurements And Metrics Interface
28
UMAMI is a pluggable dashboard that displays the
I/O performance of an application in context with
system telemetry and historical records
Each metric is shown
in a separate row
Historical samples (for a
given application) are
plotted over time
Box plots relate current
values to overall
variance
(figures courtesy of Glenn Lockwood, NERSC)
ATPESC 2017, July 30 – August 11, 201729
UMAMI example
TOKIO Unified Measurements And Metrics Interface
29
System background
load is typical
Performance for this job
is higher than usual
Server CPU load is low
after a long-term steady
climb
Corresponds to data
purge that freed up disk
blocks
Broader contextual clues simplify interpretation of
unusual performance measurements
ATPESC 2017, July 30 – August 11, 201730
Hands on exercises
https://xgitlab.cels.anl.gov/ATPESC-IO/hands-on-2017
• There are hands-on exercises available for you to try out during the
day or in tonight’s session
– Demonstrates running applications and analyzing I/O on Theta
– Try some examples and see if you can find the I/O problem!
• We can also answer questions about your own applications
– Try it on Theta, Mira, Cetus, Vesta, Cori, Edison, or Titan
– (note: the Mira, Vesta, and Cetus Darshan versions are a little
older and will differ slightly in details from this presentation)
30
ATPESC 2017, July 30 – August 11, 201731
Next up!
• This presentation covered how to evaluate I/O and tune your
application.
• The next presentation will walk through the HDF5 data management
library.

More Related Content

What's hot

Scaling Deep Learning Algorithms on Extreme Scale Architectures
Scaling Deep Learning Algorithms on Extreme Scale ArchitecturesScaling Deep Learning Algorithms on Extreme Scale Architectures
Scaling Deep Learning Algorithms on Extreme Scale Architectures
inside-BigData.com
 
view_hdf
view_hdfview_hdf
A Unified Approach for Modeling and Optimization of Energy, Makespan and Reli...
A Unified Approach for Modeling and Optimization of Energy, Makespan and Reli...A Unified Approach for Modeling and Optimization of Energy, Makespan and Reli...
A Unified Approach for Modeling and Optimization of Energy, Makespan and Reli...
Rafael Ferreira da Silva
 
Multidimensional Scientific Data in ArcGIS
Multidimensional Scientific Data in ArcGISMultidimensional Scientific Data in ArcGIS
Multidimensional Scientific Data in ArcGIS
The HDF-EOS Tools and Information Center
 
Rutherford Appleton Laboratory uses Panasas ActiveStor to accelerate global c...
Rutherford Appleton Laboratory uses Panasas ActiveStor to accelerate global c...Rutherford Appleton Laboratory uses Panasas ActiveStor to accelerate global c...
Rutherford Appleton Laboratory uses Panasas ActiveStor to accelerate global c...
Panasas
 
HDF5 iRODS
HDF5 iRODSHDF5 iRODS
Big data in the research life cycle: technologies, infrastructures, policies
Big data in the research life cycle: technologies, infrastructures, policiesBig data in the research life cycle: technologies, infrastructures, policies
Big data in the research life cycle: technologies, infrastructures, policies
BigData_Europe
 
HDF Update 2016
HDF Update 2016HDF Update 2016
ICESat-2 Metadata and Status
ICESat-2 Metadata and StatusICESat-2 Metadata and Status
ICESat-2 Metadata and Status
The HDF-EOS Tools and Information Center
 
MATLAB and Scientific Data: New Features and Capabilities
MATLAB and Scientific Data: New Features and CapabilitiesMATLAB and Scientific Data: New Features and Capabilities
MATLAB and Scientific Data: New Features and Capabilities
The HDF-EOS Tools and Information Center
 
Virtualization for HPC at NCI
Virtualization for HPC at NCIVirtualization for HPC at NCI
Virtualization for HPC at NCI
inside-BigData.com
 
Matlab, Big Data, and HDF Server
Matlab, Big Data, and HDF ServerMatlab, Big Data, and HDF Server
Matlab, Big Data, and HDF Server
The HDF-EOS Tools and Information Center
 
High Performance Data Lake with Apache Hudi and Alluxio at T3Go
High Performance Data Lake with Apache Hudi and Alluxio at T3GoHigh Performance Data Lake with Apache Hudi and Alluxio at T3Go
High Performance Data Lake with Apache Hudi and Alluxio at T3Go
Alluxio, Inc.
 
Multidimensional DB design, revolving TPC-H benchmark into OLAP bench
Multidimensional DB design, revolving TPC-H benchmark into OLAP benchMultidimensional DB design, revolving TPC-H benchmark into OLAP bench
Multidimensional DB design, revolving TPC-H benchmark into OLAP bench
Rim Moussa
 
Pilot Project for HDF5 Metadata Structures for SWOT
Pilot Project for HDF5 Metadata Structures for SWOTPilot Project for HDF5 Metadata Structures for SWOT
Pilot Project for HDF5 Metadata Structures for SWOT
The HDF-EOS Tools and Information Center
 
Working with HDF and netCDF Data in ArcGIS: Tools and Case Studies
Working with HDF and netCDF Data in ArcGIS: Tools and Case StudiesWorking with HDF and netCDF Data in ArcGIS: Tools and Case Studies
Working with HDF and netCDF Data in ArcGIS: Tools and Case Studies
The HDF-EOS Tools and Information Center
 
Working with Scientific Data in MATLAB
Working with Scientific Data in MATLABWorking with Scientific Data in MATLAB
Working with Scientific Data in MATLAB
The HDF-EOS Tools and Information Center
 
Incorporating ISO Metadata Using HDF Product Designer
Incorporating ISO Metadata Using HDF Product DesignerIncorporating ISO Metadata Using HDF Product Designer
Incorporating ISO Metadata Using HDF Product Designer
The HDF-EOS Tools and Information Center
 
Subsetting
SubsettingSubsetting
Improved Methods for Accessing Scientific Data for the Masses
Improved Methods for Accessing Scientific Data for the MassesImproved Methods for Accessing Scientific Data for the Masses
Improved Methods for Accessing Scientific Data for the Masses
The HDF-EOS Tools and Information Center
 

What's hot (20)

Scaling Deep Learning Algorithms on Extreme Scale Architectures
Scaling Deep Learning Algorithms on Extreme Scale ArchitecturesScaling Deep Learning Algorithms on Extreme Scale Architectures
Scaling Deep Learning Algorithms on Extreme Scale Architectures
 
view_hdf
view_hdfview_hdf
view_hdf
 
A Unified Approach for Modeling and Optimization of Energy, Makespan and Reli...
A Unified Approach for Modeling and Optimization of Energy, Makespan and Reli...A Unified Approach for Modeling and Optimization of Energy, Makespan and Reli...
A Unified Approach for Modeling and Optimization of Energy, Makespan and Reli...
 
Multidimensional Scientific Data in ArcGIS
Multidimensional Scientific Data in ArcGISMultidimensional Scientific Data in ArcGIS
Multidimensional Scientific Data in ArcGIS
 
Rutherford Appleton Laboratory uses Panasas ActiveStor to accelerate global c...
Rutherford Appleton Laboratory uses Panasas ActiveStor to accelerate global c...Rutherford Appleton Laboratory uses Panasas ActiveStor to accelerate global c...
Rutherford Appleton Laboratory uses Panasas ActiveStor to accelerate global c...
 
HDF5 iRODS
HDF5 iRODSHDF5 iRODS
HDF5 iRODS
 
Big data in the research life cycle: technologies, infrastructures, policies
Big data in the research life cycle: technologies, infrastructures, policiesBig data in the research life cycle: technologies, infrastructures, policies
Big data in the research life cycle: technologies, infrastructures, policies
 
HDF Update 2016
HDF Update 2016HDF Update 2016
HDF Update 2016
 
ICESat-2 Metadata and Status
ICESat-2 Metadata and StatusICESat-2 Metadata and Status
ICESat-2 Metadata and Status
 
MATLAB and Scientific Data: New Features and Capabilities
MATLAB and Scientific Data: New Features and CapabilitiesMATLAB and Scientific Data: New Features and Capabilities
MATLAB and Scientific Data: New Features and Capabilities
 
Virtualization for HPC at NCI
Virtualization for HPC at NCIVirtualization for HPC at NCI
Virtualization for HPC at NCI
 
Matlab, Big Data, and HDF Server
Matlab, Big Data, and HDF ServerMatlab, Big Data, and HDF Server
Matlab, Big Data, and HDF Server
 
High Performance Data Lake with Apache Hudi and Alluxio at T3Go
High Performance Data Lake with Apache Hudi and Alluxio at T3GoHigh Performance Data Lake with Apache Hudi and Alluxio at T3Go
High Performance Data Lake with Apache Hudi and Alluxio at T3Go
 
Multidimensional DB design, revolving TPC-H benchmark into OLAP bench
Multidimensional DB design, revolving TPC-H benchmark into OLAP benchMultidimensional DB design, revolving TPC-H benchmark into OLAP bench
Multidimensional DB design, revolving TPC-H benchmark into OLAP bench
 
Pilot Project for HDF5 Metadata Structures for SWOT
Pilot Project for HDF5 Metadata Structures for SWOTPilot Project for HDF5 Metadata Structures for SWOT
Pilot Project for HDF5 Metadata Structures for SWOT
 
Working with HDF and netCDF Data in ArcGIS: Tools and Case Studies
Working with HDF and netCDF Data in ArcGIS: Tools and Case StudiesWorking with HDF and netCDF Data in ArcGIS: Tools and Case Studies
Working with HDF and netCDF Data in ArcGIS: Tools and Case Studies
 
Working with Scientific Data in MATLAB
Working with Scientific Data in MATLABWorking with Scientific Data in MATLAB
Working with Scientific Data in MATLAB
 
Incorporating ISO Metadata Using HDF Product Designer
Incorporating ISO Metadata Using HDF Product DesignerIncorporating ISO Metadata Using HDF Product Designer
Incorporating ISO Metadata Using HDF Product Designer
 
Subsetting
SubsettingSubsetting
Subsetting
 
Improved Methods for Accessing Scientific Data for the Masses
Improved Methods for Accessing Scientific Data for the MassesImproved Methods for Accessing Scientific Data for the Masses
Improved Methods for Accessing Scientific Data for the Masses
 

Similar to HPC I/O for Computational Scientists

Big Data HPC Convergence and a bunch of other things
Big Data HPC Convergence and a bunch of other thingsBig Data HPC Convergence and a bunch of other things
Big Data HPC Convergence and a bunch of other things
Geoffrey Fox
 
InternReport
InternReportInternReport
InternReport
Swetha Tanamala
 
Performance Improvement of Heterogeneous Hadoop Cluster using Ranking Algorithm
Performance Improvement of Heterogeneous Hadoop Cluster using Ranking AlgorithmPerformance Improvement of Heterogeneous Hadoop Cluster using Ranking Algorithm
Performance Improvement of Heterogeneous Hadoop Cluster using Ranking Algorithm
IRJET Journal
 
Property Graphs with Time
Property Graphs with TimeProperty Graphs with Time
Property Graphs with Time
openCypher
 
Minimizing the Complexities of Machine Learning with Data Virtualization
Minimizing the Complexities of Machine Learning with Data VirtualizationMinimizing the Complexities of Machine Learning with Data Virtualization
Minimizing the Complexities of Machine Learning with Data Virtualization
Denodo
 
50120130405014 2-3
50120130405014 2-350120130405014 2-3
50120130405014 2-3
IAEME Publication
 
Linked Open Data about Springer Nature conferences. The story so far
Linked Open Data about Springer Nature conferences. The story so farLinked Open Data about Springer Nature conferences. The story so far
Linked Open Data about Springer Nature conferences. The story so far
Aliaksandr Birukou
 
Welcome to HDF Workshop V
Welcome to HDF Workshop VWelcome to HDF Workshop V
Welcome to HDF Workshop V
The HDF-EOS Tools and Information Center
 
Inspire hack 2017-linked-data
Inspire hack 2017-linked-dataInspire hack 2017-linked-data
Inspire hack 2017-linked-data
Raul Palma
 
Team 05 linked data generation
Team 05 linked data generationTeam 05 linked data generation
Team 05 linked data generation
plan4all
 
Big Data LDN 2018: FORTUNE 100 LESSONS ON ARCHITECTING DATA LAKES FOR REAL-TI...
Big Data LDN 2018: FORTUNE 100 LESSONS ON ARCHITECTING DATA LAKES FOR REAL-TI...Big Data LDN 2018: FORTUNE 100 LESSONS ON ARCHITECTING DATA LAKES FOR REAL-TI...
Big Data LDN 2018: FORTUNE 100 LESSONS ON ARCHITECTING DATA LAKES FOR REAL-TI...
Matt Stubbs
 
Scaling Data Science on Big Data
Scaling Data Science on Big DataScaling Data Science on Big Data
Scaling Data Science on Big Data
DataWorks Summit
 
An Efficient Approach to Manage Small Files in Distributed File Systems
An Efficient Approach to Manage Small Files in Distributed File SystemsAn Efficient Approach to Manage Small Files in Distributed File Systems
An Efficient Approach to Manage Small Files in Distributed File Systems
IRJET Journal
 
Logging/Request Tracing in Distributed Environment
Logging/Request Tracing in Distributed EnvironmentLogging/Request Tracing in Distributed Environment
Logging/Request Tracing in Distributed Environment
APNIC
 
Apricot2017 Request tracing in distributed environment
Apricot2017 Request tracing in distributed environmentApricot2017 Request tracing in distributed environment
Apricot2017 Request tracing in distributed environment
Hieu LE ☁
 
ECL-Watch: A Big Data Application Performance Tuning Tool in the HPCC Systems...
ECL-Watch: A Big Data Application Performance Tuning Tool in the HPCC Systems...ECL-Watch: A Big Data Application Performance Tuning Tool in the HPCC Systems...
ECL-Watch: A Big Data Application Performance Tuning Tool in the HPCC Systems...
HPCC Systems
 
Bicod2017
Bicod2017Bicod2017
Bicod2017
Rim Moussa
 
BICOD-2017
BICOD-2017BICOD-2017
BICOD-2017
Rim Moussa
 
MACHINE LEARNING ON MAPREDUCE FRAMEWORK
MACHINE LEARNING ON MAPREDUCE FRAMEWORKMACHINE LEARNING ON MAPREDUCE FRAMEWORK
MACHINE LEARNING ON MAPREDUCE FRAMEWORK
Abhi Jit
 
Solving the data problem for research beyond
Solving the data problem for research beyondSolving the data problem for research beyond
Solving the data problem for research beyond
Jisc
 

Similar to HPC I/O for Computational Scientists (20)

Big Data HPC Convergence and a bunch of other things
Big Data HPC Convergence and a bunch of other thingsBig Data HPC Convergence and a bunch of other things
Big Data HPC Convergence and a bunch of other things
 
InternReport
InternReportInternReport
InternReport
 
Performance Improvement of Heterogeneous Hadoop Cluster using Ranking Algorithm
Performance Improvement of Heterogeneous Hadoop Cluster using Ranking AlgorithmPerformance Improvement of Heterogeneous Hadoop Cluster using Ranking Algorithm
Performance Improvement of Heterogeneous Hadoop Cluster using Ranking Algorithm
 
Property Graphs with Time
Property Graphs with TimeProperty Graphs with Time
Property Graphs with Time
 
Minimizing the Complexities of Machine Learning with Data Virtualization
Minimizing the Complexities of Machine Learning with Data VirtualizationMinimizing the Complexities of Machine Learning with Data Virtualization
Minimizing the Complexities of Machine Learning with Data Virtualization
 
50120130405014 2-3
50120130405014 2-350120130405014 2-3
50120130405014 2-3
 
Linked Open Data about Springer Nature conferences. The story so far
Linked Open Data about Springer Nature conferences. The story so farLinked Open Data about Springer Nature conferences. The story so far
Linked Open Data about Springer Nature conferences. The story so far
 
Welcome to HDF Workshop V
Welcome to HDF Workshop VWelcome to HDF Workshop V
Welcome to HDF Workshop V
 
Inspire hack 2017-linked-data
Inspire hack 2017-linked-dataInspire hack 2017-linked-data
Inspire hack 2017-linked-data
 
Team 05 linked data generation
Team 05 linked data generationTeam 05 linked data generation
Team 05 linked data generation
 
Big Data LDN 2018: FORTUNE 100 LESSONS ON ARCHITECTING DATA LAKES FOR REAL-TI...
Big Data LDN 2018: FORTUNE 100 LESSONS ON ARCHITECTING DATA LAKES FOR REAL-TI...Big Data LDN 2018: FORTUNE 100 LESSONS ON ARCHITECTING DATA LAKES FOR REAL-TI...
Big Data LDN 2018: FORTUNE 100 LESSONS ON ARCHITECTING DATA LAKES FOR REAL-TI...
 
Scaling Data Science on Big Data
Scaling Data Science on Big DataScaling Data Science on Big Data
Scaling Data Science on Big Data
 
An Efficient Approach to Manage Small Files in Distributed File Systems
An Efficient Approach to Manage Small Files in Distributed File SystemsAn Efficient Approach to Manage Small Files in Distributed File Systems
An Efficient Approach to Manage Small Files in Distributed File Systems
 
Logging/Request Tracing in Distributed Environment
Logging/Request Tracing in Distributed EnvironmentLogging/Request Tracing in Distributed Environment
Logging/Request Tracing in Distributed Environment
 
Apricot2017 Request tracing in distributed environment
Apricot2017 Request tracing in distributed environmentApricot2017 Request tracing in distributed environment
Apricot2017 Request tracing in distributed environment
 
ECL-Watch: A Big Data Application Performance Tuning Tool in the HPCC Systems...
ECL-Watch: A Big Data Application Performance Tuning Tool in the HPCC Systems...ECL-Watch: A Big Data Application Performance Tuning Tool in the HPCC Systems...
ECL-Watch: A Big Data Application Performance Tuning Tool in the HPCC Systems...
 
Bicod2017
Bicod2017Bicod2017
Bicod2017
 
BICOD-2017
BICOD-2017BICOD-2017
BICOD-2017
 
MACHINE LEARNING ON MAPREDUCE FRAMEWORK
MACHINE LEARNING ON MAPREDUCE FRAMEWORKMACHINE LEARNING ON MAPREDUCE FRAMEWORK
MACHINE LEARNING ON MAPREDUCE FRAMEWORK
 
Solving the data problem for research beyond
Solving the data problem for research beyondSolving the data problem for research beyond
Solving the data problem for research beyond
 

More from inside-BigData.com

Major Market Shifts in IT
Major Market Shifts in ITMajor Market Shifts in IT
Major Market Shifts in IT
inside-BigData.com
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
inside-BigData.com
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
inside-BigData.com
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
inside-BigData.com
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
inside-BigData.com
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
inside-BigData.com
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networks
inside-BigData.com
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
inside-BigData.com
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
inside-BigData.com
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Update
inside-BigData.com
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19
inside-BigData.com
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
inside-BigData.com
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
inside-BigData.com
 
State of ARM-based HPC
State of ARM-based HPCState of ARM-based HPC
State of ARM-based HPC
inside-BigData.com
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
inside-BigData.com
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
inside-BigData.com
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Era
inside-BigData.com
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computing
inside-BigData.com
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
inside-BigData.com
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
inside-BigData.com
 

More from inside-BigData.com (20)

Major Market Shifts in IT
Major Market Shifts in ITMajor Market Shifts in IT
Major Market Shifts in IT
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networks
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Update
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
 
State of ARM-based HPC
State of ARM-based HPCState of ARM-based HPC
State of ARM-based HPC
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Era
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computing
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
 

Recently uploaded

"Scaling RAG Applications to serve millions of users", Kevin Goedecke
"Scaling RAG Applications to serve millions of users",  Kevin Goedecke"Scaling RAG Applications to serve millions of users",  Kevin Goedecke
"Scaling RAG Applications to serve millions of users", Kevin Goedecke
Fwdays
 
Mutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented ChatbotsMutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented Chatbots
Pablo Gómez Abajo
 
Essentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation ParametersEssentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation Parameters
Safe Software
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
Chart Kalyan
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
AstuteBusiness
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptxPRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
christinelarrosa
 
"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota
Fwdays
 
High performance Serverless Java on AWS- GoTo Amsterdam 2024
High performance Serverless Java on AWS- GoTo Amsterdam 2024High performance Serverless Java on AWS- GoTo Amsterdam 2024
High performance Serverless Java on AWS- GoTo Amsterdam 2024
Vadym Kazulkin
 
Y-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PPY-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PP
c5vrf27qcz
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
 
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
saastr
 
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge GraphGraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
Neo4j
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
ScyllaDB
 
Apps Break Data
Apps Break DataApps Break Data
Apps Break Data
Ivo Velitchkov
 
Leveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and StandardsLeveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and Standards
Neo4j
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
Miro Wengner
 
What is an RPA CoE? Session 1 – CoE Vision
What is an RPA CoE?  Session 1 – CoE VisionWhat is an RPA CoE?  Session 1 – CoE Vision
What is an RPA CoE? Session 1 – CoE Vision
DianaGray10
 
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Pitangent Analytics & Technology Solutions Pvt. Ltd
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
saastr
 

Recently uploaded (20)

"Scaling RAG Applications to serve millions of users", Kevin Goedecke
"Scaling RAG Applications to serve millions of users",  Kevin Goedecke"Scaling RAG Applications to serve millions of users",  Kevin Goedecke
"Scaling RAG Applications to serve millions of users", Kevin Goedecke
 
Mutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented ChatbotsMutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented Chatbots
 
Essentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation ParametersEssentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation Parameters
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptxPRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
 
"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota
 
High performance Serverless Java on AWS- GoTo Amsterdam 2024
High performance Serverless Java on AWS- GoTo Amsterdam 2024High performance Serverless Java on AWS- GoTo Amsterdam 2024
High performance Serverless Java on AWS- GoTo Amsterdam 2024
 
Y-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PPY-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PP
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
 
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
 
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge GraphGraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
 
Apps Break Data
Apps Break DataApps Break Data
Apps Break Data
 
Leveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and StandardsLeveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and Standards
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
 
What is an RPA CoE? Session 1 – CoE Vision
What is an RPA CoE?  Session 1 – CoE VisionWhat is an RPA CoE?  Session 1 – CoE Vision
What is an RPA CoE? Session 1 – CoE Vision
 
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
 

HPC I/O for Computational Scientists

  • 1. HPC I/O for Computational Scientists: Understanding I/O Presented to ATPESC 2017 Participants Rob Latham and Phil Carns Mathematics and Computer Science Division Argonne National Laboratory Q Center, St. Charles, IL (USA) 8/4/2017
  • 2. ATPESC 2017, July 30 – August 11, 20172 Motivation for Characterizing parallel I/O • Most scientific domains are increasingly data intensive: climate, physics, biology and much more • Upcoming platforms include complex hierarchical storage systems How can we maximize productivity in this environment? Times are changing in HPC storage! Example visualizations from the Human Connectome Project, CERN/LHC, and the Parallel Ocean Program The NERSC burst buffer roadmap and architecture, including solid state burst buffers that can be used in a variety of ways
  • 3. ATPESC 2017, July 30 – August 11, 20173 Key challenges • Instrumentation: – What do we measure? – How much overhead is acceptable and when? • Analysis: – How do we correlate data and extract actionable information? – Can we identify the root cause of performance problems? • Impact: – Develop best practices and tune applications – Improve system software – Design and procure better systems 3
  • 5. ATPESC 2017, July 30 – August 11, 20175 What is Darshan? Darshan is a scalable HPC I/O characterization tool. It captures an accurate but concise picture of application I/O behavior with minimum overhead. • No code changes, easy to use – Negligible performance impact: just “leave it on” – Enabled by default at ALCF, NERSC, NCSA, and KAUST – Installed and available for case by case use at many other sites • Produces a summary of I/O activity for each job, including: – Counters for file access operations – Time stamps and cumulative timers for key operations – Histograms of access, stride, datatype, and extent sizes 5 Project began in 2008, first public software release and deployment in 2009
  • 6. ATPESC 2017, July 30 – August 11, 20176 Darshan design principles • The Darshan run time library is inserted at link time (for static executables) or at run time (for dynamic executables) • Transparent wrappers for I/O functions collect per-file statistics • Statistics are stored in bounded memory at each rank • At shutdown time: – Collective reduction to merge shared file records – Parallel compression – Collective write to a single log file • No communication or storage operations until shutdown • Command-line tools are used to post-process log files 6
  • 7. ATPESC 2017, July 30 – August 11, 20177 JOB analysis example Example: Darshan-job-summary.pl produces a 3-page PDF file summarizing various aspects of I/O performance Estimated performance Percentage of runtime in I/O Access size histogram Access type histograms File usage
  • 8. ATPESC 2017, July 30 – August 11, 20178 SYSTEM analysis example • With a sufficient archive of performance statistics, we can develop heuristics to detect anomalous behavior 8 This example highlights large jobs that spent a disproportionate amount of time managing file metadata rather than performing raw data transfer Worst offender spent 99% of I/O time in open/close/stat/seek This identification process is not yet automated; alerts/triggers are needed in future work for greater impact Example of heuristics applied to a population of production jobs on the Hopper system in 2013: Carns et al., “Production I/O Characterization on the Cray XE6,” In Proceedings of the Cray User Group meeting 2013 (CUG 2013).
  • 9. ATPESC 2017, July 30 – August 11, 20179 Performance: function wrapping overhead What is the cost of interposing Darshan I/O instrumentation wrappers? • To test, we compare observed I/O time of an IOR configuration linked against different Darshan versions on Edison • File-per-process workload, 6,000 processes, over 12 million instrumented calls Type of Darshan builds now deployed on Theta and Cori Why the box plots? Recall observation from this morning that variability is a constant theme in HPC I/O today. (note that the Y axis labels start at 40) Snyder et al. Modular HPC I/O Characterization with Darshan. In Proceedings of 5th Workshop on Extreme-scale Programming Tools (ESPT 2016), 2016.
  • 10. ATPESC 2017, July 30 – August 11, 201710 Performance: shutdown overhead • Involves aggregating, compressing, and collectively writing I/O data records • To test, synthetic workloads are injected into Darshan and resulting shutdown time is measured on Edison Near constant shutdown time of ~100 ms in all cases Shutdown time scales linearly with job size: 5-6s extra shutdown time with 12,000 files single shared file file-per-process
  • 11. USING DARSHAN IN PRACTICE
  • 12. ATPESC 2017, July 30 – August 11, 201712 Typical deployment and usage • Darshan usage on Mira, Cetus, Vesta, Theta, Cori, or Edison, abridged: – Run your job – If the job calls MPI_Finalize(), log will be stored in DARSHAN_LOG_DIR/month/day/ – Theta: /lus/theta-fs0/logs/darshan/theta – Use tools (next slides) to interpret log • On Titan: “module load darshan” first • Links to documentation with details will be given at the end of this presentation 12
  • 13. ATPESC 2017, July 30 – August 11, 201713 Generating job summaries • Run job and find its log file: • Copy log files to save, generate PDF summaries: 13 Job id Corresponding log file in today’s directory Copy out logs List logs Load “latex” module, (if needed) Generate PDF
  • 14. ATPESC 2017, July 30 – August 11, 201714 First page of summary 14 Common questions: • Did I spend much time performing IO? • What were the access sizes? • How many files where opened, and how big were they?
  • 15. ATPESC 2017, July 30 – August 11, 201715 Second page of summary (excerpt) 15 Common questions: • Where in the timeline of the execution did ea rank do I/O? There are additional graphs in the PDF file with increasingly detailed information. You can also dump all data from the log in text format using “darshan-parser”.
  • 16. TIPS AND TRICKS: ENABLING ADDITIONAL DATA CAPTURE
  • 17. ATPESC 2017, July 30 – August 11, 201717 What if you are doing shared-file IO? 17 Your timeline might look like this No per-process information available because the data was aggregated by Darshan to save space/overhead Is that important? It depends on what you need to learn about your application. – It may be interesting for applications that access the same file in distinct phases over time
  • 18. ATPESC 2017, July 30 – August 11, 201718 What if you are doing shared-file IO? 18 Set environment variable to disable shared file reductions Increases overhead and log file size, but provides per-rank info even on shared files
  • 19. ATPESC 2017, July 30 – August 11, 201719 Detailed trace data 19 Set environment variable to enable “DXT” tracing This causes additional overhead and larger files, but captures precise access data Parse trace with “darshan-dxt-parser” Feature contributed by Cong Xu and Intel’s High Performance Data Division Cong Xu et. al, "DXT: Darshan eXtended Tracing", Cray User Group Conference 2017
  • 21. ATPESC 2017, July 30 – August 11, 201721 What’s new? Modularized instrumentation • Frequently asked question: Can I add instrumentation for X? • Darshan has been re-architected as a modular framework to help facilitate this, starting in v3.0 21 Snyder et al. Modular HPC I/O Characterization with Darshan. In Proceedings of 5th Workshop on Extreme- scale Programming Tools (ESPT 2016), 2016. Self-describing log format
  • 22. ATPESC 2017, July 30 – August 11, 201722 Darshan Module example • We are using the modular framework to integrate more data sources and simplify the connections between various components in the stack • This is a good way for collaborators to get involved in Darshan development 22
  • 23. ATPESC 2017, July 30 – August 11, 201723 The need for HOLISTIC characterization • We’ve used Darshan to improving application productivity with case studies, application tuning, and user education • ... But challenges remain: – What other factors influence performance? – What if the problem is beyond a user’s control? – The user population evolves over time; how do we stay engaged? 23
  • 24. ATPESC 2017, July 30 – August 11, 201724 “I observed performance XYZ. Now what?” • A climate vs. weather analogy: It is snowing in Atlanta, Georgia. Is that normal? • You need context to know: – Does it ever snow there? – What time of year is it? – What was the temperature yesterday? – Do your neighbors see snow too? – Should you look at it first hand? • It is similarly difficult to understand a single application performance measurement without broader context. How do we differentiate typical I/O climate from extreme I/O weather events? 24 + = ?
  • 25. ATPESC 2017, July 30 – August 11, 201725 Characterizing the I/O system • We need a big picture view • No lack of instrumentation methods for system components… – but with divergent data formats, resolutions, and scope 25
  • 26. ATPESC 2017, July 30 – August 11, 201726 Characterizing the I/O system • We need a big picture view • No lack of instrumentation methods for system components… – but with wildly divergent data formats, resolutions, and scope • This is the motivation for the TOKIO (TOtal Knowledge of I/O) project: – Integrate, correlate, and analyze I/O behavior from the system as a whole for holistic understanding 26 Holistic I/O characterization https://www.nersc.gov/research-and-development/tokio/
  • 27. ATPESC 2017, July 30 – August 11, 201727 TOKIO Strategy • Integrate existing best-in-class instrumentation tools with help from vendors • Index and query data sources in their native format – Infrastructure to align and link data sets – Adapters/parsers to produce coherent views on demand • Develop integration and analysis methods • Produce tools that share a common interface and data format – Correlation, data mining, dashboards, etc. 27 The TOKIO project is a collaboration between LBL and ANL PI: Nick Wright (LBL), Collaborators: Suren Byna, Glenn Lockwood, William Yoo, Prabhat, Jialin Liu (LBL) Phil Carns, Shane Snyder, Kevin Harms, Zach Nault, Matthieu Dorier, Rob Ross (ANL)
  • 28. ATPESC 2017, July 30 – August 11, 201728 UMAMI example TOKIO Unified Measurements And Metrics Interface 28 UMAMI is a pluggable dashboard that displays the I/O performance of an application in context with system telemetry and historical records Each metric is shown in a separate row Historical samples (for a given application) are plotted over time Box plots relate current values to overall variance (figures courtesy of Glenn Lockwood, NERSC)
  • 29. ATPESC 2017, July 30 – August 11, 201729 UMAMI example TOKIO Unified Measurements And Metrics Interface 29 System background load is typical Performance for this job is higher than usual Server CPU load is low after a long-term steady climb Corresponds to data purge that freed up disk blocks Broader contextual clues simplify interpretation of unusual performance measurements
  • 30. ATPESC 2017, July 30 – August 11, 201730 Hands on exercises https://xgitlab.cels.anl.gov/ATPESC-IO/hands-on-2017 • There are hands-on exercises available for you to try out during the day or in tonight’s session – Demonstrates running applications and analyzing I/O on Theta – Try some examples and see if you can find the I/O problem! • We can also answer questions about your own applications – Try it on Theta, Mira, Cetus, Vesta, Cori, Edison, or Titan – (note: the Mira, Vesta, and Cetus Darshan versions are a little older and will differ slightly in details from this presentation) 30
  • 31. ATPESC 2017, July 30 – August 11, 201731 Next up! • This presentation covered how to evaluate I/O and tune your application. • The next presentation will walk through the HDF5 data management library.