SlideShare a Scribd company logo
Sameer Shende
University of Oregon
Friday, March 15, 2019, 4pm – 4:45pm
Xilinx
2100 All Programmable Dr., San Jose, CA
Accelerating AI Applications
From data to
information
Time to
Analyse and
prepare
Time to
learn
From knowledge
to action/reaction
Time to
reason & conclude
Image
&Video
Voice&
Sound
Text
Sensor
The five phases of AI/DL
- From data collection to action -
From information
to knowledge
Time to
Collect
and filter
Big Data, PB Large Data, TB Large Data, TB Small Data, GB, MB
Slide credit: Ulrich Walter, IBM Germany
Attributes and Considerations
Volume
Velocity
xB/s IO/s
Veracity
Variability
Value
Interval
103
106
109
Challenges on storage for analytics and AI/DL data
Phases
TRUST
TB PB EB
Data
Slide credit: Ulrich Walter, IBM Germany
IBM Power 9 AC922
Inferenc
e
The five phases of DL/AI in the data Supply Chain – technical view
Image&
Video
Voice&S
ound
Detect and
Collect
Data
preproce
ssing
Compress/Map
Reduce
Tag/Aggregate
Knowledge Base
Learn
Distributed Deep
Learning
Comparison and
intrepretation
Combine
Conclude/Reaso
n
ComInt, ELInt,
SigInt
Text
Sensor
Platforms
FPGA
Applications
Appliances
CAPI Coherent
Accellerator
Processor
Interface for Data
Disk Storage Rich ServersFlash + NVMe
NVME Flash
High I/O
Storage
Data locality and
low latency
IBM Spectrum Family and other storage options
Applied Knowledge
Object Store Tape & LTFS
Data Creation
Data
preparation
Transformation
and validation
Compute
Storage intensive Compute intensive
Bandwidth Storage-RAM-CPU-GPU
Compute Performance
Latency
I/O throughput
Data Locality
Memory size
Device type
Latency
Volume
Velocity
Intervall
Interface
Data source
Data model
Data Type
API
Originator
Veracity
Storage
Bandwidth
Latency
Capacity
I/O Blocksize
Filesystem
Storage Type
Database
Toolchain
Analytics
Frameworks
Languages
Verified data
Volume
Storage Type
Filesystem
Toolchain
CNN
Languages
Database
Database
Toolchain
Languages
Value
Shared PCIe-4
For IB
P9 P9
CPU-GPU
NVLINK 2.0
150 GB
Data selection
Control,
loopback and
update
Analyse
Slide credit: Ulrich Walter, IBM Germany
TAU Performance System
®
http://tau.uoregon.edu
• Tuning and Analysis Utilities (20+ year project)
• Comprehensive performance profiling and tracing
• Integrated, scalable, flexible, portable
• Targets all parallel programming/execution paradigms
• Integrated performance toolkit
• Instrumentation, measurement, analysis, visualization
• Widely-ported performance profiling / tracing system
• Performance data management and data mining
• Open source (BSD-style license)
• Integrates with application frameworks
TAU Performance System®
Performance Engineering using TAU
• How much time is spent in each application routine and outer loops? Within loops, what
is the contribution of each statement? What is the time spent in OpenMP loops?
Kernels on the GPU?
• How many instructions are executed in these code regions?
Floating point, Level 1 and 2 data cache misses, hits, branches taken?
• What is the memory usage of the code? When and where is memory allocated/de-
allocated? Are there any memory leaks? What is the memory footprint of the
application? What is the memory high water mark?
• How much energy does the application use in Joules? What is the peak power usage?
• What are the I/O characteristics of the code? What is the peak read and write
bandwidth of individual calls, total volume?
• How does the application scale? What is the efficiency, runtime breakdown of
performance across different core counts?
Instrumentation
• Source instrumentation using a preprocessor
• Add timer start/stop calls in a copy of the source code.
• Use Program Database Toolkit (PDT) for parsing source code.
• Requires recompiling the code using TAU shell scripts (tau_cc.sh, tau_f90.sh)
• Selective instrumentation (filter file) can reduce runtime overhead and narrow
instrumentation focus.
• Compiler-based instrumentation
• Use system compiler to add a special flag to insert hooks at routine entry/exit.
• Requires recompiling using TAU compiler scripts (tau_cc.sh, tau_f90.sh…)
• Runtime preloading of TAU’s Dynamic Shared Object (DSO)
• No need to recompile code! Use tau_exec ./app with options.
Profiling and Tracing
• Tracing shows you when the events
take place on a timeline
Profiling Tracing
• Profiling shows you how much
(total) time was spent in each routine
• Profiling and tracing
Profiling shows you how much (total) time was spent in each routine
Tracing shows you when the events take place on a timeline
Inclusive vs. Exclusive values■ Inclusive
■ Information of all sub-elements aggregated into single value
■ Exclusive
■ Information cannot be subdivided further
Inclusive Exclusive
int foo()
{
int a;
a = 1 + 1;
bar();
a = a + 1;
return a;
}
Performance Data Measurement
Direct via Probes Indirect via Sampling
• Exact measurement
• Fine-grain control
• Calls inserted into
code
• No code modification
• Minimal effort
• Relies on debug
symbols (-g)
Call START(‘potential’)
// code
Call STOP(‘potential’)
Sampling
• Running program is periodically interrupted to take
measurement
• Timer interrupt, OS signal, or HWC overflow
• Service routine examines return-address stack
• Addresses are mapped to routines using symbol table
information
• Statistical inference of program behavior
• Not very detailed information on highly volatile metrics
• Requires long-running applications
• Works with unmodified executables
Time
main foo(0) foo(1) foo(2) int main()
{
int i;
for (i=0; i < 3; i++)
foo(i);
return 0;
}
void foo(int i)
{
if (i > 0)
foo(i – 1);
}
Measurement
t9
t7
t6
t5
t4
t1
t2
t3
t8
Instrumentation
• Measurement code is inserted such that every
event of interest is captured directly
• Can be done in various ways
• Advantage:
• Much more detailed information
• Disadvantage:
• Processing of source-code / executable
necessary
• Large relative overheads for small functions
Time
Measurement int main()
{
int i;
for (i=0; i < 3; i++)
foo(i);
return 0;
}
void foo(int i)
{
if (i > 0)
foo(i – 1);
}
Time
t1
t2
t3
t4
t5
t6
t7
t8
t9
t10
t11
t12
t13
t14
main foo(0) foo(1) foo(2)
Start(“main”);
Stop (“main”);
Start(“foo”);
Stop (“foo”);
Using TAU’s Runtime Preloading
Tool: tau_exec
• Preload a wrapper that intercepts the runtime system call and
substitutes with another
•MPI, CUDA, OpenACC, pthread, OpenCL
•OpenMP
•POSIX I/O
•Memory allocation/deallocation routines
•Wrapper library for an external package
• No modification to the binary executable!
• Enable other TAU options (communication matrix, OTF2, event-
based sampling)
• For Python: tau_python replaces python. Similar to tau_exec.
TAU Execution Command (tau_exec)
•Uninstrumented execution
• % ./a.out
•Track GPU operations
• % tau_exec –cupti ./a.out
• % tau_exec –cupti -um ./a.out (for Unified Memory)
• % tau_exec –opencl ./a.out
• % tau_exec –openacc ./a.out
• % tau_exec –rocm ./a.out
•Track MPI performance
• % tau_exec ./a.out
•Track OpenMP, and MPI performance (MPI enabled by default)
• % tau_exec –T ompt,tr6,mpi –ompt ./a.out
•Track I/O operations
• % tau_exec –io –T pthread ./a.out
•Use event based sampling (compile with –g)
• % tau_exec –ebs ./a.out
• Also –ebs_source=<PAPI_COUNTER> -ebs_period=<overflow_count>
• Also –ebs_resolution=[file | function | line ]
Configuration tags for tau_exec
% ./configure –pdt=<dir> -mpi –papi=<dir>; make install
Creates in $TAU:
Makefile.tau-papi-mpi-pdt(Configuration parameters in stub makefile)
shared-papi-mpi-pdt/libTAU.so
% ./configure –pdt=<dir> -mpi; make install creates
Makefile.tau-mpi-pdt
shared-mpi-pdt/libTAU.so
To explicitly choose preloading of shared-<options>/libTAU.so change:
% mpirun -np 256 ./a.out to
% mpirun -np 256 tau_exec –T <comma_separated_options> ./a.out
% mpirun -np 256 tau_exec –T papi,mpi,pdt ./a.out
Preloads $TAU/shared-papi-mpi-pdt/libTAU.so
% mpirun -np 256 tau_exec –T papi ./a.out
Preloads $TAU/shared-papi-mpi-pdt/libTAU.so by matching.
% aprun –n 256 tau_exec –T papi,mpi,pdt –s ./a.out
Does not execute the program. Just displays the library that it will preload if executed without the –s option.
NOTE: -mpi configuration is selected by default. Use –T serial for
Sequential programs.
ParaProf Profile Browser
% paraprof
ParaProf Profile Browser
ParaProf 3D Window
Profiling MNIST dataset with IBM
PowerAI on CTE-POWER with TAU
• AI program to recognize digits
% ssh –Y plogin1.bsc.es
% source /home/nct00/nct00005/tau.bashrc
% cp /home/nct00/nct00005/apps/tf_mnist.tgz . ; tar zxf tf_mninst.tgz; cd tf_mnist
% module load powerAI
% source /opt/DL/tensorflow/bin/tensorflow-activate
% cat r
python ./classify_mnist.py sample-images/img_8.jpg
% cat rt
/bin/rm -rf prof*
export TAU_CALLPATH_DEPTH=100
tau_python -T cupti,serial -cupti -ebs ./classify_mnist.py sample-images/img_8.jpg
% ./r
% ./rt
% paraprof
Mnist PowerAI example
TAU ParaProf Manager Window
ParaProf context event window
Context Event Window
ParaProf thread statistics window for
Mnist using IBM PowerAI
Tensorflow mnist example with
tau_python
TAU ParaProf
ParaProf: Thread Statistics Window
TAU: Tracking Data Transfers
Thread Profile on Thread 341
ParaProf Manager Window
ParaProf Node Window
Thread Statistics Window: Caffe
Googlenet
MPI_T: TAU and MVAPICH2 (OSU)
integration
TAU: Autotuning MVAPICH2 MPI_T
CVARs
Vampir [TU Dresden, Germany] showing
improvement: Before and After
TAU’s Static Analysis System:
Program Database Toolkit (PDT)
Application
/ Library
C / C++
parser
Fortran parser
F77/90/95
C / C++
IL analyzer
Fortran
IL analyzer
Program
Database
Files
IL IL
DUCTAPE
TAU
instrumentor
Automatic source
instrumentation
.
.
.
tau_instrumentor
Parsed
program
Instrumentation
specification file
Instrumented
copy of source
TAU source
analyzer
Application
source
PDT: automatic source instrumentation
FUN3D: CFD
TAU
Environment Variable Default Description
TAU_TRACE 0 Setting to 1 turns on tracing
TAU_CALLPATH 0 Setting to 1 turns on callpath profiling
TAU_TRACK_MEMORY_FOOTPRINT 0 Setting to 1 turns on tracking memory usage by sampling periodically the resident set size and high water mark of memory
usage
TAU_TRACK_POWER 0 Tracks power usage by sampling periodically.
TAU_CALLPATH_DEPTH 2 Specifies depth of callpath. Setting to 0 generates no callpath or routine information, setting to 1 generates flat profile and
context events have just parent information (e.g., Heap Entry: foo)
TAU_SAMPLING 1 Setting to 1 enables event-based sampling.
TAU_TRACK_SIGNALS 0 Setting to 1 generate debugging callstack info when a program crashes
TAU_COMM_MATRIX 0 Setting to 1 generates communication matrix display using context events
TAU_THROTTLE 1 Setting to 0 turns off throttling. Throttles instrumentation in lightweight routines that are called frequently
TAU_THROTTLE_NUMCALLS 100000 Specifies the number of calls before testing for throttling
TAU_THROTTLE_PERCALL 10 Specifies value in microseconds. Throttle a routine if it is called over 100000 times and takes less than 10 usec of inclusive
time per call
TAU_CALLSITE 0 Setting to 1 enables callsite profiling that shows where an instrumented function was called. Also compatible with tracing.
TAU_PROFILE_FORMAT Profile Setting to “merged” generates a single file. “snapshot” generates xml format
TAU_METRICS TIME Setting to a comma separated list generates other metrics. (e.g.,
ENERGY,TIME,P_VIRTUAL_TIME,PAPI_FP_INS,PAPI_NATIVE_<event>:<subevent>)
Runtime Environment Variables
Environment Variable Default Description
TAU_TRACE 0 Setting to 1 turns on tracing
TAU_TRACE_FORMAT Default Setting to “otf2” turns on TAU’s native OTF2 trace generation (configure with –otf=download)
TAU_EBS_UNWIND 0 Setting to 1 turns on unwinding the callstack during sampling (use with tau_exec –ebs or TAU_SAMPLING=1)
TAU_EBS_RESOLUTION line Setting to “function” or “file” changes the sampling resolution to function or file level respectively.
TAU_TRACK_LOAD 0 Setting to 1 tracks system load on the node
TAU_SELECT_FILE Default Setting to a file name, enables selective instrumentation based on exclude/include lists specified in the file.
TAU_OMPT_SUPPORT_LEVEL basic Setting to “full” improves resolution of OMPT TR6 regions on threads 1.. N-1. Also, “lowoverhead” option is available.
TAU_OMPT_RESOLVE_ADDRESS_EAGERLY 0 Setting to 1 is necessary for event based sampling to resolve addresses with OMPT TR6 (-ompt=download-tr6)
Runtime Environment Variables
Environment Variable Default Description
TAU_TRACK_MEMORY_LEAKS 0 Tracks allocates that were not de-allocated (needs –optMemDbg or tau_exec –memory)
TAU_EBS_SOURCE TIME Allows using PAPI hardware counters for periodic interrupts for EBS (e.g.,
TAU_EBS_SOURCE=PAPI_TOT_INS when TAU_SAMPLING=1)
TAU_EBS_PERIOD 100000 Specifies the overflow count for interrupts
TAU_MEMDBG_ALLOC_MIN/MAX 0 Byte size minimum and maximum subject to bounds checking (used with TAU_MEMDBG_PROTECT_*)
TAU_MEMDBG_OVERHEAD 0 Specifies the number of bytes for TAU’s memory overhead for memory debugging.
TAU_MEMDBG_PROTECT_BELOW/ABOVE 0 Setting to 1 enables tracking runtime bounds checking below or above the array bounds (requires –
optMemDbg while building or tau_exec –memory)
TAU_MEMDBG_ZERO_MALLOC 0 Setting to 1 enables tracking zero byte allocations as invalid memory allocations.
TAU_MEMDBG_PROTECT_FREE 0 Setting to 1 detects invalid accesses to deallocated memory that should not be referenced until it is
reallocated (requires –optMemDbg or tau_exec –memory)
TAU_MEMDBG_ATTEMPT_CONTINUE 0 Setting to 1 allows TAU to record and continue execution when a memory error occurs at runtime.
TAU_MEMDBG_FILL_GAP Undefined Initial value for gap bytes
TAU_MEMDBG_ALINGMENT Sizeof(int) Byte alignment for memory allocations
TAU_EVENT_THRESHOLD 0.5 Define a threshold value (e.g., .25 is 25%) to trigger marker events for min/max
Runtime Environment Variables (contd.)
Download TAU from U. Oregon
http://www.hpclinux.com [OVA file]
http://tau.uoregon.edu
for more information
Free download, open source, BSD license
Performance Research Lab, UO, Eugene
www.uoregon.edu
Support Acknowledgments
• US Department of Energy (DOE)
• ANL
• Office of Science contracts, ECP
• SciDAC, LBL contracts
• LLNL-LANL-SNL ASC/NNSA contract
• Battelle, PNNL and ORNL contract
• Department of Defense (DoD)
• PETTT, HPCMP
• National Science Foundation (NSF)
• SI2-SSI, Glassbox
• NASA
• CEA, France
• IBM
• Partners:
• The Ohio State University
• ParaTools, Inc.
• University of Tennessee, Knoxville
• T.U. Dresden, GWT
• Jülich Supercomputing Center

More Related Content

What's hot

Multicore
MulticoreMulticore
Accelerators: the good, the bad, and the ugly
Accelerators: the good, the bad, and the uglyAccelerators: the good, the bad, and the ugly
Accelerators: the good, the bad, and the ugly
Intel IT Center
 
1 Vampir Overview
1 Vampir Overview1 Vampir Overview
1 Vampir Overview
PTIHPA
 
Survey of Program Transformation Technologies
Survey of Program Transformation TechnologiesSurvey of Program Transformation Technologies
Survey of Program Transformation Technologies
Chunhua Liao
 
A Source-To-Source Approach to HPC Challenges
A Source-To-Source Approach to HPC ChallengesA Source-To-Source Approach to HPC Challenges
A Source-To-Source Approach to HPC Challenges
Chunhua Liao
 
Move Message Passing Interface Applications to the Next Level
Move Message Passing Interface Applications to the Next LevelMove Message Passing Interface Applications to the Next Level
Move Message Passing Interface Applications to the Next Level
Intel® Software
 
20081114 Friday Food iLabt Bart Joris
20081114 Friday Food iLabt Bart Joris20081114 Friday Food iLabt Bart Joris
20081114 Friday Food iLabt Bart Joris
imec.archive
 
Return oriented programming
Return oriented programmingReturn oriented programming
Return oriented programming
hybr1s
 
Optimizing Commercial Software for Intel Xeon Coprocessors: Lessons Learned
Optimizing Commercial Software for Intel Xeon Coprocessors: Lessons LearnedOptimizing Commercial Software for Intel Xeon Coprocessors: Lessons Learned
Optimizing Commercial Software for Intel Xeon Coprocessors: Lessons Learned
Intel IT Center
 
Rajat Monga, Engineering Director, TensorFlow, Google at MLconf 2016
Rajat Monga, Engineering Director, TensorFlow, Google at MLconf 2016Rajat Monga, Engineering Director, TensorFlow, Google at MLconf 2016
Rajat Monga, Engineering Director, TensorFlow, Google at MLconf 2016
MLconf
 
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsUnderstand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Intel® Software
 
Cs4hs2008 track a-programming
Cs4hs2008 track a-programmingCs4hs2008 track a-programming
Cs4hs2008 track a-programming
Rashi Agarwal
 
es_hardware_handout
es_hardware_handoutes_hardware_handout
es_hardware_handout
Mohammad Ranjbar
 
TAU Performance tool using OpenPOWER
TAU Performance tool using OpenPOWERTAU Performance tool using OpenPOWER
TAU Performance tool using OpenPOWER
Ganesan Narayanasamy
 
fg.workshop: Software vulnerability
fg.workshop: Software vulnerabilityfg.workshop: Software vulnerability
fg.workshop: Software vulnerability
fg.informatik Universität Basel
 

What's hot (15)

Multicore
MulticoreMulticore
Multicore
 
Accelerators: the good, the bad, and the ugly
Accelerators: the good, the bad, and the uglyAccelerators: the good, the bad, and the ugly
Accelerators: the good, the bad, and the ugly
 
1 Vampir Overview
1 Vampir Overview1 Vampir Overview
1 Vampir Overview
 
Survey of Program Transformation Technologies
Survey of Program Transformation TechnologiesSurvey of Program Transformation Technologies
Survey of Program Transformation Technologies
 
A Source-To-Source Approach to HPC Challenges
A Source-To-Source Approach to HPC ChallengesA Source-To-Source Approach to HPC Challenges
A Source-To-Source Approach to HPC Challenges
 
Move Message Passing Interface Applications to the Next Level
Move Message Passing Interface Applications to the Next LevelMove Message Passing Interface Applications to the Next Level
Move Message Passing Interface Applications to the Next Level
 
20081114 Friday Food iLabt Bart Joris
20081114 Friday Food iLabt Bart Joris20081114 Friday Food iLabt Bart Joris
20081114 Friday Food iLabt Bart Joris
 
Return oriented programming
Return oriented programmingReturn oriented programming
Return oriented programming
 
Optimizing Commercial Software for Intel Xeon Coprocessors: Lessons Learned
Optimizing Commercial Software for Intel Xeon Coprocessors: Lessons LearnedOptimizing Commercial Software for Intel Xeon Coprocessors: Lessons Learned
Optimizing Commercial Software for Intel Xeon Coprocessors: Lessons Learned
 
Rajat Monga, Engineering Director, TensorFlow, Google at MLconf 2016
Rajat Monga, Engineering Director, TensorFlow, Google at MLconf 2016Rajat Monga, Engineering Director, TensorFlow, Google at MLconf 2016
Rajat Monga, Engineering Director, TensorFlow, Google at MLconf 2016
 
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsUnderstand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
 
Cs4hs2008 track a-programming
Cs4hs2008 track a-programmingCs4hs2008 track a-programming
Cs4hs2008 track a-programming
 
es_hardware_handout
es_hardware_handoutes_hardware_handout
es_hardware_handout
 
TAU Performance tool using OpenPOWER
TAU Performance tool using OpenPOWERTAU Performance tool using OpenPOWER
TAU Performance tool using OpenPOWER
 
fg.workshop: Software vulnerability
fg.workshop: Software vulnerabilityfg.workshop: Software vulnerability
fg.workshop: Software vulnerability
 

Similar to TAU on Power 9

TAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platformTAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platform
Ganesan Narayanasamy
 
Performance Evaluation using TAU Performance System and E4S
Performance Evaluation using TAU Performance System and E4SPerformance Evaluation using TAU Performance System and E4S
Performance Evaluation using TAU Performance System and E4S
Ganesan Narayanasamy
 
Module 3-cpu-scheduling
Module 3-cpu-schedulingModule 3-cpu-scheduling
Module 3-cpu-scheduling
Hesham Elmasry
 
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
Steve Caron
 
Write a program in C or C++ which simulates CPU scheduling in an opera.pdf
Write a program in C or C++ which simulates CPU scheduling in an opera.pdfWrite a program in C or C++ which simulates CPU scheduling in an opera.pdf
Write a program in C or C++ which simulates CPU scheduling in an opera.pdf
sravi07
 
[CB16] COFI break – Breaking exploits with Processor trace and Practical cont...
[CB16] COFI break – Breaking exploits with Processor trace and Practical cont...[CB16] COFI break – Breaking exploits with Processor trace and Practical cont...
[CB16] COFI break – Breaking exploits with Processor trace and Practical cont...
CODE BLUE
 
Monitoring and Instrumentation Strategies: Tips and Best Practices - AppSphere16
Monitoring and Instrumentation Strategies: Tips and Best Practices - AppSphere16Monitoring and Instrumentation Strategies: Tips and Best Practices - AppSphere16
Monitoring and Instrumentation Strategies: Tips and Best Practices - AppSphere16
AppDynamics
 
Runtimeenvironment
RuntimeenvironmentRuntimeenvironment
Runtimeenvironment
Anusuya123
 
Introduction to InfluxDB and TICK Stack
Introduction to InfluxDB and TICK StackIntroduction to InfluxDB and TICK Stack
Introduction to InfluxDB and TICK Stack
Ahmed AbouZaid
 
Systems Programming Assignment Help - Processes
Systems Programming Assignment Help - ProcessesSystems Programming Assignment Help - Processes
Systems Programming Assignment Help - Processes
HelpWithAssignment.com
 
Tuga it 2017 - Event processing with Apache Storm
Tuga it 2017 - Event processing with Apache StormTuga it 2017 - Event processing with Apache Storm
Tuga it 2017 - Event processing with Apache Storm
Nuno Caneco
 
May2010 hex-core-opt
May2010 hex-core-optMay2010 hex-core-opt
May2010 hex-core-opt
Jeff Larkin
 
Data Onboarding Breakout Session
Data Onboarding Breakout SessionData Onboarding Breakout Session
Data Onboarding Breakout Session
Splunk
 
Monitorama 2015 Netflix Instance Analysis
Monitorama 2015 Netflix Instance AnalysisMonitorama 2015 Netflix Instance Analysis
Monitorama 2015 Netflix Instance Analysis
Brendan Gregg
 
1230 Rtf Final
1230 Rtf Final1230 Rtf Final
1230 Rtf Final
luisotaviomedici
 
Burst Buffer: From Alpha to Omega
Burst Buffer: From Alpha to OmegaBurst Buffer: From Alpha to Omega
Burst Buffer: From Alpha to Omega
George Markomanolis
 
SplunkLive! Presentation - Data Onboarding with Splunk
SplunkLive! Presentation - Data Onboarding with SplunkSplunkLive! Presentation - Data Onboarding with Splunk
SplunkLive! Presentation - Data Onboarding with Splunk
Splunk
 
SREcon 2016 Performance Checklists for SREs
SREcon 2016 Performance Checklists for SREsSREcon 2016 Performance Checklists for SREs
SREcon 2016 Performance Checklists for SREs
Brendan Gregg
 
SDN Controller - Programming Challenges
SDN Controller - Programming ChallengesSDN Controller - Programming Challenges
SDN Controller - Programming Challenges
snrism
 
CSW2017Richard Johnson_harnessing intel processor trace on windows for vulner...
CSW2017Richard Johnson_harnessing intel processor trace on windows for vulner...CSW2017Richard Johnson_harnessing intel processor trace on windows for vulner...
CSW2017Richard Johnson_harnessing intel processor trace on windows for vulner...
CanSecWest
 

Similar to TAU on Power 9 (20)

TAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platformTAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platform
 
Performance Evaluation using TAU Performance System and E4S
Performance Evaluation using TAU Performance System and E4SPerformance Evaluation using TAU Performance System and E4S
Performance Evaluation using TAU Performance System and E4S
 
Module 3-cpu-scheduling
Module 3-cpu-schedulingModule 3-cpu-scheduling
Module 3-cpu-scheduling
 
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
 
Write a program in C or C++ which simulates CPU scheduling in an opera.pdf
Write a program in C or C++ which simulates CPU scheduling in an opera.pdfWrite a program in C or C++ which simulates CPU scheduling in an opera.pdf
Write a program in C or C++ which simulates CPU scheduling in an opera.pdf
 
[CB16] COFI break – Breaking exploits with Processor trace and Practical cont...
[CB16] COFI break – Breaking exploits with Processor trace and Practical cont...[CB16] COFI break – Breaking exploits with Processor trace and Practical cont...
[CB16] COFI break – Breaking exploits with Processor trace and Practical cont...
 
Monitoring and Instrumentation Strategies: Tips and Best Practices - AppSphere16
Monitoring and Instrumentation Strategies: Tips and Best Practices - AppSphere16Monitoring and Instrumentation Strategies: Tips and Best Practices - AppSphere16
Monitoring and Instrumentation Strategies: Tips and Best Practices - AppSphere16
 
Runtimeenvironment
RuntimeenvironmentRuntimeenvironment
Runtimeenvironment
 
Introduction to InfluxDB and TICK Stack
Introduction to InfluxDB and TICK StackIntroduction to InfluxDB and TICK Stack
Introduction to InfluxDB and TICK Stack
 
Systems Programming Assignment Help - Processes
Systems Programming Assignment Help - ProcessesSystems Programming Assignment Help - Processes
Systems Programming Assignment Help - Processes
 
Tuga it 2017 - Event processing with Apache Storm
Tuga it 2017 - Event processing with Apache StormTuga it 2017 - Event processing with Apache Storm
Tuga it 2017 - Event processing with Apache Storm
 
May2010 hex-core-opt
May2010 hex-core-optMay2010 hex-core-opt
May2010 hex-core-opt
 
Data Onboarding Breakout Session
Data Onboarding Breakout SessionData Onboarding Breakout Session
Data Onboarding Breakout Session
 
Monitorama 2015 Netflix Instance Analysis
Monitorama 2015 Netflix Instance AnalysisMonitorama 2015 Netflix Instance Analysis
Monitorama 2015 Netflix Instance Analysis
 
1230 Rtf Final
1230 Rtf Final1230 Rtf Final
1230 Rtf Final
 
Burst Buffer: From Alpha to Omega
Burst Buffer: From Alpha to OmegaBurst Buffer: From Alpha to Omega
Burst Buffer: From Alpha to Omega
 
SplunkLive! Presentation - Data Onboarding with Splunk
SplunkLive! Presentation - Data Onboarding with SplunkSplunkLive! Presentation - Data Onboarding with Splunk
SplunkLive! Presentation - Data Onboarding with Splunk
 
SREcon 2016 Performance Checklists for SREs
SREcon 2016 Performance Checklists for SREsSREcon 2016 Performance Checklists for SREs
SREcon 2016 Performance Checklists for SREs
 
SDN Controller - Programming Challenges
SDN Controller - Programming ChallengesSDN Controller - Programming Challenges
SDN Controller - Programming Challenges
 
CSW2017Richard Johnson_harnessing intel processor trace on windows for vulner...
CSW2017Richard Johnson_harnessing intel processor trace on windows for vulner...CSW2017Richard Johnson_harnessing intel processor trace on windows for vulner...
CSW2017Richard Johnson_harnessing intel processor trace on windows for vulner...
 

More from Ganesan Narayanasamy

Chip Design Curriculum development Residency program
Chip Design Curriculum development Residency programChip Design Curriculum development Residency program
Chip Design Curriculum development Residency program
Ganesan Narayanasamy
 
Basics of Digital Design and Verilog
Basics of Digital Design and VerilogBasics of Digital Design and Verilog
Basics of Digital Design and Verilog
Ganesan Narayanasamy
 
180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISA180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISA
Ganesan Narayanasamy
 
Workload Transformation and Innovations in POWER Architecture
Workload Transformation and Innovations in POWER Architecture Workload Transformation and Innovations in POWER Architecture
Workload Transformation and Innovations in POWER Architecture
Ganesan Narayanasamy
 
OpenPOWER Workshop at IIT Roorkee
OpenPOWER Workshop at IIT RoorkeeOpenPOWER Workshop at IIT Roorkee
OpenPOWER Workshop at IIT Roorkee
Ganesan Narayanasamy
 
Deep Learning Use Cases using OpenPOWER systems
Deep Learning Use Cases using OpenPOWER systemsDeep Learning Use Cases using OpenPOWER systems
Deep Learning Use Cases using OpenPOWER systems
Ganesan Narayanasamy
 
IBM BOA for POWER
IBM BOA for POWER IBM BOA for POWER
IBM BOA for POWER
Ganesan Narayanasamy
 
OpenPOWER System Marconi100
OpenPOWER System Marconi100OpenPOWER System Marconi100
OpenPOWER System Marconi100
Ganesan Narayanasamy
 
OpenPOWER Latest Updates
OpenPOWER Latest UpdatesOpenPOWER Latest Updates
OpenPOWER Latest Updates
Ganesan Narayanasamy
 
POWER10 innovations for HPC
POWER10 innovations for HPCPOWER10 innovations for HPC
POWER10 innovations for HPC
Ganesan Narayanasamy
 
Deeplearningusingcloudpakfordata
DeeplearningusingcloudpakfordataDeeplearningusingcloudpakfordata
Deeplearningusingcloudpakfordata
Ganesan Narayanasamy
 
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
Ganesan Narayanasamy
 
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systemsAI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
Ganesan Narayanasamy
 
AI in healthcare - Use Cases
AI in healthcare - Use Cases AI in healthcare - Use Cases
AI in healthcare - Use Cases
Ganesan Narayanasamy
 
AI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systemsAI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systems
Ganesan Narayanasamy
 
AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems
Ganesan Narayanasamy
 
Poster from NUS
Poster from NUSPoster from NUS
Poster from NUS
Ganesan Narayanasamy
 
SAP HANA on POWER9 systems
SAP HANA on POWER9 systemsSAP HANA on POWER9 systems
SAP HANA on POWER9 systems
Ganesan Narayanasamy
 
Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9
Ganesan Narayanasamy
 
AI in the enterprise
AI in the enterprise AI in the enterprise
AI in the enterprise
Ganesan Narayanasamy
 

More from Ganesan Narayanasamy (20)

Chip Design Curriculum development Residency program
Chip Design Curriculum development Residency programChip Design Curriculum development Residency program
Chip Design Curriculum development Residency program
 
Basics of Digital Design and Verilog
Basics of Digital Design and VerilogBasics of Digital Design and Verilog
Basics of Digital Design and Verilog
 
180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISA180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISA
 
Workload Transformation and Innovations in POWER Architecture
Workload Transformation and Innovations in POWER Architecture Workload Transformation and Innovations in POWER Architecture
Workload Transformation and Innovations in POWER Architecture
 
OpenPOWER Workshop at IIT Roorkee
OpenPOWER Workshop at IIT RoorkeeOpenPOWER Workshop at IIT Roorkee
OpenPOWER Workshop at IIT Roorkee
 
Deep Learning Use Cases using OpenPOWER systems
Deep Learning Use Cases using OpenPOWER systemsDeep Learning Use Cases using OpenPOWER systems
Deep Learning Use Cases using OpenPOWER systems
 
IBM BOA for POWER
IBM BOA for POWER IBM BOA for POWER
IBM BOA for POWER
 
OpenPOWER System Marconi100
OpenPOWER System Marconi100OpenPOWER System Marconi100
OpenPOWER System Marconi100
 
OpenPOWER Latest Updates
OpenPOWER Latest UpdatesOpenPOWER Latest Updates
OpenPOWER Latest Updates
 
POWER10 innovations for HPC
POWER10 innovations for HPCPOWER10 innovations for HPC
POWER10 innovations for HPC
 
Deeplearningusingcloudpakfordata
DeeplearningusingcloudpakfordataDeeplearningusingcloudpakfordata
Deeplearningusingcloudpakfordata
 
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
 
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systemsAI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
 
AI in healthcare - Use Cases
AI in healthcare - Use Cases AI in healthcare - Use Cases
AI in healthcare - Use Cases
 
AI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systemsAI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systems
 
AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems
 
Poster from NUS
Poster from NUSPoster from NUS
Poster from NUS
 
SAP HANA on POWER9 systems
SAP HANA on POWER9 systemsSAP HANA on POWER9 systems
SAP HANA on POWER9 systems
 
Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9
 
AI in the enterprise
AI in the enterprise AI in the enterprise
AI in the enterprise
 

Recently uploaded

GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
IndexBug
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
Chart Kalyan
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
Wouter Lemaire
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
DanBrown980551
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
fredae14
 
Webinar: Designing a schema for a Data Warehouse
Webinar: Designing a schema for a Data WarehouseWebinar: Designing a schema for a Data Warehouse
Webinar: Designing a schema for a Data Warehouse
Federico Razzoli
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
SitimaJohn
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 

Recently uploaded (20)

GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
 
Webinar: Designing a schema for a Data Warehouse
Webinar: Designing a schema for a Data WarehouseWebinar: Designing a schema for a Data Warehouse
Webinar: Designing a schema for a Data Warehouse
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 

TAU on Power 9

  • 1. Sameer Shende University of Oregon Friday, March 15, 2019, 4pm – 4:45pm Xilinx 2100 All Programmable Dr., San Jose, CA Accelerating AI Applications
  • 2. From data to information Time to Analyse and prepare Time to learn From knowledge to action/reaction Time to reason & conclude Image &Video Voice& Sound Text Sensor The five phases of AI/DL - From data collection to action - From information to knowledge Time to Collect and filter Big Data, PB Large Data, TB Large Data, TB Small Data, GB, MB Slide credit: Ulrich Walter, IBM Germany
  • 3. Attributes and Considerations Volume Velocity xB/s IO/s Veracity Variability Value Interval 103 106 109 Challenges on storage for analytics and AI/DL data Phases TRUST TB PB EB Data Slide credit: Ulrich Walter, IBM Germany
  • 4. IBM Power 9 AC922 Inferenc e The five phases of DL/AI in the data Supply Chain – technical view Image& Video Voice&S ound Detect and Collect Data preproce ssing Compress/Map Reduce Tag/Aggregate Knowledge Base Learn Distributed Deep Learning Comparison and intrepretation Combine Conclude/Reaso n ComInt, ELInt, SigInt Text Sensor Platforms FPGA Applications Appliances CAPI Coherent Accellerator Processor Interface for Data Disk Storage Rich ServersFlash + NVMe NVME Flash High I/O Storage Data locality and low latency IBM Spectrum Family and other storage options Applied Knowledge Object Store Tape & LTFS Data Creation Data preparation Transformation and validation Compute Storage intensive Compute intensive Bandwidth Storage-RAM-CPU-GPU Compute Performance Latency I/O throughput Data Locality Memory size Device type Latency Volume Velocity Intervall Interface Data source Data model Data Type API Originator Veracity Storage Bandwidth Latency Capacity I/O Blocksize Filesystem Storage Type Database Toolchain Analytics Frameworks Languages Verified data Volume Storage Type Filesystem Toolchain CNN Languages Database Database Toolchain Languages Value Shared PCIe-4 For IB P9 P9 CPU-GPU NVLINK 2.0 150 GB Data selection Control, loopback and update Analyse Slide credit: Ulrich Walter, IBM Germany
  • 5. TAU Performance System ® http://tau.uoregon.edu • Tuning and Analysis Utilities (20+ year project) • Comprehensive performance profiling and tracing • Integrated, scalable, flexible, portable • Targets all parallel programming/execution paradigms • Integrated performance toolkit • Instrumentation, measurement, analysis, visualization • Widely-ported performance profiling / tracing system • Performance data management and data mining • Open source (BSD-style license) • Integrates with application frameworks
  • 7. Performance Engineering using TAU • How much time is spent in each application routine and outer loops? Within loops, what is the contribution of each statement? What is the time spent in OpenMP loops? Kernels on the GPU? • How many instructions are executed in these code regions? Floating point, Level 1 and 2 data cache misses, hits, branches taken? • What is the memory usage of the code? When and where is memory allocated/de- allocated? Are there any memory leaks? What is the memory footprint of the application? What is the memory high water mark? • How much energy does the application use in Joules? What is the peak power usage? • What are the I/O characteristics of the code? What is the peak read and write bandwidth of individual calls, total volume? • How does the application scale? What is the efficiency, runtime breakdown of performance across different core counts?
  • 8. Instrumentation • Source instrumentation using a preprocessor • Add timer start/stop calls in a copy of the source code. • Use Program Database Toolkit (PDT) for parsing source code. • Requires recompiling the code using TAU shell scripts (tau_cc.sh, tau_f90.sh) • Selective instrumentation (filter file) can reduce runtime overhead and narrow instrumentation focus. • Compiler-based instrumentation • Use system compiler to add a special flag to insert hooks at routine entry/exit. • Requires recompiling using TAU compiler scripts (tau_cc.sh, tau_f90.sh…) • Runtime preloading of TAU’s Dynamic Shared Object (DSO) • No need to recompile code! Use tau_exec ./app with options.
  • 9. Profiling and Tracing • Tracing shows you when the events take place on a timeline Profiling Tracing • Profiling shows you how much (total) time was spent in each routine • Profiling and tracing Profiling shows you how much (total) time was spent in each routine Tracing shows you when the events take place on a timeline
  • 10. Inclusive vs. Exclusive values■ Inclusive ■ Information of all sub-elements aggregated into single value ■ Exclusive ■ Information cannot be subdivided further Inclusive Exclusive int foo() { int a; a = 1 + 1; bar(); a = a + 1; return a; }
  • 11. Performance Data Measurement Direct via Probes Indirect via Sampling • Exact measurement • Fine-grain control • Calls inserted into code • No code modification • Minimal effort • Relies on debug symbols (-g) Call START(‘potential’) // code Call STOP(‘potential’)
  • 12. Sampling • Running program is periodically interrupted to take measurement • Timer interrupt, OS signal, or HWC overflow • Service routine examines return-address stack • Addresses are mapped to routines using symbol table information • Statistical inference of program behavior • Not very detailed information on highly volatile metrics • Requires long-running applications • Works with unmodified executables Time main foo(0) foo(1) foo(2) int main() { int i; for (i=0; i < 3; i++) foo(i); return 0; } void foo(int i) { if (i > 0) foo(i – 1); } Measurement t9 t7 t6 t5 t4 t1 t2 t3 t8
  • 13. Instrumentation • Measurement code is inserted such that every event of interest is captured directly • Can be done in various ways • Advantage: • Much more detailed information • Disadvantage: • Processing of source-code / executable necessary • Large relative overheads for small functions Time Measurement int main() { int i; for (i=0; i < 3; i++) foo(i); return 0; } void foo(int i) { if (i > 0) foo(i – 1); } Time t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 main foo(0) foo(1) foo(2) Start(“main”); Stop (“main”); Start(“foo”); Stop (“foo”);
  • 14. Using TAU’s Runtime Preloading Tool: tau_exec • Preload a wrapper that intercepts the runtime system call and substitutes with another •MPI, CUDA, OpenACC, pthread, OpenCL •OpenMP •POSIX I/O •Memory allocation/deallocation routines •Wrapper library for an external package • No modification to the binary executable! • Enable other TAU options (communication matrix, OTF2, event- based sampling) • For Python: tau_python replaces python. Similar to tau_exec.
  • 15. TAU Execution Command (tau_exec) •Uninstrumented execution • % ./a.out •Track GPU operations • % tau_exec –cupti ./a.out • % tau_exec –cupti -um ./a.out (for Unified Memory) • % tau_exec –opencl ./a.out • % tau_exec –openacc ./a.out • % tau_exec –rocm ./a.out •Track MPI performance • % tau_exec ./a.out •Track OpenMP, and MPI performance (MPI enabled by default) • % tau_exec –T ompt,tr6,mpi –ompt ./a.out •Track I/O operations • % tau_exec –io –T pthread ./a.out •Use event based sampling (compile with –g) • % tau_exec –ebs ./a.out • Also –ebs_source=<PAPI_COUNTER> -ebs_period=<overflow_count> • Also –ebs_resolution=[file | function | line ]
  • 16. Configuration tags for tau_exec % ./configure –pdt=<dir> -mpi –papi=<dir>; make install Creates in $TAU: Makefile.tau-papi-mpi-pdt(Configuration parameters in stub makefile) shared-papi-mpi-pdt/libTAU.so % ./configure –pdt=<dir> -mpi; make install creates Makefile.tau-mpi-pdt shared-mpi-pdt/libTAU.so To explicitly choose preloading of shared-<options>/libTAU.so change: % mpirun -np 256 ./a.out to % mpirun -np 256 tau_exec –T <comma_separated_options> ./a.out % mpirun -np 256 tau_exec –T papi,mpi,pdt ./a.out Preloads $TAU/shared-papi-mpi-pdt/libTAU.so % mpirun -np 256 tau_exec –T papi ./a.out Preloads $TAU/shared-papi-mpi-pdt/libTAU.so by matching. % aprun –n 256 tau_exec –T papi,mpi,pdt –s ./a.out Does not execute the program. Just displays the library that it will preload if executed without the –s option. NOTE: -mpi configuration is selected by default. Use –T serial for Sequential programs.
  • 20. Profiling MNIST dataset with IBM PowerAI on CTE-POWER with TAU • AI program to recognize digits % ssh –Y plogin1.bsc.es % source /home/nct00/nct00005/tau.bashrc % cp /home/nct00/nct00005/apps/tf_mnist.tgz . ; tar zxf tf_mninst.tgz; cd tf_mnist % module load powerAI % source /opt/DL/tensorflow/bin/tensorflow-activate % cat r python ./classify_mnist.py sample-images/img_8.jpg % cat rt /bin/rm -rf prof* export TAU_CALLPATH_DEPTH=100 tau_python -T cupti,serial -cupti -ebs ./classify_mnist.py sample-images/img_8.jpg % ./r % ./rt % paraprof
  • 21. Mnist PowerAI example TAU ParaProf Manager Window
  • 22. ParaProf context event window Context Event Window
  • 23. ParaProf thread statistics window for Mnist using IBM PowerAI
  • 24. Tensorflow mnist example with tau_python
  • 27. TAU: Tracking Data Transfers
  • 28. Thread Profile on Thread 341
  • 31. Thread Statistics Window: Caffe Googlenet
  • 32. MPI_T: TAU and MVAPICH2 (OSU) integration
  • 34. Vampir [TU Dresden, Germany] showing improvement: Before and After
  • 35. TAU’s Static Analysis System: Program Database Toolkit (PDT) Application / Library C / C++ parser Fortran parser F77/90/95 C / C++ IL analyzer Fortran IL analyzer Program Database Files IL IL DUCTAPE TAU instrumentor Automatic source instrumentation . . .
  • 36. tau_instrumentor Parsed program Instrumentation specification file Instrumented copy of source TAU source analyzer Application source PDT: automatic source instrumentation
  • 38. TAU
  • 39. Environment Variable Default Description TAU_TRACE 0 Setting to 1 turns on tracing TAU_CALLPATH 0 Setting to 1 turns on callpath profiling TAU_TRACK_MEMORY_FOOTPRINT 0 Setting to 1 turns on tracking memory usage by sampling periodically the resident set size and high water mark of memory usage TAU_TRACK_POWER 0 Tracks power usage by sampling periodically. TAU_CALLPATH_DEPTH 2 Specifies depth of callpath. Setting to 0 generates no callpath or routine information, setting to 1 generates flat profile and context events have just parent information (e.g., Heap Entry: foo) TAU_SAMPLING 1 Setting to 1 enables event-based sampling. TAU_TRACK_SIGNALS 0 Setting to 1 generate debugging callstack info when a program crashes TAU_COMM_MATRIX 0 Setting to 1 generates communication matrix display using context events TAU_THROTTLE 1 Setting to 0 turns off throttling. Throttles instrumentation in lightweight routines that are called frequently TAU_THROTTLE_NUMCALLS 100000 Specifies the number of calls before testing for throttling TAU_THROTTLE_PERCALL 10 Specifies value in microseconds. Throttle a routine if it is called over 100000 times and takes less than 10 usec of inclusive time per call TAU_CALLSITE 0 Setting to 1 enables callsite profiling that shows where an instrumented function was called. Also compatible with tracing. TAU_PROFILE_FORMAT Profile Setting to “merged” generates a single file. “snapshot” generates xml format TAU_METRICS TIME Setting to a comma separated list generates other metrics. (e.g., ENERGY,TIME,P_VIRTUAL_TIME,PAPI_FP_INS,PAPI_NATIVE_<event>:<subevent>) Runtime Environment Variables
  • 40. Environment Variable Default Description TAU_TRACE 0 Setting to 1 turns on tracing TAU_TRACE_FORMAT Default Setting to “otf2” turns on TAU’s native OTF2 trace generation (configure with –otf=download) TAU_EBS_UNWIND 0 Setting to 1 turns on unwinding the callstack during sampling (use with tau_exec –ebs or TAU_SAMPLING=1) TAU_EBS_RESOLUTION line Setting to “function” or “file” changes the sampling resolution to function or file level respectively. TAU_TRACK_LOAD 0 Setting to 1 tracks system load on the node TAU_SELECT_FILE Default Setting to a file name, enables selective instrumentation based on exclude/include lists specified in the file. TAU_OMPT_SUPPORT_LEVEL basic Setting to “full” improves resolution of OMPT TR6 regions on threads 1.. N-1. Also, “lowoverhead” option is available. TAU_OMPT_RESOLVE_ADDRESS_EAGERLY 0 Setting to 1 is necessary for event based sampling to resolve addresses with OMPT TR6 (-ompt=download-tr6) Runtime Environment Variables
  • 41. Environment Variable Default Description TAU_TRACK_MEMORY_LEAKS 0 Tracks allocates that were not de-allocated (needs –optMemDbg or tau_exec –memory) TAU_EBS_SOURCE TIME Allows using PAPI hardware counters for periodic interrupts for EBS (e.g., TAU_EBS_SOURCE=PAPI_TOT_INS when TAU_SAMPLING=1) TAU_EBS_PERIOD 100000 Specifies the overflow count for interrupts TAU_MEMDBG_ALLOC_MIN/MAX 0 Byte size minimum and maximum subject to bounds checking (used with TAU_MEMDBG_PROTECT_*) TAU_MEMDBG_OVERHEAD 0 Specifies the number of bytes for TAU’s memory overhead for memory debugging. TAU_MEMDBG_PROTECT_BELOW/ABOVE 0 Setting to 1 enables tracking runtime bounds checking below or above the array bounds (requires – optMemDbg while building or tau_exec –memory) TAU_MEMDBG_ZERO_MALLOC 0 Setting to 1 enables tracking zero byte allocations as invalid memory allocations. TAU_MEMDBG_PROTECT_FREE 0 Setting to 1 detects invalid accesses to deallocated memory that should not be referenced until it is reallocated (requires –optMemDbg or tau_exec –memory) TAU_MEMDBG_ATTEMPT_CONTINUE 0 Setting to 1 allows TAU to record and continue execution when a memory error occurs at runtime. TAU_MEMDBG_FILL_GAP Undefined Initial value for gap bytes TAU_MEMDBG_ALINGMENT Sizeof(int) Byte alignment for memory allocations TAU_EVENT_THRESHOLD 0.5 Define a threshold value (e.g., .25 is 25%) to trigger marker events for min/max Runtime Environment Variables (contd.)
  • 42. Download TAU from U. Oregon http://www.hpclinux.com [OVA file] http://tau.uoregon.edu for more information Free download, open source, BSD license
  • 43. Performance Research Lab, UO, Eugene www.uoregon.edu
  • 44. Support Acknowledgments • US Department of Energy (DOE) • ANL • Office of Science contracts, ECP • SciDAC, LBL contracts • LLNL-LANL-SNL ASC/NNSA contract • Battelle, PNNL and ORNL contract • Department of Defense (DoD) • PETTT, HPCMP • National Science Foundation (NSF) • SI2-SSI, Glassbox • NASA • CEA, France • IBM • Partners: • The Ohio State University • ParaTools, Inc. • University of Tennessee, Knoxville • T.U. Dresden, GWT • Jülich Supercomputing Center