SlideShare a Scribd company logo
Instrumentation and
  analysis of NPB
         Zafar Gilani
         EMDC 2012
Measurement Tools and Techniques
             UPC
Outline
●   Introduction to benchmark app
●   Testbeds
●   Instrumentation
●   Traces
●   Measurement criterion
●   Evaluation
●   Anomalies
●   Conclusions
1




    Introduction to benchmark app
    ● NPB = NAS Parallel Benchmarks.
    ● A small set of programs designed to
      evaluate performance of parallel
      supercomputers.
    ● 5 kernels, 3 pseudo applications.
    ● 3 versions: Serial, OpenMP, MPI.
    ● 8 kind of classes of tests:
      ○   S - small, for quick tests
      ○   W - workstation size
      ○   A, B, C - standard tests, ~4x increase from A to C
      ○   D, E, F - large tests, ~16x increase from A to C
2




    Testbeds
                    Local                Remote
     Machine type   Laptop               Server
     Processor      Intel Core i3-330M   Intel Xeon E5645
                    2.13GHz              2.40GHz
     Cores          2                    6
     Cache (MB)     3                    12
     Memory (GB)    3                    24
3




    Instrumentation
    ● Preload Extrae's MPI trace library
      "libmpitrace.so".
    ● The library intercepts all the MPI calls and
      traces all the MPI events.
    ● Instrumented and executed:
       ○ NPB version 3.3 stable
       ○ NPB3.3-MPI
       ○ IS (Integer Sort) kernel with 2, 4, 8, 16 and 32 procs
    ● Per experiment:
       ○ Size of problem: Class C, 135 million values approx.
       ○ Iterations: 10
4

    Local
    traces
              Exec




       Comm
5

    Remote
    traces
Evaluation & Comparative
         Analysis
6




    Measurement criterion
    Metric               Relevance to NPB-MPI Integer Sort
    Computation time     General idea of speed-up.
    Communication time   Impact of increasing number of processes
                         on communication.
    Load imbalance       Which processes or threads do less as
                         compared to others.
    Bottlenecks          Performance bottlenecks.
    L1 cache misses      To see how many times the CPU had to
                         go to other memory to find data.
7




    Computation time
    ● Measured: thread processing time.
    ● Local:
      ○ increase in time directly proportional to nprocs
      ○ upto 32 processes
      ○ poor scalability
    ● Remote:
      ○ decrease in time directly proportional to nprocs
      ○ upto 32 processes
      ○ good scalability
8
9




    Communication time
    ● Overall communication time is determined
      by the process taking maximum time.
    ● Local:
      ○ rapid increase in time as number of processes are
          increased
    ● Remote:
       ○ nominal increase in time as number of processes
         are increased
10
11




     Load Imbalance
     ● On boada
       ○ For nprocs = 4, threads = {2, 3} are lazy.
       ○ For nprocs = 16, threads = {5, 6, 7, 8, 12} are lazy.
                                                                 Exec


                                                                 Wait



                                                             Comm
12




     Bottlenecks
     ● For nprocs = {8, 16, 32}, one or more
       processes takes more time.
       ○ Wait/Wait All signals.
       ○ Typical times for local machine is around 1000 ms.
       ○ Typical times for remote machine is around 250 ms.
         ■ 4x difference (threads in remote machine have
            shorter wait time).
13


     Wait




     I/O
14




     L1 cache misses
     ● Cache misses in local machine are more
       expensive: typically costing 5x more time.
       ○ Cache size difference? Local has to "look"
          elsewhere more often.
          ■ i3 has 3MB cache.
          ■ Xeon has 12MB cache.
15
16




     Anomalies
     ● For 32 threads:
       ○ Time taken to spawn threads varies.
       ○ Remote takes less time to spawn 32 threads.
       ○ Possible reasons:
         ■ Acquiring locks and switching between resource
              acquisition and release is costly.
     ● Time taken by "other" jobs also varies:
       ○ But these generally vary from system to system.
17




     Spawning




                Others ??
18




     Conclusions
     ● Instrumentation is necessary to reveal
       performance insights of parallel code.
     ● Extrae supports a handy procedure for
       automatic instrumentation.
     ● Some interesting observations:
       ○ IS does not properly scale on low-end machines
         beyond 16 procs.
       ○ Scales nicely on a server such as boada.
       ○ IS code becomes communication intensive when
         nprocs is increased.
       ○ Some bottlenecks deteriorate performance.
Instrumentation and
  analysis of NPB
         Zafar Gilani
         EMDC 2012
Measurement Tools and Techniques
             UPC

More Related Content

What's hot

On component interface
On component interfaceOn component interface
On component interface
Laurence Chen
 
A simple tool for debug (tap>)
A simple tool for debug (tap>)A simple tool for debug (tap>)
A simple tool for debug (tap>)
Laurence Chen
 
Efficient Bytecode Analysis: Linespeed Shellcode Detection
Efficient Bytecode Analysis: Linespeed Shellcode DetectionEfficient Bytecode Analysis: Linespeed Shellcode Detection
Efficient Bytecode Analysis: Linespeed Shellcode Detection
Georg Wicherski
 
On the Necessity and Inapplicability of Python
On the Necessity and Inapplicability of PythonOn the Necessity and Inapplicability of Python
On the Necessity and Inapplicability of Python
Takeshi Akutsu
 
On the necessity and inapplicability of python
On the necessity and inapplicability of pythonOn the necessity and inapplicability of python
On the necessity and inapplicability of python
Yung-Yu Chen
 
Why Is Concurrent Programming Hard? And What Can We Do about It?
Why Is Concurrent Programming Hard? And What Can We Do about It?Why Is Concurrent Programming Hard? And What Can We Do about It?
Why Is Concurrent Programming Hard? And What Can We Do about It?
Stefan Marr
 
BUD17-300: Journey of a packet
BUD17-300: Journey of a packetBUD17-300: Journey of a packet
BUD17-300: Journey of a packet
Linaro
 
Greedy Enough for the Grid?
Greedy Enough for the Grid?Greedy Enough for the Grid?
Greedy Enough for the Grid?
Matteo Romanello
 
Using R in remote computer clusters
Using R in remote computer clustersUsing R in remote computer clusters
Using R in remote computer clusters
Burak Himmetoglu
 
BKK16-503 Undefined Behavior and Compiler Optimizations – Why Your Program St...
BKK16-503 Undefined Behavior and Compiler Optimizations – Why Your Program St...BKK16-503 Undefined Behavior and Compiler Optimizations – Why Your Program St...
BKK16-503 Undefined Behavior and Compiler Optimizations – Why Your Program St...
Linaro
 
Parallel Programming on the ANDC cluster
Parallel Programming on the ANDC clusterParallel Programming on the ANDC cluster
Parallel Programming on the ANDC cluster
Sudhang Shankar
 
Diagnosing HotSpot JVM Memory Leaks with JFR and JMC
Diagnosing HotSpot JVM Memory Leaks with JFR and JMCDiagnosing HotSpot JVM Memory Leaks with JFR and JMC
Diagnosing HotSpot JVM Memory Leaks with JFR and JMC
Mushfekur Rahman
 
NS3 Tech Talk
NS3 Tech TalkNS3 Tech Talk
NS3 Tech Talk
Rodrigo Melo
 
Building a Unified Logging Layer with Fluentd, Elasticsearch and Kibana
Building a Unified Logging Layer with Fluentd, Elasticsearch and KibanaBuilding a Unified Logging Layer with Fluentd, Elasticsearch and Kibana
Building a Unified Logging Layer with Fluentd, Elasticsearch and Kibana
Mushfekur Rahman
 

What's hot (14)

On component interface
On component interfaceOn component interface
On component interface
 
A simple tool for debug (tap>)
A simple tool for debug (tap>)A simple tool for debug (tap>)
A simple tool for debug (tap>)
 
Efficient Bytecode Analysis: Linespeed Shellcode Detection
Efficient Bytecode Analysis: Linespeed Shellcode DetectionEfficient Bytecode Analysis: Linespeed Shellcode Detection
Efficient Bytecode Analysis: Linespeed Shellcode Detection
 
On the Necessity and Inapplicability of Python
On the Necessity and Inapplicability of PythonOn the Necessity and Inapplicability of Python
On the Necessity and Inapplicability of Python
 
On the necessity and inapplicability of python
On the necessity and inapplicability of pythonOn the necessity and inapplicability of python
On the necessity and inapplicability of python
 
Why Is Concurrent Programming Hard? And What Can We Do about It?
Why Is Concurrent Programming Hard? And What Can We Do about It?Why Is Concurrent Programming Hard? And What Can We Do about It?
Why Is Concurrent Programming Hard? And What Can We Do about It?
 
BUD17-300: Journey of a packet
BUD17-300: Journey of a packetBUD17-300: Journey of a packet
BUD17-300: Journey of a packet
 
Greedy Enough for the Grid?
Greedy Enough for the Grid?Greedy Enough for the Grid?
Greedy Enough for the Grid?
 
Using R in remote computer clusters
Using R in remote computer clustersUsing R in remote computer clusters
Using R in remote computer clusters
 
BKK16-503 Undefined Behavior and Compiler Optimizations – Why Your Program St...
BKK16-503 Undefined Behavior and Compiler Optimizations – Why Your Program St...BKK16-503 Undefined Behavior and Compiler Optimizations – Why Your Program St...
BKK16-503 Undefined Behavior and Compiler Optimizations – Why Your Program St...
 
Parallel Programming on the ANDC cluster
Parallel Programming on the ANDC clusterParallel Programming on the ANDC cluster
Parallel Programming on the ANDC cluster
 
Diagnosing HotSpot JVM Memory Leaks with JFR and JMC
Diagnosing HotSpot JVM Memory Leaks with JFR and JMCDiagnosing HotSpot JVM Memory Leaks with JFR and JMC
Diagnosing HotSpot JVM Memory Leaks with JFR and JMC
 
NS3 Tech Talk
NS3 Tech TalkNS3 Tech Talk
NS3 Tech Talk
 
Building a Unified Logging Layer with Fluentd, Elasticsearch and Kibana
Building a Unified Logging Layer with Fluentd, Elasticsearch and KibanaBuilding a Unified Logging Layer with Fluentd, Elasticsearch and Kibana
Building a Unified Logging Layer with Fluentd, Elasticsearch and Kibana
 

Viewers also liked

2 rest-elevator-pitch
2 rest-elevator-pitch2 rest-elevator-pitch
2 rest-elevator-pitch
zafargilani
 
5 state-of-cloud-applications-and-platforms
5 state-of-cloud-applications-and-platforms5 state-of-cloud-applications-and-platforms
5 state-of-cloud-applications-and-platforms
zafargilani
 
6 intelligent-placement-of-datacenters
6 intelligent-placement-of-datacenters6 intelligent-placement-of-datacenters
6 intelligent-placement-of-datacenters
zafargilani
 
Bigtable
BigtableBigtable
Bigtable
zafargilani
 
1 distributed-systems-template-modified
1 distributed-systems-template-modified1 distributed-systems-template-modified
1 distributed-systems-template-modified
zafargilani
 
1 logical data models for cc arch
1 logical data models for cc arch1 logical data models for cc arch
1 logical data models for cc arch
zafargilani
 
Laporan lengakap percobaan pembiasan cahaya
Laporan lengakap percobaan pembiasan cahayaLaporan lengakap percobaan pembiasan cahaya
Laporan lengakap percobaan pembiasan cahaya
fikar zul
 
Topik 1 Dunia Melalui Deria Kita (bahagian 1)
Topik 1 Dunia Melalui Deria Kita (bahagian 1)Topik 1 Dunia Melalui Deria Kita (bahagian 1)
Topik 1 Dunia Melalui Deria Kita (bahagian 1)
Faizal Jay'z
 
Bab 1 Dunia Melalui Deria Kita
Bab 1 Dunia Melalui Deria Kita Bab 1 Dunia Melalui Deria Kita
Bab 1 Dunia Melalui Deria Kita
Safwan Yusuf
 
3 apache-avro
3 apache-avro3 apache-avro
3 apache-avro
zafargilani
 
Hype vs. Reality: The AI Explainer
Hype vs. Reality: The AI ExplainerHype vs. Reality: The AI Explainer
Hype vs. Reality: The AI Explainer
Luminary Labs
 
Study: The Future of VR, AR and Self-Driving Cars
Study: The Future of VR, AR and Self-Driving CarsStudy: The Future of VR, AR and Self-Driving Cars
Study: The Future of VR, AR and Self-Driving Cars
LinkedIn
 

Viewers also liked (12)

2 rest-elevator-pitch
2 rest-elevator-pitch2 rest-elevator-pitch
2 rest-elevator-pitch
 
5 state-of-cloud-applications-and-platforms
5 state-of-cloud-applications-and-platforms5 state-of-cloud-applications-and-platforms
5 state-of-cloud-applications-and-platforms
 
6 intelligent-placement-of-datacenters
6 intelligent-placement-of-datacenters6 intelligent-placement-of-datacenters
6 intelligent-placement-of-datacenters
 
Bigtable
BigtableBigtable
Bigtable
 
1 distributed-systems-template-modified
1 distributed-systems-template-modified1 distributed-systems-template-modified
1 distributed-systems-template-modified
 
1 logical data models for cc arch
1 logical data models for cc arch1 logical data models for cc arch
1 logical data models for cc arch
 
Laporan lengakap percobaan pembiasan cahaya
Laporan lengakap percobaan pembiasan cahayaLaporan lengakap percobaan pembiasan cahaya
Laporan lengakap percobaan pembiasan cahaya
 
Topik 1 Dunia Melalui Deria Kita (bahagian 1)
Topik 1 Dunia Melalui Deria Kita (bahagian 1)Topik 1 Dunia Melalui Deria Kita (bahagian 1)
Topik 1 Dunia Melalui Deria Kita (bahagian 1)
 
Bab 1 Dunia Melalui Deria Kita
Bab 1 Dunia Melalui Deria Kita Bab 1 Dunia Melalui Deria Kita
Bab 1 Dunia Melalui Deria Kita
 
3 apache-avro
3 apache-avro3 apache-avro
3 apache-avro
 
Hype vs. Reality: The AI Explainer
Hype vs. Reality: The AI ExplainerHype vs. Reality: The AI Explainer
Hype vs. Reality: The AI Explainer
 
Study: The Future of VR, AR and Self-Driving Cars
Study: The Future of VR, AR and Self-Driving CarsStudy: The Future of VR, AR and Self-Driving Cars
Study: The Future of VR, AR and Self-Driving Cars
 

Similar to Assignment 1-mtat

The Dark Side Of Go -- Go runtime related problems in TiDB in production
The Dark Side Of Go -- Go runtime related problems in TiDB  in productionThe Dark Side Of Go -- Go runtime related problems in TiDB  in production
The Dark Side Of Go -- Go runtime related problems in TiDB in production
PingCAP
 
Understanding and Measuring I/O Performance
Understanding and Measuring I/O PerformanceUnderstanding and Measuring I/O Performance
Understanding and Measuring I/O Performance
Glenn K. Lockwood
 
Benchmarks, performance, scalability, and capacity what s behind the numbers...
Benchmarks, performance, scalability, and capacity  what s behind the numbers...Benchmarks, performance, scalability, and capacity  what s behind the numbers...
Benchmarks, performance, scalability, and capacity what s behind the numbers...
james tong
 
Benchmarks, performance, scalability, and capacity what's behind the numbers
Benchmarks, performance, scalability, and capacity what's behind the numbersBenchmarks, performance, scalability, and capacity what's behind the numbers
Benchmarks, performance, scalability, and capacity what's behind the numbers
Justin Dorfman
 
Harnessing OpenCL in Modern Coprocessors
Harnessing OpenCL in Modern CoprocessorsHarnessing OpenCL in Modern Coprocessors
Harnessing OpenCL in Modern Coprocessors
Unai Lopez-Novoa
 
PEARC17: Evaluation of Intel Omni-Path on the Intel Knights Landing Processor
PEARC17: Evaluation of Intel Omni-Path on the Intel Knights Landing ProcessorPEARC17: Evaluation of Intel Omni-Path on the Intel Knights Landing Processor
PEARC17: Evaluation of Intel Omni-Path on the Intel Knights Landing Processor
Antonio Gomez
 
DAA Slides for Multiple topics such as different algorithms
DAA Slides for Multiple topics such as different algorithmsDAA Slides for Multiple topics such as different algorithms
DAA Slides for Multiple topics such as different algorithms
DEVARSHHIRENBHAIPARM
 
An End to Order (many cores with java, session two)
An End to Order (many cores with java, session two)An End to Order (many cores with java, session two)
An End to Order (many cores with java, session two)
Robert Burrell Donkin
 
OpenMP
OpenMPOpenMP
OpenMP
Eric Cheng
 
Mirko Damiani - An Embedded soft real time distributed system in Go
Mirko Damiani - An Embedded soft real time distributed system in GoMirko Damiani - An Embedded soft real time distributed system in Go
Mirko Damiani - An Embedded soft real time distributed system in Go
linuxlab_conf
 
The Search for Gravitational Waves
The Search for Gravitational WavesThe Search for Gravitational Waves
The Search for Gravitational Waves
inside-BigData.com
 
Preparing Fusion codes for Perlmutter - CGYRO
Preparing Fusion codes for Perlmutter - CGYROPreparing Fusion codes for Perlmutter - CGYRO
Preparing Fusion codes for Perlmutter - CGYRO
Igor Sfiligoi
 
Parallel Algorithms
Parallel AlgorithmsParallel Algorithms
Parallel Algorithms
Dr Sandeep Kumar Poonia
 
An End to Order
An End to OrderAn End to Order
An End to Order
Robert Burrell Donkin
 
Performance Optimization of CGYRO for Multiscale Turbulence Simulations
Performance Optimization of CGYRO for Multiscale Turbulence SimulationsPerformance Optimization of CGYRO for Multiscale Turbulence Simulations
Performance Optimization of CGYRO for Multiscale Turbulence Simulations
Igor Sfiligoi
 
GPU Computing for Data Science
GPU Computing for Data Science GPU Computing for Data Science
GPU Computing for Data Science
Domino Data Lab
 
SPE effiency on modern hardware paper presentation
SPE effiency on modern hardware   paper presentationSPE effiency on modern hardware   paper presentation
SPE effiency on modern hardware paper presentation
PanagiotisSavvaidis
 
Computer network (7)
Computer network (7)Computer network (7)
Computer network (7)
NYversity
 
Parallel Algorithms
Parallel AlgorithmsParallel Algorithms
Parallel Algorithms
Dr Sandeep Kumar Poonia
 
Building zero data loss pipelines with apache kafka
Building zero data loss pipelines with apache kafkaBuilding zero data loss pipelines with apache kafka
Building zero data loss pipelines with apache kafka
Avinash Ramineni
 

Similar to Assignment 1-mtat (20)

The Dark Side Of Go -- Go runtime related problems in TiDB in production
The Dark Side Of Go -- Go runtime related problems in TiDB  in productionThe Dark Side Of Go -- Go runtime related problems in TiDB  in production
The Dark Side Of Go -- Go runtime related problems in TiDB in production
 
Understanding and Measuring I/O Performance
Understanding and Measuring I/O PerformanceUnderstanding and Measuring I/O Performance
Understanding and Measuring I/O Performance
 
Benchmarks, performance, scalability, and capacity what s behind the numbers...
Benchmarks, performance, scalability, and capacity  what s behind the numbers...Benchmarks, performance, scalability, and capacity  what s behind the numbers...
Benchmarks, performance, scalability, and capacity what s behind the numbers...
 
Benchmarks, performance, scalability, and capacity what's behind the numbers
Benchmarks, performance, scalability, and capacity what's behind the numbersBenchmarks, performance, scalability, and capacity what's behind the numbers
Benchmarks, performance, scalability, and capacity what's behind the numbers
 
Harnessing OpenCL in Modern Coprocessors
Harnessing OpenCL in Modern CoprocessorsHarnessing OpenCL in Modern Coprocessors
Harnessing OpenCL in Modern Coprocessors
 
PEARC17: Evaluation of Intel Omni-Path on the Intel Knights Landing Processor
PEARC17: Evaluation of Intel Omni-Path on the Intel Knights Landing ProcessorPEARC17: Evaluation of Intel Omni-Path on the Intel Knights Landing Processor
PEARC17: Evaluation of Intel Omni-Path on the Intel Knights Landing Processor
 
DAA Slides for Multiple topics such as different algorithms
DAA Slides for Multiple topics such as different algorithmsDAA Slides for Multiple topics such as different algorithms
DAA Slides for Multiple topics such as different algorithms
 
An End to Order (many cores with java, session two)
An End to Order (many cores with java, session two)An End to Order (many cores with java, session two)
An End to Order (many cores with java, session two)
 
OpenMP
OpenMPOpenMP
OpenMP
 
Mirko Damiani - An Embedded soft real time distributed system in Go
Mirko Damiani - An Embedded soft real time distributed system in GoMirko Damiani - An Embedded soft real time distributed system in Go
Mirko Damiani - An Embedded soft real time distributed system in Go
 
The Search for Gravitational Waves
The Search for Gravitational WavesThe Search for Gravitational Waves
The Search for Gravitational Waves
 
Preparing Fusion codes for Perlmutter - CGYRO
Preparing Fusion codes for Perlmutter - CGYROPreparing Fusion codes for Perlmutter - CGYRO
Preparing Fusion codes for Perlmutter - CGYRO
 
Parallel Algorithms
Parallel AlgorithmsParallel Algorithms
Parallel Algorithms
 
An End to Order
An End to OrderAn End to Order
An End to Order
 
Performance Optimization of CGYRO for Multiscale Turbulence Simulations
Performance Optimization of CGYRO for Multiscale Turbulence SimulationsPerformance Optimization of CGYRO for Multiscale Turbulence Simulations
Performance Optimization of CGYRO for Multiscale Turbulence Simulations
 
GPU Computing for Data Science
GPU Computing for Data Science GPU Computing for Data Science
GPU Computing for Data Science
 
SPE effiency on modern hardware paper presentation
SPE effiency on modern hardware   paper presentationSPE effiency on modern hardware   paper presentation
SPE effiency on modern hardware paper presentation
 
Computer network (7)
Computer network (7)Computer network (7)
Computer network (7)
 
Parallel Algorithms
Parallel AlgorithmsParallel Algorithms
Parallel Algorithms
 
Building zero data loss pipelines with apache kafka
Building zero data loss pipelines with apache kafkaBuilding zero data loss pipelines with apache kafka
Building zero data loss pipelines with apache kafka
 

Recently uploaded

Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 
dbms calicut university B. sc Cs 4th sem.pdf
dbms  calicut university B. sc Cs 4th sem.pdfdbms  calicut university B. sc Cs 4th sem.pdf
dbms calicut university B. sc Cs 4th sem.pdf
Shinana2
 
A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024
Intelisync
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Wask
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
Zilliz
 
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Jeffrey Haguewood
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
kumardaparthi1024
 
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Tatiana Kojar
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
Chart Kalyan
 
Operating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptxOperating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptx
Pravash Chandra Das
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
Tatiana Kojar
 
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - HiikeSystem Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
Hiike
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
fredae14
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
DanBrown980551
 

Recently uploaded (20)

Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 
dbms calicut university B. sc Cs 4th sem.pdf
dbms  calicut university B. sc Cs 4th sem.pdfdbms  calicut university B. sc Cs 4th sem.pdf
dbms calicut university B. sc Cs 4th sem.pdf
 
A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
 
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
 
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
 
Operating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptxOperating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptx
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
 
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - HiikeSystem Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
 

Assignment 1-mtat

  • 1. Instrumentation and analysis of NPB Zafar Gilani EMDC 2012 Measurement Tools and Techniques UPC
  • 2. Outline ● Introduction to benchmark app ● Testbeds ● Instrumentation ● Traces ● Measurement criterion ● Evaluation ● Anomalies ● Conclusions
  • 3. 1 Introduction to benchmark app ● NPB = NAS Parallel Benchmarks. ● A small set of programs designed to evaluate performance of parallel supercomputers. ● 5 kernels, 3 pseudo applications. ● 3 versions: Serial, OpenMP, MPI. ● 8 kind of classes of tests: ○ S - small, for quick tests ○ W - workstation size ○ A, B, C - standard tests, ~4x increase from A to C ○ D, E, F - large tests, ~16x increase from A to C
  • 4. 2 Testbeds Local Remote Machine type Laptop Server Processor Intel Core i3-330M Intel Xeon E5645 2.13GHz 2.40GHz Cores 2 6 Cache (MB) 3 12 Memory (GB) 3 24
  • 5. 3 Instrumentation ● Preload Extrae's MPI trace library "libmpitrace.so". ● The library intercepts all the MPI calls and traces all the MPI events. ● Instrumented and executed: ○ NPB version 3.3 stable ○ NPB3.3-MPI ○ IS (Integer Sort) kernel with 2, 4, 8, 16 and 32 procs ● Per experiment: ○ Size of problem: Class C, 135 million values approx. ○ Iterations: 10
  • 6. 4 Local traces Exec Comm
  • 7. 5 Remote traces
  • 9. 6 Measurement criterion Metric Relevance to NPB-MPI Integer Sort Computation time General idea of speed-up. Communication time Impact of increasing number of processes on communication. Load imbalance Which processes or threads do less as compared to others. Bottlenecks Performance bottlenecks. L1 cache misses To see how many times the CPU had to go to other memory to find data.
  • 10. 7 Computation time ● Measured: thread processing time. ● Local: ○ increase in time directly proportional to nprocs ○ upto 32 processes ○ poor scalability ● Remote: ○ decrease in time directly proportional to nprocs ○ upto 32 processes ○ good scalability
  • 11. 8
  • 12. 9 Communication time ● Overall communication time is determined by the process taking maximum time. ● Local: ○ rapid increase in time as number of processes are increased ● Remote: ○ nominal increase in time as number of processes are increased
  • 13. 10
  • 14. 11 Load Imbalance ● On boada ○ For nprocs = 4, threads = {2, 3} are lazy. ○ For nprocs = 16, threads = {5, 6, 7, 8, 12} are lazy. Exec Wait Comm
  • 15. 12 Bottlenecks ● For nprocs = {8, 16, 32}, one or more processes takes more time. ○ Wait/Wait All signals. ○ Typical times for local machine is around 1000 ms. ○ Typical times for remote machine is around 250 ms. ■ 4x difference (threads in remote machine have shorter wait time).
  • 16. 13 Wait I/O
  • 17. 14 L1 cache misses ● Cache misses in local machine are more expensive: typically costing 5x more time. ○ Cache size difference? Local has to "look" elsewhere more often. ■ i3 has 3MB cache. ■ Xeon has 12MB cache.
  • 18. 15
  • 19. 16 Anomalies ● For 32 threads: ○ Time taken to spawn threads varies. ○ Remote takes less time to spawn 32 threads. ○ Possible reasons: ■ Acquiring locks and switching between resource acquisition and release is costly. ● Time taken by "other" jobs also varies: ○ But these generally vary from system to system.
  • 20. 17 Spawning Others ??
  • 21. 18 Conclusions ● Instrumentation is necessary to reveal performance insights of parallel code. ● Extrae supports a handy procedure for automatic instrumentation. ● Some interesting observations: ○ IS does not properly scale on low-end machines beyond 16 procs. ○ Scales nicely on a server such as boada. ○ IS code becomes communication intensive when nprocs is increased. ○ Some bottlenecks deteriorate performance.
  • 22. Instrumentation and analysis of NPB Zafar Gilani EMDC 2012 Measurement Tools and Techniques UPC