SlideShare a Scribd company logo
1 of 40
Download to read offline
Parallel and Distributed
  Computing on Low
    Latecy Clusters

                Vittorio Giovara
    M. S. Electrical Engineering and Computer Science
              University of Illinois at Chicago
                          May 2009
Contents
•   Motivation            •   Application

•   Strategy                    •   Compiler Optimizations


•   Technologies                •   OpenMP and MPI over
                                    Infinband

      •   OpenMP
                          •   Results
      •   MPI
                          •   Conclusions
      •   Infinband
Motivation
Motivation

• Scaling trend has to stop for CMOS
  technology:
    ✓ Direct-tunneling limit in SiO2 ~3 nm
    ✓ Distance between Si atoms ~0.3 nm
    ✓ Variabilty


• Foundamental reason: rising fab cost
Motivation

• Easy to build multiple core processor
• Requires human action to modify and adapt
  concurrent software
• New classification for computer
  architectures
Classification
SISD                                                        SIMD
                     data pool                                                   data pool




                                                              instruction pool
  instruction pool




                       CPU                                                       CPU CPU



                        MISD                                                        MIMD
                                                data pool                                                  data pool



                                                                                        instruction pool
                             instruction pool




                                                  CPU                                                      CPU CPU

                                                  CPU                                                      CPU CPU
easier to parallelize




             abstraction level

algorithm
              loop level
                           process management
Levels
 recursion
  memory
management
  profiling
                                 data dependency
                                branching overhead
                                   control flow
       algorithm
                   loop level
                                process management

                                      SMP Multiprogramming
                                    Multithreading and Scheduling
Backfire

• Difficutly to fully exploit the parallelism
  offered
• Automatic tools required to adapt software
  to parallelism
• Compiler support for manual or semi-
  automatic enhancement
Applications
• OpenMP and MPI are two popular tools
  used to simplify the parallelizing process of
  both new and old software
• Mathematics and Physics
• Computer Science
• Biomedics
Specific Problem and
     Background
• Sally3D is a micromagnetism program suit
  for field analysis and modeling developed at
  Politecnico di Torino (Department of
  Electrical Engineering)
• Computationally intensive (even days of
  CPU); speedup required
• Previous works still not fully encompassing
  the problem (no Infiniband or OpenMP
  +MPI solutions)
Strategy
Strategy
• Install a Linux Kernel with ad-hoc
  configuration for scientific computation
• Compile a OpenMP enable GCC
  (supported from 4.3.1 onwards)
• Add the Infiniband link among clusters with
  proper drivers in kernel and user space
• Select a MPI implementation library
Strategy
• Verify Infiniband network through some
  MPI test examples
• Install the target software
• Proceed to include OpenMP and MPI
  directives in the code
• Run test cases
OpenMP

• standard
• supported by most of modern compilers
• requires little knowledge of the software
• very simple construction methods
OpenMP - example
OpenMP - example
Parallel Task 1                     Parallel Task 3




                  Parallel Task 2                     Parallel Task 4
Parallel Task 1         Parallel Task 2


Thread A
                                                              Parallel Task 4




            Thread B
                         Parallel Task 3




                                                       Join
  Master Thread
OpenMP Sceduler

• Which scheduler available for hardware?
   - Static
   - Dynamic
   - Guided
OpenMP Scheduler
                                                    OpenMP Static Scheduler Chart
               80000


               70000


               60000


               50000
microseconds




               40000


               30000


               20000


               10000


                  0
                       1   2   3     4   5      6       7      8        9       10       11     12    13    14   15   16
                                                            number of threads




                               chunk 1   chunk 10      chunk 100            chunk 1000        chunk 10000
OpenMP Scheduler
                                                         OpenMP Dynamic Scheduler Chart
               117000


               102375


               87750


               73125
microseconds




               58500


               43875


               29250


               14625


                   0
                        1   2    3        4    5         6    7      8       9        10   11     12    13    14   15   16
                                                                  number of threads



                                chunk 1       chunk 10       chunk 100        chunk 1000        chunk 10000
OpenMP Scheduler
                                                        OpenMP Guided Scheduler Chart
               80000


               70000


               60000


               50000
microseconds




               40000


               30000


               20000


               10000


                  0
                       1   2   3        4    5     6        7      8        9       10   11      12     13   14   15   16
                                                                number of threads




                                   chunk 1   chunk 10       chunk 100       chunk 1000        chunk 10000
OpenMP Scheduler
OpenMP Scheduler




   static scheduler   dynamic scheduler   guided scheduler
MPI
• standard
• widely used in cluster environment
• many transport link supported
• different implementations available
      - OpenMPI
      - MVAPICH
Infiniband

• standard
• widely used in cluster environment
• very low latency for small packets
• up to 16 Gb/s transfer speed
MPI over Infiniband
10000000,0 µs




 1000000,0 µs




  100000,0 µs




   10000,0 µs




    1000,0 µs




     100,0 µs




      10,0 µs




       1,0 µs
                kB
                     kB
                          kB
                               kB

                                    kB
                                               kB

                                          12 B
                                          25 B
                                          51 B
                                               kB

                                                 B
                                                 B
                                                 B
                                                 B

                                          32 B
                                          64 B
                                         12 B
                                         25 B
                                         51 B
                                                 B
                                                 B
                                                 B
                                                 B
                                                 B

                                                 B
                                               k

                                               k
                                               k


                                              M
                                              M
                                              M
                                              M

                                              M
                                              M
                                              M

                                              M
                                              M
                                              M
                                               G
                                               G
                                               G
                                               G

                                               G
                1
                    2
                        4
                            8
                                16
                                     32
                                           64

                                             8
                                             6
                                             2




                                            1
                                            2
                                            4
                                            8
                                           16
                                            1
                                            2
                                            4
                                            8
                                          16




                                            8
                                            6
                                            2
                                          OpenMPI   Mvapich2
MPI over Infiniband
10000000,00 µs




 1000000,00 µs




  100000,00 µs




   10000,00 µs




    1000,00 µs




     100,00 µs




      10,00 µs




       1,00 µs
                 kB


                      kB


                           kB


                                kB


                                        kB


                                               kB


                                                     kB


                                                            kB


                                                                   kB


                                                                          kB


                                                                                   B


                                                                                           B


                                                                                                   B


                                                                                                           B
                                                                               M


                                                                                       M


                                                                                               M


                                                                                                       M
                 1


                      2


                           4


                                8


                                      16


                                              32


                                                    64



                                                             8


                                                                    6


                                                                           2


                                                                               1


                                                                                       2


                                                                                               4


                                                                                                       8
                                                          12


                                                                 25


                                                                        51




                                    OpenMPI                  Mvapich2
Optimizations

• Active at compile time
• Available only after porting the software to
  standard FORTRAN
• Consistent documentation available
• Unexpected positive results
Optimizations

•-march = native
•-O3
•-ffast-math
•-Wl,-O1
Target Software
Target Software

• Sally3D
• micromagnetic equation solver
• written in FORTRAN with some C libraries
• program uses linear formulation of
  mathematical models
Implementation Scheme
          sequential loop                    parallel loop

  standard
programming
    model

                                              OpenMP Threads
      distributed loop


     OpenMP Threads         OpenMP Threads
         Host 1               Host 2
                      MPI
Implementation
         Scheme
• Data Structure: not embarrassingly parallel
• Three dimensional matrix
• Several temporary arrays – synchronization
  obiects required
  ➡ send() and recv() mechanism
  ➡ critical regions using OpenMP directives
  ➡ functions merging
  ➡ matrix conversion
Results
Results
  OMP    MPI    OPT   seconds
   *      *      *       133
   *      *      -       400
   *      -      *       186
   *      -      -       487
   -      *      *       200
   -      *      -       792
   -      -      *       246
   -      -      -      1062




Total Speed Increase: 87.52%
Actual Results
                  OMP       MPI      seconds
                   *         *          59
                   *         -         129
                   -         *         174
                   -         -         249




  Function Name   Normal    OpenMP      MPI     OpenMP+MPI
calc_intmudua      24.5 s    4.7 s     14.4 s      2.8 s
calc_hdmg_tet      16.9 s    3.0 s     10.8 s      1.7 s
calc_mudua         12.1 s    1.9 s      7.0 s      1.1 s
campo_effettivo    17.7 s    4.5 s      9.9 s      2.3 s
Actual Results

• OpenMP – 6-8x
• MPI – 2x
• OpenMP + MPI – 14 - 16x

     Total Raw Speed Increment: 76%
Conclusions
Conclusions and
        Future Works
• Computational time has been significantly
    decreased
• Speedup is consistent with expected results
• Submitted to COMPUMAG ‘09
•   Continue inserting OpenMP and MPI directives
•   Perform algorithm optimizations
•   Increase cluster size

More Related Content

What's hot

Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)Shien-Chun Luo
 
Cots moves to multicore: AMD
Cots moves to multicore: AMDCots moves to multicore: AMD
Cots moves to multicore: AMDKonrad Witte
 
Dimemas and Multi-Level Cache Simulations
Dimemas and Multi-Level Cache SimulationsDimemas and Multi-Level Cache Simulations
Dimemas and Multi-Level Cache SimulationsMário Almeida
 
Presentation of the open source CFD code Code_Saturne
Presentation of the open source CFD code Code_SaturnePresentation of the open source CFD code Code_Saturne
Presentation of the open source CFD code Code_SaturneRenuda SARL
 
Introduction to Chainer 11 may,2018
Introduction to Chainer 11 may,2018Introduction to Chainer 11 may,2018
Introduction to Chainer 11 may,2018Preferred Networks
 
ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs
ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAsScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs
ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAsShinya Takamaeda-Y
 
Gfarm Fs Tatebe Tip2004
Gfarm Fs Tatebe Tip2004Gfarm Fs Tatebe Tip2004
Gfarm Fs Tatebe Tip2004xlight
 
Thesis Report - Gaurav Raina MSc ES - v2
Thesis Report - Gaurav Raina MSc ES - v2Thesis Report - Gaurav Raina MSc ES - v2
Thesis Report - Gaurav Raina MSc ES - v2Gaurav Raina
 
Data flow super computing valentina balas
Data flow super computing   valentina balasData flow super computing   valentina balas
Data flow super computing valentina balasValentina Emilia Balas
 
HPCMPUG2011 cray tutorial
HPCMPUG2011 cray tutorialHPCMPUG2011 cray tutorial
HPCMPUG2011 cray tutorialJeff Larkin
 
Lect.10.arm soc.4 neon
Lect.10.arm soc.4 neonLect.10.arm soc.4 neon
Lect.10.arm soc.4 neonsean chen
 
stream processing engine
stream processing enginestream processing engine
stream processing enginetiana528
 

What's hot (15)

ISBI MPI Tutorial
ISBI MPI TutorialISBI MPI Tutorial
ISBI MPI Tutorial
 
2011.jtr.pbasanta.
2011.jtr.pbasanta.2011.jtr.pbasanta.
2011.jtr.pbasanta.
 
Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)
 
Cots moves to multicore: AMD
Cots moves to multicore: AMDCots moves to multicore: AMD
Cots moves to multicore: AMD
 
Dimemas and Multi-Level Cache Simulations
Dimemas and Multi-Level Cache SimulationsDimemas and Multi-Level Cache Simulations
Dimemas and Multi-Level Cache Simulations
 
Presentation of the open source CFD code Code_Saturne
Presentation of the open source CFD code Code_SaturnePresentation of the open source CFD code Code_Saturne
Presentation of the open source CFD code Code_Saturne
 
Introduction to Chainer 11 may,2018
Introduction to Chainer 11 may,2018Introduction to Chainer 11 may,2018
Introduction to Chainer 11 may,2018
 
ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs
ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAsScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs
ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs
 
Gfarm Fs Tatebe Tip2004
Gfarm Fs Tatebe Tip2004Gfarm Fs Tatebe Tip2004
Gfarm Fs Tatebe Tip2004
 
Thesis Report - Gaurav Raina MSc ES - v2
Thesis Report - Gaurav Raina MSc ES - v2Thesis Report - Gaurav Raina MSc ES - v2
Thesis Report - Gaurav Raina MSc ES - v2
 
Data flow super computing valentina balas
Data flow super computing   valentina balasData flow super computing   valentina balas
Data flow super computing valentina balas
 
HPCMPUG2011 cray tutorial
HPCMPUG2011 cray tutorialHPCMPUG2011 cray tutorial
HPCMPUG2011 cray tutorial
 
Lect.10.arm soc.4 neon
Lect.10.arm soc.4 neonLect.10.arm soc.4 neon
Lect.10.arm soc.4 neon
 
stream processing engine
stream processing enginestream processing engine
stream processing engine
 
L05 parallel
L05 parallelL05 parallel
L05 parallel
 

Viewers also liked

Google Cluster Innards
Google Cluster InnardsGoogle Cluster Innards
Google Cluster InnardsMartin Dvorak
 
Evaluation of Virtual Clusters Performance on a Cloud Computing Infrastructure
Evaluation of Virtual Clusters Performance on a Cloud Computing InfrastructureEvaluation of Virtual Clusters Performance on a Cloud Computing Infrastructure
Evaluation of Virtual Clusters Performance on a Cloud Computing InfrastructureEuroCloud
 
Cloud Computing Clusters for Dummies
Cloud Computing Clusters for DummiesCloud Computing Clusters for Dummies
Cloud Computing Clusters for DummiesLiberteks
 
Clusters (Distributed computing)
Clusters (Distributed computing)Clusters (Distributed computing)
Clusters (Distributed computing)Sri Prasanna
 
Clusters, Grids & Clouds for Engineering Design, Simulation, and Collaboration
Clusters, Grids & Clouds for Engineering Design, Simulation, and CollaborationClusters, Grids & Clouds for Engineering Design, Simulation, and Collaboration
Clusters, Grids & Clouds for Engineering Design, Simulation, and CollaborationWolfgang Gentzsch
 
SLE12 SP2 : High Availability et Geo Cluster
SLE12 SP2 : High Availability et Geo ClusterSLE12 SP2 : High Availability et Geo Cluster
SLE12 SP2 : High Availability et Geo ClusterSUSE
 
Grid computing ppt 2003(done)
Grid computing ppt 2003(done)Grid computing ppt 2003(done)
Grid computing ppt 2003(done)TASNEEM88
 
Chap8 basic cluster_analysis
Chap8 basic cluster_analysisChap8 basic cluster_analysis
Chap8 basic cluster_analysisguru_prasadg
 
Grid computing notes
Grid computing notesGrid computing notes
Grid computing notesSyed Mustafa
 
Data mining tools (R , WEKA, RAPID MINER, ORANGE)
Data mining tools (R , WEKA, RAPID MINER, ORANGE)Data mining tools (R , WEKA, RAPID MINER, ORANGE)
Data mining tools (R , WEKA, RAPID MINER, ORANGE)Krishna Petrochemicals
 

Viewers also liked (16)

Cloud Computing
Cloud Computing Cloud Computing
Cloud Computing
 
Google Cluster Innards
Google Cluster InnardsGoogle Cluster Innards
Google Cluster Innards
 
Grid
GridGrid
Grid
 
Evaluation of Virtual Clusters Performance on a Cloud Computing Infrastructure
Evaluation of Virtual Clusters Performance on a Cloud Computing InfrastructureEvaluation of Virtual Clusters Performance on a Cloud Computing Infrastructure
Evaluation of Virtual Clusters Performance on a Cloud Computing Infrastructure
 
Cloud Computing Clusters for Dummies
Cloud Computing Clusters for DummiesCloud Computing Clusters for Dummies
Cloud Computing Clusters for Dummies
 
Clusters (Distributed computing)
Clusters (Distributed computing)Clusters (Distributed computing)
Clusters (Distributed computing)
 
Chapter16 new
Chapter16 newChapter16 new
Chapter16 new
 
Clusters, Grids & Clouds for Engineering Design, Simulation, and Collaboration
Clusters, Grids & Clouds for Engineering Design, Simulation, and CollaborationClusters, Grids & Clouds for Engineering Design, Simulation, and Collaboration
Clusters, Grids & Clouds for Engineering Design, Simulation, and Collaboration
 
SLE12 SP2 : High Availability et Geo Cluster
SLE12 SP2 : High Availability et Geo ClusterSLE12 SP2 : High Availability et Geo Cluster
SLE12 SP2 : High Availability et Geo Cluster
 
Grid computing ppt 2003(done)
Grid computing ppt 2003(done)Grid computing ppt 2003(done)
Grid computing ppt 2003(done)
 
Cluster computing
Cluster computingCluster computing
Cluster computing
 
Chap8 basic cluster_analysis
Chap8 basic cluster_analysisChap8 basic cluster_analysis
Chap8 basic cluster_analysis
 
Grid computing notes
Grid computing notesGrid computing notes
Grid computing notes
 
Data mining tools (R , WEKA, RAPID MINER, ORANGE)
Data mining tools (R , WEKA, RAPID MINER, ORANGE)Data mining tools (R , WEKA, RAPID MINER, ORANGE)
Data mining tools (R , WEKA, RAPID MINER, ORANGE)
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
Distributed Computing
Distributed ComputingDistributed Computing
Distributed Computing
 

Similar to Parallel and Distributed Computing on Low Latency Clusters

Data-Intensive Computing for Competent Genetic Algorithms: A Pilot Study us...
Data-Intensive Computing for  Competent Genetic Algorithms:  A Pilot Study us...Data-Intensive Computing for  Competent Genetic Algorithms:  A Pilot Study us...
Data-Intensive Computing for Competent Genetic Algorithms: A Pilot Study us...Xavier Llorà
 
Evaluating Data Freshness in Large Scale Replicated Databases
Evaluating Data Freshness in Large Scale Replicated DatabasesEvaluating Data Freshness in Large Scale Replicated Databases
Evaluating Data Freshness in Large Scale Replicated DatabasesMiguel Araújo
 
Acceleration for big data, hadoop and memcached it168文库
Acceleration for big data, hadoop and memcached it168文库Acceleration for big data, hadoop and memcached it168文库
Acceleration for big data, hadoop and memcached it168文库Accenture
 
Acceleration for big data, hadoop and memcached it168文库
Acceleration for big data, hadoop and memcached it168文库Acceleration for big data, hadoop and memcached it168文库
Acceleration for big data, hadoop and memcached it168文库Accenture
 
Performance Oriented Design
Performance Oriented DesignPerformance Oriented Design
Performance Oriented DesignRodrigo Campos
 
The 5 Stages of Scale
The 5 Stages of ScaleThe 5 Stages of Scale
The 5 Stages of Scalexcbsmith
 
A Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural NetworksA Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural Networksinside-BigData.com
 
Crash course on data streaming (with examples using Apache Flink)
Crash course on data streaming (with examples using Apache Flink)Crash course on data streaming (with examples using Apache Flink)
Crash course on data streaming (with examples using Apache Flink)Vincenzo Gulisano
 
Optimizing Terascale Machine Learning Pipelines with Keystone ML
Optimizing Terascale Machine Learning Pipelines with Keystone MLOptimizing Terascale Machine Learning Pipelines with Keystone ML
Optimizing Terascale Machine Learning Pipelines with Keystone MLSpark Summit
 
Betting On Data Grids
Betting On Data GridsBetting On Data Grids
Betting On Data Gridsgojkoadzic
 
Fast datastacks - fast and flexible nfv solution stacks leveraging fd.io
Fast datastacks - fast and flexible nfv solution stacks leveraging fd.ioFast datastacks - fast and flexible nfv solution stacks leveraging fd.io
Fast datastacks - fast and flexible nfv solution stacks leveraging fd.ioOPNFV
 
AIST Super Green Cloud: lessons learned from the operation and the performanc...
AIST Super Green Cloud: lessons learned from the operation and the performanc...AIST Super Green Cloud: lessons learned from the operation and the performanc...
AIST Super Green Cloud: lessons learned from the operation and the performanc...Ryousei Takano
 
Kersten eder q4_2008_bristol_2
Kersten eder q4_2008_bristol_2Kersten eder q4_2008_bristol_2
Kersten eder q4_2008_bristol_2Obsidian Software
 
PerfUG 3 - perfs système
PerfUG 3 - perfs systèmePerfUG 3 - perfs système
PerfUG 3 - perfs systèmeLudovic Piot
 
Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav R...
Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav R...Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav R...
Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav R...Gaurav Raina
 
2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetupGanesan Narayanasamy
 
Performance Analysis of Lattice QCD with APGAS Programming Model
Performance Analysis of Lattice QCD with APGAS Programming ModelPerformance Analysis of Lattice QCD with APGAS Programming Model
Performance Analysis of Lattice QCD with APGAS Programming ModelKoichi Shirahata
 
OpenStack and OpenFlow Demos
OpenStack and OpenFlow DemosOpenStack and OpenFlow Demos
OpenStack and OpenFlow DemosBrent Salisbury
 
Scallable Distributed Deep Learning on OpenPOWER systems
Scallable Distributed Deep Learning on OpenPOWER systemsScallable Distributed Deep Learning on OpenPOWER systems
Scallable Distributed Deep Learning on OpenPOWER systemsGanesan Narayanasamy
 

Similar to Parallel and Distributed Computing on Low Latency Clusters (20)

Data-Intensive Computing for Competent Genetic Algorithms: A Pilot Study us...
Data-Intensive Computing for  Competent Genetic Algorithms:  A Pilot Study us...Data-Intensive Computing for  Competent Genetic Algorithms:  A Pilot Study us...
Data-Intensive Computing for Competent Genetic Algorithms: A Pilot Study us...
 
Evaluating Data Freshness in Large Scale Replicated Databases
Evaluating Data Freshness in Large Scale Replicated DatabasesEvaluating Data Freshness in Large Scale Replicated Databases
Evaluating Data Freshness in Large Scale Replicated Databases
 
Acceleration for big data, hadoop and memcached it168文库
Acceleration for big data, hadoop and memcached it168文库Acceleration for big data, hadoop and memcached it168文库
Acceleration for big data, hadoop and memcached it168文库
 
Acceleration for big data, hadoop and memcached it168文库
Acceleration for big data, hadoop and memcached it168文库Acceleration for big data, hadoop and memcached it168文库
Acceleration for big data, hadoop and memcached it168文库
 
Performance Oriented Design
Performance Oriented DesignPerformance Oriented Design
Performance Oriented Design
 
The 5 Stages of Scale
The 5 Stages of ScaleThe 5 Stages of Scale
The 5 Stages of Scale
 
A Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural NetworksA Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural Networks
 
Crash course on data streaming (with examples using Apache Flink)
Crash course on data streaming (with examples using Apache Flink)Crash course on data streaming (with examples using Apache Flink)
Crash course on data streaming (with examples using Apache Flink)
 
Optimizing Terascale Machine Learning Pipelines with Keystone ML
Optimizing Terascale Machine Learning Pipelines with Keystone MLOptimizing Terascale Machine Learning Pipelines with Keystone ML
Optimizing Terascale Machine Learning Pipelines with Keystone ML
 
Betting On Data Grids
Betting On Data GridsBetting On Data Grids
Betting On Data Grids
 
Fast datastacks - fast and flexible nfv solution stacks leveraging fd.io
Fast datastacks - fast and flexible nfv solution stacks leveraging fd.ioFast datastacks - fast and flexible nfv solution stacks leveraging fd.io
Fast datastacks - fast and flexible nfv solution stacks leveraging fd.io
 
AIST Super Green Cloud: lessons learned from the operation and the performanc...
AIST Super Green Cloud: lessons learned from the operation and the performanc...AIST Super Green Cloud: lessons learned from the operation and the performanc...
AIST Super Green Cloud: lessons learned from the operation and the performanc...
 
Kersten eder q4_2008_bristol_2
Kersten eder q4_2008_bristol_2Kersten eder q4_2008_bristol_2
Kersten eder q4_2008_bristol_2
 
Dasia 2022
Dasia 2022Dasia 2022
Dasia 2022
 
PerfUG 3 - perfs système
PerfUG 3 - perfs systèmePerfUG 3 - perfs système
PerfUG 3 - perfs système
 
Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav R...
Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav R...Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav R...
Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav R...
 
2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup
 
Performance Analysis of Lattice QCD with APGAS Programming Model
Performance Analysis of Lattice QCD with APGAS Programming ModelPerformance Analysis of Lattice QCD with APGAS Programming Model
Performance Analysis of Lattice QCD with APGAS Programming Model
 
OpenStack and OpenFlow Demos
OpenStack and OpenFlow DemosOpenStack and OpenFlow Demos
OpenStack and OpenFlow Demos
 
Scallable Distributed Deep Learning on OpenPOWER systems
Scallable Distributed Deep Learning on OpenPOWER systemsScallable Distributed Deep Learning on OpenPOWER systems
Scallable Distributed Deep Learning on OpenPOWER systems
 

More from Vittorio Giovara

Color me intrigued: A jaunt through color technology in video
Color me intrigued: A jaunt through color technology in videoColor me intrigued: A jaunt through color technology in video
Color me intrigued: A jaunt through color technology in videoVittorio Giovara
 
An overview on 10 bit video: UHDTV, HDR, and coding efficiency
An overview on 10 bit video: UHDTV, HDR, and coding efficiencyAn overview on 10 bit video: UHDTV, HDR, and coding efficiency
An overview on 10 bit video: UHDTV, HDR, and coding efficiencyVittorio Giovara
 
Introduction to video reverse engineering
Introduction to video reverse engineeringIntroduction to video reverse engineering
Introduction to video reverse engineeringVittorio Giovara
 
Block Cipher Modes of Operation And Cmac For Authentication
Block Cipher Modes of Operation And Cmac For AuthenticationBlock Cipher Modes of Operation And Cmac For Authentication
Block Cipher Modes of Operation And Cmac For AuthenticationVittorio Giovara
 
Fuzzing Techniques for Software Vulnerability Discovery
Fuzzing Techniques for Software Vulnerability DiscoveryFuzzing Techniques for Software Vulnerability Discovery
Fuzzing Techniques for Software Vulnerability DiscoveryVittorio Giovara
 
Software Requirements for Safety-related Systems
Software Requirements for Safety-related SystemsSoftware Requirements for Safety-related Systems
Software Requirements for Safety-related SystemsVittorio Giovara
 
Microprocessor-based Systems 48/32bit Division Algorithm
Microprocessor-based Systems 48/32bit Division AlgorithmMicroprocessor-based Systems 48/32bit Division Algorithm
Microprocessor-based Systems 48/32bit Division AlgorithmVittorio Giovara
 
Misra C Software Development Standard
Misra C Software Development StandardMisra C Software Development Standard
Misra C Software Development StandardVittorio Giovara
 
OpenSSL User Manual and Data Format
OpenSSL User Manual and Data FormatOpenSSL User Manual and Data Format
OpenSSL User Manual and Data FormatVittorio Giovara
 
Authenticated Encryption Gcm Ccm
Authenticated Encryption Gcm CcmAuthenticated Encryption Gcm Ccm
Authenticated Encryption Gcm CcmVittorio Giovara
 

More from Vittorio Giovara (13)

Color me intrigued: A jaunt through color technology in video
Color me intrigued: A jaunt through color technology in videoColor me intrigued: A jaunt through color technology in video
Color me intrigued: A jaunt through color technology in video
 
An overview on 10 bit video: UHDTV, HDR, and coding efficiency
An overview on 10 bit video: UHDTV, HDR, and coding efficiencyAn overview on 10 bit video: UHDTV, HDR, and coding efficiency
An overview on 10 bit video: UHDTV, HDR, and coding efficiency
 
Introduction to video reverse engineering
Introduction to video reverse engineeringIntroduction to video reverse engineering
Introduction to video reverse engineering
 
Il Caso Ryanair
Il Caso RyanairIl Caso Ryanair
Il Caso Ryanair
 
I Mercati Geografici
I Mercati GeograficiI Mercati Geografici
I Mercati Geografici
 
Block Cipher Modes of Operation And Cmac For Authentication
Block Cipher Modes of Operation And Cmac For AuthenticationBlock Cipher Modes of Operation And Cmac For Authentication
Block Cipher Modes of Operation And Cmac For Authentication
 
Crittografia Quantistica
Crittografia QuantisticaCrittografia Quantistica
Crittografia Quantistica
 
Fuzzing Techniques for Software Vulnerability Discovery
Fuzzing Techniques for Software Vulnerability DiscoveryFuzzing Techniques for Software Vulnerability Discovery
Fuzzing Techniques for Software Vulnerability Discovery
 
Software Requirements for Safety-related Systems
Software Requirements for Safety-related SystemsSoftware Requirements for Safety-related Systems
Software Requirements for Safety-related Systems
 
Microprocessor-based Systems 48/32bit Division Algorithm
Microprocessor-based Systems 48/32bit Division AlgorithmMicroprocessor-based Systems 48/32bit Division Algorithm
Microprocessor-based Systems 48/32bit Division Algorithm
 
Misra C Software Development Standard
Misra C Software Development StandardMisra C Software Development Standard
Misra C Software Development Standard
 
OpenSSL User Manual and Data Format
OpenSSL User Manual and Data FormatOpenSSL User Manual and Data Format
OpenSSL User Manual and Data Format
 
Authenticated Encryption Gcm Ccm
Authenticated Encryption Gcm CcmAuthenticated Encryption Gcm Ccm
Authenticated Encryption Gcm Ccm
 

Recently uploaded

CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 

Recently uploaded (20)

CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 

Parallel and Distributed Computing on Low Latency Clusters

  • 1. Parallel and Distributed Computing on Low Latecy Clusters Vittorio Giovara M. S. Electrical Engineering and Computer Science University of Illinois at Chicago May 2009
  • 2. Contents • Motivation • Application • Strategy • Compiler Optimizations • Technologies • OpenMP and MPI over Infinband • OpenMP • Results • MPI • Conclusions • Infinband
  • 4. Motivation • Scaling trend has to stop for CMOS technology: ✓ Direct-tunneling limit in SiO2 ~3 nm ✓ Distance between Si atoms ~0.3 nm ✓ Variabilty • Foundamental reason: rising fab cost
  • 5. Motivation • Easy to build multiple core processor • Requires human action to modify and adapt concurrent software • New classification for computer architectures
  • 6. Classification SISD SIMD data pool data pool instruction pool instruction pool CPU CPU CPU MISD MIMD data pool data pool instruction pool instruction pool CPU CPU CPU CPU CPU CPU
  • 7. easier to parallelize abstraction level algorithm loop level process management
  • 8. Levels recursion memory management profiling data dependency branching overhead control flow algorithm loop level process management SMP Multiprogramming Multithreading and Scheduling
  • 9. Backfire • Difficutly to fully exploit the parallelism offered • Automatic tools required to adapt software to parallelism • Compiler support for manual or semi- automatic enhancement
  • 10. Applications • OpenMP and MPI are two popular tools used to simplify the parallelizing process of both new and old software • Mathematics and Physics • Computer Science • Biomedics
  • 11. Specific Problem and Background • Sally3D is a micromagnetism program suit for field analysis and modeling developed at Politecnico di Torino (Department of Electrical Engineering) • Computationally intensive (even days of CPU); speedup required • Previous works still not fully encompassing the problem (no Infiniband or OpenMP +MPI solutions)
  • 13. Strategy • Install a Linux Kernel with ad-hoc configuration for scientific computation • Compile a OpenMP enable GCC (supported from 4.3.1 onwards) • Add the Infiniband link among clusters with proper drivers in kernel and user space • Select a MPI implementation library
  • 14. Strategy • Verify Infiniband network through some MPI test examples • Install the target software • Proceed to include OpenMP and MPI directives in the code • Run test cases
  • 15. OpenMP • standard • supported by most of modern compilers • requires little knowledge of the software • very simple construction methods
  • 17. OpenMP - example Parallel Task 1 Parallel Task 3 Parallel Task 2 Parallel Task 4
  • 18. Parallel Task 1 Parallel Task 2 Thread A Parallel Task 4 Thread B Parallel Task 3 Join Master Thread
  • 19. OpenMP Sceduler • Which scheduler available for hardware? - Static - Dynamic - Guided
  • 20. OpenMP Scheduler OpenMP Static Scheduler Chart 80000 70000 60000 50000 microseconds 40000 30000 20000 10000 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 number of threads chunk 1 chunk 10 chunk 100 chunk 1000 chunk 10000
  • 21. OpenMP Scheduler OpenMP Dynamic Scheduler Chart 117000 102375 87750 73125 microseconds 58500 43875 29250 14625 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 number of threads chunk 1 chunk 10 chunk 100 chunk 1000 chunk 10000
  • 22. OpenMP Scheduler OpenMP Guided Scheduler Chart 80000 70000 60000 50000 microseconds 40000 30000 20000 10000 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 number of threads chunk 1 chunk 10 chunk 100 chunk 1000 chunk 10000
  • 24. OpenMP Scheduler static scheduler dynamic scheduler guided scheduler
  • 25. MPI • standard • widely used in cluster environment • many transport link supported • different implementations available - OpenMPI - MVAPICH
  • 26. Infiniband • standard • widely used in cluster environment • very low latency for small packets • up to 16 Gb/s transfer speed
  • 27. MPI over Infiniband 10000000,0 µs 1000000,0 µs 100000,0 µs 10000,0 µs 1000,0 µs 100,0 µs 10,0 µs 1,0 µs kB kB kB kB kB kB 12 B 25 B 51 B kB B B B B 32 B 64 B 12 B 25 B 51 B B B B B B B k k k M M M M M M M M M M G G G G G 1 2 4 8 16 32 64 8 6 2 1 2 4 8 16 1 2 4 8 16 8 6 2 OpenMPI Mvapich2
  • 28. MPI over Infiniband 10000000,00 µs 1000000,00 µs 100000,00 µs 10000,00 µs 1000,00 µs 100,00 µs 10,00 µs 1,00 µs kB kB kB kB kB kB kB kB kB kB B B B B M M M M 1 2 4 8 16 32 64 8 6 2 1 2 4 8 12 25 51 OpenMPI Mvapich2
  • 29. Optimizations • Active at compile time • Available only after porting the software to standard FORTRAN • Consistent documentation available • Unexpected positive results
  • 32. Target Software • Sally3D • micromagnetic equation solver • written in FORTRAN with some C libraries • program uses linear formulation of mathematical models
  • 33. Implementation Scheme sequential loop parallel loop standard programming model OpenMP Threads distributed loop OpenMP Threads OpenMP Threads Host 1 Host 2 MPI
  • 34. Implementation Scheme • Data Structure: not embarrassingly parallel • Three dimensional matrix • Several temporary arrays – synchronization obiects required ➡ send() and recv() mechanism ➡ critical regions using OpenMP directives ➡ functions merging ➡ matrix conversion
  • 36. Results OMP MPI OPT seconds * * * 133 * * - 400 * - * 186 * - - 487 - * * 200 - * - 792 - - * 246 - - - 1062 Total Speed Increase: 87.52%
  • 37. Actual Results OMP MPI seconds * * 59 * - 129 - * 174 - - 249 Function Name Normal OpenMP MPI OpenMP+MPI calc_intmudua 24.5 s 4.7 s 14.4 s 2.8 s calc_hdmg_tet 16.9 s 3.0 s 10.8 s 1.7 s calc_mudua 12.1 s 1.9 s 7.0 s 1.1 s campo_effettivo 17.7 s 4.5 s 9.9 s 2.3 s
  • 38. Actual Results • OpenMP – 6-8x • MPI – 2x • OpenMP + MPI – 14 - 16x Total Raw Speed Increment: 76%
  • 40. Conclusions and Future Works • Computational time has been significantly decreased • Speedup is consistent with expected results • Submitted to COMPUMAG ‘09 • Continue inserting OpenMP and MPI directives • Perform algorithm optimizations • Increase cluster size