SlideShare a Scribd company logo
1 of 45
Download to read offline
On the memory behavior of a
multifrontal QR software
for multicore systems




Alfredo Buttari,        CNRS-IRIT

Toulouse, September 6-7, 2011
The multifrontal QR method
The Multifrontal QR for newbies


  The multifrontal QR factorization is
  guided by a graph called elimination
  tree:
  • at each node of the tree k pivots are
       eliminated
  • each node of the tree is associated
       with a relatively small dense matrix
       called frontal matrix (or, simply, front)
       which contains the k columns related
       to the pivots and all the other
       coefficients concerned by their
       elimination



3/25                                               Toulouse, September 6-7, 2011
The Multifrontal QR for newbies

   The tree is traversed in topological order (i.e., bottom-up) and, at
   each node, two operations are performed:




4/25                                                Toulouse, September 6-7, 2011
The Multifrontal QR for newbies

   The tree is traversed in topological order (i.e., bottom-up) and, at
   each node, two operations are performed:
  • assembly: a set of coefficient from the
       original matrix associated with the pivots
       and a number of contribution blocks
       produced by the treatment of the child
       nodes are assembled together to form
       the frontal matrix




4/25                                                Toulouse, September 6-7, 2011
The Multifrontal QR for newbies

   The tree is traversed in topological order (i.e., bottom-up) and, at
   each node, two operations are performed:
  • assembly: a set of coefficient from the
    original matrix associated with the pivots
    and a number of contribution blocks
    produced by the treatment of the child
    nodes are assembled together to form
    the frontal matrix
  • factorization: the k pivots are eliminated
    through a complete QR factorization of
    the frontal matrix. As a result we get:
       ◦ k rows of the global R factor
       ◦ a bunch of Householder vectors
       ◦ a triangular contribution block that will
         be assembled into the father’s front

4/25                                                 Toulouse, September 6-7, 2011
Multifrontal QR, parallelism
The Multifrontal QR: parallelism


   Two sources of parallelism are available in any multifrontal method:


   Tree parallelism

       • fronts associated with nodes in different branches are
         independent and can, thus, be factorized in parallel



   Front parallelism

       • if the size of a front is big enough, multiple processes may be
         used to factorize it



6/25                                                  Toulouse, September 6-7, 2011
Parallelism: classical approach

   The classical approach (Puglisi, Matstom, Davis)
       • Tree parallelism:
         ◦ a front assembly+factorization corresponds to a task
         ◦ computational tasks are added to a task pool
         ◦ threads fetch tasks from the pool repeatedly until all the fronts are
           done
       • Front parallelism:
         ◦ Multithreaded BLAS for the front facto

   What’s wrong with this approach? A complete separation of the
   two levels of parallelism which causes
       • potentially strong load unbalance
       • heavy synchronizations due to the sequential nature of some
         operations (assembly)
       • sub-optimal exploitation of the concurrency in the multifrontal
7/25     method                                            Toulouse, September 6-7, 2011
Parallelism: a new approach


   fine-grained, data-flow parallel approach

       •    fine granularity: tasks are not defined as
           operations on fronts but as operations on
           portions of fronts defined by a 1-D partitioning
       •    data flow parallelism: tasks are scheduled
           dynamically based on the dependencies
           between them
   Both node and tree parallelism are handled the same way at any
   level of the tree.


8/25                                            Toulouse, September 6-7, 2011
Fine grained, asynchronous,
parallel QR
Parallelism: a new approach

   Fine-granularity is achieved through a 1-D block partitioning of
   fronts and the definition of five elementary operations:
  1. activate(front): the activation of a front
     corresponds to a full determination of its
     (staircase) structure and allocation of the
     needed memory areas
 2. panel(bcol): QR factorization (Level2
    BLAS) of a column
 3. update(bcol): update of a column in the
    trailing submatrix wrt to a panel
 4. assemble(bcol): assembly of a column of
    the contribution block into the father
 5. clean(front): cleanup the front in order
    to release all the memory areas that are no
    more needed
10/25                                              Toulouse, September 6-7, 2011
Parallelism: a new approach

    If a task is defined as the execution of one elementary operation
    on a block-column or a front, then the entire multifrontal
    factorization can be represented as a Directed Acyclic Graph
    (DAG)




    where the nodes represent tasks and edges the dependencies
    among them
11/25                                             Toulouse, September 6-7, 2011
Parallelism: a new approach
    • d1: no other elementary operation can be executed on a front or on one of
        its block-columns until the front is not activated;
    • d2: a block column can be updated with respect to a panel only if the
        corresponding panel factorization is completed;
    • d3: the panel operation can be executed on block-column i only if it is
        up-to-date with respect to panel i − 1;
    • d4: a block-column can be updated with respect to a panel i in its front only
        if it is up-to-date with respect to the previous panel i − 1 in the same front;
    •   d5: a block-column can be assembled into the parent (if it exists) when it is
        up-to-date with respect to the last panel factorization to be performed on the
        front it belongs to (in this case it is assumed that block-column i is
        up-to-date with respect to panel i when the corresponding panel operation is
        executed);
    •   d6: no other elementary operation can be executed on a block-column until
        all the corresponding portions of the contribution blocks from the child
        nodes have been assembled into it, in which case the block-column is said to
        be assembled;
    •   d7: since the structure of a frontal matrix depends on the structure of its
        children, a front matrix can be activated only if all of its children are already
12/25
        active;                                                    Toulouse, September 6-7, 2011
Parallelism: a new approach

   Tasks are scheduled dynamically and asynchronously




13/25                                          Toulouse, September 6-7, 2011
Results




14/25     Toulouse, September 6-7, 2011
Tasks scheduling
Target architecture
   Our target architecture has four, hexa-core AMD Istanbul
   processors connected through HyperTransport links in a ring
   topology with a memory module attached to each of them:




16/25                                           Toulouse, September 6-7, 2011
Target architecture
   Our target architecture has four, hexa-core AMD Istanbul
   processors connected through HyperTransport links in a ring
   topology with a memory module attached to each of them:




   The bandwidth depends on the number of hops
16/25                                        Toulouse, September 6-7, 2011
Proximity scheduling

    The scheduling can be guided by a proximity criterion: a task
    should be executed by the core which is closest to the concerned
    data. This can be implemented through a system of task queues,
    one per thread/core:




    No front-to-core mapping is done (yet)!
17/25                                            Toulouse, September 6-7, 2011
Proximity scheduling

    The scheduling can be guided by a proximity criterion: a task
    should be executed by the core which is closest to the concerned
    data. This can be implemented through a system of task queues,
    one per thread/core:

                            • At the moment when a thread activates
                              a front it becomes its owner




    No front-to-core mapping is done (yet)!
17/25                                            Toulouse, September 6-7, 2011
Proximity scheduling

    The scheduling can be guided by a proximity criterion: a task
    should be executed by the core which is closest to the concerned
    data. This can be implemented through a system of task queues,
    one per thread/core:

                            • At the moment when a thread activates
                              a front it becomes its owner
                            • All the tasks related to a front will be
                              pushed on the queue associated with its
                              owner




    No front-to-core mapping is done (yet)!
17/25                                              Toulouse, September 6-7, 2011
Proximity scheduling

    The scheduling can be guided by a proximity criterion: a task
    should be executed by the core which is closest to the concerned
    data. This can be implemented through a system of task queues,
    one per thread/core:

                            • At the moment when a thread activates
                              a front it becomes its owner
                            • All the tasks related to a front will be
                              pushed on the queue associated with its
                              owner
                            • Work-stealing is used to feed threads
                              that run out of tasks:




    No front-to-core mapping is done (yet)!
17/25                                              Toulouse, September 6-7, 2011
Proximity scheduling

    The scheduling can be guided by a proximity criterion: a task
    should be executed by the core which is closest to the concerned
    data. This can be implemented through a system of task queues,
    one per thread/core:

                            • At the moment when a thread activates
                              a front it becomes its owner
                            • All the tasks related to a front will be
                              pushed on the queue associated with its
                              owner
                            • Work-stealing is used to feed threads
                              that run out of tasks:
                               ◦ a thread will first try to steal tasks from
                                 neighbor queues...


    No front-to-core mapping is done (yet)!
17/25                                                Toulouse, September 6-7, 2011
Proximity scheduling

    The scheduling can be guided by a proximity criterion: a task
    should be executed by the core which is closest to the concerned
    data. This can be implemented through a system of task queues,
    one per thread/core:

                            • At the moment when a thread activates
                              a front it becomes its owner
                            • All the tasks related to a front will be
                              pushed on the queue associated with its
                              owner
                            • Work-stealing is used to feed threads
                              that run out of tasks:
                               ◦ a thread will first try to steal tasks from
                                 neighbor queues...
                               ◦ ...and then from any other queue

    No front-to-core mapping is done (yet)!
17/25                                                Toulouse, September 6-7, 2011
Proximity scheduling

   Implementing all this requires the ability to:




18/25                                               Toulouse, September 6-7, 2011
Proximity scheduling

   Implementing all this requires the ability to:
    • control the placement of threads: we have to bind each thread
        to a single core and prevent threads migrations. This can be
        done in a number of ways, e.g. by means of tools such as
        hwloc which allows thread pinning




18/25                                               Toulouse, September 6-7, 2011
Proximity scheduling

   Implementing all this requires the ability to:
    • control the placement of threads: we have to bind each thread
      to a single core and prevent threads migrations. This can be
      done in a number of ways, e.g. by means of tools such as
      hwloc which allows thread pinning
    • control the placement of data: we have to make sure that one
      front physically resides on a specific NUMA module. This can
      be done with:
        ◦ the first touch rule: the data is allocated close to the core that
          makes the first reference
        ◦ hwloc or numalib which provide NUMA-aware allocators




18/25                                                     Toulouse, September 6-7, 2011
Proximity scheduling

   Implementing all this requires the ability to:
    • control the placement of threads: we have to bind each thread
      to a single core and prevent threads migrations. This can be
      done in a number of ways, e.g. by means of tools such as
      hwloc which allows thread pinning
    • control the placement of data: we have to make sure that one
      front physically resides on a specific NUMA module. This can
      be done with:
        ◦ the first touch rule: the data is allocated close to the core that
          makes the first reference
        ◦ hwloc or numalib which provide NUMA-aware allocators
    • detect the architecture we have to figure out the memory/cores
        layout in order to guide the work stealing. This can be done
        with hwloc


18/25                                                     Toulouse, September 6-7, 2011
Proximity scheduling: experiments
   Experimental results show that proximity scheduling is good:

              Matrix     Strat.   Time (s)


                        no loc.     155
              Rucci1
                           loc.     144
                        no loc.     43
              ohne2
                           loc.     39
                        no loc.     89
            lp nug20
                           loc.     84

   Why?



19/25                                            Toulouse, September 6-7, 2011
Proximity scheduling: experiments
   Experimental results show that proximity scheduling is good:

              Matrix     Strat.   Time (s)   Dwords on HT
                                                (×109 )
                        no loc.     155         2000
              Rucci1
                           loc.     144         1634
                        no loc.     43           504
              ohne2
                           loc.     39           375
                        no loc.     89           879
            lp nug20
                           loc.     84           778

   Why? Fewer Dwords transferred through the HyperTransport links,
   as shown by the number of occurrences of the PAPI
   HYPERTRANSPORT LINKx:DATA DWORD SENT event.

19/25                                            Toulouse, September 6-7, 2011
Proximity scheduling: experiments
   Experimental results show that proximity scheduling is good:

              Matrix      Strat.   Time (s)   Dwords on HT
                                                 (×109 )
                         no loc.     155         2000
              Rucci1
                            loc.     144         1634
                         no loc.     43           504
              ohne2
                            loc.     39           375
                         no loc.     89           879
             lp nug20
                            loc.     84           778

   Why? Fewer Dwords transferred through the HyperTransport links,
   as shown by the number of occurrences of the PAPI
   HYPERTRANSPORT LINKx:DATA DWORD SENT event.
   But this not the whole story!
19/25                                             Toulouse, September 6-7, 2011
Target architecture

   What happens when more threads are active at the same time on
   the system?




20/25                                         Toulouse, September 6-7, 2011
Target architecture

   What happens when more threads are active at the same time on
   the system?




   BW drops; why?
20/25                                         Toulouse, September 6-7, 2011
Target architecture

   What happens when more threads are active at the same time on
   the system?




   BW drops; why? memory conflicts!
20/25                                         Toulouse, September 6-7, 2011
Experiments on conflicts

   Should we do something to reduce conflicts?




21/25                                           Toulouse, September 6-7, 2011
Experiments on conflicts

   Should we do something to reduce conflicts?




   Memory interleaving should provide a more uniform distribution
   of data that (supposedly) reduces conflicts and increases the
   memory bandwidth
21/25                                           Toulouse, September 6-7, 2011
Experiments on conflicts

        Matrix    Strat.   Time (s)   Dwords on HT
                                         (×109 )
                 no loc.     155         2000
        Rucci1      loc.     144         1634


                 no loc.     43           504
        ohne2       loc.     39           375


                 no loc.     89           879
    lp nug20        loc.     84           778




22/25                                            Toulouse, September 6-7, 2011
Experiments on conflicts

        Matrix    Strat.     Time (s)   Dwords on HT
                                           (×109 )
                 no loc.       155         2000
        Rucci1      loc.       144         1634
                     r. r.     117
                 no loc.       43           504
        ohne2       loc.       39           375
                     r. r.     38
                 no loc.       89           879
    lp nug20        loc.       84           778
                     r. r.     66




22/25                                              Toulouse, September 6-7, 2011
Experiments on conflicts

        Matrix    Strat.     Time (s)   Dwords on HT
                                           (×109 )
                 no loc.       155         2000
        Rucci1      loc.       144         1634
                     r. r.     117         2306
                 no loc.       43           504
        ohne2       loc.       39           375
                     r. r.     38           665
                 no loc.       89           879
    lp nug20        loc.       84           778
                     r. r.     66           985




22/25                                              Toulouse, September 6-7, 2011
Experiments on conflicts

        Matrix    Strat.     Time (s)   Dwords on HT   Conflicts on DCT
                                           (×109 )         (×109 )
                 no loc.       155         2000               23.8
        Rucci1      loc.       144         1634               25.4
                     r. r.     117         2306               19.7
                 no loc.       43           504               6.63
        ohne2       loc.       39           375               6.71
                     r. r.     38           665               4.54
                 no loc.       89           879               11.4
    lp nug20        loc.       84           778               11.9
                     r. r.     66           985               6.81
   Conflicts are DRAM ACCESSES PAGE:DCTx PAGE CONFLICT events
   measured by PAPI


22/25                                              Toulouse, September 6-7, 2011
Experiments on conflicts

        Matrix    Strat.     Time (s)   Dwords on HT   Conflicts on DCT
                                           (×109 )         (×109 )
                 no loc.       155         2000               23.8
        Rucci1      loc.       144         1634               25.4
                     r. r.     117         2306               19.7
                 no loc.       43           504               6.63
        ohne2       loc.       39           375               6.71
                     r. r.     38           665               4.54
                 no loc.       89           879               11.4
    lp nug20        loc.       84           778               11.9
                     r. r.     66           985               6.81
   Conflicts are DRAM ACCESSES PAGE:DCTx PAGE CONFLICT events
   measured by PAPI
   Analogous behavior was found on the HSL MA87 code.

22/25                                              Toulouse, September 6-7, 2011
Experiments on conflicts

        Matrix    Strat.     Time (s)   Dwords on HT   Conflicts on DCT
                                           (×109 )         (×109 )
                 no loc.       155         2000               23.8
        Rucci1      loc.       144         1634               25.4
                     r. r.     117         2306               19.7
                 no loc.       43           504               6.63
        ohne2       loc.       39           375               6.71
                     r. r.     38           665               4.54
                 no loc.       89           879               11.4
    lp nug20        loc.       84           778               11.9
                     r. r.     66           985               6.81
   Conflicts are DRAM ACCESSES PAGE:DCTx PAGE CONFLICT events
   measured by PAPI
   Analogous behavior was found on the HSL MA87 code.
   The answer is: yes, we should reduce conflicts (work in progress).
22/25                                              Toulouse, September 6-7, 2011
Conclusions


    • fine-granularity, asynchronism and data-flow parallelism are
        tough (especially all together) but the effort is largely paid off
    • a correct exploitation of the memory system is critical,
        especially in NUMA systems
    • reducing memory transfers gives some performance
        improvement but...
    • ...maybe it is not the most important thing: memory conflicts
        seem to be waaay more penalizing than data traffic. How do we
        solve this? Not clear for the moment. Maybe a careful
        front-to-memory mapping? Ideas and suggestions are very
        welcome



23/25                                                  Toulouse, September 6-7, 2011
?   Thank you all!
    Questions?

More Related Content

What's hot

[PR12] Inception and Xception - Jaejun Yoo
[PR12] Inception and Xception - Jaejun Yoo[PR12] Inception and Xception - Jaejun Yoo
[PR12] Inception and Xception - Jaejun YooJaeJun Yoo
 
Coding For Cores - C# Way
Coding For Cores - C# WayCoding For Cores - C# Way
Coding For Cores - C# WayBishnu Rawal
 
PR-305: Exploring Simple Siamese Representation Learning
PR-305: Exploring Simple Siamese Representation LearningPR-305: Exploring Simple Siamese Representation Learning
PR-305: Exploring Simple Siamese Representation LearningSungchul Kim
 
Image classification using CNN
Image classification using CNNImage classification using CNN
Image classification using CNNNoura Hussein
 
Finding the best solution for Image Processing
Finding the best solution for Image ProcessingFinding the best solution for Image Processing
Finding the best solution for Image ProcessingTech Triveni
 
Faster R-CNN - PR012
Faster R-CNN - PR012Faster R-CNN - PR012
Faster R-CNN - PR012Jinwon Lee
 
Convolutional Neural Networks
Convolutional Neural NetworksConvolutional Neural Networks
Convolutional Neural NetworksAshray Bhandare
 
CNNs: from the Basics to Recent Advances
CNNs: from the Basics to Recent AdvancesCNNs: from the Basics to Recent Advances
CNNs: from the Basics to Recent AdvancesDmytro Mishkin
 
Wits presentation 6_28072015
Wits presentation 6_28072015Wits presentation 6_28072015
Wits presentation 6_28072015Beatrice van Eden
 
ConvNeXt: A ConvNet for the 2020s explained
ConvNeXt: A ConvNet for the 2020s explainedConvNeXt: A ConvNet for the 2020s explained
ConvNeXt: A ConvNet for the 2020s explainedSushant Gautam
 
A Neural Network that Understands Handwriting
A Neural Network that Understands HandwritingA Neural Network that Understands Handwriting
A Neural Network that Understands HandwritingShivam Sawhney
 
Deep Learning - CNN and RNN
Deep Learning - CNN and RNNDeep Learning - CNN and RNN
Deep Learning - CNN and RNNAshray Bhandare
 
Reading group nfm - 20170312
Reading group  nfm - 20170312Reading group  nfm - 20170312
Reading group nfm - 20170312Shuai Zhang
 
Convolutional neural network
Convolutional neural networkConvolutional neural network
Convolutional neural networkMojammilHusain
 
Comparing reinforcement learning and access points with rowel
Comparing reinforcement learning and access points with rowelComparing reinforcement learning and access points with rowel
Comparing reinforcement learning and access points with rowelijcseit
 

What's hot (19)

[PR12] Inception and Xception - Jaejun Yoo
[PR12] Inception and Xception - Jaejun Yoo[PR12] Inception and Xception - Jaejun Yoo
[PR12] Inception and Xception - Jaejun Yoo
 
N ns 1
N ns 1N ns 1
N ns 1
 
Cnn
CnnCnn
Cnn
 
Lec1 Intro
Lec1 IntroLec1 Intro
Lec1 Intro
 
Coding For Cores - C# Way
Coding For Cores - C# WayCoding For Cores - C# Way
Coding For Cores - C# Way
 
PR-305: Exploring Simple Siamese Representation Learning
PR-305: Exploring Simple Siamese Representation LearningPR-305: Exploring Simple Siamese Representation Learning
PR-305: Exploring Simple Siamese Representation Learning
 
Multi tasking learning
Multi tasking learningMulti tasking learning
Multi tasking learning
 
Image classification using CNN
Image classification using CNNImage classification using CNN
Image classification using CNN
 
Finding the best solution for Image Processing
Finding the best solution for Image ProcessingFinding the best solution for Image Processing
Finding the best solution for Image Processing
 
Faster R-CNN - PR012
Faster R-CNN - PR012Faster R-CNN - PR012
Faster R-CNN - PR012
 
Convolutional Neural Networks
Convolutional Neural NetworksConvolutional Neural Networks
Convolutional Neural Networks
 
CNNs: from the Basics to Recent Advances
CNNs: from the Basics to Recent AdvancesCNNs: from the Basics to Recent Advances
CNNs: from the Basics to Recent Advances
 
Wits presentation 6_28072015
Wits presentation 6_28072015Wits presentation 6_28072015
Wits presentation 6_28072015
 
ConvNeXt: A ConvNet for the 2020s explained
ConvNeXt: A ConvNet for the 2020s explainedConvNeXt: A ConvNet for the 2020s explained
ConvNeXt: A ConvNet for the 2020s explained
 
A Neural Network that Understands Handwriting
A Neural Network that Understands HandwritingA Neural Network that Understands Handwriting
A Neural Network that Understands Handwriting
 
Deep Learning - CNN and RNN
Deep Learning - CNN and RNNDeep Learning - CNN and RNN
Deep Learning - CNN and RNN
 
Reading group nfm - 20170312
Reading group  nfm - 20170312Reading group  nfm - 20170312
Reading group nfm - 20170312
 
Convolutional neural network
Convolutional neural networkConvolutional neural network
Convolutional neural network
 
Comparing reinforcement learning and access points with rowel
Comparing reinforcement learning and access points with rowelComparing reinforcement learning and access points with rowel
Comparing reinforcement learning and access points with rowel
 

Similar to Multithreaded sparse QR

240226_Thanh_LabSeminar[Structure-Aware Transformer for Graph Representation ...
240226_Thanh_LabSeminar[Structure-Aware Transformer for Graph Representation ...240226_Thanh_LabSeminar[Structure-Aware Transformer for Graph Representation ...
240226_Thanh_LabSeminar[Structure-Aware Transformer for Graph Representation ...thanhdowork
 
240226_Thanh_LabSeminar[Structure-Aware Transformer for Graph Representation ...
240226_Thanh_LabSeminar[Structure-Aware Transformer for Graph Representation ...240226_Thanh_LabSeminar[Structure-Aware Transformer for Graph Representation ...
240226_Thanh_LabSeminar[Structure-Aware Transformer for Graph Representation ...thanhdowork
 
Tsinghua invited talk_zhou_xing_v2r0
Tsinghua invited talk_zhou_xing_v2r0Tsinghua invited talk_zhou_xing_v2r0
Tsinghua invited talk_zhou_xing_v2r0Joe Xing
 
NS-CUK Seminar: H.E.Lee, Review on "Weisfeiler and Leman Go Neural: Higher-O...
NS-CUK Seminar: H.E.Lee,  Review on "Weisfeiler and Leman Go Neural: Higher-O...NS-CUK Seminar: H.E.Lee,  Review on "Weisfeiler and Leman Go Neural: Higher-O...
NS-CUK Seminar: H.E.Lee, Review on "Weisfeiler and Leman Go Neural: Higher-O...ssuser4b1f48
 
Artificial Neural Network Implementation on FPGA – a Modular Approach
Artificial Neural Network Implementation on FPGA – a Modular ApproachArtificial Neural Network Implementation on FPGA – a Modular Approach
Artificial Neural Network Implementation on FPGA – a Modular ApproachRoee Levy
 
Data Parallel and Object Oriented Model
Data Parallel and Object Oriented ModelData Parallel and Object Oriented Model
Data Parallel and Object Oriented ModelNikhil Sharma
 
Deep Learning and Tensorflow Implementation(딥러닝, 텐서플로우, 파이썬, CNN)_Myungyon Ki...
Deep Learning and Tensorflow Implementation(딥러닝, 텐서플로우, 파이썬, CNN)_Myungyon Ki...Deep Learning and Tensorflow Implementation(딥러닝, 텐서플로우, 파이썬, CNN)_Myungyon Ki...
Deep Learning and Tensorflow Implementation(딥러닝, 텐서플로우, 파이썬, CNN)_Myungyon Ki...Myungyon Kim
 
Squeeze Excitation Networks, The simple idea that won the final ImageNet Chal...
Squeeze Excitation Networks, The simple idea that won the final ImageNet Chal...Squeeze Excitation Networks, The simple idea that won the final ImageNet Chal...
Squeeze Excitation Networks, The simple idea that won the final ImageNet Chal...Joonhyung Lee
 
Graph Attention Networks.pptx
Graph Attention Networks.pptxGraph Attention Networks.pptx
Graph Attention Networks.pptxssuser2624f71
 
Network topologies
Network topologiesNetwork topologies
Network topologiesTom Hanstead
 
Neural Architecture Search: Learning How to Learn
Neural Architecture Search: Learning How to LearnNeural Architecture Search: Learning How to Learn
Neural Architecture Search: Learning How to LearnKwanghee Choi
 
Architecture of TPU, GPU and CPU
Architecture of TPU, GPU and CPUArchitecture of TPU, GPU and CPU
Architecture of TPU, GPU and CPUGlobalLogic Ukraine
 
A brief introduction to recent segmentation methods
A brief introduction to recent segmentation methodsA brief introduction to recent segmentation methods
A brief introduction to recent segmentation methodsShunta Saito
 
Types of network topology, hub, switch, router, repeater and brouter
Types of network topology, hub, switch, router, repeater and brouterTypes of network topology, hub, switch, router, repeater and brouter
Types of network topology, hub, switch, router, repeater and brouterAlidHasan4
 
This is introduction to distributed systems for the revised curiculum
This is introduction to distributed systems for the revised curiculumThis is introduction to distributed systems for the revised curiculum
This is introduction to distributed systems for the revised curiculummeharikiros2
 

Similar to Multithreaded sparse QR (20)

240226_Thanh_LabSeminar[Structure-Aware Transformer for Graph Representation ...
240226_Thanh_LabSeminar[Structure-Aware Transformer for Graph Representation ...240226_Thanh_LabSeminar[Structure-Aware Transformer for Graph Representation ...
240226_Thanh_LabSeminar[Structure-Aware Transformer for Graph Representation ...
 
240226_Thanh_LabSeminar[Structure-Aware Transformer for Graph Representation ...
240226_Thanh_LabSeminar[Structure-Aware Transformer for Graph Representation ...240226_Thanh_LabSeminar[Structure-Aware Transformer for Graph Representation ...
240226_Thanh_LabSeminar[Structure-Aware Transformer for Graph Representation ...
 
Tsinghua invited talk_zhou_xing_v2r0
Tsinghua invited talk_zhou_xing_v2r0Tsinghua invited talk_zhou_xing_v2r0
Tsinghua invited talk_zhou_xing_v2r0
 
NS-CUK Seminar: H.E.Lee, Review on "Weisfeiler and Leman Go Neural: Higher-O...
NS-CUK Seminar: H.E.Lee,  Review on "Weisfeiler and Leman Go Neural: Higher-O...NS-CUK Seminar: H.E.Lee,  Review on "Weisfeiler and Leman Go Neural: Higher-O...
NS-CUK Seminar: H.E.Lee, Review on "Weisfeiler and Leman Go Neural: Higher-O...
 
Deep Learning
Deep LearningDeep Learning
Deep Learning
 
Artificial Neural Network Implementation on FPGA – a Modular Approach
Artificial Neural Network Implementation on FPGA – a Modular ApproachArtificial Neural Network Implementation on FPGA – a Modular Approach
Artificial Neural Network Implementation on FPGA – a Modular Approach
 
Machine Learning @NECST
Machine Learning @NECSTMachine Learning @NECST
Machine Learning @NECST
 
Data Parallel and Object Oriented Model
Data Parallel and Object Oriented ModelData Parallel and Object Oriented Model
Data Parallel and Object Oriented Model
 
Deep Learning and Tensorflow Implementation(딥러닝, 텐서플로우, 파이썬, CNN)_Myungyon Ki...
Deep Learning and Tensorflow Implementation(딥러닝, 텐서플로우, 파이썬, CNN)_Myungyon Ki...Deep Learning and Tensorflow Implementation(딥러닝, 텐서플로우, 파이썬, CNN)_Myungyon Ki...
Deep Learning and Tensorflow Implementation(딥러닝, 텐서플로우, 파이썬, CNN)_Myungyon Ki...
 
Squeeze Excitation Networks, The simple idea that won the final ImageNet Chal...
Squeeze Excitation Networks, The simple idea that won the final ImageNet Chal...Squeeze Excitation Networks, The simple idea that won the final ImageNet Chal...
Squeeze Excitation Networks, The simple idea that won the final ImageNet Chal...
 
Graph Attention Networks.pptx
Graph Attention Networks.pptxGraph Attention Networks.pptx
Graph Attention Networks.pptx
 
4 026
4 0264 026
4 026
 
Wiki 2
Wiki 2Wiki 2
Wiki 2
 
93 99
93 9993 99
93 99
 
Network topologies
Network topologiesNetwork topologies
Network topologies
 
Neural Architecture Search: Learning How to Learn
Neural Architecture Search: Learning How to LearnNeural Architecture Search: Learning How to Learn
Neural Architecture Search: Learning How to Learn
 
Architecture of TPU, GPU and CPU
Architecture of TPU, GPU and CPUArchitecture of TPU, GPU and CPU
Architecture of TPU, GPU and CPU
 
A brief introduction to recent segmentation methods
A brief introduction to recent segmentation methodsA brief introduction to recent segmentation methods
A brief introduction to recent segmentation methods
 
Types of network topology, hub, switch, router, repeater and brouter
Types of network topology, hub, switch, router, repeater and brouterTypes of network topology, hub, switch, router, repeater and brouter
Types of network topology, hub, switch, router, repeater and brouter
 
This is introduction to distributed systems for the revised curiculum
This is introduction to distributed systems for the revised curiculumThis is introduction to distributed systems for the revised curiculum
This is introduction to distributed systems for the revised curiculum
 

Recently uploaded

Google I/O Extended 2024 Warsaw
Google I/O Extended 2024 WarsawGoogle I/O Extended 2024 Warsaw
Google I/O Extended 2024 WarsawGDSC PJATK
 
PLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. StartupsPLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. StartupsStefano
 
Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024Enterprise Knowledge
 
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyJohn Staveley
 
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfLinux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfFIDO Alliance
 
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxWSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxJennifer Lim
 
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka DoktorováCzechDreamin
 
AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101vincent683379
 
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlPeter Udo Diehl
 
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...FIDO Alliance
 
Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessUXDXConf
 
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomSalesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomCzechDreamin
 
Connecting the Dots in Product Design at KAYAK
Connecting the Dots in Product Design at KAYAKConnecting the Dots in Product Design at KAYAK
Connecting the Dots in Product Design at KAYAKUXDXConf
 
A Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System StrategyA Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System StrategyUXDXConf
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...CzechDreamin
 
ECS 2024 Teams Premium - Pretty Secure
ECS 2024   Teams Premium - Pretty SecureECS 2024   Teams Premium - Pretty Secure
ECS 2024 Teams Premium - Pretty SecureFemke de Vroome
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxDavid Michel
 
What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024Stephanie Beckett
 
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdfWhere to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdfFIDO Alliance
 
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdfSimplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdfFIDO Alliance
 

Recently uploaded (20)

Google I/O Extended 2024 Warsaw
Google I/O Extended 2024 WarsawGoogle I/O Extended 2024 Warsaw
Google I/O Extended 2024 Warsaw
 
PLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. StartupsPLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. Startups
 
Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024
 
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John Staveley
 
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfLinux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
 
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxWSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
 
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
 
AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101
 
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
 
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
 
Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for Success
 
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomSalesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
 
Connecting the Dots in Product Design at KAYAK
Connecting the Dots in Product Design at KAYAKConnecting the Dots in Product Design at KAYAK
Connecting the Dots in Product Design at KAYAK
 
A Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System StrategyA Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System Strategy
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
 
ECS 2024 Teams Premium - Pretty Secure
ECS 2024   Teams Premium - Pretty SecureECS 2024   Teams Premium - Pretty Secure
ECS 2024 Teams Premium - Pretty Secure
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
 
What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024
 
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdfWhere to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
 
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdfSimplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
 

Multithreaded sparse QR

  • 1. On the memory behavior of a multifrontal QR software for multicore systems Alfredo Buttari, CNRS-IRIT Toulouse, September 6-7, 2011
  • 3. The Multifrontal QR for newbies The multifrontal QR factorization is guided by a graph called elimination tree: • at each node of the tree k pivots are eliminated • each node of the tree is associated with a relatively small dense matrix called frontal matrix (or, simply, front) which contains the k columns related to the pivots and all the other coefficients concerned by their elimination 3/25 Toulouse, September 6-7, 2011
  • 4. The Multifrontal QR for newbies The tree is traversed in topological order (i.e., bottom-up) and, at each node, two operations are performed: 4/25 Toulouse, September 6-7, 2011
  • 5. The Multifrontal QR for newbies The tree is traversed in topological order (i.e., bottom-up) and, at each node, two operations are performed: • assembly: a set of coefficient from the original matrix associated with the pivots and a number of contribution blocks produced by the treatment of the child nodes are assembled together to form the frontal matrix 4/25 Toulouse, September 6-7, 2011
  • 6. The Multifrontal QR for newbies The tree is traversed in topological order (i.e., bottom-up) and, at each node, two operations are performed: • assembly: a set of coefficient from the original matrix associated with the pivots and a number of contribution blocks produced by the treatment of the child nodes are assembled together to form the frontal matrix • factorization: the k pivots are eliminated through a complete QR factorization of the frontal matrix. As a result we get: ◦ k rows of the global R factor ◦ a bunch of Householder vectors ◦ a triangular contribution block that will be assembled into the father’s front 4/25 Toulouse, September 6-7, 2011
  • 8. The Multifrontal QR: parallelism Two sources of parallelism are available in any multifrontal method: Tree parallelism • fronts associated with nodes in different branches are independent and can, thus, be factorized in parallel Front parallelism • if the size of a front is big enough, multiple processes may be used to factorize it 6/25 Toulouse, September 6-7, 2011
  • 9. Parallelism: classical approach The classical approach (Puglisi, Matstom, Davis) • Tree parallelism: ◦ a front assembly+factorization corresponds to a task ◦ computational tasks are added to a task pool ◦ threads fetch tasks from the pool repeatedly until all the fronts are done • Front parallelism: ◦ Multithreaded BLAS for the front facto What’s wrong with this approach? A complete separation of the two levels of parallelism which causes • potentially strong load unbalance • heavy synchronizations due to the sequential nature of some operations (assembly) • sub-optimal exploitation of the concurrency in the multifrontal 7/25 method Toulouse, September 6-7, 2011
  • 10. Parallelism: a new approach fine-grained, data-flow parallel approach • fine granularity: tasks are not defined as operations on fronts but as operations on portions of fronts defined by a 1-D partitioning • data flow parallelism: tasks are scheduled dynamically based on the dependencies between them Both node and tree parallelism are handled the same way at any level of the tree. 8/25 Toulouse, September 6-7, 2011
  • 12. Parallelism: a new approach Fine-granularity is achieved through a 1-D block partitioning of fronts and the definition of five elementary operations: 1. activate(front): the activation of a front corresponds to a full determination of its (staircase) structure and allocation of the needed memory areas 2. panel(bcol): QR factorization (Level2 BLAS) of a column 3. update(bcol): update of a column in the trailing submatrix wrt to a panel 4. assemble(bcol): assembly of a column of the contribution block into the father 5. clean(front): cleanup the front in order to release all the memory areas that are no more needed 10/25 Toulouse, September 6-7, 2011
  • 13. Parallelism: a new approach If a task is defined as the execution of one elementary operation on a block-column or a front, then the entire multifrontal factorization can be represented as a Directed Acyclic Graph (DAG) where the nodes represent tasks and edges the dependencies among them 11/25 Toulouse, September 6-7, 2011
  • 14. Parallelism: a new approach • d1: no other elementary operation can be executed on a front or on one of its block-columns until the front is not activated; • d2: a block column can be updated with respect to a panel only if the corresponding panel factorization is completed; • d3: the panel operation can be executed on block-column i only if it is up-to-date with respect to panel i − 1; • d4: a block-column can be updated with respect to a panel i in its front only if it is up-to-date with respect to the previous panel i − 1 in the same front; • d5: a block-column can be assembled into the parent (if it exists) when it is up-to-date with respect to the last panel factorization to be performed on the front it belongs to (in this case it is assumed that block-column i is up-to-date with respect to panel i when the corresponding panel operation is executed); • d6: no other elementary operation can be executed on a block-column until all the corresponding portions of the contribution blocks from the child nodes have been assembled into it, in which case the block-column is said to be assembled; • d7: since the structure of a frontal matrix depends on the structure of its children, a front matrix can be activated only if all of its children are already 12/25 active; Toulouse, September 6-7, 2011
  • 15. Parallelism: a new approach Tasks are scheduled dynamically and asynchronously 13/25 Toulouse, September 6-7, 2011
  • 16. Results 14/25 Toulouse, September 6-7, 2011
  • 18. Target architecture Our target architecture has four, hexa-core AMD Istanbul processors connected through HyperTransport links in a ring topology with a memory module attached to each of them: 16/25 Toulouse, September 6-7, 2011
  • 19. Target architecture Our target architecture has four, hexa-core AMD Istanbul processors connected through HyperTransport links in a ring topology with a memory module attached to each of them: The bandwidth depends on the number of hops 16/25 Toulouse, September 6-7, 2011
  • 20. Proximity scheduling The scheduling can be guided by a proximity criterion: a task should be executed by the core which is closest to the concerned data. This can be implemented through a system of task queues, one per thread/core: No front-to-core mapping is done (yet)! 17/25 Toulouse, September 6-7, 2011
  • 21. Proximity scheduling The scheduling can be guided by a proximity criterion: a task should be executed by the core which is closest to the concerned data. This can be implemented through a system of task queues, one per thread/core: • At the moment when a thread activates a front it becomes its owner No front-to-core mapping is done (yet)! 17/25 Toulouse, September 6-7, 2011
  • 22. Proximity scheduling The scheduling can be guided by a proximity criterion: a task should be executed by the core which is closest to the concerned data. This can be implemented through a system of task queues, one per thread/core: • At the moment when a thread activates a front it becomes its owner • All the tasks related to a front will be pushed on the queue associated with its owner No front-to-core mapping is done (yet)! 17/25 Toulouse, September 6-7, 2011
  • 23. Proximity scheduling The scheduling can be guided by a proximity criterion: a task should be executed by the core which is closest to the concerned data. This can be implemented through a system of task queues, one per thread/core: • At the moment when a thread activates a front it becomes its owner • All the tasks related to a front will be pushed on the queue associated with its owner • Work-stealing is used to feed threads that run out of tasks: No front-to-core mapping is done (yet)! 17/25 Toulouse, September 6-7, 2011
  • 24. Proximity scheduling The scheduling can be guided by a proximity criterion: a task should be executed by the core which is closest to the concerned data. This can be implemented through a system of task queues, one per thread/core: • At the moment when a thread activates a front it becomes its owner • All the tasks related to a front will be pushed on the queue associated with its owner • Work-stealing is used to feed threads that run out of tasks: ◦ a thread will first try to steal tasks from neighbor queues... No front-to-core mapping is done (yet)! 17/25 Toulouse, September 6-7, 2011
  • 25. Proximity scheduling The scheduling can be guided by a proximity criterion: a task should be executed by the core which is closest to the concerned data. This can be implemented through a system of task queues, one per thread/core: • At the moment when a thread activates a front it becomes its owner • All the tasks related to a front will be pushed on the queue associated with its owner • Work-stealing is used to feed threads that run out of tasks: ◦ a thread will first try to steal tasks from neighbor queues... ◦ ...and then from any other queue No front-to-core mapping is done (yet)! 17/25 Toulouse, September 6-7, 2011
  • 26. Proximity scheduling Implementing all this requires the ability to: 18/25 Toulouse, September 6-7, 2011
  • 27. Proximity scheduling Implementing all this requires the ability to: • control the placement of threads: we have to bind each thread to a single core and prevent threads migrations. This can be done in a number of ways, e.g. by means of tools such as hwloc which allows thread pinning 18/25 Toulouse, September 6-7, 2011
  • 28. Proximity scheduling Implementing all this requires the ability to: • control the placement of threads: we have to bind each thread to a single core and prevent threads migrations. This can be done in a number of ways, e.g. by means of tools such as hwloc which allows thread pinning • control the placement of data: we have to make sure that one front physically resides on a specific NUMA module. This can be done with: ◦ the first touch rule: the data is allocated close to the core that makes the first reference ◦ hwloc or numalib which provide NUMA-aware allocators 18/25 Toulouse, September 6-7, 2011
  • 29. Proximity scheduling Implementing all this requires the ability to: • control the placement of threads: we have to bind each thread to a single core and prevent threads migrations. This can be done in a number of ways, e.g. by means of tools such as hwloc which allows thread pinning • control the placement of data: we have to make sure that one front physically resides on a specific NUMA module. This can be done with: ◦ the first touch rule: the data is allocated close to the core that makes the first reference ◦ hwloc or numalib which provide NUMA-aware allocators • detect the architecture we have to figure out the memory/cores layout in order to guide the work stealing. This can be done with hwloc 18/25 Toulouse, September 6-7, 2011
  • 30. Proximity scheduling: experiments Experimental results show that proximity scheduling is good: Matrix Strat. Time (s) no loc. 155 Rucci1 loc. 144 no loc. 43 ohne2 loc. 39 no loc. 89 lp nug20 loc. 84 Why? 19/25 Toulouse, September 6-7, 2011
  • 31. Proximity scheduling: experiments Experimental results show that proximity scheduling is good: Matrix Strat. Time (s) Dwords on HT (×109 ) no loc. 155 2000 Rucci1 loc. 144 1634 no loc. 43 504 ohne2 loc. 39 375 no loc. 89 879 lp nug20 loc. 84 778 Why? Fewer Dwords transferred through the HyperTransport links, as shown by the number of occurrences of the PAPI HYPERTRANSPORT LINKx:DATA DWORD SENT event. 19/25 Toulouse, September 6-7, 2011
  • 32. Proximity scheduling: experiments Experimental results show that proximity scheduling is good: Matrix Strat. Time (s) Dwords on HT (×109 ) no loc. 155 2000 Rucci1 loc. 144 1634 no loc. 43 504 ohne2 loc. 39 375 no loc. 89 879 lp nug20 loc. 84 778 Why? Fewer Dwords transferred through the HyperTransport links, as shown by the number of occurrences of the PAPI HYPERTRANSPORT LINKx:DATA DWORD SENT event. But this not the whole story! 19/25 Toulouse, September 6-7, 2011
  • 33. Target architecture What happens when more threads are active at the same time on the system? 20/25 Toulouse, September 6-7, 2011
  • 34. Target architecture What happens when more threads are active at the same time on the system? BW drops; why? 20/25 Toulouse, September 6-7, 2011
  • 35. Target architecture What happens when more threads are active at the same time on the system? BW drops; why? memory conflicts! 20/25 Toulouse, September 6-7, 2011
  • 36. Experiments on conflicts Should we do something to reduce conflicts? 21/25 Toulouse, September 6-7, 2011
  • 37. Experiments on conflicts Should we do something to reduce conflicts? Memory interleaving should provide a more uniform distribution of data that (supposedly) reduces conflicts and increases the memory bandwidth 21/25 Toulouse, September 6-7, 2011
  • 38. Experiments on conflicts Matrix Strat. Time (s) Dwords on HT (×109 ) no loc. 155 2000 Rucci1 loc. 144 1634 no loc. 43 504 ohne2 loc. 39 375 no loc. 89 879 lp nug20 loc. 84 778 22/25 Toulouse, September 6-7, 2011
  • 39. Experiments on conflicts Matrix Strat. Time (s) Dwords on HT (×109 ) no loc. 155 2000 Rucci1 loc. 144 1634 r. r. 117 no loc. 43 504 ohne2 loc. 39 375 r. r. 38 no loc. 89 879 lp nug20 loc. 84 778 r. r. 66 22/25 Toulouse, September 6-7, 2011
  • 40. Experiments on conflicts Matrix Strat. Time (s) Dwords on HT (×109 ) no loc. 155 2000 Rucci1 loc. 144 1634 r. r. 117 2306 no loc. 43 504 ohne2 loc. 39 375 r. r. 38 665 no loc. 89 879 lp nug20 loc. 84 778 r. r. 66 985 22/25 Toulouse, September 6-7, 2011
  • 41. Experiments on conflicts Matrix Strat. Time (s) Dwords on HT Conflicts on DCT (×109 ) (×109 ) no loc. 155 2000 23.8 Rucci1 loc. 144 1634 25.4 r. r. 117 2306 19.7 no loc. 43 504 6.63 ohne2 loc. 39 375 6.71 r. r. 38 665 4.54 no loc. 89 879 11.4 lp nug20 loc. 84 778 11.9 r. r. 66 985 6.81 Conflicts are DRAM ACCESSES PAGE:DCTx PAGE CONFLICT events measured by PAPI 22/25 Toulouse, September 6-7, 2011
  • 42. Experiments on conflicts Matrix Strat. Time (s) Dwords on HT Conflicts on DCT (×109 ) (×109 ) no loc. 155 2000 23.8 Rucci1 loc. 144 1634 25.4 r. r. 117 2306 19.7 no loc. 43 504 6.63 ohne2 loc. 39 375 6.71 r. r. 38 665 4.54 no loc. 89 879 11.4 lp nug20 loc. 84 778 11.9 r. r. 66 985 6.81 Conflicts are DRAM ACCESSES PAGE:DCTx PAGE CONFLICT events measured by PAPI Analogous behavior was found on the HSL MA87 code. 22/25 Toulouse, September 6-7, 2011
  • 43. Experiments on conflicts Matrix Strat. Time (s) Dwords on HT Conflicts on DCT (×109 ) (×109 ) no loc. 155 2000 23.8 Rucci1 loc. 144 1634 25.4 r. r. 117 2306 19.7 no loc. 43 504 6.63 ohne2 loc. 39 375 6.71 r. r. 38 665 4.54 no loc. 89 879 11.4 lp nug20 loc. 84 778 11.9 r. r. 66 985 6.81 Conflicts are DRAM ACCESSES PAGE:DCTx PAGE CONFLICT events measured by PAPI Analogous behavior was found on the HSL MA87 code. The answer is: yes, we should reduce conflicts (work in progress). 22/25 Toulouse, September 6-7, 2011
  • 44. Conclusions • fine-granularity, asynchronism and data-flow parallelism are tough (especially all together) but the effort is largely paid off • a correct exploitation of the memory system is critical, especially in NUMA systems • reducing memory transfers gives some performance improvement but... • ...maybe it is not the most important thing: memory conflicts seem to be waaay more penalizing than data traffic. How do we solve this? Not clear for the moment. Maybe a careful front-to-memory mapping? Ideas and suggestions are very welcome 23/25 Toulouse, September 6-7, 2011
  • 45. ? Thank you all! Questions?