Efficient Parallel Algorithms                                     Alexander Tiskin                                 Departmen...
1   Computation by circuits2   Parallel computation models3   Basic parallel algorithms4   Further parallel algorithms5   ...
1   Computation by circuits2   Parallel computation models3   Basic parallel algorithms4   Further parallel algorithms5   ...
Computation by circuitsComputation models and algorithmsModel: abstraction of reality allowing qualitative and quantitativ...
Computation by circuitsComputation models and algorithmsModel: abstraction of reality allowing qualitative and quantitativ...
Computation by circuitsComputation models and algorithmsModel: abstraction of reality allowing qualitative and quantitativ...
Computation by circuitsComputation models and algorithmsModel: abstraction of reality allowing qualitative and quantitativ...
Computation by circuitsComputation models and algorithmsAlgorithm complexity depends on the model   Alexander Tiskin (Warw...
Computation by circuitsComputation models and algorithmsAlgorithm complexity depends on the modelE.g. sorting n items:    ...
Computation by circuitsComputation models and algorithmsAlgorithm complexity depends on the modelE.g. sorting n items:    ...
Computation by circuitsComputation models and algorithmsAlgorithm complexity depends on the modelE.g. sorting n items:    ...
Computation by circuitsThe circuit modelBasic special-purpose parallel model: a circuit                                   ...
Computation by circuitsThe circuit modelBasic special-purpose parallel model: a circuit                                   ...
Computation by circuitsThe circuit modelBounded or unbounded fan-in/fan-outElementary operations:      arithmetic/Boolean/...
Computation by circuitsThe circuit modelBounded or unbounded fan-in/fan-outElementary operations:      arithmetic/Boolean/...
Computation by circuitsThe circuit modelBounded or unbounded fan-in/fan-outElementary operations:      arithmetic/Boolean/...
Computation by circuitsThe comparison network modelA comparison network is a circuit of comparator nodes   Alexander Tiski...
Computation by circuitsThe comparison network modelA comparison network is a circuit of comparator nodes           x y    ...
Computation by circuitsThe comparison network modelA comparison network is a circuit of comparator nodes           x y    ...
Computation by circuitsThe comparison network modelA comparison network is a circuit of comparator nodes           x y    ...
Computation by circuitsThe comparison network modelA merging network is a comparison network that takes two sorted inputse...
Computation by circuitsThe comparison network modelA merging network is a comparison network that takes two sorted inputse...
Computation by circuitsThe comparison network modelA merging network is a comparison network that takes two sorted inputse...
Computation by circuitsNaive sorting networksBUBBLE -SORT (n)size n(n − 1)/2 = O(n2 )depth 2n − 3 = O(n)                  ...
Computation by circuitsNaive sorting networksBUBBLE -SORT (n)size n(n − 1)/2 = O(n2 )depth 2n − 3 = O(n)                  ...
Computation by circuitsNaive sorting networksINSERTION-SORT (n)size n(n − 1)/2 = O(n2 )                                   ...
Computation by circuitsNaive sorting networksINSERTION-SORT (n)size n(n − 1)/2 = O(n2 )                                   ...
Computation by circuitsThe zero-one principleZero-one principle: A comparison network is sorting, if and only if it sortsa...
Computation by circuitsThe zero-one principleZero-one principle: A comparison network is sorting, if and only if it sortsa...
Computation by circuitsThe zero-one principleZero-one principle: A comparison network is sorting, if and only if it sortsa...
Computation by circuitsThe zero-one principleZero-one principle: A comparison network is sorting, if and only if it sortsa...
Computation by circuitsThe zero-one principleZero-one principle: A comparison network is sorting, if and only if it sortsa...
Computation by circuitsThe zero-one principleZero-one principle: A comparison network is sorting, if and only if it sortsa...
Computation by circuitsThe zero-one principleThe zero-one principle applies to sorting, merging and other comparisonproble...
Computation by circuitsThe zero-one principleThe zero-one principle applies to sorting, merging and other comparisonproble...
Computation by circuitsEfficient merging and sorting networksGeneral merging: O(n) comparisons, non-obliviousHow fast can we...
Computation by circuitsEfficient merging and sorting networksGeneral merging: O(n) comparisons, non-obliviousHow fast can we...
Computation by circuitsEfficient merging and sorting networksGeneral merging: O(n) comparisons, non-obliviousHow fast can we...
Computation by circuitsEfficient merging and sorting networksOEM(n , n )size O(n log n)                                     ...
Computation by circuitsEfficient merging and sorting networksOEM(n , n )size O(n log n)                                     ...
Computation by circuitsEfficient merging and sorting networksCorrectness proof of odd-even merging (sketch):   Alexander Tis...
Computation by circuitsEfficient merging and sorting networksCorrectness proof of odd-even merging (sketch): by induction an...
Computation by circuitsEfficient merging and sorting networksCorrectness proof of odd-even merging (sketch): by induction an...
Computation by circuitsEfficient merging and sorting networksSorting an arbitrary input x1 , . . . , xnOdd-even merge sortin...
Computation by circuitsEfficient merging and sorting networksSorting an arbitrary input x1 , . . . , xnOdd-even merge sortin...
Computation by circuitsEfficient merging and sorting networksOEM-SORT (n)                                                   ...
Computation by circuitsEfficient merging and sorting networksOEM-SORT (n)                                                   ...
Computation by circuitsEfficient merging and sorting networksA bitonic sequence: x1 ≥ · · · ≥ xm ≤ · · · ≤ xn   Alexander Ti...
Computation by circuitsEfficient merging and sorting networksA bitonic sequence: x1 ≥ · · · ≥ xm ≤ · · · ≤ xnBitonic merging...
Computation by circuitsEfficient merging and sorting networksA bitonic sequence: x1 ≥ · · · ≥ xm ≤ · · · ≤ xnBitonic merging...
Computation by circuitsEfficient merging and sorting networksA bitonic sequence: x1 ≥ · · · ≥ xm ≤ · · · ≤ xnBitonic merging...
Computation by circuitsEfficient merging and sorting networksBM(n)size O(n log n)                                           ...
Computation by circuitsEfficient merging and sorting networksBM(n)size O(n log n)                                           ...
Computation by circuitsEfficient merging and sorting networksBitonic merge sorting                                          ...
Computation by circuitsEfficient merging and sorting networksBitonic merge sorting                                          ...
Computation by circuitsEfficient merging and sorting networksBitonic merge sorting                                          ...
Computation by circuitsEfficient merging and sorting networksBM-SORT (n)                                                    ...
Computation by circuitsEfficient merging and sorting networksBM-SORT (n)                                                    ...
Computation by circuitsEfficient merging and sorting networksBoth OEM-SORT and BM-SORT have size Θ n(log n)2Is it possible t...
Computation by circuitsEfficient merging and sorting networksBoth OEM-SORT and BM-SORT have size Θ n(log n)2Is it possible t...
1   Computation by circuits2   Parallel computation models3   Basic parallel algorithms4   Further parallel algorithms5   ...
Parallel computation modelsThe PRAM modelParallel Random Access Machine (PRAM)                        [Fortune, Wyllie: 19...
Parallel computation modelsThe PRAM modelParallel Random Access Machine (PRAM)                        [Fortune, Wyllie: 19...
Parallel computation modelsThe PRAM modelPRAM computation: sequence of parallel stepsCommunication and synchronisation tak...
Parallel computation modelsThe PRAM modelPRAM computation: sequence of parallel stepsCommunication and synchronisation tak...
Parallel computation modelsThe PRAM modelPRAM computation: sequence of parallel stepsCommunication and synchronisation tak...
Parallel computation modelsThe BSP modelBulk-Synchronous Parallel (BSP) computer                               [Valiant: 1...
Parallel computation modelsThe BSP modelBulk-Synchronous Parallel (BSP) computer                               [Valiant: 1...
Parallel computation modelsThe BSP modelSome elements of a BSP computer can be emulated by others, e.g.     external memor...
Parallel computation modelsThe BSP modelSome elements of a BSP computer can be emulated by others, e.g.     external memor...
Parallel computation modelsThe BSP modelSome elements of a BSP computer can be emulated by others, e.g.     external memor...
Parallel computation modelsThe BSP model                                                                              Þ Ò ...
Parallel computation modelsThe BSP modelBSP computation: sequence of parallel supersteps 0 1 p−1     Alexander Tiskin (War...
Parallel computation modelsThe BSP modelBSP computation: sequence of parallel supersteps 0 1 p−1     Alexander Tiskin (War...
Parallel computation modelsThe BSP modelBSP computation: sequence of parallel supersteps 0 1 p−1     Alexander Tiskin (War...
Parallel computation modelsThe BSP modelBSP computation: sequence of parallel supersteps 0 1 p−1     Alexander Tiskin (War...
Parallel computation modelsThe BSP modelBSP computation: sequence of parallel supersteps 0                                ...
Parallel computation modelsThe BSP modelBSP computation: sequence of parallel supersteps 0                                ...
Parallel computation modelsThe BSP modelBSP computation: sequence of parallel supersteps 0                                ...
Parallel computation modelsThe BSP modelCompositional cost modelFor individual processor proc in superstep sstep:     comp...
Parallel computation modelsThe BSP modelCompositional cost modelFor individual processor proc in superstep sstep:     comp...
Parallel computation modelsThe BSP modelFor the whole BSP computation with sync supersteps:     comp =           0≤sstep<s...
Parallel computation modelsThe BSP modelFor the whole BSP computation with sync supersteps:     comp =           0≤sstep<s...
Parallel computation modelsThe BSP modelFor the whole BSP computation with sync supersteps:     comp =           0≤sstep<s...
Parallel computation modelsThe BSP modelBSP computation: scalable, portable, predictable  Alexander Tiskin (Warwick)   Effic...
Parallel computation modelsThe BSP modelBSP computation: scalable, portable, predictableBSP algorithm design: minimising c...
Parallel computation modelsThe BSP modelBSP computation: scalable, portable, predictableBSP algorithm design: minimising c...
Parallel computation modelsThe BSP modelBSP computation: scalable, portable, predictableBSP algorithm design: minimising c...
Parallel computation modelsNetwork routingBSP network model: complete graph, uniformly accessible (accessefficiency describe...
Parallel computation modelsNetwork routingBSP network model: complete graph, uniformly accessible (accessefficiency describe...
Parallel computation modelsNetwork routing2D array network                                              P QP              ...
Parallel computation modelsNetwork routing                                                                            ‚ƒ ...
Parallel computation modelsNetwork routingButterfly network                                            ¨ !§¥ £ ¡ ¦©¦¦¤¢    ...
Parallel computation modelsNetwork routing                                                                    ¤§¦¢        ...
Parallel computation modelsNetwork routingNetwork            Degree       Diameter1D array           2            1/2 · p2...
Parallel computation modelsNetwork routingNetwork            Degree       Diameter1D array           2            1/2 · p2...
Parallel computation modelsNetwork routingNetwork            Degree       Diameter1D array           2            1/2 · p2...
Parallel computation modelsNetwork routingh-relation (h-superstep): every processor sends and receives ≤ h packets   Alexa...
Parallel computation modelsNetwork routingh-relation (h-superstep): every processor sends and receives ≤ h packetsSufficient...
Parallel computation modelsNetwork routingh-relation (h-superstep): every processor sends and receives ≤ h packetsSufficient...
Parallel computation modelsNetwork routingh-relation (h-superstep): every processor sends and receives ≤ h packetsSufficient...
Parallel computation modelsNetwork routingRouting based on sorting networksEach processor corresponds to a wireEach link c...
Parallel computation modelsNetwork routingRouting based on sorting networksEach processor corresponds to a wireEach link c...
Parallel computation modelsNetwork routingRouting based on sorting networksEach processor corresponds to a wireEach link c...
Parallel computation modelsNetwork routingTwo-phase randomised routing:                                 [Valiant: 1980]   ...
Parallel computation modelsNetwork routingTwo-phase randomised routing:                                 [Valiant: 1980]   ...
Parallel computation modelsNetwork routingTwo-phase randomised routing:                                 [Valiant: 1980]   ...
Parallel computation modelsNetwork routingBSP implementation: processes placed at random, communication delayeduntil end o...
Parallel computation modelsNetwork routingBSP implementation: processes placed at random, communication delayeduntil end o...
1   Computation by circuits2   Parallel computation models3   Basic parallel algorithms4   Further parallel algorithms5   ...
Basic parallel algorithmsBroadcast/combineThe broadcasting problem:      initially, one designated processor holds a value...
Basic parallel algorithmsBroadcast/combineThe broadcasting problem:      initially, one designated processor holds a value...
Basic parallel algorithmsBroadcast/combineThe broadcasting problem:      initially, one designated processor holds a value...
Basic parallel algorithmsBroadcast/combineDirect broadcast:      designated processor makes p − 1 copies of a and sends th...
Basic parallel algorithmsBroadcast/combineDirect broadcast:      designated processor makes p − 1 copies of a and sends th...
Basic parallel algorithmsBroadcast/combineDirect broadcast:      designated processor makes p − 1 copies of a and sends th...
Basic parallel algorithmsBroadcast/combineBinary tree broadcast:      initially, only designated processor is awake      p...
Basic parallel algorithmsBroadcast/combineBinary tree broadcast:      initially, only designated processor is awake      p...
Basic parallel algorithmsBroadcast/combineBinary tree broadcast:      initially, only designated processor is awake      p...
Basic parallel algorithmsBroadcast/combineThe array broadcasting/combining problem: broadcast/combine an arrayof size n ≥ ...
Basic parallel algorithmsBroadcast/combineTwo-phase array broadcast:      partition array into p blocks of size n/p      s...
Basic parallel algorithmsBroadcast/combineTwo-phase array broadcast:      partition array into p blocks of size n/p      s...
Basic parallel algorithmsBalanced tree and prefix sumsThe balanced binary tree dag:      a generalisation of broadcasting/c...
Basic parallel algorithmsBalanced tree and prefix sumsThe balanced binary tree dag:      a generalisation of broadcasting/c...
Basic parallel algorithmsBalanced tree and prefix sumsParallel balanced tree computationFrom now on, we always assume that ...
Basic parallel algorithmsBalanced tree and prefix sumsParallel balanced tree computationFrom now on, we always assume that ...
Basic parallel algorithmsBalanced tree and prefix sumsParallel balanced tree computation (contd.)      a designated process...
Basic parallel algorithmsBalanced tree and prefix sumsParallel balanced tree computation (contd.)      a designated process...
Basic parallel algorithmsBalanced tree and prefix sumsThe described parallel balanced tree algorithm is fully optimal:     ...
Basic parallel algorithmsBalanced tree and prefix sumsThe described parallel balanced tree algorithm is fully optimal:     ...
Basic parallel algorithmsBalanced tree and prefix sumsThe described parallel balanced tree algorithm is fully optimal:     ...
Basic parallel algorithmsBalanced tree and prefix sumsLet • be an associative operator, computable in time O(1)a • (b • c) ...
Basic parallel algorithmsBalanced tree and prefix sumsLet • be an associative operator, computable in time O(1)a • (b • c) ...
Basic parallel algorithmsBalanced tree and prefix sumsThe prefix circuit                                                    ...
Basic parallel algorithmsBalanced tree and prefix sumsThe prefix circuit (contd.)                                ∗   a0 a1 a...
Basic parallel algorithmsBalanced tree and prefix sumsParallel prefix computationThe dag prefix(n) consists of      a dag sim...
Basic parallel algorithmsBalanced tree and prefix sumsParallel prefix computationThe dag prefix(n) consists of      a dag sim...
Basic parallel algorithmsBalanced tree and prefix sumsParallel prefix computationThe dag prefix(n) consists of      a dag sim...
Basic parallel algorithmsBalanced tree and prefix sumsApplication: binary addition via Boolean logicx +y =zLet x = xn−1 , ....
Basic parallel algorithmsBalanced tree and prefix sumsApplication: binary addition via Boolean logicx +y =zLet x = xn−1 , ....
Basic parallel algorithmsBalanced tree and prefix sumsx +y =zLet ui = xi ∧ yi            vi = xi ⊕ yi     0≤i nArrays u = u...
Basic parallel algorithmsBalanced tree and prefix sumsx +y =zLet ui = xi ∧ yi            vi = xi ⊕ yi      0≤i nArrays u = ...
Basic parallel algorithmsBalanced tree and prefix sumsx +y =zLet ui = xi ∧ yi            vi = xi ⊕ yi      0≤i nArrays u = ...
Basic parallel algorithmsBalanced tree and prefix sumsWe have ci = ui ∨ (vi ∧ ci−1 )Let Fu,v (c) = u ∨ (v ∧ c)          ci ...
Basic parallel algorithmsBalanced tree and prefix sumsWe have ci = ui ∨ (vi ∧ ci−1 )Let Fu,v (c) = u ∨ (v ∧ c)          ci ...
Basic parallel algorithmsBalanced tree and prefix sumsWe have ci = ui ∨ (vi ∧ ci−1 )Let Fu,v (c) = u ∨ (v ∧ c)          ci ...
Basic parallel algorithmsFast Fourier Transform and the butterfly dagA complex number ω is called a primitive root of unity...
Basic parallel algorithmsFast Fourier Transform and the butterfly dagA complex number ω is called a primitive root of unity...
Basic parallel algorithmsFast Fourier Transform and the butterfly dagA complex number ω is called a primitive root of unity...
Basic parallel algorithmsFast Fourier Transform and the butterfly dagThe Fast Fourier Transform (FFT) algorithm (“four-step...
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
20121021 bspapproach tiskin
Upcoming SlideShare
Loading in...5
×

20121021 bspapproach tiskin

586

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
586
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

20121021 bspapproach tiskin

  1. 1. Efficient Parallel Algorithms Alexander Tiskin Department of Computer Science University of Warwick http://go.warwick.ac.uk/alextiskinAlexander Tiskin (Warwick) Efficient Parallel Algorithms 1 / 174
  2. 2. 1 Computation by circuits2 Parallel computation models3 Basic parallel algorithms4 Further parallel algorithms5 Parallel matrix algorithms6 Parallel graph algorithms Alexander Tiskin (Warwick) Efficient Parallel Algorithms 2 / 174
  3. 3. 1 Computation by circuits2 Parallel computation models3 Basic parallel algorithms4 Further parallel algorithms5 Parallel matrix algorithms6 Parallel graph algorithms Alexander Tiskin (Warwick) Efficient Parallel Algorithms 3 / 174
  4. 4. Computation by circuitsComputation models and algorithmsModel: abstraction of reality allowing qualitative and quantitativereasoningE.g. atom, galaxy, biological cell, Newton’s universe, Einstein’s universe. . . Alexander Tiskin (Warwick) Efficient Parallel Algorithms 4 / 174
  5. 5. Computation by circuitsComputation models and algorithmsModel: abstraction of reality allowing qualitative and quantitativereasoningE.g. atom, galaxy, biological cell, Newton’s universe, Einstein’s universe. . .Computation model: abstract computing device to reason aboutcomputations and algorithmsE.g. scales+weights, Turing machine, von Neumann machine (“ordinarycomputer”), JVM, quantum computer. . . Alexander Tiskin (Warwick) Efficient Parallel Algorithms 4 / 174
  6. 6. Computation by circuitsComputation models and algorithmsModel: abstraction of reality allowing qualitative and quantitativereasoningE.g. atom, galaxy, biological cell, Newton’s universe, Einstein’s universe. . .Computation model: abstract computing device to reason aboutcomputations and algorithmsE.g. scales+weights, Turing machine, von Neumann machine (“ordinarycomputer”), JVM, quantum computer. . .An algorithm in a specific model: input → (computation steps) → outputInput/output encoding must be specified Alexander Tiskin (Warwick) Efficient Parallel Algorithms 4 / 174
  7. 7. Computation by circuitsComputation models and algorithmsModel: abstraction of reality allowing qualitative and quantitativereasoningE.g. atom, galaxy, biological cell, Newton’s universe, Einstein’s universe. . .Computation model: abstract computing device to reason aboutcomputations and algorithmsE.g. scales+weights, Turing machine, von Neumann machine (“ordinarycomputer”), JVM, quantum computer. . .An algorithm in a specific model: input → (computation steps) → outputInput/output encoding must be specifiedAlgorithm complexity (worst-case): T (n) = max computation steps input size=n Alexander Tiskin (Warwick) Efficient Parallel Algorithms 4 / 174
  8. 8. Computation by circuitsComputation models and algorithmsAlgorithm complexity depends on the model Alexander Tiskin (Warwick) Efficient Parallel Algorithms 5 / 174
  9. 9. Computation by circuitsComputation models and algorithmsAlgorithm complexity depends on the modelE.g. sorting n items: Ω(n log n) in the comparison model O(n) in the arithmetic model (by radix sort) Alexander Tiskin (Warwick) Efficient Parallel Algorithms 5 / 174
  10. 10. Computation by circuitsComputation models and algorithmsAlgorithm complexity depends on the modelE.g. sorting n items: Ω(n log n) in the comparison model O(n) in the arithmetic model (by radix sort)E.g. factoring large numbers: hard in a von Neumann-type (standard) model not so hard on a quantum computer Alexander Tiskin (Warwick) Efficient Parallel Algorithms 5 / 174
  11. 11. Computation by circuitsComputation models and algorithmsAlgorithm complexity depends on the modelE.g. sorting n items: Ω(n log n) in the comparison model O(n) in the arithmetic model (by radix sort)E.g. factoring large numbers: hard in a von Neumann-type (standard) model not so hard on a quantum computerE.g. deciding if a program halts on a given input: impossible in a standard (or even quantum) model can be added to the standard model as an oracle, to create a more powerful model Alexander Tiskin (Warwick) Efficient Parallel Algorithms 5 / 174
  12. 12. Computation by circuitsThe circuit modelBasic special-purpose parallel model: a circuit a ba2 + 2ab + b 2 x2 2xy y2a2 − b 2 x +y +z x −y Alexander Tiskin (Warwick) Efficient Parallel Algorithms 6 / 174
  13. 13. Computation by circuitsThe circuit modelBasic special-purpose parallel model: a circuit a ba2 + 2ab + b 2 x2 2xy y2a2 − b 2 x +y +z x −yDirected acyclic graph (dag)Fixed number of inputs/outputsOblivious computation: control sequence independent of the input Alexander Tiskin (Warwick) Efficient Parallel Algorithms 6 / 174
  14. 14. Computation by circuitsThe circuit modelBounded or unbounded fan-in/fan-outElementary operations: arithmetic/Boolean/comparison each (usually) constant time Alexander Tiskin (Warwick) Efficient Parallel Algorithms 7 / 174
  15. 15. Computation by circuitsThe circuit modelBounded or unbounded fan-in/fan-outElementary operations: arithmetic/Boolean/comparison each (usually) constant timesize = number of nodesdepth = max path length from input to output Alexander Tiskin (Warwick) Efficient Parallel Algorithms 7 / 174
  16. 16. Computation by circuitsThe circuit modelBounded or unbounded fan-in/fan-outElementary operations: arithmetic/Boolean/comparison each (usually) constant timesize = number of nodesdepth = max path length from input to outputTimed circuits with feedback: systolic arrays Alexander Tiskin (Warwick) Efficient Parallel Algorithms 7 / 174
  17. 17. Computation by circuitsThe comparison network modelA comparison network is a circuit of comparator nodes Alexander Tiskin (Warwick) Efficient Parallel Algorithms 8 / 174
  18. 18. Computation by circuitsThe comparison network modelA comparison network is a circuit of comparator nodes x y x y = min denotes = max x y x y x y x yThe input and output sequences have the same length Alexander Tiskin (Warwick) Efficient Parallel Algorithms 8 / 174
  19. 19. Computation by circuitsThe comparison network modelA comparison network is a circuit of comparator nodes x y x y = min denotes = max x y x y x y x yThe input and output sequences have the same lengthExamples: n=4 size 5 depth 3 Alexander Tiskin (Warwick) Efficient Parallel Algorithms 8 / 174
  20. 20. Computation by circuitsThe comparison network modelA comparison network is a circuit of comparator nodes x y x y = min denotes = max x y x y x y x yThe input and output sequences have the same lengthExamples: n=4 n=4 size 5 size 6 depth 3 depth 3 Alexander Tiskin (Warwick) Efficient Parallel Algorithms 8 / 174
  21. 21. Computation by circuitsThe comparison network modelA merging network is a comparison network that takes two sorted inputsequences of length n , n , and produces a sorted output sequence oflength n = n + nA sorting network is a comparison network that takes an arbitrary inputsequence, and produces a sorted output sequence Alexander Tiskin (Warwick) Efficient Parallel Algorithms 9 / 174
  22. 22. Computation by circuitsThe comparison network modelA merging network is a comparison network that takes two sorted inputsequences of length n , n , and produces a sorted output sequence oflength n = n + nA sorting network is a comparison network that takes an arbitrary inputsequence, and produces a sorted output sequenceA sorting (or merging) network is equivalent to an oblivious sorting (ormerging) algorithm; the network’s size/depth determine the algorithm’ssequential/parallel complexity Alexander Tiskin (Warwick) Efficient Parallel Algorithms 9 / 174
  23. 23. Computation by circuitsThe comparison network modelA merging network is a comparison network that takes two sorted inputsequences of length n , n , and produces a sorted output sequence oflength n = n + nA sorting network is a comparison network that takes an arbitrary inputsequence, and produces a sorted output sequenceA sorting (or merging) network is equivalent to an oblivious sorting (ormerging) algorithm; the network’s size/depth determine the algorithm’ssequential/parallel complexityGeneral merging: O(n) comparisons, non-obliviousGeneral sorting: O(n log n) comparisons by mergesort, non-obliviousWhat is the complexity of oblivious sorting? Alexander Tiskin (Warwick) Efficient Parallel Algorithms 9 / 174
  24. 24. Computation by circuitsNaive sorting networksBUBBLE -SORT (n)size n(n − 1)/2 = O(n2 )depth 2n − 3 = O(n) BUBBLESORT (n−1) Alexander Tiskin (Warwick) Efficient Parallel Algorithms 10 / 174
  25. 25. Computation by circuitsNaive sorting networksBUBBLE -SORT (n)size n(n − 1)/2 = O(n2 )depth 2n − 3 = O(n) BUBBLESORT (n−1)BUBBLE -SORT (8)size 28depth 13 Alexander Tiskin (Warwick) Efficient Parallel Algorithms 10 / 174
  26. 26. Computation by circuitsNaive sorting networksINSERTION-SORT (n)size n(n − 1)/2 = O(n2 ) INSERTIONSORT (n−1)depth 2n − 3 = O(n) Alexander Tiskin (Warwick) Efficient Parallel Algorithms 11 / 174
  27. 27. Computation by circuitsNaive sorting networksINSERTION-SORT (n)size n(n − 1)/2 = O(n2 ) INSERTIONSORT (n−1)depth 2n − 3 = O(n)INSERTION-SORT (8)size 28depth 13Identical to BUBBLE -SORT ! Alexander Tiskin (Warwick) Efficient Parallel Algorithms 11 / 174
  28. 28. Computation by circuitsThe zero-one principleZero-one principle: A comparison network is sorting, if and only if it sortsall input sequences of 0s and 1s Alexander Tiskin (Warwick) Efficient Parallel Algorithms 12 / 174
  29. 29. Computation by circuitsThe zero-one principleZero-one principle: A comparison network is sorting, if and only if it sortsall input sequences of 0s and 1sProof. Alexander Tiskin (Warwick) Efficient Parallel Algorithms 12 / 174
  30. 30. Computation by circuitsThe zero-one principleZero-one principle: A comparison network is sorting, if and only if it sortsall input sequences of 0s and 1sProof. “Only if”: trivial. Alexander Tiskin (Warwick) Efficient Parallel Algorithms 12 / 174
  31. 31. Computation by circuitsThe zero-one principleZero-one principle: A comparison network is sorting, if and only if it sortsall input sequences of 0s and 1sProof. “Only if”: trivial. “If”: by contradiction.Assume a given network does not sort input x = x1 , . . . , xn x1 , . . . , xn → y1 , . . . , yn ∃k, l : k < l : yk > yl Alexander Tiskin (Warwick) Efficient Parallel Algorithms 12 / 174
  32. 32. Computation by circuitsThe zero-one principleZero-one principle: A comparison network is sorting, if and only if it sortsall input sequences of 0s and 1sProof. “Only if”: trivial. “If”: by contradiction.Assume a given network does not sort input x = x1 , . . . , xn x1 , . . . , xn → y1 , . . . , yn ∃k, l : k < l : yk > yl 0 if xi < ykLet Xi = , and run the network on input X = X1 , . . . , Xn 1 if xi ≥ ykFor all i, j we have xi ≤ xj ⇔ Xi ≤ Xj , therefore each Xi follows the samepath through the network as xi Alexander Tiskin (Warwick) Efficient Parallel Algorithms 12 / 174
  33. 33. Computation by circuitsThe zero-one principleZero-one principle: A comparison network is sorting, if and only if it sortsall input sequences of 0s and 1sProof. “Only if”: trivial. “If”: by contradiction.Assume a given network does not sort input x = x1 , . . . , xn x1 , . . . , xn → y1 , . . . , yn ∃k, l : k < l : yk > yl 0 if xi < ykLet Xi = , and run the network on input X = X1 , . . . , Xn 1 if xi ≥ ykFor all i, j we have xi ≤ xj ⇔ Xi ≤ Xj , therefore each Xi follows the samepath through the network as xi X1 , . . . , Xn → Y1 , . . . , Yn Yk = 1 > 0 = YlWe have k < l but Yk > Yl , so the network does not sort 0s and 1s Alexander Tiskin (Warwick) Efficient Parallel Algorithms 12 / 174
  34. 34. Computation by circuitsThe zero-one principleThe zero-one principle applies to sorting, merging and other comparisonproblems (e.g. selection) Alexander Tiskin (Warwick) Efficient Parallel Algorithms 13 / 174
  35. 35. Computation by circuitsThe zero-one principleThe zero-one principle applies to sorting, merging and other comparisonproblems (e.g. selection)It allows one to test: a sorting network by checking only 2n input sequences, instead of a much larger number n! ≈ (n/e)n a merging network by checking only (n + 1) · (n + 1) pairs of input sequences, instead of an exponentially larger number n = nn n Alexander Tiskin (Warwick) Efficient Parallel Algorithms 13 / 174
  36. 36. Computation by circuitsEfficient merging and sorting networksGeneral merging: O(n) comparisons, non-obliviousHow fast can we merge obliviously? Alexander Tiskin (Warwick) Efficient Parallel Algorithms 14 / 174
  37. 37. Computation by circuitsEfficient merging and sorting networksGeneral merging: O(n) comparisons, non-obliviousHow fast can we merge obliviously? x1 ≤ · · · ≤ xn , y1 ≤ · · · ≤ yn → z1 ≤ · · · ≤ znOdd-even mergingWhen n = n = 1 compare (x1 , y1 ), otherwise merge x1 , x3 , . . . , y1 , y3 , . . . → u1 ≤ u2 ≤ · · · ≤ u n /2 + n /2 merge x2 , x4 , . . . , y2 , y4 , . . . → v1 ≤ v2 ≤ · · · ≤ v n /2 + n /2 compare pairwise: (u2 , v1 ), (u3 , v2 ), . . . Alexander Tiskin (Warwick) Efficient Parallel Algorithms 14 / 174
  38. 38. Computation by circuitsEfficient merging and sorting networksGeneral merging: O(n) comparisons, non-obliviousHow fast can we merge obliviously? x1 ≤ · · · ≤ xn , y1 ≤ · · · ≤ yn → z1 ≤ · · · ≤ znOdd-even mergingWhen n = n = 1 compare (x1 , y1 ), otherwise merge x1 , x3 , . . . , y1 , y3 , . . . → u1 ≤ u2 ≤ · · · ≤ u n /2 + n /2 merge x2 , x4 , . . . , y2 , y4 , . . . → v1 ≤ v2 ≤ · · · ≤ v n /2 + n /2 compare pairwise: (u2 , v1 ), (u3 , v2 ), . . .size OEM (n , n ) ≤ 2 · size OEM (n /2, n /2) + O(n) = O(n log n)depthOEM (n , n ) ≤ depthOEM (n /2, n /2) + 1 = O(log n) Alexander Tiskin (Warwick) Efficient Parallel Algorithms 14 / 174
  39. 39. Computation by circuitsEfficient merging and sorting networksOEM(n , n )size O(n log n) OEM( n /2 , n /2 )depth O(log n)n ≤n OEM( n /2 , n /2 ) Alexander Tiskin (Warwick) Efficient Parallel Algorithms 15 / 174
  40. 40. Computation by circuitsEfficient merging and sorting networksOEM(n , n )size O(n log n) OEM( n /2 , n /2 )depth O(log n)n ≤n OEM( n /2 , n /2 )OEM(4, 4)size 9depth 3 Alexander Tiskin (Warwick) Efficient Parallel Algorithms 15 / 174
  41. 41. Computation by circuitsEfficient merging and sorting networksCorrectness proof of odd-even merging (sketch): Alexander Tiskin (Warwick) Efficient Parallel Algorithms 16 / 174
  42. 42. Computation by circuitsEfficient merging and sorting networksCorrectness proof of odd-even merging (sketch): by induction and thezero-one principleInduction base: trivial (2 inputs, 1 comparator)Inductive step. By the inductive hypothesis, we have for all k, l: 0 k/2 11 . . . , 0 l/2 11 . . . → 0 k/2 + l/2 11 . . . 0 k/2 11 . . . , 0 l/2 11 . . . → 0 k/2 + l/2 11 . . .We need 0k 11 . . . , 0l 11 . . . → 0k+l 11 . . . Alexander Tiskin (Warwick) Efficient Parallel Algorithms 16 / 174
  43. 43. Computation by circuitsEfficient merging and sorting networksCorrectness proof of odd-even merging (sketch): by induction and thezero-one principleInduction base: trivial (2 inputs, 1 comparator)Inductive step. By the inductive hypothesis, we have for all k, l: 0 k/2 11 . . . , 0 l/2 11 . . . → 0 k/2 + l/2 11 . . . 0 k/2 11 . . . , 0 l/2 11 . . . → 0 k/2 + l/2 11 . . .We need 0k 11 . . . , 0l 11 . . . → 0k+l 11 . . .( k/2 + l/2 ) − ( k/2 + l/2 ) = 0, 1 result sorted: 0k+l 11 . . . 2 single pair wrong: 0k+l−1 1011 . . .The final stage of comparators corrects the wrong pair Alexander Tiskin (Warwick) Efficient Parallel Algorithms 16 / 174
  44. 44. Computation by circuitsEfficient merging and sorting networksSorting an arbitrary input x1 , . . . , xnOdd-even merge sorting [Batcher: 1968]When n = 1 we are done, otherwise sort x1 , . . . , x n/2 recursively sort x n/2 +1 , . . . , xn recursively merge results by OEM( n/2 , n/2 ) Alexander Tiskin (Warwick) Efficient Parallel Algorithms 17 / 174
  45. 45. Computation by circuitsEfficient merging and sorting networksSorting an arbitrary input x1 , . . . , xnOdd-even merge sorting [Batcher: 1968]When n = 1 we are done, otherwise sort x1 , . . . , x n/2 recursively sort x n/2 +1 , . . . , xn recursively merge results by OEM( n/2 , n/2 )size OEMSORT (n) ≤ 2 · size OEMSORT (n/2) + size OEM (n) = 2 · size OEMSORT (n/2) + O(n log n) = O n(log n)2depthOEMSORT (n) ≤ depthOEMSORT (n/2) + depthOEM (n) = depthOEMSORT (n/2) + O(log n) = O (log n)2 Alexander Tiskin (Warwick) Efficient Parallel Algorithms 17 / 174
  46. 46. Computation by circuitsEfficient merging and sorting networksOEM-SORT (n) OEMSORT OEMSORTsize O n(log n)2 ( n/2 ) ( n/2 )depth O (log n)2 OEM( n/2 , n/2 ) Alexander Tiskin (Warwick) Efficient Parallel Algorithms 18 / 174
  47. 47. Computation by circuitsEfficient merging and sorting networksOEM-SORT (n) OEMSORT OEMSORTsize O n(log n)2 ( n/2 ) ( n/2 )depth O (log n)2 OEM( n/2 , n/2 )OEM-SORT (8)size 19depth 6 Alexander Tiskin (Warwick) Efficient Parallel Algorithms 18 / 174
  48. 48. Computation by circuitsEfficient merging and sorting networksA bitonic sequence: x1 ≥ · · · ≥ xm ≤ · · · ≤ xn Alexander Tiskin (Warwick) Efficient Parallel Algorithms 19 / 174
  49. 49. Computation by circuitsEfficient merging and sorting networksA bitonic sequence: x1 ≥ · · · ≥ xm ≤ · · · ≤ xnBitonic merging: sorting a bitonic sequenceWhen n = 1 we are done, otherwise sort bitonic x1 , x3 , . . . recursively sort bitonic x2 , x4 , . . . recursively compare pairwise: (x1 , x2 ), (x3 , x4 ), . . .Correctness proof: by zero-one principle (exercise)(Note: cannot exchange ≥ and ≤ in definition of bitonic!) Alexander Tiskin (Warwick) Efficient Parallel Algorithms 19 / 174
  50. 50. Computation by circuitsEfficient merging and sorting networksA bitonic sequence: x1 ≥ · · · ≥ xm ≤ · · · ≤ xnBitonic merging: sorting a bitonic sequenceWhen n = 1 we are done, otherwise sort bitonic x1 , x3 , . . . recursively sort bitonic x2 , x4 , . . . recursively compare pairwise: (x1 , x2 ), (x3 , x4 ), . . .Correctness proof: by zero-one principle (exercise)(Note: cannot exchange ≥ and ≤ in definition of bitonic!)Bitonic merging is more flexible than odd-even merging, since a singlecircuit applies to all values of m Alexander Tiskin (Warwick) Efficient Parallel Algorithms 19 / 174
  51. 51. Computation by circuitsEfficient merging and sorting networksA bitonic sequence: x1 ≥ · · · ≥ xm ≤ · · · ≤ xnBitonic merging: sorting a bitonic sequenceWhen n = 1 we are done, otherwise sort bitonic x1 , x3 , . . . recursively sort bitonic x2 , x4 , . . . recursively compare pairwise: (x1 , x2 ), (x3 , x4 ), . . .Correctness proof: by zero-one principle (exercise)(Note: cannot exchange ≥ and ≤ in definition of bitonic!)Bitonic merging is more flexible than odd-even merging, since a singlecircuit applies to all values of msize BM (n) = O(n log n) depthBM (n) = O(log n) Alexander Tiskin (Warwick) Efficient Parallel Algorithms 19 / 174
  52. 52. Computation by circuitsEfficient merging and sorting networksBM(n)size O(n log n) BM( n/2 )depth O(log n) BM( n/2 ) Alexander Tiskin (Warwick) Efficient Parallel Algorithms 20 / 174
  53. 53. Computation by circuitsEfficient merging and sorting networksBM(n)size O(n log n) BM( n/2 )depth O(log n) BM( n/2 )BM(8)size 12depth 3 Alexander Tiskin (Warwick) Efficient Parallel Algorithms 20 / 174
  54. 54. Computation by circuitsEfficient merging and sorting networksBitonic merge sorting [Batcher: 1968]When n = 1 we are done, otherwise sort x1 , . . . , x n/2 → y1 ≥ · · · ≥ y n/2 in reverse, recursively sort x n/2 +1 , . . . , xn → y n/2 +1 ≤ · · · ≤ yn recursively sort bitonic y1 ≥ · · · ≥ ym ≤ · · · ≤ yn m = n/2 or n/2 + 1Sorting in reverse seems to require “inverted comparators” Alexander Tiskin (Warwick) Efficient Parallel Algorithms 21 / 174
  55. 55. Computation by circuitsEfficient merging and sorting networksBitonic merge sorting [Batcher: 1968]When n = 1 we are done, otherwise sort x1 , . . . , x n/2 → y1 ≥ · · · ≥ y n/2 in reverse, recursively sort x n/2 +1 , . . . , xn → y n/2 +1 ≤ · · · ≤ yn recursively sort bitonic y1 ≥ · · · ≥ ym ≤ · · · ≤ yn m = n/2 or n/2 + 1Sorting in reverse seems to require “inverted comparators”, however comparators are actually nodes in a circuit, which can always be drawn using “standard comparators” a network drawn with “inverted comparators” can be converted into one with only “standard comparators” by a top-down rearrangement Alexander Tiskin (Warwick) Efficient Parallel Algorithms 21 / 174
  56. 56. Computation by circuitsEfficient merging and sorting networksBitonic merge sorting [Batcher: 1968]When n = 1 we are done, otherwise sort x1 , . . . , x n/2 → y1 ≥ · · · ≥ y n/2 in reverse, recursively sort x n/2 +1 , . . . , xn → y n/2 +1 ≤ · · · ≤ yn recursively sort bitonic y1 ≥ · · · ≥ ym ≤ · · · ≤ yn m = n/2 or n/2 + 1Sorting in reverse seems to require “inverted comparators”, however comparators are actually nodes in a circuit, which can always be drawn using “standard comparators” a network drawn with “inverted comparators” can be converted into one with only “standard comparators” by a top-down rearrangementsize BM-SORT (n) = O n(log n)2 depthBM-SORT (n) = O (log n)2 Alexander Tiskin (Warwick) Efficient Parallel Algorithms 21 / 174
  57. 57. Computation by circuitsEfficient merging and sorting networksBM-SORT (n) ⇐= =⇒ BSORT BSORTsize O n(log n)2 ( n/2 ) ( n/2 )depth O (log n)2 BM( n/2 , n/2 ) Alexander Tiskin (Warwick) Efficient Parallel Algorithms 22 / 174
  58. 58. Computation by circuitsEfficient merging and sorting networksBM-SORT (n) ⇐= =⇒ BSORT BSORTsize O n(log n)2 ( n/2 ) ( n/2 )depth O (log n)2 BM( n/2 , n/2 )BM-SORT (8)size 24depth 6 Alexander Tiskin (Warwick) Efficient Parallel Algorithms 22 / 174
  59. 59. Computation by circuitsEfficient merging and sorting networksBoth OEM-SORT and BM-SORT have size Θ n(log n)2Is it possible to sort obliviously in size o n(log n)2 ? O(n log n)? Alexander Tiskin (Warwick) Efficient Parallel Algorithms 23 / 174
  60. 60. Computation by circuitsEfficient merging and sorting networksBoth OEM-SORT and BM-SORT have size Θ n(log n)2Is it possible to sort obliviously in size o n(log n)2 ? O(n log n)?AKS sorting [Ajtai, Koml´s, Szemer´di: 1983] o eSorting network: size O(n log n), depth O(log n)Uses sophisticated graph theory (expanders)Asymptotically optimal, but has huge constant factors Alexander Tiskin (Warwick) Efficient Parallel Algorithms 23 / 174
  61. 61. 1 Computation by circuits2 Parallel computation models3 Basic parallel algorithms4 Further parallel algorithms5 Parallel matrix algorithms6 Parallel graph algorithms Alexander Tiskin (Warwick) Efficient Parallel Algorithms 24 / 174
  62. 62. Parallel computation modelsThe PRAM modelParallel Random Access Machine (PRAM) [Fortune, Wyllie: 1978]Simple, idealised general-purpose parallel 0 1 2model P P P ··· MEMORY Alexander Tiskin (Warwick) Efficient Parallel Algorithms 25 / 174
  63. 63. Parallel computation modelsThe PRAM modelParallel Random Access Machine (PRAM) [Fortune, Wyllie: 1978]Simple, idealised general-purpose parallel 0 1 2model P P P ··· MEMORYContains unlimited number of processors (1 time unit/op) global shared memory (1 time unit/access)Operates in full synchrony Alexander Tiskin (Warwick) Efficient Parallel Algorithms 25 / 174
  64. 64. Parallel computation modelsThe PRAM modelPRAM computation: sequence of parallel stepsCommunication and synchronisation taken for grantedNot scalable in practice! Alexander Tiskin (Warwick) Efficient Parallel Algorithms 26 / 174
  65. 65. Parallel computation modelsThe PRAM modelPRAM computation: sequence of parallel stepsCommunication and synchronisation taken for grantedNot scalable in practice!PRAM variants: concurrent/exclusive read concurrent/exclusive writeCRCW, CREW, EREW, (ERCW) PRAM Alexander Tiskin (Warwick) Efficient Parallel Algorithms 26 / 174
  66. 66. Parallel computation modelsThe PRAM modelPRAM computation: sequence of parallel stepsCommunication and synchronisation taken for grantedNot scalable in practice!PRAM variants: concurrent/exclusive read concurrent/exclusive writeCRCW, CREW, EREW, (ERCW) PRAME.g. a linear system solver: O (log n)2 steps using n4 processors :-OPRAM algorithm design: minimising number of steps, sometimes alsonumber of processors Alexander Tiskin (Warwick) Efficient Parallel Algorithms 26 / 174
  67. 67. Parallel computation modelsThe BSP modelBulk-Synchronous Parallel (BSP) computer [Valiant: 1990]Simple, realistic general-purpose parallel 0 1 p−1model P P ··· P M M M COMM. ENV . (g , l) Alexander Tiskin (Warwick) Efficient Parallel Algorithms 27 / 174
  68. 68. Parallel computation modelsThe BSP modelBulk-Synchronous Parallel (BSP) computer [Valiant: 1990]Simple, realistic general-purpose parallel 0 1 p−1model P P ··· P M M M COMM. ENV . (g , l)Contains p processors, each with local memory (1 time unit/operation) communication environment, including a network and an external memory (g time units/data unit communicated) barrier synchronisation mechanism (l time units/synchronisation) Alexander Tiskin (Warwick) Efficient Parallel Algorithms 27 / 174
  69. 69. Parallel computation modelsThe BSP modelSome elements of a BSP computer can be emulated by others, e.g. external memory by local memory + communication barrier synchronisation mechanism by the network Alexander Tiskin (Warwick) Efficient Parallel Algorithms 28 / 174
  70. 70. Parallel computation modelsThe BSP modelSome elements of a BSP computer can be emulated by others, e.g. external memory by local memory + communication barrier synchronisation mechanism by the networkParameter g corresponds to the network’s communication gap (inversebandwidth) — the time for a data unit to enter/exit the networkParameter l corresponds to the network’s latency — the worst-case timefor a data unit to get across the network Alexander Tiskin (Warwick) Efficient Parallel Algorithms 28 / 174
  71. 71. Parallel computation modelsThe BSP modelSome elements of a BSP computer can be emulated by others, e.g. external memory by local memory + communication barrier synchronisation mechanism by the networkParameter g corresponds to the network’s communication gap (inversebandwidth) — the time for a data unit to enter/exit the networkParameter l corresponds to the network’s latency — the worst-case timefor a data unit to get across the networkEvery parallel computer can be (approximately) described by theparameters p, g , lE.g. for Cray T3E: p = 64, g ≈ 78, l ≈ 1825 Alexander Tiskin (Warwick) Efficient Parallel Algorithms 28 / 174
  72. 72. Parallel computation modelsThe BSP model Þ Ò Ø ß ÕÎ Í ÔÍ Û Ý Ó Î ÛË Ì í ë îð ëðð Ó -«®»¼¿ »¿ ¼ ¬¿ Ô -¬ó«¿ º »¿ -¯ ®»- ·¬ îð ððð ïð ëðð ïð ððð ëð ðð ð ð ë ð ï ðð ï ëð î ðð î ëð ¸ Ú òò ò ·³ ±º² ó ¿ ±² ê °®±½ ·¹ï í Ì » ï ¿ ®»´¬·±² ¿ ì ó »--±® Ý § ÌÛ ®¿ íò Ì ¾» ï ò »² ³ ®µ»¼ Í °¿ ³ ¿´ ò Þ ½ ¿ ÞÐ ®¿ »¬»®- î ¸ ¿¼ » ² ¬¸ ¬·³ ±º ð ¿ º ¿ ®¿ ÌÛ ß ¬·³ ¿ ·² ±° » ¿ ó ¬·±²±® Ý § íò ´ »­ ®» ®»´ ´ Œ «² øãí Ó ±°ñ ·¬- ë Œ ­÷ ½ ³ ð ±³ ø÷ ï í ê ì é í è î î è ìè ê íî ë ì í ï êé ç ìí é è í ï ïí ïç ëè ð ï ê í ï îè ðï éë é í î é î ïë ïì èé ï ê ì é è ïë èî ïð ìì ´ ´ Alexander Tiskin (Warwick) Efficient Parallel Algorithms 29 / 174
  73. 73. Parallel computation modelsThe BSP modelBSP computation: sequence of parallel supersteps 0 1 p−1 Alexander Tiskin (Warwick) Efficient Parallel Algorithms 30 / 174
  74. 74. Parallel computation modelsThe BSP modelBSP computation: sequence of parallel supersteps 0 1 p−1 Alexander Tiskin (Warwick) Efficient Parallel Algorithms 30 / 174
  75. 75. Parallel computation modelsThe BSP modelBSP computation: sequence of parallel supersteps 0 1 p−1 Alexander Tiskin (Warwick) Efficient Parallel Algorithms 30 / 174
  76. 76. Parallel computation modelsThe BSP modelBSP computation: sequence of parallel supersteps 0 1 p−1 Alexander Tiskin (Warwick) Efficient Parallel Algorithms 30 / 174
  77. 77. Parallel computation modelsThe BSP modelBSP computation: sequence of parallel supersteps 0 ··· 1 ··· ··· p−1 ··· Alexander Tiskin (Warwick) Efficient Parallel Algorithms 30 / 174
  78. 78. Parallel computation modelsThe BSP modelBSP computation: sequence of parallel supersteps 0 ··· 1 ··· ··· p−1 ···Asynchronous computation/communication within supersteps (includesdata exchange with external memory)Synchronisation before/after each superstep Alexander Tiskin (Warwick) Efficient Parallel Algorithms 30 / 174
  79. 79. Parallel computation modelsThe BSP modelBSP computation: sequence of parallel supersteps 0 ··· 1 ··· ··· p−1 ···Asynchronous computation/communication within supersteps (includesdata exchange with external memory)Synchronisation before/after each superstepCf. CSP: parallel collection of sequential processes Alexander Tiskin (Warwick) Efficient Parallel Algorithms 30 / 174
  80. 80. Parallel computation modelsThe BSP modelCompositional cost modelFor individual processor proc in superstep sstep: comp(sstep, proc): the amount of local computation and local memory operations by processor proc in superstep sstep comm(sstep, proc): the amount of data sent and received by processor proc in superstep sstep Alexander Tiskin (Warwick) Efficient Parallel Algorithms 31 / 174
  81. 81. Parallel computation modelsThe BSP modelCompositional cost modelFor individual processor proc in superstep sstep: comp(sstep, proc): the amount of local computation and local memory operations by processor proc in superstep sstep comm(sstep, proc): the amount of data sent and received by processor proc in superstep sstepFor the whole BSP computer in one superstep sstep: comp(sstep) = max0≤proc<p comp(sstep, proc) comm(sstep) = max0≤proc<p comm(sstep, proc) cost(sstep) = comp(sstep) + comm(sstep) · g + l Alexander Tiskin (Warwick) Efficient Parallel Algorithms 31 / 174
  82. 82. Parallel computation modelsThe BSP modelFor the whole BSP computation with sync supersteps: comp = 0≤sstep<sync comp(sstep) comm = 0≤sstep<sync comm(sstep) cost = 0≤sstep<sync cost(sstep) = comp + comm · g + sync · l Alexander Tiskin (Warwick) Efficient Parallel Algorithms 32 / 174
  83. 83. Parallel computation modelsThe BSP modelFor the whole BSP computation with sync supersteps: comp = 0≤sstep<sync comp(sstep) comm = 0≤sstep<sync comm(sstep) cost = 0≤sstep<sync cost(sstep) = comp + comm · g + sync · lThe input/output data are stored in the external memory; the cost ofinput/output is included in comm Alexander Tiskin (Warwick) Efficient Parallel Algorithms 32 / 174
  84. 84. Parallel computation modelsThe BSP modelFor the whole BSP computation with sync supersteps: comp = 0≤sstep<sync comp(sstep) comm = 0≤sstep<sync comm(sstep) cost = 0≤sstep<sync cost(sstep) = comp + comm · g + sync · lThe input/output data are stored in the external memory; the cost ofinput/output is included in commE.g. for a particular linear system solver with an n × n matrix:comp O(n3 /p) comm O(n2 /p 1/2 ) sync O(p 1/2 ) Alexander Tiskin (Warwick) Efficient Parallel Algorithms 32 / 174
  85. 85. Parallel computation modelsThe BSP modelBSP computation: scalable, portable, predictable Alexander Tiskin (Warwick) Efficient Parallel Algorithms 33 / 174
  86. 86. Parallel computation modelsThe BSP modelBSP computation: scalable, portable, predictableBSP algorithm design: minimising comp, comm, sync Alexander Tiskin (Warwick) Efficient Parallel Algorithms 33 / 174
  87. 87. Parallel computation modelsThe BSP modelBSP computation: scalable, portable, predictableBSP algorithm design: minimising comp, comm, syncMain principles: load balancing minimises comp data locality minimises comm coarse granularity minimises sync Alexander Tiskin (Warwick) Efficient Parallel Algorithms 33 / 174
  88. 88. Parallel computation modelsThe BSP modelBSP computation: scalable, portable, predictableBSP algorithm design: minimising comp, comm, syncMain principles: load balancing minimises comp data locality minimises comm coarse granularity minimises syncData locality exploited, network locality ignored!Typically, problem size n p (slackness) Alexander Tiskin (Warwick) Efficient Parallel Algorithms 33 / 174
  89. 89. Parallel computation modelsNetwork routingBSP network model: complete graph, uniformly accessible (accessefficiency described by parameters g , l)Has to be implemented on concrete networks Alexander Tiskin (Warwick) Efficient Parallel Algorithms 34 / 174
  90. 90. Parallel computation modelsNetwork routingBSP network model: complete graph, uniformly accessible (accessefficiency described by parameters g , l)Has to be implemented on concrete networksParameters of a network topology (i.e. the underlying graph): degree — number of links per node diameter — maximum distance between nodesLow degree — easier to implementLow diameter — more efficient Alexander Tiskin (Warwick) Efficient Parallel Algorithms 34 / 174
  91. 91. Parallel computation modelsNetwork routing2D array network P QP  ¡ ¢£ ¤¥ ¦§ ¨© R SPp = q 2 processors degree 4 ! # $% ()diameter p 1/2 = q 01 23 45 67 89 @A BC DE FG HI P TR R QR Alexander Tiskin (Warwick) Efficient Parallel Algorithms 35 / 174
  92. 92. Parallel computation modelsNetwork routing ‚ƒ € € ‚ƒ ƒ €3D array network de –— fg ˜™ hi de pq fg ‚€ € € 23 45 67 89p = q 3 processors  ¡ ¢£ ¤¥ ¦§ hi jk lm nodegree 6 @A rs BC tu DE vw FG xy ¨© diameter 3/2 · p 1/3 = 3/2 · q pq rs tu vw € ‚ƒ „… †‡ HI PQ RS TU ! # xy z{ |} ~ ˆ‰ ‘ ’“ ”• ‚ƒ ƒ ƒ VW XY `a bc $% () 01 ‚€ € ƒ ƒƒ‚€ Alexander Tiskin (Warwick) Efficient Parallel Algorithms 36 / 174
  93. 93. Parallel computation modelsNetwork routingButterfly network ¨ !§¥ £ ¡ ¦©¦¦¤¢  ¡ ¡ ¡ ¡ ¡ ¡ ¨ § ¥ £ ¡p = q log q processorsdegree 4diameter ≈ log p ≈ log q ¨ %¡¡$©$§ #¥¤£©¦  ¡ ¥ ¡ ¡ £ ¡ ¨ ¡ ¡ § Alexander Tiskin (Warwick) Efficient Parallel Algorithms 37 / 174
  94. 94. Parallel computation modelsNetwork routing ¤§¦¢ ¢ ¢ ¥ ¨©¨¢ ¥ ¢ ¥Hypercube network  ¡  ¡ £¤£¢ ¢ ¢ ¢p = 2q processors  ¡  ¡  ¡  ¡ ¦§¤¥ ¥ ¢ ¥degree log p = q  ¡  ¡diameter log p = q  ¡  ¡  ¡  ¡ §¦£¢ ¢ ¥ ¢  ¡  ¡ ¤£¤¥ ¥ ¥ ¥  ¡  ¡ ©¦§¥ ¢ ¥ ¢ ¥£¦©¥ ¥ ¢ Alexander Tiskin (Warwick) Efficient Parallel Algorithms 38 / 174
  95. 95. Parallel computation modelsNetwork routingNetwork Degree Diameter1D array 2 1/2 · p2D array 4 p 1/23D array 6 3/2 · p 1/3Butterfly 4 log pHypercube log p log p··· ··· ···BSP parameters g , l depend on degree, diameter, routing strategy Alexander Tiskin (Warwick) Efficient Parallel Algorithms 39 / 174
  96. 96. Parallel computation modelsNetwork routingNetwork Degree Diameter1D array 2 1/2 · p2D array 4 p 1/23D array 6 3/2 · p 1/3Butterfly 4 log pHypercube log p log p··· ··· ···BSP parameters g , l depend on degree, diameter, routing strategyAssume store-and-forward routing (alternative — wormhole)Assume distributed routing: no global control Alexander Tiskin (Warwick) Efficient Parallel Algorithms 39 / 174
  97. 97. Parallel computation modelsNetwork routingNetwork Degree Diameter1D array 2 1/2 · p2D array 4 p 1/23D array 6 3/2 · p 1/3Butterfly 4 log pHypercube log p log p··· ··· ···BSP parameters g , l depend on degree, diameter, routing strategyAssume store-and-forward routing (alternative — wormhole)Assume distributed routing: no global controlOblivious routing: path determined only by source and destinationE.g. greedy routing: a packet always takes the shortest path Alexander Tiskin (Warwick) Efficient Parallel Algorithms 39 / 174
  98. 98. Parallel computation modelsNetwork routingh-relation (h-superstep): every processor sends and receives ≤ h packets Alexander Tiskin (Warwick) Efficient Parallel Algorithms 40 / 174
  99. 99. Parallel computation modelsNetwork routingh-relation (h-superstep): every processor sends and receives ≤ h packetsSufficient to consider permutations (1-relations): once we can route anypermutation in k steps, we can route any h-relation in hk steps Alexander Tiskin (Warwick) Efficient Parallel Algorithms 40 / 174
  100. 100. Parallel computation modelsNetwork routingh-relation (h-superstep): every processor sends and receives ≤ h packetsSufficient to consider permutations (1-relations): once we can route anypermutation in k steps, we can route any h-relation in hk stepsAny routing method may be forced to make Ω(diameter ) steps Alexander Tiskin (Warwick) Efficient Parallel Algorithms 40 / 174
  101. 101. Parallel computation modelsNetwork routingh-relation (h-superstep): every processor sends and receives ≤ h packetsSufficient to consider permutations (1-relations): once we can route anypermutation in k steps, we can route any h-relation in hk stepsAny routing method may be forced to make Ω(diameter ) stepsAny oblivious routing method may be forced to make Ω(p 1/2 /degree) stepsMany practical patterns force such “hot spots” on traditional networks Alexander Tiskin (Warwick) Efficient Parallel Algorithms 40 / 174
  102. 102. Parallel computation modelsNetwork routingRouting based on sorting networksEach processor corresponds to a wireEach link corresponds to (possibly several) comparatorsRouting corresponds to sorting by destination addressEach stage of routing corresponds to a stage of sorting Alexander Tiskin (Warwick) Efficient Parallel Algorithms 41 / 174
  103. 103. Parallel computation modelsNetwork routingRouting based on sorting networksEach processor corresponds to a wireEach link corresponds to (possibly several) comparatorsRouting corresponds to sorting by destination addressEach stage of routing corresponds to a stage of sortingSuch routing is non-oblivious (for individual packets)!Network Degree DiameterOEM-SORT/BM-SORT O (log p)2 O (log p)2AKS O(log p) O(log p) Alexander Tiskin (Warwick) Efficient Parallel Algorithms 41 / 174
  104. 104. Parallel computation modelsNetwork routingRouting based on sorting networksEach processor corresponds to a wireEach link corresponds to (possibly several) comparatorsRouting corresponds to sorting by destination addressEach stage of routing corresponds to a stage of sortingSuch routing is non-oblivious (for individual packets)!Network Degree DiameterOEM-SORT/BM-SORT O (log p)2 O (log p)2AKS O(log p) O(log p)No “hot spots”: can always route a permutation in O(diameter ) stepsRequires a specialised network, too messy and impractical Alexander Tiskin (Warwick) Efficient Parallel Algorithms 41 / 174
  105. 105. Parallel computation modelsNetwork routingTwo-phase randomised routing: [Valiant: 1980] send every packet to random intermediate destination forward every packet to final destinationBoth phases oblivious (e.g. greedy), but non-oblivious overall due torandomness Alexander Tiskin (Warwick) Efficient Parallel Algorithms 42 / 174
  106. 106. Parallel computation modelsNetwork routingTwo-phase randomised routing: [Valiant: 1980] send every packet to random intermediate destination forward every packet to final destinationBoth phases oblivious (e.g. greedy), but non-oblivious overall due torandomnessHot spots very unlikely: on a 2D array, butterfly, hypercube, can route apermutation in O(diameter ) steps with high probability Alexander Tiskin (Warwick) Efficient Parallel Algorithms 42 / 174
  107. 107. Parallel computation modelsNetwork routingTwo-phase randomised routing: [Valiant: 1980] send every packet to random intermediate destination forward every packet to final destinationBoth phases oblivious (e.g. greedy), but non-oblivious overall due torandomnessHot spots very unlikely: on a 2D array, butterfly, hypercube, can route apermutation in O(diameter ) steps with high probabilityOn a hypercube, the same holds even for a log p-relationHence constant g , l in the BSP model Alexander Tiskin (Warwick) Efficient Parallel Algorithms 42 / 174
  108. 108. Parallel computation modelsNetwork routingBSP implementation: processes placed at random, communication delayeduntil end of superstepAll packets with same source and destination sent together, hence messageoverhead absorbed in l Alexander Tiskin (Warwick) Efficient Parallel Algorithms 43 / 174
  109. 109. Parallel computation modelsNetwork routingBSP implementation: processes placed at random, communication delayeduntil end of superstepAll packets with same source and destination sent together, hence messageoverhead absorbed in lNetwork g l1D array O(p) O(p)2D array O(p 1/2 ) O(p 1/2 )3D array O(p 1/3 ) O(p 1/3 )Butterfly O(log p) O(log p)Hypercube O(1) O(log p)··· ··· ···Actual values of g , l obtained by running benchmarks Alexander Tiskin (Warwick) Efficient Parallel Algorithms 43 / 174
  110. 110. 1 Computation by circuits2 Parallel computation models3 Basic parallel algorithms4 Further parallel algorithms5 Parallel matrix algorithms6 Parallel graph algorithms Alexander Tiskin (Warwick) Efficient Parallel Algorithms 44 / 174
  111. 111. Basic parallel algorithmsBroadcast/combineThe broadcasting problem: initially, one designated processor holds a value a at the end, every processor must hold a copy of a Alexander Tiskin (Warwick) Efficient Parallel Algorithms 45 / 174
  112. 112. Basic parallel algorithmsBroadcast/combineThe broadcasting problem: initially, one designated processor holds a value a at the end, every processor must hold a copy of aThe combining problem (complementary to broadcasting): initially, every processor holds a value ai , 0 ≤ i p at the end, one designated processor must hold a0 • · · · • ap−1 for a given associative operator • (e.g. +) Alexander Tiskin (Warwick) Efficient Parallel Algorithms 45 / 174
  113. 113. Basic parallel algorithmsBroadcast/combineThe broadcasting problem: initially, one designated processor holds a value a at the end, every processor must hold a copy of aThe combining problem (complementary to broadcasting): initially, every processor holds a value ai , 0 ≤ i p at the end, one designated processor must hold a0 • · · · • ap−1 for a given associative operator • (e.g. +)By symmetry, we only need to consider broadcasting Alexander Tiskin (Warwick) Efficient Parallel Algorithms 45 / 174
  114. 114. Basic parallel algorithmsBroadcast/combineDirect broadcast: designated processor makes p − 1 copies of a and sends them directly to destinations a a a a a Alexander Tiskin (Warwick) Efficient Parallel Algorithms 46 / 174
  115. 115. Basic parallel algorithmsBroadcast/combineDirect broadcast: designated processor makes p − 1 copies of a and sends them directly to destinations a a a a a comp O(p) comm O(p) sync O(1) Alexander Tiskin (Warwick) Efficient Parallel Algorithms 46 / 174
  116. 116. Basic parallel algorithmsBroadcast/combineDirect broadcast: designated processor makes p − 1 copies of a and sends them directly to destinations a a a a a comp O(p) comm O(p) sync O(1)(from now on, cost components will be shaded when they are optimal,i.e. cannot be improved under reasonable assumptions) Alexander Tiskin (Warwick) Efficient Parallel Algorithms 46 / 174
  117. 117. Basic parallel algorithmsBroadcast/combineBinary tree broadcast: initially, only designated processor is awake processors are woken up in log p rounds in every round, every awake processor makes a copy of a and send it to a sleeping processor, waking it up Alexander Tiskin (Warwick) Efficient Parallel Algorithms 47 / 174
  118. 118. Basic parallel algorithmsBroadcast/combineBinary tree broadcast: initially, only designated processor is awake processors are woken up in log p rounds in every round, every awake processor makes a copy of a and send it to a sleeping processor, waking it upIn round k = 0, . . . , log p − 1, the number of awake processors is 2k Alexander Tiskin (Warwick) Efficient Parallel Algorithms 47 / 174
  119. 119. Basic parallel algorithmsBroadcast/combineBinary tree broadcast: initially, only designated processor is awake processors are woken up in log p rounds in every round, every awake processor makes a copy of a and send it to a sleeping processor, waking it upIn round k = 0, . . . , log p − 1, the number of awake processors is 2k comp O(log p) comm O(log p) sync O(log p) Alexander Tiskin (Warwick) Efficient Parallel Algorithms 47 / 174
  120. 120. Basic parallel algorithmsBroadcast/combineThe array broadcasting/combining problem: broadcast/combine an arrayof size n ≥ p elementwise(effectively, n independent instances of broadcasting/combining) Alexander Tiskin (Warwick) Efficient Parallel Algorithms 48 / 174
  121. 121. Basic parallel algorithmsBroadcast/combineTwo-phase array broadcast: partition array into p blocks of size n/p scatter blocks, then total-exchange blocks A B C D A B C D A B A B A B A B C D C D C D C D Alexander Tiskin (Warwick) Efficient Parallel Algorithms 49 / 174
  122. 122. Basic parallel algorithmsBroadcast/combineTwo-phase array broadcast: partition array into p blocks of size n/p scatter blocks, then total-exchange blocks A B C D A B C D A B A B A B A B C D C D C D C D comp O(n) comm O(n) sync O(1) Alexander Tiskin (Warwick) Efficient Parallel Algorithms 49 / 174
  123. 123. Basic parallel algorithmsBalanced tree and prefix sumsThe balanced binary tree dag: a generalisation of broadcasting/combining can be defined top-down (root the input, leaves the outputs) or bottom-up atree(n)1 inputn outputssize n − 1depth log n b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 b12 b13 b14 b15 Alexander Tiskin (Warwick) Efficient Parallel Algorithms 50 / 174
  124. 124. Basic parallel algorithmsBalanced tree and prefix sumsThe balanced binary tree dag: a generalisation of broadcasting/combining can be defined top-down (root the input, leaves the outputs) or bottom-up atree(n)1 inputn outputssize n − 1depth log n b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 b12 b13 b14 b15Sequential work O(n) Alexander Tiskin (Warwick) Efficient Parallel Algorithms 50 / 174
  125. 125. Basic parallel algorithmsBalanced tree and prefix sumsParallel balanced tree computationFrom now on, we always assume that a problem’s input/output is stored inthe external memory Alexander Tiskin (Warwick) Efficient Parallel Algorithms 51 / 174
  126. 126. Basic parallel algorithmsBalanced tree and prefix sumsParallel balanced tree computationFrom now on, we always assume that a problem’s input/output is stored inthe external memoryPartition tree(n) into one top block, isomorphic to tree(p) a bottom layer of p blocks, each isomorphic to tree(n/p) atree(n) tree(p) tree(n/p) tree(n/p) tree(n/p) tree(n/p) b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 b12 b13 b14 b15 Alexander Tiskin (Warwick) Efficient Parallel Algorithms 51 / 174
  127. 127. Basic parallel algorithmsBalanced tree and prefix sumsParallel balanced tree computation (contd.) a designated processor is assigned the top block; the processor reads the input from external memory, computes the block, and writes the p outputs back to external memory; every processor is assigned a different bottom block; a processor reads the input from external memory, computes the block, and writes the n/p outputs back to external memory.For bottom-up computation, reverse the steps Alexander Tiskin (Warwick) Efficient Parallel Algorithms 52 / 174
  128. 128. Basic parallel algorithmsBalanced tree and prefix sumsParallel balanced tree computation (contd.) a designated processor is assigned the top block; the processor reads the input from external memory, computes the block, and writes the p outputs back to external memory; every processor is assigned a different bottom block; a processor reads the input from external memory, computes the block, and writes the n/p outputs back to external memory.For bottom-up computation, reverse the stepsn ≥ p2 comp O(n/p) comm O(n/p) sync O(1) Alexander Tiskin (Warwick) Efficient Parallel Algorithms 52 / 174
  129. 129. Basic parallel algorithmsBalanced tree and prefix sumsThe described parallel balanced tree algorithm is fully optimal: sequential work optimal comp O(n/p) = O p input/output size optimal comm O(n/p) = O p optimal sync O(1) Alexander Tiskin (Warwick) Efficient Parallel Algorithms 53 / 174
  130. 130. Basic parallel algorithmsBalanced tree and prefix sumsThe described parallel balanced tree algorithm is fully optimal: sequential work optimal comp O(n/p) = O p input/output size optimal comm O(n/p) = O p optimal sync O(1)For other problems, we may not be so lucky. However, we are typicallyinterested in algorithms that are optimal in comp (under reasonableassumptions). Optimality in comm and sync is considered relative to that. Alexander Tiskin (Warwick) Efficient Parallel Algorithms 53 / 174
  131. 131. Basic parallel algorithmsBalanced tree and prefix sumsThe described parallel balanced tree algorithm is fully optimal: sequential work optimal comp O(n/p) = O p input/output size optimal comm O(n/p) = O p optimal sync O(1)For other problems, we may not be so lucky. However, we are typicallyinterested in algorithms that are optimal in comp (under reasonableassumptions). Optimality in comm and sync is considered relative to that.For example, we are not allowed to run the whole computation in a singleprocessor, sacrificing comp and comm to guarantee optimal sync O(1)! Alexander Tiskin (Warwick) Efficient Parallel Algorithms 53 / 174
  132. 132. Basic parallel algorithmsBalanced tree and prefix sumsLet • be an associative operator, computable in time O(1)a • (b • c) = (a • b) • cE.g. numerical +, ·, min. . .     a0 a0   a1     a0 • a1  The prefix sums problem:   a2 →   a0 • a1 • a2    . .   . .   .   .  an−1 a0 • a1 • · · · • an−1 Alexander Tiskin (Warwick) Efficient Parallel Algorithms 54 / 174
  133. 133. Basic parallel algorithmsBalanced tree and prefix sumsLet • be an associative operator, computable in time O(1)a • (b • c) = (a • b) • cE.g. numerical +, ·, min. . .     a0 a0   a1     a0 • a1  The prefix sums problem:   a2 →   a0 • a1 • a2    . .   . .   .   .  an−1 a0 • a1 • · · · • an−1Sequential work O(n) Alexander Tiskin (Warwick) Efficient Parallel Algorithms 54 / 174
  134. 134. Basic parallel algorithmsBalanced tree and prefix sumsThe prefix circuit [Ladner, Fischer: 1980] ∗ a0 a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11 a12 a13 a14 a15prefix(n) a0:1 a2:3 a4:5 a6:7 a8:9 a10:11 a12:13 a14:15 prefix(n/2) a0:1 a0:3 a0:5 a0:7 a0:9 a0:11 a0:13 a0:15 ∗ a0 a0:2 a0:4 a0:6 a0:8 a0:10 a0:12 a0:14where ak:l = ak • ak+1 • . . . • al , and “∗” is a dummy valueThe underlying dag is called the prefix dag Alexander Tiskin (Warwick) Efficient Parallel Algorithms 55 / 174
  135. 135. Basic parallel algorithmsBalanced tree and prefix sumsThe prefix circuit (contd.) ∗ a0 a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11 a12 a13 a14 a15prefix(n)n inputsn outputssize 2n − 2depth 2 log n ∗ a0:1 a0:3 a0:5 a0:7 a0:9 a0:11 a0:13 a0:15 a0 a0:2 a0:4 a0:6 a0:8 a0:10 a0:12 a0:14 Alexander Tiskin (Warwick) Efficient Parallel Algorithms 56 / 174
  136. 136. Basic parallel algorithmsBalanced tree and prefix sumsParallel prefix computationThe dag prefix(n) consists of a dag similar to bottom-up tree(n), but with an extra output per node (total n inputs, n outputs) a dag similar to top-down tree(n), but with an extra input per node (total n inputs, n outputs) Alexander Tiskin (Warwick) Efficient Parallel Algorithms 57 / 174
  137. 137. Basic parallel algorithmsBalanced tree and prefix sumsParallel prefix computationThe dag prefix(n) consists of a dag similar to bottom-up tree(n), but with an extra output per node (total n inputs, n outputs) a dag similar to top-down tree(n), but with an extra input per node (total n inputs, n outputs)Both trees can be computed by the previous algorithm. Extrainputs/outputs are absorbed into O(n/p) communication cost. Alexander Tiskin (Warwick) Efficient Parallel Algorithms 57 / 174
  138. 138. Basic parallel algorithmsBalanced tree and prefix sumsParallel prefix computationThe dag prefix(n) consists of a dag similar to bottom-up tree(n), but with an extra output per node (total n inputs, n outputs) a dag similar to top-down tree(n), but with an extra input per node (total n inputs, n outputs)Both trees can be computed by the previous algorithm. Extrainputs/outputs are absorbed into O(n/p) communication cost.n ≥ p2 comp O(n/p) comm O(n/p) sync O(1) Alexander Tiskin (Warwick) Efficient Parallel Algorithms 57 / 174
  139. 139. Basic parallel algorithmsBalanced tree and prefix sumsApplication: binary addition via Boolean logicx +y =zLet x = xn−1 , . . . , x0 , y = yn−1 , . . . , y0 , z = zn , zn−1 , . . . , z0 be thebinary representation of x, y , zThe problem: given xi , yi , compute zi using bitwise ∧ (“and”), ∨(“or”), ⊕ (“xor”) Alexander Tiskin (Warwick) Efficient Parallel Algorithms 58 / 174
  140. 140. Basic parallel algorithmsBalanced tree and prefix sumsApplication: binary addition via Boolean logicx +y =zLet x = xn−1 , . . . , x0 , y = yn−1 , . . . , y0 , z = zn , zn−1 , . . . , z0 be thebinary representation of x, y , zThe problem: given xi , yi , compute zi using bitwise ∧ (“and”), ∨(“or”), ⊕ (“xor”)Let c = cn−1 , . . . , c0 , where ci is the i-th carry bitWe have: xi + yi + ci−1 = zi + 2ci 0≤i n Alexander Tiskin (Warwick) Efficient Parallel Algorithms 58 / 174
  141. 141. Basic parallel algorithmsBalanced tree and prefix sumsx +y =zLet ui = xi ∧ yi vi = xi ⊕ yi 0≤i nArrays u = un−1 , . . . , u0 , v = vn−1 , . . . , v0 can be computed in sizeO(n) and depth O(1) Alexander Tiskin (Warwick) Efficient Parallel Algorithms 59 / 174
  142. 142. Basic parallel algorithmsBalanced tree and prefix sumsx +y =zLet ui = xi ∧ yi vi = xi ⊕ yi 0≤i nArrays u = un−1 , . . . , u0 , v = vn−1 , . . . , v0 can be computed in sizeO(n) and depth O(1) z0 = v0 c0 = u0 z1 = v1 ⊕ c0 c1 = u1 ∨ (v1 ∧ c0 ) ··· ··· zn−1 = vn−1 ⊕ cn−2 cn−1 = un−1 ∨ (vn−1 ∧ cn−2 ) zn = cn−1 Alexander Tiskin (Warwick) Efficient Parallel Algorithms 59 / 174
  143. 143. Basic parallel algorithmsBalanced tree and prefix sumsx +y =zLet ui = xi ∧ yi vi = xi ⊕ yi 0≤i nArrays u = un−1 , . . . , u0 , v = vn−1 , . . . , v0 can be computed in sizeO(n) and depth O(1) z0 = v0 c0 = u0 z1 = v1 ⊕ c0 c1 = u1 ∨ (v1 ∧ c0 ) ··· ··· zn−1 = vn−1 ⊕ cn−2 cn−1 = un−1 ∨ (vn−1 ∧ cn−2 ) zn = cn−1Resulting circuit has size and depth O(n). Can we do better? Alexander Tiskin (Warwick) Efficient Parallel Algorithms 59 / 174
  144. 144. Basic parallel algorithmsBalanced tree and prefix sumsWe have ci = ui ∨ (vi ∧ ci−1 )Let Fu,v (c) = u ∨ (v ∧ c) ci = Fui ,vi (ci−1 )We have ci = Fui ,vi (. . . Fu0 ,v0 (0))) = Fu0 ,v0 ◦ · · · ◦ Fui ,vi (0) Alexander Tiskin (Warwick) Efficient Parallel Algorithms 60 / 174
  145. 145. Basic parallel algorithmsBalanced tree and prefix sumsWe have ci = ui ∨ (vi ∧ ci−1 )Let Fu,v (c) = u ∨ (v ∧ c) ci = Fui ,vi (ci−1 )We have ci = Fui ,vi (. . . Fu0 ,v0 (0))) = Fu0 ,v0 ◦ · · · ◦ Fui ,vi (0)Function composition ◦ is associativeFu ,v ◦ Fu,v (c) = Fu,v (Fu ,v (c)) = u ∨ (v ∧ (u ∨ (v ∧ c))) = u ∨ (v ∧ u ) ∨ (v ∧ v ∧ c) = Fu∨(v ∧u ),v ∧v (c)Hence, Fu ,v ◦ Fu,v = Fu∨(v ∧u ),v ∧v is computable from u, v , u , v intime O(1) Alexander Tiskin (Warwick) Efficient Parallel Algorithms 60 / 174
  146. 146. Basic parallel algorithmsBalanced tree and prefix sumsWe have ci = ui ∨ (vi ∧ ci−1 )Let Fu,v (c) = u ∨ (v ∧ c) ci = Fui ,vi (ci−1 )We have ci = Fui ,vi (. . . Fu0 ,v0 (0))) = Fu0 ,v0 ◦ · · · ◦ Fui ,vi (0)Function composition ◦ is associativeFu ,v ◦ Fu,v (c) = Fu,v (Fu ,v (c)) = u ∨ (v ∧ (u ∨ (v ∧ c))) = u ∨ (v ∧ u ) ∨ (v ∧ v ∧ c) = Fu∨(v ∧u ),v ∧v (c)Hence, Fu ,v ◦ Fu,v = Fu∨(v ∧u ),v ∧v is computable from u, v , u , v intime O(1)We compute Fu0 ,v0 ◦ · · · ◦ Fui ,vi for all i by prefix(n)Then compute ci , zi in size O(n) and depth O(1)Resulting circuit has size O(n) and depth O(log n) Alexander Tiskin (Warwick) Efficient Parallel Algorithms 60 / 174
  147. 147. Basic parallel algorithmsFast Fourier Transform and the butterfly dagA complex number ω is called a primitive root of unity of degree n, ifω, ω 2 , . . . , ω n−1 = 1, and ω n = 1 Alexander Tiskin (Warwick) Efficient Parallel Algorithms 61 / 174
  148. 148. Basic parallel algorithmsFast Fourier Transform and the butterfly dagA complex number ω is called a primitive root of unity of degree n, ifω, ω 2 , . . . , ω n−1 = 1, and ω n = 1The Discrete Fourier Transform problem: n−1Fn,ω (a) = Fn,ω · a = b, where Fn,ω = ω ij i,j=0       1 1 1 ··· 1 a0 b0   1 ω ω2 ··· ω n−1     a1     b1     1 ω2 ω4 ··· ω n−2 ·   a2 =   b2    . . . . . . .. . .   . .   . .   . . . . .   .   .  1 ω n−1 ω n−2 ··· ω an−1 bn−1 j ω ij aj = bi i, j = 0, . . . , n − 1 Alexander Tiskin (Warwick) Efficient Parallel Algorithms 61 / 174
  149. 149. Basic parallel algorithmsFast Fourier Transform and the butterfly dagA complex number ω is called a primitive root of unity of degree n, ifω, ω 2 , . . . , ω n−1 = 1, and ω n = 1The Discrete Fourier Transform problem: n−1Fn,ω (a) = Fn,ω · a = b, where Fn,ω = ω ij i,j=0       1 1 1 ··· 1 a0 b0   1 ω ω2 ··· ω n−1     a1     b1     1 ω2 ω4 ··· ω n−2 ·   a2 =   b2    . . . . . . .. . .   . .   . .   . . . . .   .   .  1 ω n−1 ω n−2 ··· ω an−1 bn−1 j ω ij aj = bi i, j = 0, . . . , n − 1Sequential work O(n2 ) by matrix-vector multiplication Alexander Tiskin (Warwick) Efficient Parallel Algorithms 61 / 174
  150. 150. Basic parallel algorithmsFast Fourier Transform and the butterfly dagThe Fast Fourier Transform (FFT) algorithm (“four-step” version)Assume n = 22r Let m = n1/2 = 2rLet Au,v = amu+v Bs,t = bms+t s, t, u, v = 0, . . . , m − 1Matrices A, B are vectors a, b written out as m × m matrices Alexander Tiskin (Warwick) Efficient Parallel Algorithms 62 / 174
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×